### 데이터전처리(Data Preprocessing)

- ML 알고리즘은 데이터에 기반하고 있기 때문에 어떤 데이터를 입력으로 가지느냐에 따라 결과도 달라짐
- 결측치 처리 : NaN, Null 값은 허용하지 않음
- 문자열 값은 입력값으로 혀용하지 않음 : 숫자형으로 변환

#### 데이터 인코딩

#### 레이블 인코딩(label encoding)

- 문자열을 숫자형 카테고리 값으로 변환
- 몇몇 ML 알고리즘은 숫자 값의 크고 작음에 대한 특성을 반영함으로 성능 저하
    - 선형회귀 알고리즘에는 적용하면 안됨
    - 트리 계열의 알고리즘에는 적용 가능

In [1]:
area = ['서울', '부산', '부산', '대구', '서울']
set(area)

{'대구', '부산', '서울'}

In [2]:
area = ['서울', '부산', '부산', '대구', '서울']
area = [0 if item == '서울' else 1 if item == '부산' else 2 for item in area]
area

[0, 1, 1, 2, 0]

In [3]:
import pandas as pd
dt = {'서울' : 0, '부산' : 1, '대구' : 2}
area = ['서울', '부산', '부산', '대구', '서울']

area_df = pd.DataFrame(data = area, columns = ['지역'])
area_df['지역'] = area_df['지역'].map(dt)
area_df

Unnamed: 0,지역
0,0
1,1
2,1
3,2
4,0


In [4]:
from sklearn.preprocessing import LabelEncoder

item = ['서울', '부산', '부산', '대구', '서울']

# LabelEncoder 객체생성
encoder = LabelEncoder()

# fit()과 transform()으로 레이블 인코딩
encoder.fit(item)
labels = encoder.transform(item)

print(f'인코딩 변환값 : {labels}')
print(f'인코딩 클래스 : {encoder.classes_}')
print(f'디코딩 원본값 : {encoder.inverse_transform(labels)}')

인코딩 변환값 : [2 1 1 0 2]
인코딩 클래스 : ['대구' '부산' '서울']
디코딩 원본값 : ['서울' '부산' '부산' '대구' '서울']


#### 원-핫 인코딩(One-Hot Encoding)

- 간단하게 피처 값의 유형에 따라 새로운 피처를 추가해 고유 값에 해당하는 칼럼에만 1을 표시하고 나머지 칼럼에는 0을 표시하는 방법

In [9]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

item = ['서울', '부산', '부산', '대구', '서울']

# 2차원 ndarray로 변환
item = np.array(item)
print(item)
item = np.array(item).reshape(-1,1)
print(item)

['서울' '부산' '부산' '대구' '서울']
[['서울']
 ['부산']
 ['부산']
 ['대구']
 ['서울']]


In [11]:
print(item)

[['서울']
 ['부산']
 ['부산']
 ['대구']
 ['서울']]


In [10]:
# 원핫인코딩 적용
enc = OneHotEncoder()
enc.fit(item)
labels = enc.transform(item)
print(labels)
labels = labels.toarray()

print(f'원핫인코딩 : \n{labels}')
print(f'원핫인코딩차원 : {labels.shape}')

  (0, 2)	1.0
  (1, 1)	1.0
  (2, 1)	1.0
  (3, 0)	1.0
  (4, 2)	1.0
원핫인코딩 : 
[[0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
원핫인코딩차원 : (5, 3)


In [12]:
# 판다스를 이용하는 방법
import pandas as pd

df = pd.DataFrame({'item' : ['서울', '부산', '부산', '대구', '서울']})
df

Unnamed: 0,item
0,서울
1,부산
2,부산
3,대구
4,서울


In [13]:
df1 = pd.get_dummies(df['item'])
df1

Unnamed: 0,대구,부산,서울
0,0,0,1
1,0,1,0
2,0,1,0
3,1,0,0
4,0,0,1


In [14]:
# 원래 데이터에 붙여넣기
df2 = pd.concat([df,df1],axis=1)
df2

Unnamed: 0,item,대구,부산,서울
0,서울,0,0,1
1,부산,0,1,0
2,부산,0,1,0
3,대구,1,0,0
4,서울,0,0,1


#### 피쳐 스케일링(feature scaling)

- 서로 다른 변수의 값 범위를 일정한 수준으로 맞추는 작업
- 표준화(Standardization)
    - 데이터 피쳐 각각이 평균이 0이고 분산이 1인 가우시안 정규 분포를 가진 값으로 변환
- 정규화(Normalization)
    - 서로 다른 피쳐의 크기를 통일하기 위해 크기를 변환해주는 개념
    - 개별 데이터의 크기를 모두 똑같은 단위로 변경하는 것
    
    
- 학습데이터와 테스트 데이터의 유의사항
    - 전체 데이터의 스케일링 변환을 적용한 후 학습과 테스트 데이터로 분리
    - 데이터를 분리했으면 학습데이터로 fit()이된 Scaler객체를 이용해 transform()

#### StandardScaler

- 개별 피처를 평균이 0이고 분산이 1인 값으로 변환
- 가우시안 정규 분포를 가질 수 있도록 데이터 변환
- 서포트벡터머신(Support Vector Marchine), 선형회귀(Linear Regression), 로지스틱회귀(Logistic Regression) 알고리즘은 데이터가 가우시안 분포를 가지고 있다고 가정하여 구현됐기 때문에 사전에 표준화를 적용하는 것이 예측성능향상에 중요한 요소가 됨

In [1]:
from sklearn.datasets import load_iris
import pandas as pd

In [32]:
iris = load_iris()
df_iris = pd.DataFrame(data=iris.data, columns = iris.feature_names)
df_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [4]:
print(f'평균 :\n{df.mean()}')
print(f'분산 : \n{df.var()}')

평균 :
sepal length (cm)    5.843333
sepal width (cm)     3.057333
petal length (cm)    3.758000
petal width (cm)     1.199333
target               1.000000
dtype: float64
분산 : 
sepal length (cm)    0.685694
sepal width (cm)     0.189979
petal length (cm)    3.116278
petal width (cm)     0.581006
target               0.671141
dtype: float64


In [26]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df_iris, iris['target'], random_state = 7)

In [27]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state = 156)

dt.fit(x_train, y_train)
dtp = dt.predict(x_test)
dtp

array([2, 1, 0, 1, 2, 0, 1, 1, 0, 1, 2, 1, 0, 2, 0, 2, 2, 2, 0, 0, 1, 2,
       1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 0, 2, 1, 0, 0, 0])

In [28]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, dtp)

0.9210526315789473

In [29]:
from sklearn.svm import SVC

svc = SVC().fit(x_train, y_train)
pred = svc.predict(x_test)
accuracy_score(y_test, pred)

0.868421052631579

In [33]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_iris)
iris_scaler = scaler.transform(df_iris)

df_iris_scaler = pd.DataFrame(data=iris_scaler, columns = iris['feature_names'])
print(f'평균 : \n{df_iris_scaler.mean()}')
print(f'분산 : \n{df_iris_scaler.var()}')

평균 : 
sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.842970e-15
petal length (cm)   -1.698641e-15
petal width (cm)    -1.409243e-15
dtype: float64
분산 : 
sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64


In [34]:
x_train, x_test, y_train, y_test = train_test_split(df_iris_scaler, iris['target'], random_state = 7)

In [36]:
svc = SVC().fit(x_train, y_train)
pred = svc.predict(x_test)
accuracy_score(y_test, pred)

0.8947368421052632

### MinMaxScaler

- 데이터 값을 0과 1사이의 범위 값으로 변환
    - 음수값이 있으면 -1에서 1 값으로 변환
- 데이터의 분포가 가우시안 분포가 아닐 경우 적용

In [25]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# MinMaxScaler 데이터셋 만들기
scaler.fit(df_iris)
df_iris_minmax = scaler.transform(df_iris)

# ndarray로 반환됨으로 데이터프레임으로 변환
df_iris_minmax = pd.DataFrame(data=df_iris_minmax, columns = iris['feature_names'])

df_iris_minmax

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.222222,0.625000,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.500000,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667
...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667
146,0.555556,0.208333,0.677966,0.750000
147,0.611111,0.416667,0.711864,0.791667
148,0.527778,0.583333,0.745763,0.916667


In [37]:
df_iris_minmax.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,0.428704,0.440556,0.467458,0.458056
std,0.230018,0.181611,0.299203,0.317599
min,0.0,0.0,0.0,0.0
25%,0.222222,0.333333,0.101695,0.083333
50%,0.416667,0.416667,0.567797,0.5
75%,0.583333,0.541667,0.694915,0.708333
max,1.0,1.0,1.0,1.0


In [38]:
from sklearn.model_selection import train_test_split

df = pd.DataFrame(data=iris['data'], columns = iris['feature_names'])
df['target'] = iris['target']

x_train, x_test, y_train, y_test = train_test_split(df.iloc[:,:-1], df['target'], random_state=7)

In [39]:
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [41]:
# MinMaxScaler 객체 생성
scaler = MinMaxScaler()

# MinMaxScaler 데이터셋 만들기
scaler.fit(x_train)
x_train_minmax = scaler.transform(x_train)

# 데이터프레임으로 변환
x_train_minmax = pd.DataFrame(x_train_minmax, columns = iris['feature_names'])

In [42]:
x_train

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
17,5.1,3.5,1.4,0.3
102,7.1,3.0,5.9,2.1
124,6.7,3.3,5.7,2.1
76,6.8,2.8,4.8,1.4
132,6.4,2.8,5.6,2.2
...,...,...,...,...
142,5.8,2.7,5.1,1.9
92,5.8,2.6,4.0,1.2
103,6.3,2.9,5.6,1.8
67,5.8,2.7,4.1,1.0


In [43]:
x_train_minmax

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.222222,0.625000,0.051724,0.083333
1,0.777778,0.416667,0.827586,0.833333
2,0.666667,0.541667,0.793103,0.833333
3,0.694444,0.333333,0.637931,0.541667
4,0.583333,0.333333,0.775862,0.875000
...,...,...,...,...
107,0.416667,0.291667,0.689655,0.750000
108,0.416667,0.250000,0.500000,0.458333
109,0.555556,0.375000,0.775862,0.708333
110,0.416667,0.291667,0.517241,0.375000


In [48]:
scaler.fit(x_test)
x_test_minmax = scaler.transform(x_test)

x_test_minmax = pd.DataFrame(x_test_minmax, columns = iris['feature_names'])

In [49]:
x_test

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
149,5.9,3.0,5.1,1.8
84,5.4,3.0,4.5,1.5
40,5.0,3.5,1.3,0.3
66,5.6,3.0,4.5,1.5
106,4.9,2.5,4.5,1.7
41,4.5,2.3,1.3,0.3
52,6.9,3.1,4.9,1.5
94,5.6,2.7,4.2,1.3
11,4.8,3.4,1.6,0.2
51,6.4,3.2,4.5,1.5


In [50]:
x_test_minmax

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.518519,0.421053,0.803922,0.708333
1,0.333333,0.421053,0.686275,0.583333
2,0.185185,0.684211,0.058824,0.083333
3,0.407407,0.421053,0.686275,0.583333
4,0.148148,0.157895,0.686275,0.666667
5,0.0,0.052632,0.058824,0.083333
6,0.888889,0.473684,0.764706,0.583333
7,0.407407,0.263158,0.627451,0.5
8,0.111111,0.631579,0.117647,0.041667
9,0.703704,0.526316,0.686275,0.583333


In [51]:
dt_train = dict(zip(x_train['sepal length (cm)'], x_train_minmax['sepal length (cm)']))
dt_train

{5.1: 0.2222222222222221,
 7.1: 0.7777777777777777,
 6.7: 0.6666666666666667,
 6.8: 0.6944444444444444,
 6.4: 0.5833333333333335,
 6.5: 0.6111111111111112,
 5.7: 0.38888888888888884,
 5.0: 0.19444444444444442,
 6.0: 0.4722222222222223,
 4.7: 0.11111111111111116,
 4.6: 0.08333333333333326,
 7.7: 0.9444444444444442,
 4.3: 0.0,
 6.3: 0.5555555555555556,
 5.5: 0.33333333333333326,
 4.4: 0.0277777777777779,
 7.3: 0.833333333333333,
 5.2: 0.25,
 7.2: 0.8055555555555556,
 5.4: 0.3055555555555556,
 5.8: 0.4166666666666665,
 6.1: 0.4999999999999998,
 4.8: 0.13888888888888884,
 6.6: 0.6388888888888888,
 7.0: 0.75,
 4.9: 0.16666666666666674,
 5.6: 0.36111111111111094,
 7.9: 1.0,
 7.6: 0.9166666666666665,
 7.4: 0.8611111111111112,
 5.3: 0.2777777777777777,
 6.2: 0.5277777777777779,
 6.9: 0.7222222222222223,
 5.9: 0.44444444444444464}

In [52]:
dt_test = dict(zip(x_test['sepal length (cm)'], x_test_minmax['sepal length (cm)']))
dt_test

{5.9: 0.5185185185185186,
 5.4: 0.3333333333333335,
 5.0: 0.18518518518518512,
 5.6: 0.40740740740740744,
 4.9: 0.14814814814814836,
 4.5: 0.0,
 6.9: 0.8888888888888888,
 4.8: 0.11111111111111116,
 6.4: 0.7037037037037037,
 6.7: 0.8148148148148149,
 6.0: 0.5555555555555558,
 5.2: 0.2592592592592593,
 7.2: 1.0,
 5.1: 0.2222222222222221,
 5.8: 0.4814814814814814,
 6.1: 0.5925925925925926,
 6.2: 0.6296296296296298,
 5.5: 0.37037037037037024,
 5.7: 0.44444444444444464,
 4.6: 0.03703703703703698}

In [53]:
for k, v in dt_test.items():
    if k in dt_train.keys():
        print(k, dt_train[k], dt_test[k])

5.9 0.44444444444444464 0.5185185185185186
5.4 0.3055555555555556 0.3333333333333335
5.0 0.19444444444444442 0.18518518518518512
5.6 0.36111111111111094 0.40740740740740744
4.9 0.16666666666666674 0.14814814814814836
6.9 0.7222222222222223 0.8888888888888888
4.8 0.13888888888888884 0.11111111111111116
6.4 0.5833333333333335 0.7037037037037037
6.7 0.6666666666666667 0.8148148148148149
6.0 0.4722222222222223 0.5555555555555558
5.2 0.25 0.2592592592592593
7.2 0.8055555555555556 1.0
5.1 0.2222222222222221 0.2222222222222221
5.8 0.4166666666666665 0.4814814814814814
6.1 0.4999999999999998 0.5925925925925926
6.2 0.5277777777777779 0.6296296296296298
5.5 0.33333333333333326 0.37037037037037024
5.7 0.38888888888888884 0.44444444444444464
4.6 0.08333333333333326 0.03703703703703698


- MinMaxScale 뒤에 나누기

In [54]:
from sklearn.model_selection import train_test_split

df = pd.DataFrame(data=iris['data'], columns = iris['feature_names'])

scaler.fit(df)
df_minmax = scaler.transform(df)

df_minmax = pd.DataFrame(df_minmax, columns = iris['feature_names'])

df_minmax['target'] = iris['target']
x_train, x_test, y_train, y_test = train_test_split(df_minmax.iloc[:,:-1], df_minmax['target'], random_state=7)

In [55]:
svc = SVC().fit(x_train, y_train)
pred = svc.predict(x_test)
accuracy_score(y_test, pred)

0.9210526315789473