# Data 전처리 (Data Preprocessing)
- 좋은 train dataset을 만드는 것은 모델의 성능에 가장 큰 영향을 줌

- 목적
    1. 학습이 가능한 데이터셋을 만들기 위한 전처리
        - 머신러닝 알고리즘은 숫자만 처리. 결측치, 문자열이 있으면 학습이나 추론을 할 수 없다
    2. 학습이 더 잘되도록 만들기 위한 전처리
        - 공학적 전처리(Feature Engineering)
        - 도메인 지식에 의한 전처리

## 결측치 처리
- 결측치 : 수집하지 못한 값, 모르는 값
- 머신러닝 알고리즘은 데이터셋에 결측치가 있으면 학습이나 추론을 못하기 때문에 처리가 필요
- 결측치 처리는 데이터 전처리 단계에서 진행

- 결측치 처리 방법
    1. 제거 (열단위, 행단위)
        - 행단위를 기본으로 하는데 특정 열에 결측치가 너무 많을 경우 제거
    2. 다른 값으로 대체
        - 가장 가능성이 높은 값으로 대체
            - 수치형 : 평균, 중앙값
            - 범주형 : 최빈값
            - 결측치를 예측하는 머신러닝 알고리즘을 모델링해서 추론
        - 결측치 자체를 표현하는 값을 만들어서 대체

## 이상치(Outlier) 처리
- 대부분의 값들과는 동떨어진 값

- 오류값
    - 잘못 수집된 값
    - 처리
        - 결측치로 변환 후 처리

- 극단치(분포에서 벋어난 값)
    - 다른 값들과 다른 패턴을 가지는 값
    - 극단적으로 크거나 작은 값
    - 처리
        1. 그대로 유지
        2. 결측치로 변환 후 처리
        3. 다른 값으로 대체
            - 값이 가질 수 있는 Min/Max값을 설정한 뒤 그 값으로 변경

## Feature 타입 별 전처리
### Feature(변수)의 타입
- 범주형 변수 / 이산형 변수
    - 대부분 몇 개의 범주 중 하나에 속하는 값들로 구성
    - 명목형 변수 / 비서열 변수
        - 범주에 속한 값에 서열이 없는 변수
        - 성별, 혈액형
    - 순위 변수 / 서열 변수
        - 범주에 속한 값에 서열이 있는 변수
        - 성적, 직급

- 연속형 변수
    - 서로 연속된 값을 가지는 변수, 보통 정해진 범위 안의 모든 실수
    - 등간 변수
        - 측정 대상의 순서와 측정 대상 간의 간격을 알 수 있는 변수, 사이 간격이 같은 변수
        - 0의 값이 특정 의미로 사용되는 값으로 0이 절대적인 0의 값이 아닐 수 있음
            - 온도 : 온도에서 0은 절대적 의미가 아닌 얼음이 어는 빙결점의 온도
    - 비율 변수
        - 측정 대상의 순서와 측정 대상 간의 간격을 알 수 있는 변수, 그 사이의 간격이 같은 변수 (등간 변수와 동일)
        - 0이 절대적인 0의 값으로 사용
        - 나이, 무게, 거리, 소득

> - 실수형 데이터로 구성된 Feature는 연속형 값
> - 문자열 데이터로 구성된 Feature는 단순 문자열값 혹은 범주형 값
> - 정수형 데이터로 구성된 Feature는 범주형이거나 연속형 값
>   - 몇개의 고유값으로 구성되어있는지 확인 필요

# 범주형 데이터 전처리
- Scikit-learn의 머신러닝 API들은 Feature나 Label의 값들이 숫자인 것만 처리 가능
- 문자열일 경우 숫자 형으로 변환
    - 범주형 변수는 전처리를 통해 정수값으로 변환
    - 범주형이 아닌 단순 문자열은 일반적으로 제거

## 범주형 Feature의 처리
- Label Encoding
- One-Hot Encoding

### 레이블 인코딩(Label Encoding)
- 범주형 Feature의 고유값들을 오름차순 정렬 후 0 부터 1씩 증가하는 값으로 변환
- `숫자의 크기의 차이가 모델에 영향이 없는 트리 계열 모델(의사결정 나무, 랜덤포레스트)에 적용`
- `숫자의 크기의 차이가 모델에 영향을 미치는 선형 계열 모델(로지스틱 회귀, SVM, 신경망)에 사용하면 안된다`

![image](https://blogfiles.pstatic.net/MjAyMDA5MTRfMjQg/MDAxNjAwMDEwMjUxOTU0.pylH6lfAa43xNG2N-GyI967J2-YhdeViwwMLZx6hwe0g.O_X_At3qfN-VsUGGLDAvVj4bYhP8ePXUe0d58wKUfWcg.PNG.dalgoon02121/1.PNG)

- sklearn.preprocessing.LabelEncoder 사용
    - fit() : 어떻게 변환할지 학습
    - transform() : 문자열을 숫자로 변환 (encoding)
    - fit_transform() : 학습과 변환을 한번에 처리
    - inverse_transform() : 숫자를 문자열로 변환 (encoding)
    - classes_ : 인코딩한 클래스 조회

## adult data에 label encoding 적용
- 미국 성인 소득 데이터셋
- target은 income이며 수입이 $50,000 이하인지 초과인지 두개의 class를 가진다

### 데이터 로딩

In [1]:
cols = ['age', 'workclass','fnlwgt','education', 'education-num', 'marital-status', 'occupation','relationship', 'race', 'gender','capital-gain','capital-loss', 'hours-per-week','native-country', 'income']

In [2]:
import pandas as pd

data = pd.read_csv('data/adult.data',
                   header = None,
                   names = cols,
                   na_values = '?',
                   skipinitialspace = True
                   )
data.shape

(32561, 15)

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   gender          32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [4]:
data.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


### 결측치 처리 - 제거

In [5]:
data.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        1843
relationship         0
race                 0
gender               0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
income               0
dtype: int64

In [6]:
# 결측치 제거
df = data.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30162 non-null  int64 
 1   workclass       30162 non-null  object
 2   fnlwgt          30162 non-null  int64 
 3   education       30162 non-null  object
 4   education-num   30162 non-null  int64 
 5   marital-status  30162 non-null  object
 6   occupation      30162 non-null  object
 7   relationship    30162 non-null  object
 8   race            30162 non-null  object
 9   gender          30162 non-null  object
 10  capital-gain    30162 non-null  int64 
 11  capital-loss    30162 non-null  int64 
 12  hours-per-week  30162 non-null  int64 
 13  native-country  30162 non-null  object
 14  income          30162 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [7]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
gender            0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

In [8]:
print(df.shape)
print(df.income.value_counts())

(30162, 15)
<=50K    22654
>50K      7508
Name: income, dtype: int64


### label encoding 처리

In [9]:
encoding_columns = ['workclass','education','marital-status', 'occupation','relationship','race','gender','native-country', 'income']
not_encoding_columns = ['age','fnlwgt', 'education-num','capital-gain','capital-loss','hours-per-week']

In [10]:
adult_df = df.copy()
adult_df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [11]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

le_dict = {}
for col in encoding_columns:
    le = LabelEncoder()
    adult_df[col] = le.fit_transform(adult_df[col])
    le_dict[col] = le

In [12]:
le_dict.keys()

dict_keys(['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income'])

In [13]:
le_dict['workclass'].classes_

array(['Federal-gov', 'Local-gov', 'Private', 'Self-emp-inc',
       'Self-emp-not-inc', 'State-gov', 'Without-pay'], dtype=object)

In [14]:
adult_df.head(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,5,77516,9,13,4,0,1,4,1,2174,0,40,38,0
1,50,4,83311,9,13,2,3,0,4,1,0,0,13,38,0
2,38,2,215646,11,9,0,5,1,4,1,0,0,40,38,0


In [15]:
# apply() 이용
# 컬럼을 받아서 LabelEncoding 처리하는 함수
le_dict2 = {}

def encoding(column):
    le = LabelEncoder()
    result = le.fit_transform(column)
    le_dict2[column.name] = le
    return result

In [16]:
result_df = df.copy()
result_df.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [17]:
adult_df2 = result_df[encoding_columns].apply(encoding)
adult_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30162 entries, 0 to 32560
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   workclass       30162 non-null  int32
 1   education       30162 non-null  int32
 2   marital-status  30162 non-null  int32
 3   occupation      30162 non-null  int32
 4   relationship    30162 non-null  int32
 5   race            30162 non-null  int32
 6   gender          30162 non-null  int32
 7   native-country  30162 non-null  int32
 8   income          30162 non-null  int32
dtypes: int32(9)
memory usage: 1.3 MB


In [19]:
result = pd.concat([adult_df2, result_df[not_encoding_columns]], axis = 1)
result.head(3)

Unnamed: 0,workclass,education,marital-status,occupation,relationship,race,gender,native-country,income,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,5,9,4,0,1,4,1,38,0,39,77516,13,2174,0,40
1,4,9,2,3,0,4,1,38,0,50,83311,13,0,0,13
2,2,11,0,5,1,4,1,38,0,38,215646,9,0,0,40


### Adult dataset의 income 추론 모델링

In [20]:
adult_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,5,77516,9,13,4,0,1,4,1,2174,0,40,38,0
1,50,4,83311,9,13,2,3,0,4,1,0,0,13,38,0
2,38,2,215646,11,9,0,5,1,4,1,0,0,40,38,0
3,53,2,234721,1,7,2,5,0,2,1,0,0,40,38,0
4,28,2,338409,9,13,2,9,5,2,0,0,0,40,4,0


### 데이터 분할
- X, y 나누기
- train/validation/test set 나누기

In [21]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [22]:
# adult_df에서 X, y 분리
y = adult_df['income']
X = adult_df.drop(columns = 'income')

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, stratify = y_train, random_state = 0)
print(X_train.shape, X_val.shape, X_test.shape)

(19303, 14) (4826, 14) (6033, 14)


### 모델생성, 학습
- DecisionTreeClassifier
- train set 이용

In [24]:
max_depth = 7
tree = DecisionTreeClassifier(max_depth = max_depth, random_state = 0)
tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=7, random_state=0)

### 검증
- 평가지표 : 정확도(accuracy)
- train set / validation set 이용

In [25]:
# 검증
pred_train = tree.predict(X_train)
pred_val = tree.predict(X_val)

## 정확도 계산
train_acc = accuracy_score(y_train, pred_train)
val_acc = accuracy_score(y_val, pred_val)

In [26]:
print(f"max_depth: {max_depth}")
print("train 정확도:", train_acc)
print("val 정확도:", val_acc)

max_depth: 7
train 정확도: 0.8564471843754857
val 정확도: 0.855781185246581


### 최종평가
- test set으로 최종평가

In [27]:
pred_test = tree.predict(X_test)
test_acc = accuracy_score(y_test, pred_test)
print('최종평가 결과: ', test_acc)

최종평가 결과:  0.8488314271506713


In [28]:
# cross validation
tree2 = DecisionTreeClassifier(max_depth = 7, random_state = 0)
result = cross_val_score(tree, X, y, scoring='accuracy', cv = 5)

In [29]:
result

array([0.84767114, 0.8478369 , 0.8512931 , 0.85460875, 0.84930371])

In [30]:
np.mean(result)

0.8501427218819921

## 원핫 인코딩 (One-Hot encoding)
- N개의 클래스를 N 차원의 One-Hot 벡터로 표현되도록 변환
    - 고유값들을 피처(컬럼)로 만들고 정답에 해당하는 열은 1, 나머지는 0으로 표시
- 숫자의 크기 차이가 모델에 영향을 미치는 선형 계열 모델(로지스틱회귀, SVM, 신경망)에서 범주형 데이터 변환시 Label Encoding 보다 One Hot Encoding을 사용
- DecisionTree 계열의 알고리즘은 Feature에 0이 많은 경우(Sparse Matrix) 성능이 떨어지기 때문에 Label Encoding을 한다.

![image](https://blogfiles.pstatic.net/MjAyMDA5MTRfMTM3/MDAxNjAwMDQ1ODAyNzMz.E9qug25o4TesPxstb7XqqlHaPesC6np5dbq3Xfsro1Qg.783mGfoBFOVAp6uf_Uu1of1vVXjdPTaPCEYGqMD_HRsg.PNG.dalgoon02121/3.PNG)

### One-Hot Encoding 변환 처리
- Scikit-learn
    - sklearn.preprocessing.OneHotEncoder 이용
        - fit(데이터셋) : 데이터셋을 기준으로 어떻게 변환할지 학습
        - transform(데이터셋) : Argument로 받은 데이터셋을 원핫인코딩 처리
        - fit_transform(데이터셋) : 학습과 변환을 한번에 처리
        - get_feature_names_out() : 원핫인코딩으로 변환된 Feature(컬럼)들의 이름을 반환
        - 데이터셋은 2차원 배열을 전달 하며 Feature별로 원핫인코딩 처리한다.
            - DataFrame도 가능
            - 원핫인코딩 처리시 모든 타입의 값들을 다 변환, 변환하려는 변수들만 모아서 처리

- Pandas
    - pandas.get_dummies(DataFrame [, columns = [변환할 컬럼명]]) 함수 이용
    - DataFrame에서 범주형(object, category) 변수만 변환

> - 범주형 변수의 값이 숫자 값인 경우
>   - get_dummies(columns = ["컬럼명", "컬럼명"]) 매개변수로 컬럼들을 명시

## adult dataset - One Hot Encoding 적용

### 데이터 로딩

In [31]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [32]:
cols = ['age', 'workclass','fnlwgt','education', 'education-num', 'marital-status', 'occupation','relationship', 'race', 'gender','capital-gain','capital-loss', 'hours-per-week','native-country', 'income']

In [34]:
data = pd.read_csv('data/adult.data',
                   header = None,
                   names = cols,
                   na_values = '?',
                   skipinitialspace = True
                   )
print(data.shape)

(32561, 15)


### 필요한 Feature들만 추출

In [35]:
adult_df = data[['age', 'workclass','education', 'occupation', 'gender', 'hours-per-week', 'income']].copy()
adult_df.head(3)

Unnamed: 0,age,workclass,education,occupation,gender,hours-per-week,income
0,39,State-gov,Bachelors,Adm-clerical,Male,40,<=50K
1,50,Self-emp-not-inc,Bachelors,Exec-managerial,Male,13,<=50K
2,38,Private,HS-grad,Handlers-cleaners,Male,40,<=50K


In [36]:
adult_df.dropna(inplace = True)

In [37]:
adult_df.isnull().sum()

age               0
workclass         0
education         0
occupation        0
gender            0
hours-per-week    0
income            0
dtype: int64

In [38]:
adult_df.shape

(30718, 7)

In [39]:
# index이름이 순번인 경우 행을 drop, index를 제거해 순번을 맞춤
adult_df.reset_index(drop = True, inplace = True)
adult_df

Unnamed: 0,age,workclass,education,occupation,gender,hours-per-week,income
0,39,State-gov,Bachelors,Adm-clerical,Male,40,<=50K
1,50,Self-emp-not-inc,Bachelors,Exec-managerial,Male,13,<=50K
2,38,Private,HS-grad,Handlers-cleaners,Male,40,<=50K
3,53,Private,11th,Handlers-cleaners,Male,40,<=50K
4,28,Private,Bachelors,Prof-specialty,Female,40,<=50K
...,...,...,...,...,...,...,...
30713,27,Private,Assoc-acdm,Tech-support,Female,38,<=50K
30714,40,Private,HS-grad,Machine-op-inspct,Male,40,>50K
30715,58,Private,HS-grad,Adm-clerical,Female,40,<=50K
30716,22,Private,HS-grad,Adm-clerical,Male,20,<=50K


In [40]:
category_columns = ['workclass', 'education', 'occupation', 'gender']
continuous_columns = ['age', 'hours-per-week']
target = 'income'

In [41]:
# income(target)은 Label Encoding 처리 하고 변수 y에 대입
le = LabelEncoder()
y = le.fit_transform(adult_df[target])
y.shape

(30718,)

In [42]:
X = adult_df.drop(columns = 'income')
X.head()

Unnamed: 0,age,workclass,education,occupation,gender,hours-per-week
0,39,State-gov,Bachelors,Adm-clerical,Male,40
1,50,Self-emp-not-inc,Bachelors,Exec-managerial,Male,13
2,38,Private,HS-grad,Handlers-cleaners,Male,40
3,53,Private,11th,Handlers-cleaners,Male,40
4,28,Private,Bachelors,Prof-specialty,Female,40


### one hot encoding 처리

In [43]:
# get_dummies()를 이용해 반환
X_ohe = pd.get_dummies(X)

X_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30718 entries, 0 to 30717
Data columns (total 41 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   age                           30718 non-null  int64
 1   hours-per-week                30718 non-null  int64
 2   workclass_Federal-gov         30718 non-null  uint8
 3   workclass_Local-gov           30718 non-null  uint8
 4   workclass_Private             30718 non-null  uint8
 5   workclass_Self-emp-inc        30718 non-null  uint8
 6   workclass_Self-emp-not-inc    30718 non-null  uint8
 7   workclass_State-gov           30718 non-null  uint8
 8   workclass_Without-pay         30718 non-null  uint8
 9   education_10th                30718 non-null  uint8
 10  education_11th                30718 non-null  uint8
 11  education_12th                30718 non-null  uint8
 12  education_1st-4th             30718 non-null  uint8
 13  education_5th-6th             3

In [44]:
# scikit-learn의 OneHotEncoder를 이용해서 변환
ohe = OneHotEncoder(sparse = False)
values = ohe.fit_transform(X[category_columns])
X_ohe2 = np.concatenate([values, X[continuous_columns].values], axis = 1)
X_ohe2.shape, X_ohe.shape

((30718, 41), (30718, 41))

### 모델 학습
- train, validation, test set 나누기

In [46]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X_ohe, y, test_size = 0.2, stratify = y, random_state = 0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.25, stratify = y_train, random_state = 0)
print(X_train.shape, X_val.shape, X_test.shape)

(18430, 41) (6144, 41) (6144, 41)


### 모델생성
- DecisionTreeClassifier 사용
- LogisticRegression

### 검증

In [49]:
tree = DecisionTreeClassifier(max_depth = 7, random_state = 0)
lr = LogisticRegression(max_iter = 10000, random_state = 0)

estimators = [
    ('DecisionTree', tree),
    ('LogisticRegression', lr)
]

for name, model in estimators:
    model.fit (X_train, y_train)
    pred_train = model.predict(X_train)
    pred_val = model.predict(X_val)
    train_acc = accuracy_score(y_train, pred_train)
    val_acc = accuracy_score(y_val, pred_val)
    print(f"{name}, trainset accuracy: {train_acc}")
    print(f"{name}, validation set accuracy: {val_acc}")
    print('='*50)
    

DecisionTree, trainset accuracy: 0.8116115029842648
DecisionTree, validation set accuracy: 0.7985026041666666
LogisticRegression, trainset accuracy: 0.8075420510037982
LogisticRegression, validation set accuracy: 0.8059895833333334


### 평가

In [50]:
for name, model in estimators:
    pred_test = model.predict(X_test)
    test_acc = accuracy_score(y_test, pred_test)
    print(f"{name} test set 정확도: {test_acc}")

DecisionTree test set 정확도: 0.79736328125
LogisticRegression test set 정확도: 0.8064778645833334


In [51]:
# 교차검증을 이용한 modeling
result_tree = cross_val_score(DecisionTreeClassifier(max_depth = 7, random_state = 0),
                              X_ohe, y, scoring = 'accuracy', cv = 4,
                              n_jobs = -1
                              )
print(result_tree)
print(np.mean(result_tree))

[0.7953125  0.80377604 0.80609454 0.80622477]
0.8028519635192841


In [53]:
result_lr = cross_val_score(LogisticRegression(max_iter = 1000, random_state = 0),
                            X_ohe, y, scoring = 'accuracy', cv = 4 ,
                            n_jobs = -1
                            )
print(result_lr)
print(np.mean(result_lr))

[0.80338542 0.80742187 0.80830837 0.80700612]
0.806530446435354


## 연속형(수치형) 데이터 전처리
- 연속형 데이터는 변수가 가지는 값들이 연속된 값인 경우로 보통 정해진 범위 안의 모든 실수가 값이 될 수 있다

### Feature Scaling(정규화)
- 각 피처들간의 값의 범위가 다르면 값의 범위로 일정 범위로 맞추는 작업
- 트리 계열을 제외한 대부분의 머신러닝 알고리즘들이 Feature간 서로 다른 Scale에 영향을 받음
    - 선형모델, SVM 모델, 신경망 모델
- `Scaling은 train set으로 fitting, test set이나 예측할 새로운 데이터는 train set으로 fitting한 것으로 변환`
    - `Train set으로 학습한 scaler를 이용해 Train/Validation/Test set들을 변환`

### 종류
- 표준화(Standardization) Scaling
    - StandardScaler 사용
- Min Max Scaling
    - MinMaxScaler 사용

### 메소드
- fit() : 어떻게 변환할 지 학습
    - 2차원 배열을 받으면 0축을 기준으로 학습
- transform() : 변환
    - 2차원 배열을 받으며 0축을 기준으로 변환
- fit_transform() : 학습과 변환을 한번에 처리
- inverse_transform() : 변환된 값을 원래값으로 복원

## 표준화(StandardScaler)
- 피쳐의 값들이 평균이 0이고 표준편차가 1인 범위에 있도록 변환
    - 0을 기준으로 모든 데이터들이 모임
- sklearn.preprocessing.StandardScaler를 이용
$$
New\,x_i = \cfrac{X_i-\mu}{\sigma}\\
\mu-평균,\;  \sigma-표준편차
$$

## MinMaxScaler
- 데이터셋의 모든 값을 0(Min value)과 1(Max value)사이의 값으로 변환
$$
New\,x_i = \cfrac{x_i - min(X)}{max(X) - min(X)}
$$

### 위스콘신 유방암 데이터셋
- 위스콘신 대학교에서 제공한 유방암 진단결과 데이터
- Feature : 종양 측정값
    - 모든 Feature는 연속형
- target : 악성, 양성 여부

In [54]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

X = data.data
y = data.target
X.shape, y.shape

((569, 30), (569,))

In [55]:
# 악성/양성 비율
np.unique(y, return_counts = True)[1]/y.size

array([0.37258348, 0.62741652])

### 데이터 나누기
- train/test : test_size = 0.2
- train/val : test_size = 0.2

In [56]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
import pandas as pd

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify=y, random_state = 0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, stratify=y_train, random_state = 0)
X_train.shape, X_val.shape, X_test.shape

((364, 30), (91, 30), (114, 30))

### Scaling 처리

#### 표준화(StandardScaling)
- 각 dataset별로 변환된 것을 다음의 변수에 대입
    - X_train_scaled1, X_val_scaled1, X_test_scale1

In [57]:
# 변환전 평균, 표준편차
train_df = pd.DataFrame(X_train, columns=data.feature_names)
val_df = pd.DataFrame(X_val, columns = data.feature_names)
test_df = pd.DataFrame(X_test, columns = data.feature_names)

# train set 의 평균 표준편차
train_df.agg(['mean', 'std']).T

Unnamed: 0,mean,std
mean radius,14.173478,3.622847
mean texture,19.174231,4.240037
mean perimeter,92.28717,24.999251
mean area,661.099451,365.089909
mean smoothness,0.096343,0.014091
mean compactness,0.104168,0.053197
mean concavity,0.08862,0.078731
mean concave points,0.049121,0.039568
mean symmetry,0.18029,0.027193
mean fractal dimension,0.062672,0.006991


In [58]:
# test set
test_df.agg(['mean', 'std']).T

Unnamed: 0,mean,std
mean radius,14.073789,3.213541
mean texture,19.532544,4.075435
mean perimeter,91.648509,22.223488
mean area,643.833333,304.290282
mean smoothness,0.095877,0.014783
mean compactness,0.105227,0.055819
mean concavity,0.092296,0.0901
mean concave points,0.048186,0.038487
mean symmetry,0.18152,0.028763
mean fractal dimension,0.063111,0.007385


In [59]:
# validation set
val_df.agg(['mean', 'std']).T

Unnamed: 0,mean,std
mean radius,14.009571,3.528089
mean texture,19.447033,4.819788
mean perimeter,91.098022,24.171235
mean area,643.897802,356.786712
mean smoothness,0.097035,0.013129
mean compactness,0.103921,0.047704
mean concavity,0.085136,0.069899
mean concave points,0.04903,0.03643
mean symmetry,0.184201,0.02663
mean fractal dimension,0.06291,0.006986


In [60]:
# Scaling
s_scaler = StandardScaler()

X_train_scaled1 = s_scaler.fit_transform(X_train)
X_val_scaled1 = s_scaler.transform(X_val)
X_test_scaled1 = s_scaler.transform(X_test)

### 확인
- 평균, 표준편차 확인

In [61]:
pd.DataFrame(X_train_scaled1).agg(['mean', 'std']).T

Unnamed: 0,mean,std
0,-2.573491e-15,1.001376
1,-1.329218e-15,1.001376
2,-6.431058e-16,1.001376
3,2.443101e-16,1.001376
4,-8.979386e-16,1.001376
5,4.886201e-16,1.001376
6,4.111485e-16,1.001376
7,-5.053955e-16,1.001376
8,-1.679975e-15,1.001376
9,2.691986e-15,1.001376


In [62]:
pd.DataFrame(X_val_scaled1).agg(['mean', 'std']).T

Unnamed: 0,mean,std
0,-0.045305,0.975185
1,0.064428,1.138297
2,-0.047633,0.968209
3,-0.047181,0.978602
4,0.04917,0.933004
5,-0.004655,0.89799
6,-0.044309,0.889047
7,-0.002286,0.921953
8,0.144029,0.980635
9,0.034098,1.000597


In [63]:
pd.DataFrame(X_test_scaled1).agg(['mean', 'std']).T

Unnamed: 0,mean,std
0,-0.027555,0.888242
1,0.084623,0.962502
2,-0.025582,0.89019
3,-0.047358,0.834614
4,-0.033099,1.050593
5,0.019931,1.050728
6,0.04675,1.145982
7,-0.023651,0.97402
8,0.045306,1.059172
9,0.062934,1.057814


In [64]:
print('train set의 평균, 표준편차')
print(X_train_scaled1.mean(axis = 0))
print(X_train_scaled1.std(axis = 0))

print('val set의 평균, 표준편차')
print(X_val_scaled1.mean(axis = 0))
print(X_val_scaled1.std(axis = 0))

print('test set의 평균, 표준편차')
print(X_test_scaled1.mean(axis = 0))
print(X_test_scaled1.std(axis = 0))

train set의 평균, 표준편차
[-2.57349087e-15 -1.32921757e-15 -6.43105837e-16  2.44310067e-16
 -8.97938622e-16  4.88620133e-16  4.11148527e-16 -5.05395481e-16
 -1.67997484e-15  2.69198583e-15 -3.18807862e-16  8.82688306e-16
 -4.05810916e-16  6.43563347e-16  7.83179992e-16 -3.52434809e-16
  5.37573649e-16 -8.97824245e-16 -1.95623432e-15  1.57078258e-16
 -2.40116230e-16 -7.12189770e-17 -1.59483995e-15  1.17412185e-15
  5.79512018e-15  1.85504847e-15  2.56205313e-17 -3.49232243e-16
  5.84392120e-16 -1.46769044e-15]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
val set의 평균, 표준편차
[-0.04530476  0.06442814 -0.04763283 -0.04718105  0.0491703  -0.00465532
 -0.04430904 -0.00228626  0.1440289   0.03409771  0.11236541  0.01312549
  0.11299126  0.09671882  0.133951   -0.00739747 -0.05742507  0.02280448
  0.04359547  0.06701324 -0.02866774 -0.0063776  -0.02648608 -0.03962948
  0.0723013  -0.01883656 -0.0728466  -0.01860934  0.00212971  0.03249858]
[0.96981196 1.

### MinMax Scaling
- 각 dataset별로 변환된 것을 다음의 변수에 대입
    - X_train_scaled2, X_val_scaled2, X_test_scale2

In [65]:
# 변환전에 min/max값 확인
train_df.agg(['min', 'max']).T

Unnamed: 0,min,max
mean radius,7.691,27.42
mean texture,9.71,33.81
mean perimeter,47.92,186.9
mean area,170.4,2501.0
mean smoothness,0.05263,0.1425
mean compactness,0.01938,0.3454
mean concavity,0.0,0.3754
mean concave points,0.0,0.1913
mean symmetry,0.106,0.304
mean fractal dimension,0.04996,0.09744


In [66]:
val_df.agg(['min', 'max']).T

Unnamed: 0,min,max
mean radius,8.571,28.11
mean texture,11.28,39.28
mean perimeter,54.53,188.5
mean area,221.3,2499.0
mean smoothness,0.06883,0.1335
mean compactness,0.03813,0.277
mean concavity,0.0,0.3514
mean concave points,0.0,0.1595
mean symmetry,0.1342,0.2743
mean fractal dimension,0.05054,0.09575


In [67]:
test_df.agg(['min', 'max']).T

Unnamed: 0,min,max
mean radius,6.981,24.25
mean texture,12.17,32.47
mean perimeter,43.79,166.2
mean area,143.5,1761.0
mean smoothness,0.06613,0.1634
mean compactness,0.0265,0.2867
mean concavity,0.0,0.4268
mean concave points,0.0,0.2012
mean symmetry,0.1167,0.2678
mean fractal dimension,0.05025,0.09502


In [68]:
#MinMax Scaling
mm_scaler = MinMaxScaler()
X_train_scaled2 = mm_scaler.fit_transform(X_train)
X_val_scaled2 = mm_scaler.transform(X_val)
X_test_scaled2 = mm_scaler.transform(X_test)

### 확인
- min, max값 확인

In [69]:
pd.DataFrame(X_train_scaled2).agg(['min', 'max']).T

Unnamed: 0,min,max
0,0.0,1.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,0.0,1.0
5,0.0,1.0
6,0.0,1.0
7,0.0,1.0
8,0.0,1.0
9,0.0,1.0


In [70]:
pd.DataFrame(X_val_scaled2).agg(['min', 'max']).T

Unnamed: 0,min,max
0,0.044604,1.034974
1,0.065145,1.226971
2,0.047561,1.011512
3,0.02184,0.999142
4,0.18026,0.899855
5,0.057512,0.790197
6,0.0,0.936068
7,0.0,0.833769
8,0.142424,0.85
9,0.012216,0.964406


In [71]:
pd.DataFrame(X_test_scaled2).agg(['min', 'max']).T

Unnamed: 0,min,max
0,-0.035988,0.839323
1,0.102075,0.944398
2,-0.029717,0.851058
3,-0.011542,0.682485
4,0.150217,1.232558
5,0.021839,0.81995
6,0.0,1.136921
7,0.0,1.051751
8,0.05404,0.817172
9,0.006108,0.949031


In [72]:
print('train set 의 Min값')
print(X_train_scaled2.min(axis=0))
print('Max값')
print(X_train_scaled2.max(axis=0))

print('val set 의 Min값')
print(X_val_scaled2.min(axis=0))
print('Max값')
print(X_val_scaled2.max(axis=0))

print('test set 의 Min값')
print(X_test_scaled2.min(axis=0))
print('Max값')
print(X_test_scaled2.max(axis=0))

train set 의 Min값
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.]
Max값
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
val set 의 Min값
[0.04460439 0.06514523 0.0475608  0.02183987 0.18026038 0.05751181
 0.         0.         0.14242424 0.01221567 0.00119072 0.00594501
 0.00921589 0.00084423 0.03824319 0.02523508 0.         0.
 0.05006473 0.01355823 0.02905489 0.05810235 0.04478674 0.01290195
 0.13775342 0.04187303 0.         0.         0.01636113 0.00178026]
Max값
[1.0349739  1.22697095 1.01151245 0.99914185 0.89985535 0.79019692
 0.93606819 0.83376895 0.85       0.96440607 1.13385342 0.70893741
 1.1861063  0.96899503 0.68181664 0.60622766 0.29262673 1.04150751
 0.56210953 0.56786719 0.80666618 0.87553305 0.73717655 0.72434498
 0.85471835 0.92334809 0.74976038 0.91284878 0.55450424 0.68972533]
test set 의 Min값
[-0.03598763  0.10207469 -0.02971651 -0.01154209  0.15021698  0.02183915
  0.          0.   

### Modeling

In [73]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

#### scaling 하지 않은 데이터셋 이용

In [74]:
# 모델 생성
# svc = SVC(C=0.1, gamma=0.1, random_state = 0) # C, gamma - hyper parameter
svc = SVC(random_state = 0)
# 학습
svc.fit(X_train, y_train)

# 검증/평가
## 추론
pred_train = svc.predict(X_train)
pred_val = svc.predict(X_val)
pred_test = svc.predict(X_test)
print('train:', accuracy_score(y_train, pred_train))
print('val:', accuracy_score(y_val, pred_val))
print('test:', accuracy_score(y_test, pred_test))

train: 0.9203296703296703
val: 0.9120879120879121
test: 0.9122807017543859


#### StandardScaler 데이터셋 이용

In [75]:
# 모델 생성
# svc = SVC(C=0.1, gamma=0.1, random_state = 0) 
svc = SVC(random_state = 0)

# 학습
svc.fit(X_train_scaled1, y_train)

# 검증/평가
## 추론
pred_train1 = svc.predict(X_train_scaled1)
pred_val1 = svc.predict(X_val_scaled1)
pred_test1 = svc.predict(X_test_scaled1)
print('train:', accuracy_score(y_train, pred_train1))
print('val:', accuracy_score(y_val, pred_val1))
print('test:', accuracy_score(y_test, pred_test1))

train: 0.9917582417582418
val: 0.989010989010989
test: 0.9473684210526315


#### MinMax Scaling 데이터셋 이용

In [76]:
# 모델 생성
# svc = SVC(C=0.1, gamma=0.1, random_state = 0) 
svc = SVC(random_state = 0)

# 학습
svc.fit(X_train_scaled2, y_train)

# 검증/평가
## 추론
pred_train2 = svc.predict(X_train_scaled2)
pred_val2 = svc.predict(X_val_scaled2)
pred_test2 = svc.predict(X_test_scaled2)
print('train:', accuracy_score(y_train, pred_train2))
print('val:', accuracy_score(y_val, pred_val2))
print('test:', accuracy_score(y_test, pred_test2))

train: 0.9862637362637363
val: 0.989010989010989
test: 0.9473684210526315


### `feature scaling을 하면 성능이 상승, 연속형 데이터는 scaling 필요`