# preprocessing (전처리)

- Data Cleansing: 정제
- Feature Engineering: 속성 생성/변환
- Data Encoding: 텍스트 데이터(범주형 데이터)를 숫자로 변환
- Data Scaling: 숫자값 정규화
- Outlier: 이상치 처리

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Data Encoding

##### Label Encoder
- 범주형 데이터에 대해 적절한 숫자로 변환

In [2]:
from sklearn.preprocessing import LabelEncoder

candies = ['캬라멜', '커피사탕', '땅콩캬라멜', '아몬드사탕', '페레로로쉐', '커피사탕', '아몬드사탕']

lb_enc = LabelEncoder()

lb_enc.fit(candies)     # 중복 제거 및 오름차순 정렬 후, 적절한 숫자 라벨링
encoded_candies = lb_enc.transform(candies)
encoded_candies

array([2, 3, 0, 1, 4, 3, 1])

In [3]:
lb_enc.classes_

array(['땅콩캬라멜', '아몬드사탕', '캬라멜', '커피사탕', '페레로로쉐'], dtype='<U5')

##### One-hot Encoder
- 주어진 데이터를 희소배열로 변환 (One-vs-Rest 배열)
- 희소배열이란? 대부분이 0이고 특정 인덱스만 값을 가지고 있는 배열

In [4]:
from sklearn.preprocessing import OneHotEncoder

candies_2d = np.array(candies).reshape(-1, 1)

oh_enc = OneHotEncoder()

oh_enc.fit(candies_2d)      # 중복 제거 및 오름차순 정렬 후, 해당 인덱스에만 1을 준 희소행렬
encoded_candies = oh_enc.transform(candies_2d)
encoded_candies

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (7, 5)>

In [5]:
print(encoded_candies)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (7, 5)>
  Coords	Values
  (0, 2)	1.0
  (1, 3)	1.0
  (2, 0)	1.0
  (3, 1)	1.0
  (4, 4)	1.0
  (5, 3)	1.0
  (6, 1)	1.0


In [6]:
print(encoded_candies.toarray())

[[0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]
 [0. 1. 0. 0. 0.]]


In [7]:
oh_enc.categories_

[array(['땅콩캬라멜', '아몬드사탕', '캬라멜', '커피사탕', '페레로로쉐'], dtype='<U5')]

In [8]:
# DataFrame에서 One-hot encoding 하기
candies_df = pd.DataFrame({
    'candy': ['캬라멜', '커피사탕', '땅콩캬라멜', '아몬드사탕', '페레로로쉐', '커피사탕', '아몬드사탕']
})
# candies_df

df_dummies = pd.get_dummies(candies_df)

# dataframe > ndarray 변환
df_dummies.to_numpy()
np.array(df_dummies)

array([[False, False,  True, False, False],
       [False, False, False,  True, False],
       [ True, False, False, False, False],
       [False,  True, False, False, False],
       [False, False, False, False,  True],
       [False, False, False,  True, False],
       [False,  True, False, False, False]])

### Data Scaling

In [9]:
from sklearn.datasets import load_iris

iris_ds = load_iris()
iris_features = iris_ds.data

##### Standard Scaler (표준정규화, Z-변환)
- 평균이 0, 표준편차가 1인 값으로 데이터 스케일을 변환
- 데이터가 정규분포인 경우 더욱 적합
- 이상치에 덜 민감 (선형회귀, 로지스틱 회귀 등의 알고리즘에 적합)

In [10]:
from sklearn.preprocessing import StandardScaler

standard_sc = StandardScaler()
standard_sc.fit(iris_features)
standard_sc.transform(iris_features)

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

##### MinMax Scaler (최소최대정규화, 0-1변환)
- 0~1 사이의 값으로 데이터 스케일을 변환 (최소값 == 0, 최대값 == 1)
- 거리 기반 모델에 적합 (KNN, SVM 등)
- 이상치에 민감하게 반응 (이상치가 있는 경우 데이터 왜곡 가능성 O)

In [11]:
from sklearn.preprocessing import MinMaxScaler

minmax_sc = MinMaxScaler()
minmax_sc.fit(iris_features)
minmax_sc.transform(iris_features)

# minmax_sc.fit_transform([[20], [30], [40]])
# (값 - 최소값) / (최대값 - 최소값)

array([[0.22222222, 0.625     , 0.06779661, 0.04166667],
       [0.16666667, 0.41666667, 0.06779661, 0.04166667],
       [0.11111111, 0.5       , 0.05084746, 0.04166667],
       [0.08333333, 0.45833333, 0.08474576, 0.04166667],
       [0.19444444, 0.66666667, 0.06779661, 0.04166667],
       [0.30555556, 0.79166667, 0.11864407, 0.125     ],
       [0.08333333, 0.58333333, 0.06779661, 0.08333333],
       [0.19444444, 0.58333333, 0.08474576, 0.04166667],
       [0.02777778, 0.375     , 0.06779661, 0.04166667],
       [0.16666667, 0.45833333, 0.08474576, 0.        ],
       [0.30555556, 0.70833333, 0.08474576, 0.04166667],
       [0.13888889, 0.58333333, 0.10169492, 0.04166667],
       [0.13888889, 0.41666667, 0.06779661, 0.        ],
       [0.        , 0.41666667, 0.01694915, 0.        ],
       [0.41666667, 0.83333333, 0.03389831, 0.04166667],
       [0.38888889, 1.        , 0.08474576, 0.125     ],
       [0.30555556, 0.79166667, 0.05084746, 0.125     ],
       [0.22222222, 0.625     ,

### [한번 해보기] 타이타닉 생존율 예측에 필요한 전처리 해보기

##### 1. 데이터 로드

In [12]:
df = pd.read_csv('./data/titanic.csv')

##### 2. 전처리 (-> 전처리 함수)

In [13]:
def fillna(df):
    '''
    결측치 처리
    - Age: 평균치로 대체
    - Cabin: 기본값 'N'으로 대체
    - Embarked: 기본값 'N'으로 대체
    '''
    df['Age'] = df['Age'].fillna(df['Age'].mean())
    df['Cabin'] = df['Cabin'].fillna('N')
    df['Embarked'] = df['Embarked'].fillna('N')

    return df

In [14]:
def drop_feature(df):
    '''
    모델 훈련과 관련 없는 속성 제거
    - PassengerId, Name, Ticket
    '''
    return df.drop(['PassengerId', 'Name', 'Ticket'], axis=1)

In [15]:
from sklearn.preprocessing import LabelEncoder

def encode_feature(df):
    '''
    범주형 데이터를 "숫자"로 인코딩
    - Sex, Cabin, Embarked
    - [tip] Cabin은 각각 다른 문자열 데이터이므로 앞 글자만 가져와서 범주형으로 치환
    '''
    df['Cabin'] = df['Cabin'].str[:1]

    categories = ['Sex', 'Cabin', 'Embarked']

    for cate_item in categories:
        label_encoder = LabelEncoder()
        df[cate_item] = label_encoder.fit_transform(df[cate_item])
        
    return df

In [16]:
def preprocess_data(df):
    '''
    전처리 함수 모두 호출
    '''
    df = drop_feature(df)
    df = fillna(df)
    df = encode_feature(df)
    return df

In [17]:
from sklearn.preprocessing import StandardScaler

def scaling_feature(train_data, test_data):
    '''
    특성 스케일링 (정규화)
    '''
    scaler = StandardScaler()
    train_scaled = scaler.fit_transform(train_data)
    test_scaled = scaler.transform(test_data)

    return train_scaled, test_scaled

In [18]:
# 전처리 함수 호출
df = preprocess_data(df)

##### 3. 데이터 분리

In [19]:
from sklearn.model_selection import train_test_split

# 입력-라벨 데이터 분리
titanic_input = df.drop(['Survived'], axis=1)
titanic_label = df['Survived']

# 훈련-테스트 데이터 분리
X_train, X_test, y_train, y_test = \
    train_test_split(titanic_input, titanic_label, test_size=.2, random_state=0)

##### 4. 특성 스케일링

In [21]:
# 스케일링 함수 호출
X_scaled_train, X_scaled_test = scaling_feature(X_train, X_test)

##### 5. LogisticRegression 모델 학습

In [22]:
from sklearn.linear_model import LogisticRegression

lr_classifier = LogisticRegression()
lr_classifier.fit(X_scaled_train, y_train)

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


##### 6. 평가

In [23]:
lr_classifier.score(X_scaled_train, y_train), lr_classifier.score(X_scaled_test, y_test)

(0.7935393258426966, 0.8100558659217877)