<a href='https://github.com/SeWonKwon' ><div> <img src ='https://slid-capture.s3.ap-northeast-2.amazonaws.com/public/image_upload/6556674324ed41a289a354258718280d/964e5a8b-75ad-41fc-ae75-0ca66d06fbc7.png' align='left' /> </div></a>


In [1]:
import pandas as pd
import numpy as np


import warnings
warnings.filterwarnings('ignore')

# Scaler

## `preprocessing` 데이터 전처리 모듈

* 데이터의 특징 스케일링(feature scaling)을 위한 방법


1. 표준화 방법(Standardization)

\begin{equation}
x_i^{\prime} = \frac{x_i-mean(x)}{stdev(x)}
\end{equation}

2. 정규화 방법(Normalization)

\begin{equation}
x_i^{\prime} = \frac{x_i-min(x)}{max(x)-min(x)}
\end{equation}

3. MaxabsScaler

\begin{equation}
x_i^{\prime} = \frac{x_i}{max(|x|)}
\end{equation}

4. RobustScaler

\begin{equation}
x_i^{\prime} = \frac{x_i- x_{median}}{IQR}
\end{equation}

5. log1pScaler

\begin{equation}
x_i^{\prime} = log{(x_i +1)}
\end{equation}

* scikit-learn에서는 개별 벡터 크기를 맞추는 형태로 정규화

* pipeline 모듈로 estimator 로 묶일수 있음
`pipeline()`

* 표준화 와 정규화 차이점.

||표준화|정규화|Maxabs|Robust|
|:--|:--:|:--:|:-:|:-:|
|방식|정규분포|최대최소|절대값의 최대값|중앙값과 사분위|
|결과값| 0을 기준으로 나뉨|0과 1사이|-1과 1사이|중앙값을 기준으로 나뉨|
|배열|**변함**|**불변**|**조금변함**|**변함**|
|이상치| 영향적음 | 영향이 큼|영향이 큼| 가장적음 |

* 자연계의 분포는 log scale을 통과 했을때 더 자연 스러운 경우가 많다고 한다. 

### iris를 전처리 하지 않은 모델

In [2]:
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_df, iris.target, test_size=0.3)

model = LogisticRegression()
model.fit(X_train, y_train)

print("학습 데이터 점수: {}".format(model.score(X_train, y_train)))
print("평가 데이터 점수: {}".format(model.score(X_test, y_test)))


학습 데이터 점수: 1.0
평가 데이터 점수: 0.9333333333333333


### `StandardScaler`: 표준화 클래스

\begin{equation}
x_i^{\prime} = \frac{x_i-mean(x)}{stdev(x)}
\end{equation}


In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
iris_scaled = scaler.fit_transform(iris_df)
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
iris_df_scaled

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.900681,1.019004,-1.340227,-1.315444
1,-1.143017,-0.131979,-1.340227,-1.315444
2,-1.385353,0.328414,-1.397064,-1.315444
3,-1.506521,0.098217,-1.283389,-1.315444
4,-1.021849,1.249201,-1.340227,-1.315444
...,...,...,...,...
145,1.038005,-0.131979,0.819596,1.448832
146,0.553333,-1.282963,0.705921,0.922303
147,0.795669,-0.131979,0.819596,1.053935
148,0.432165,0.788808,0.933271,1.448832


In [5]:
X_train, X_test, y_train, y_test = train_test_split(iris_df_scaled, iris.target, test_size=0.3)

model = LogisticRegression()
model.fit(X_train, y_train)

print("학습 데이터 점수: {}".format(model.score(X_train, y_train)))
print("평가 데이터 점수: {}".format(model.score(X_test, y_test)))


학습 데이터 점수: 0.9809523809523809
평가 데이터 점수: 0.9555555555555556


### `MinMaxScaler`: 정규화 클래스

\begin{equation}
x_i^{\prime} = \frac{x_i-min(x)}{max(x)-min(x)}
\end{equation}

In [6]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
iris_scaled = scaler.fit_transform(iris_df)
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
iris_df_scaled

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.222222,0.625000,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.500000,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667
...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667
146,0.555556,0.208333,0.677966,0.750000
147,0.611111,0.416667,0.711864,0.791667
148,0.527778,0.583333,0.745763,0.916667


In [7]:
X_train, X_test, y_train, y_test = train_test_split(iris_df_scaled, iris.target, test_size=0.3)

model = LogisticRegression()
model.fit(X_train, y_train)

print("학습 데이터 점수: {}".format(model.score(X_train, y_train)))
print("평가 데이터 점수: {}".format(model.score(X_test, y_test)))


학습 데이터 점수: 0.9523809523809523
평가 데이터 점수: 0.9111111111111111


### `MaxabsScaler`

In [8]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
iris_scaled = scaler.fit_transform(iris_df)
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
iris_df_scaled

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.645570,0.795455,0.202899,0.08
1,0.620253,0.681818,0.202899,0.08
2,0.594937,0.727273,0.188406,0.08
3,0.582278,0.704545,0.217391,0.08
4,0.632911,0.818182,0.202899,0.08
...,...,...,...,...
145,0.848101,0.681818,0.753623,0.92
146,0.797468,0.568182,0.724638,0.76
147,0.822785,0.681818,0.753623,0.80
148,0.784810,0.772727,0.782609,0.92


In [9]:
X_train, X_test, y_train, y_test = train_test_split(iris_df_scaled, iris.target, test_size=0.3)

model = LogisticRegression()
model.fit(X_train, y_train)

print("학습 데이터 점수: {}".format(model.score(X_train, y_train)))
print("평가 데이터 점수: {}".format(model.score(X_test, y_test)))


학습 데이터 점수: 0.9428571428571428
평가 데이터 점수: 0.9111111111111111


### `RobustScaler`

In [10]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
iris_scaled = scaler.fit_transform(iris_df)
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
iris_df_scaled

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,-0.538462,1.0,-0.842857,-0.733333
1,-0.692308,0.0,-0.842857,-0.733333
2,-0.846154,0.4,-0.871429,-0.733333
3,-0.923077,0.2,-0.814286,-0.733333
4,-0.615385,1.2,-0.842857,-0.733333
...,...,...,...,...
145,0.692308,0.0,0.242857,0.666667
146,0.384615,-1.0,0.185714,0.400000
147,0.538462,0.0,0.242857,0.466667
148,0.307692,0.8,0.300000,0.666667


In [11]:
X_train, X_test, y_train, y_test = train_test_split(iris_df_scaled, iris.target, test_size=0.3)

model = LogisticRegression()
model.fit(X_train, y_train)

print("학습 데이터 점수: {}".format(model.score(X_train, y_train)))
print("평가 데이터 점수: {}".format(model.score(X_test, y_test)))


학습 데이터 점수: 0.9619047619047619
평가 데이터 점수: 0.9111111111111111


### `log1p scaler`

In [12]:
from sklearn.preprocessing import MaxAbsScaler

scaler = MaxAbsScaler()
iris_scaled = np.log1p(iris_df)
iris_df_scaled = pd.DataFrame(data=iris_scaled, columns=iris.feature_names)
iris_df_scaled

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,1.808289,1.504077,0.875469,0.182322
1,1.774952,1.386294,0.875469,0.182322
2,1.740466,1.435085,0.832909,0.182322
3,1.722767,1.410987,0.916291,0.182322
4,1.791759,1.526056,0.875469,0.182322
...,...,...,...,...
145,2.041220,1.386294,1.824549,1.193922
146,1.987874,1.252763,1.791759,1.064711
147,2.014903,1.386294,1.824549,1.098612
148,1.974081,1.481605,1.856298,1.193922


In [13]:
X_train, X_test, y_train, y_test = train_test_split(iris_df_scaled, iris.target, test_size=0.3)

model = LogisticRegression()
model.fit(X_train, y_train)

print("학습 데이터 점수: {}".format(model.score(X_train, y_train)))
print("평가 데이터 점수: {}".format(model.score(X_test, y_test)))

학습 데이터 점수: 0.9333333333333333
평가 데이터 점수: 0.9111111111111111


## 스케일링 변환 시 유의점

* fit, transform 을 적용시
    1. train data에만 먼저 fit을 한다.
    2. transform은 train과 test 동일하게 수행한다. 

# CreditCard Furad 실습

In [14]:
import gdown
import os
import pandas as pd
from sklearn.model_selection import train_test_split
# https://drive.google.com/file/d/1TwKDZ24Gp76MhZFP4kRee2uAJYtYZLve/view?usp=sharing
def get_creditcard_dataset():

    google_path = 'https://drive.google.com/uc?id='
    file_id_train = '1TwKDZ24Gp76MhZFP4kRee2uAJYtYZLve'

    gdown.download(google_path+file_id_train,'creditcard.csv',quiet=False)
   
    
    X = pd.read_csv('creditcard.csv')
    y = X.iloc[:,[-1]]
    X = X.iloc[:,:-1]
    
    os.remove('creditcard.csv')

    
    return X, y

X, y = get_creditcard_dataset()

Downloading...
From: https://drive.google.com/uc?id=1TwKDZ24Gp76MhZFP4kRee2uAJYtYZLve
To: C:\Users\N\OneDrive\WorkJ\Machine Learning\Machine_Learning\creditcard.csv
151MB [00:04, 35.9MB/s] 


df로 환원

In [15]:
import gdown
import os
import pandas as pd
from sklearn.model_selection import train_test_split
# https://drive.google.com/file/d/1TwKDZ24Gp76MhZFP4kRee2uAJYtYZLve/view?usp=sharing
def get_creditcard_dataset():

    google_path = 'https://drive.google.com/uc?id='
    file_id_train = '1TwKDZ24Gp76MhZFP4kRee2uAJYtYZLve'

    gdown.download(google_path+file_id_train,'creditcard.csv',quiet=False)
   
    
    X = pd.read_csv('creditcard.csv')
    y = X.iloc[:,[-1]]
    X = X.iloc[:,:-1]
    
    os.remove('creditcard.csv')

    
    return X, y

X, y = get_creditcard_dataset()

Downloading...
From: https://drive.google.com/uc?id=1TwKDZ24Gp76MhZFP4kRee2uAJYtYZLve
To: C:\Users\N\OneDrive\WorkJ\Machine Learning\Machine_Learning\creditcard.csv
151MB [00:04, 36.6MB/s] 


In [16]:
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score

def plot_conf_mat(conf_mat):
    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots(figsize=(2.5, 2.5))
    ax.matshow(conf_mat, cmap=plt.cm.Blues, alpha=0.3)
    for i in range(conf_mat.shape[0]):
        for j in range(conf_mat.shape[1]):
            ax.text(x=j, y=i, s=conf_mat[i, j], va='center', ha='center',fontsize=19)
    
    plt.xlabel('Predicted label')
    plt.ylabel('True label')
    plt.tight_layout()
    plt.show()

def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    roc_auc = roc_auc_score(y_test, pred_proba)
#     print('오차 행렬')
#     plot_conf_mat(confusion)
#     print('정확도:{0:.4f}, 정밀도:{1:.4f}, 재현율:{2:.4f}, F1_score:{3:.4f},\
#             AUC: {4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))
    
    return accuracy, precision, recall, f1, roc_auc

In [17]:
def get_model_train_eval(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    pred_proba = model.predict_proba(X_test)[:, 1]
    
    model_name = str(type(model)).split('.')[-1][:-2]
    accuracy, precision, recall, f1, roc_auc = get_clf_eval(y_test, pred, pred_proba)
    
    return model_name, accuracy, precision, recall, f1, roc_auc

In [18]:
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier

lr_clf = LogisticRegression()
lgbm_model= LGBMClassifier(n_estimators=1000, num_leaves=64, n_jobs=-1, boost_from_average=False)

In [19]:
def get_train_test_dataset(df_X=None, df_y=None, scaler=None, test_size=0.3, random_state=0):
    df_copy = get_preprocessed_df(df_X, scaler)
    X_train, X_test, y_train, y_test = train_test_split(df_copy, df_y, test_size=test_size, 
                                            random_state=random_state, stratify = df_y)
    
    return X_train, X_test, y_train, y_test
    


In [20]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler, RobustScaler
import numpy as np

scalers = [StandardScaler(), MinMaxScaler(), MaxAbsScaler(), RobustScaler(), 'log1p']

def get_preprocessed_df(df, scaler):
    df_copy = df.copy()
    if scaler=='log1p':
        df_copy['Amount_S'] = np.log1p(df_copy['Amount'].values.reshape(-1, 1))
    else:
        df_copy['Amount_S'] = scaler.fit_transform(df_copy['Amount'].values.reshape(-1, 1))
    
    df_copy.drop(['Time', 'Amount'], axis=1, inplace=True)
    
    return df_copy

In [21]:
result_df = pd.DataFrame(columns=['name', 'accuracy', 'precision', 'recall', 'f1', 'roc_auc'])
idx=0
for scaler in scalers:
    
    X_train, X_test, y_train, y_test = get_train_test_dataset(X, y,  scaler)
    if scaler == 'log1p':
        scaler_name= 'log1p'
    else:
        scaler_name = str(type(scaler)).split('.')[-1][:-2][:-len('Scaler')]
        
    model_name, accuracy, precision, recall, f1, roc_auc = get_model_train_eval(lr_clf, X_train, y_train, X_test, y_test)
    result_df.loc[idx] = str(model_name + scaler_name), accuracy, precision, recall, f1, roc_auc
    idx +=1
    
    model_name, accuracy, precision, recall, f1, roc_auc = get_model_train_eval(lgbm_model, X_train, y_train, X_test, y_test)
    result_df.loc[idx] = str(model_name + scaler_name), accuracy, precision, recall, f1, roc_auc
    idx +=1

In [22]:
result_df

Unnamed: 0,name,accuracy,precision,recall,f1,roc_auc
0,LogisticRegressionStandard,0.999157,0.865385,0.608108,0.714286,0.970227
1,LGBMClassifierStandard,0.999532,0.965517,0.756757,0.848485,0.978225
2,LogisticRegressionMinMax,0.999181,0.882353,0.608108,0.72,0.971258
3,LGBMClassifierMinMax,0.99952,0.957265,0.756757,0.845283,0.979034
4,LogisticRegressionMaxAbs,0.999181,0.882353,0.608108,0.72,0.971258
5,LGBMClassifierMaxAbs,0.99952,0.957265,0.756757,0.845283,0.979034
6,LogisticRegressionRobust,0.999157,0.865385,0.608108,0.714286,0.970216
7,LGBMClassifierRobust,0.99952,0.957265,0.756757,0.845283,0.979108
8,LogisticRegressionlog1p,0.999169,0.881188,0.601351,0.714859,0.972683
9,LGBMClassifierlog1p,0.99952,0.957265,0.756757,0.845283,0.979034


In [23]:
result_df.sort_values(['roc_auc','f1'], ascending=False)

Unnamed: 0,name,accuracy,precision,recall,f1,roc_auc
7,LGBMClassifierRobust,0.99952,0.957265,0.756757,0.845283,0.979108
3,LGBMClassifierMinMax,0.99952,0.957265,0.756757,0.845283,0.979034
5,LGBMClassifierMaxAbs,0.99952,0.957265,0.756757,0.845283,0.979034
9,LGBMClassifierlog1p,0.99952,0.957265,0.756757,0.845283,0.979034
1,LGBMClassifierStandard,0.999532,0.965517,0.756757,0.848485,0.978225
8,LogisticRegressionlog1p,0.999169,0.881188,0.601351,0.714859,0.972683
2,LogisticRegressionMinMax,0.999181,0.882353,0.608108,0.72,0.971258
4,LogisticRegressionMaxAbs,0.999181,0.882353,0.608108,0.72,0.971258
0,LogisticRegressionStandard,0.999157,0.865385,0.608108,0.714286,0.970227
6,LogisticRegressionRobust,0.999157,0.865385,0.608108,0.714286,0.970216


# 출처:

* <a href='https://github.com/SeWonKwon' ><div> <img src ='https://slid-capture.s3.ap-northeast-2.amazonaws.com/public/image_upload/6556674324ed41a289a354258718280d/964e5a8b-75ad-41fc-ae75-0ca66d06fbc7.png' align='left' /> </div></a>

<br>

* 빅데이터분석기사 필기, DataEDU

* [이수안컴퓨터연구소](https://www.youtube.com/c/%EC%9D%B4%EC%88%98%EC%95%88%EC%BB%B4%ED%93%A8%ED%84%B0%EC%97%B0%EA%B5%AC%EC%86%8C)

* [R Friend](https://rfriend.tistory.com/636)

* [sklearn](https://scikit-learn.org/)