# Data Scaling

주로 사용되는 스케일링 개념
# Standardization (표준화)
- 특성들의 평균을 0, 분산을 1 로 스케일링하는 것입니다.
- 즉, 특성들을 정규분포로 만드는 것입니다.

# Normalization (정규화)
- 특성들을 특정 범위(주로 [0,1]) 로 스케일링 하는 것입니다.
- 가작 작은 값은 0, 가장 큰 값은 1 로 변환되므로, 모든 특성들은 [0, 1] 범위를 갖게됩니다.

# scikit-learn 의 scaler 사용 전, 주의 사항
- 우선, scikit-learn 의 scaler 를 사용하기전에, 주의해야할 점을 먼저 살펴보겠습니다.
- scaler 는 fit 과 transform 메서드를 지니고 있습니다.
- fit 메서드는 훈련 데이터에만 적용해, 훈련 데이터의 분포를 먼저 학습하고
- 그 이후, transform 메서드를 훈련 데이터와 테스트 데이터에 적용해 스케일을 조정해야합니다.
- 따라서, 훈련 데이터에는 fit_transform() 메서드를 적용하고, 테스트 데이터에는 transform() 메서드를 적용해야합니다.
- fit_transform() 은 fit 과 transform 이 결합된 단축 메서드입니다.
또한, 스케일링할 때, 모든 특성의 범위를 유사하게 만드는 것은 중요하지만, 그렇다고 모두 같은 분포로 만들 필요는 없습니다.
특성에 따라 각기 다른 스케일링을 적용하는게 유리할 수도 있기 때문입니다.
이제 scikit-learn 에서 제공하는 5가지 스케일링 방법을 알아보겠습니다.
StandardScaler()
특성들의 평균을 0, 분산을 1 로 스케일링하는 것입니다.
즉, 특성들을 정규분포로 만드는 것입니다.
최솟값과 최댓값의 크기를 제한하지 않기 때문에, 어떤 알고리즘에서는 문제가 있을 수 있으며
이상치에 매우 민감합니다.
회귀보다 분류에 유용합니다.

In [39]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [40]:
a =[2,4,6,8,10]
b = [3,5,7]
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
std = StandardScaler()

In [41]:
std.fit(df_a)

In [42]:
a_scaled = std.transform(df_a)

# 테스트 데이터의 스케일링
b_scaled = std.transform(df_b)

In [43]:
a_scaled 

array([[-1.41421356],
       [-0.70710678],
       [ 0.        ],
       [ 0.70710678],
       [ 1.41421356]])

In [44]:
a_scaled.mean()

0.0

In [45]:
a_scaled.std()

0.9999999999999999

In [46]:
b_scaled

array([[-1.06066017],
       [-0.35355339],
       [ 0.35355339]])

In [47]:
b_scaled.mean()

-0.35355339059327373

In [48]:
b_scaled.std()

0.5773502691896257

In [49]:
c = [2,3,4,5,6,7,8,10]
df_c = pd.DataFrame(c)

In [50]:
std.fit(df_c)

In [51]:
c_scaled = std.transform(df_c)
c_scaled

array([[-1.45181591],
       [-1.05131497],
       [-0.65081403],
       [-0.25031309],
       [ 0.15018785],
       [ 0.55068879],
       [ 0.95118973],
       [ 1.75219161]])

In [52]:
a =[2,4,6,8,10]
b = [3,5,7]
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
mm = MinMaxScaler()

In [53]:
mm.fit(df_a)

a_scaled = mm.transform(df_a)

# 테스트 데이터의 스케일링
b_scaled = mm.transform(df_b)

In [54]:
a_scaled 

array([[0.  ],
       [0.25],
       [0.5 ],
       [0.75],
       [1.  ]])

In [55]:
b_scaled

array([[0.125],
       [0.375],
       [0.625]])

In [56]:
c = [2,3,4,5,6,7,8,10]
df_c = pd.DataFrame(c)

In [57]:
mm.fit(df_c)
c_scaled = mm.transform(df_c)

In [58]:
c_scaled 

array([[0.   ],
       [0.125],
       [0.25 ],
       [0.375],
       [0.5  ],
       [0.625],
       [0.75 ],
       [1.   ]])

# MaxAbsScaler
각 특성의 절대값이 0 과 1 사이가 되도록 스케일링합니다.\
즉, 모든 값은 -1 과 1 사이로 표현되며, 데이터가 양수일 경우 MinMaxScaler 와 같습니다.\
이상치에 매우 민감합니다.

In [60]:
from sklearn.preprocessing import MaxAbsScaler

ma = MaxAbsScaler()
ma.fit(df_a)
a_scaled = ma.transform(df_a)

# 테스트 데이터의 스케일링
b_scaled = ma.transform(df_b)

In [61]:
a_scaled

array([[0.2],
       [0.4],
       [0.6],
       [0.8],
       [1. ]])

In [62]:
b_scaled

array([[0.3],
       [0.5],
       [0.7]])

In [67]:
e = [-10,-6,-2,2,6,10]
df_e = pd.DataFrame(e)
ma.fit(df_e)

In [68]:
e_scaled = ma.transform(df_e)
e_scaled

array([[-1. ],
       [-0.6],
       [-0.2],
       [ 0.2],
       [ 0.6],
       [ 1. ]])

# RobustScaler()
평균과 분산 대신에 중간 값과 사분위 값을 사용합니다.\
중간 값은 정렬시 중간에 있는 값을 의미하고\
사분위값은 1/4, 3/4에 위치한 값을 의미합니다.\
이상치 영향을 최소화할 수 있습니다.

In [73]:
from sklearn.preprocessing import RobustScaler

rb = RobustScaler()
rb.fit(df_a)
a_scaled = rb.transform(df_a)

# 테스트 데이터의 스케일링
b_scaled = rb.transform(df_b)

In [74]:
a_scaled 

array([[-1. ],
       [-0.5],
       [ 0. ],
       [ 0.5],
       [ 1. ]])

In [77]:
a_scaled.std()

0.7071067811865476

In [78]:
a_scaled.mean()

0.0

In [75]:
b_scaled

array([[-0.75],
       [-0.25],
       [ 0.25]])

# Normalizer
- 앞의 4가지 스케일러는 각 특성(열)의 통계치를 이용하여 진행됩니다.
- 그러나 Normalizer 의 경우 각 샘플(행)마다 적용되는 방식입니다.
- 이는 한 행의 모든 특성들 사이의 유클리드 거리(L2 norm)가 1이 되도록 스케일링합니다.
- 일반적인 데이터 전처리의 상황에서 사용되는 것이 아니라
- 모델(특히나 딥러닝) 내 학습 벡터에 적용하며,
- 특히나 피쳐들이 다른 단위(키, 나이, 소득 등)라면 더더욱 사용하지 않습니다.

In [81]:
from sklearn.preprocessing import Normalizer

nm = Normalizer()
nm.fit(df_a)
a_scaled = nm.transform(df_a)

# 테스트 데이터의 스케일링
b_scaled = nm.transform(df_b)

In [82]:
a_scaled

array([[1.],
       [1.],
       [1.],
       [1.],
       [1.]])

In [83]:
b_scaled

array([[1.],
       [1.],
       [1.]])

## 데이터를 가지고 직접해보기

In [85]:
df = pd.read_csv("./data/성장성 지표.csv")

In [88]:
df.fillna(0,inplace=True)


In [95]:
from sklearn.preprocessing import StandardScaler

std = StandardScaler()

In [96]:
std.fit(std)

TypeError: float() argument must be a string or a number, not 'StandardScaler'

In [98]:
df.columns

Index(['통계표', '업종코드', '규모선택', '지표선택', '단위', '변환', '2022/Q3', '2022/Q2',
       '2022/Q1', '2021/Q4', '2021/Q3', '2021/Q2', '2021/Q1', '2020/Q4',
       '2020/Q3', '2020/Q2', '2020/Q1', '2019/Q4', '2019/Q3'],
      dtype='object')

In [102]:
df_data = df[['2022/Q3', '2022/Q2','2022/Q1', '2021/Q4', '2021/Q3', '2021/Q2', '2021/Q1', '2020/Q4',
       '2020/Q3', '2020/Q2', '2020/Q1', '2019/Q4', '2019/Q3']]
df_data

Unnamed: 0,2022/Q3,2022/Q2,2022/Q1,2021/Q4,2021/Q3,2021/Q2,2021/Q1,2020/Q4,2020/Q3,2020/Q2,2020/Q1,2019/Q4,2019/Q3
0,2.76,2.33,3.73,0.0,3.05,1.40,3.29,0.0,1.87,1.10,1.52,0.0,1.12
1,17.52,20.51,17.04,0.0,15.44,18.65,7.37,0.0,-3.17,-10.11,-1.86,0.0,-2.79
2,2.72,2.21,3.78,0.0,2.86,0.79,3.19,0.0,1.72,0.78,1.16,0.0,0.85
3,19.00,23.02,20.15,0.0,16.74,20.22,7.09,0.0,-3.60,-11.34,-1.87,0.0,-3.34
4,2.95,2.86,3.55,0.0,3.70,3.50,3.76,0.0,2.52,2.49,3.12,0.0,2.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,24.72,23.36,18.65,0.0,26.07,33.88,-6.60,0.0,-23.38,-27.84,-11.79,0.0,13.24
122,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.0,0.00
123,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.0,0.00
124,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.0,0.00


In [106]:
std.fit(df_data)

In [107]:
df_data_scaled = std.transform(df_data)
df_data_scaled

array([[-0.19724402, -0.27259568, -0.06540187, ...,  0.71426371,
         0.        ,  0.46469538],
       [ 1.57898362,  1.80523432,  1.57778714, ..., -0.67740354,
         0.        , -1.22300698],
       [-0.20205764, -0.28631073, -0.05922911, ...,  0.56603879,
         0.        ,  0.34815327],
       ...,
       [-0.52938414, -0.53889622, -0.52588986, ...,  0.08842518,
         0.        , -0.01873855],
       [-0.52938414, -0.53889622, -0.52588986, ...,  0.08842518,
         0.        , -0.01873855],
       [-0.52938414, -0.53889622, -0.52588986, ...,  0.08842518,
         0.        , -0.01873855]])

In [108]:
df_data_scaled = pd.DataFrame(df_data_scaled, columns = df_data.columns)

In [109]:
df_data_scaled

Unnamed: 0,2022/Q3,2022/Q2,2022/Q1,2021/Q4,2021/Q3,2021/Q2,2021/Q1,2020/Q4,2020/Q3,2020/Q2,2020/Q1,2019/Q4,2019/Q3
0,-0.197244,-0.272596,-0.065402,0.0,-0.096458,-0.329849,0.465546,0.0,0.621585,0.537519,0.714264,0.0,0.464695
1,1.578984,1.805234,1.577787,0.0,1.500074,1.674529,1.680655,0.0,-0.652441,-1.569096,-0.677404,0.0,-1.223007
2,-0.202058,-0.286311,-0.059229,0.0,-0.120941,-0.400728,0.435764,0.0,0.583668,0.477384,0.566039,0.0,0.348153
3,1.757088,2.092107,1.961733,0.0,1.667587,1.856957,1.597265,0.0,-0.761137,-1.800241,-0.681521,0.0,-1.460408
4,-0.174379,-0.212021,-0.087624,0.0,-0.012702,-0.085837,0.605522,0.0,0.785894,0.798732,1.373041,0.0,0.986977
...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,2.445436,2.130967,1.776550,0.0,2.869818,3.444191,-2.479902,0.0,-5.761184,-4.900968,-4.765941,0.0,5.696141
122,-0.529384,-0.538896,-0.525890,0.0,-0.489471,-0.492523,-0.514285,0.0,0.148881,0.330804,0.088425,0.0,-0.018739
123,-0.529384,-0.538896,-0.525890,0.0,-0.489471,-0.492523,-0.514285,0.0,0.148881,0.330804,0.088425,0.0,-0.018739
124,-0.529384,-0.538896,-0.525890,0.0,-0.489471,-0.492523,-0.514285,0.0,0.148881,0.330804,0.088425,0.0,-0.018739


In [110]:
df_data_scaled.mean()

2022/Q3    1.586033e-17
2022/Q2    7.577713e-17
2022/Q1    1.621278e-16
2021/Q4    0.000000e+00
2021/Q3   -3.700743e-17
2021/Q2   -4.317534e-17
2021/Q1    6.872809e-17
2020/Q4    0.000000e+00
2020/Q3   -1.480297e-16
2020/Q2   -2.819614e-17
2020/Q1    5.286776e-17
2019/Q4    0.000000e+00
2019/Q3    1.070572e-16
dtype: float64

In [111]:
df_data_scaled.std()

2022/Q3    1.003992
2022/Q2    1.003992
2022/Q1    1.003992
2021/Q4    0.000000
2021/Q3    1.003992
2021/Q2    1.003992
2021/Q1    1.003992
2020/Q4    0.000000
2020/Q3    1.003992
2020/Q2    1.003992
2020/Q1    1.003992
2019/Q4    0.000000
2019/Q3    1.003992
dtype: float64