# 데이터 정규화

데이터를 특정 범위나 척도로 변환하여 처리하거나 분석할 때 사용되는 기술

데이터 정규화의 목표는 `서로 다른 단위나 범위를 가진 데이터를 동일한 기준으로 맞춤`으로써, 데이터 분석이나 머신러닝 모델의 성능을 향상시키는 것

## #01. 작업준비

### 패키지 참조

In [2]:
from pandas import read_excel
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler

In [3]:
df = read_excel("https://data.hossam.kr/D05/gradeuate.xlsx")
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.00,1
3,1,640,3.19,4
4,0,520,2.93,4
...,...,...,...,...
395,0,620,4.00,2
396,0,560,3.04,3
397,0,460,2.63,2
398,0,700,3.65,2


## #02. Min-Max 정규화

모든 데이터의 범위를 `0~1`로 변환하는 것.

데이터에서 최소값을 0으로 최대값을 1로 매핑

$정규화된 값 = (X - Xmin) / (Xmax - Xmin) $

이 방법은 데이터의 분포를 유지하면서 데이터를 특정 범위로 축소시키는데에 유용.

### 직접계산

In [4]:
Xmin = df['필기점수'].min()
Xmax = df['필기점수'].max()
df['필기점수_MinMax(1)'] = (df['필기점수']-Xmin) / (Xmax - Xmin)
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수_MinMax(1)
0,0,380,3.61,3,0.275862
1,1,660,3.67,3,0.758621
2,1,800,4.00,1,1.000000
3,1,640,3.19,4,0.724138
4,0,520,2.93,4,0.517241
...,...,...,...,...,...
395,0,620,4.00,2,0.689655
396,0,560,3.04,3,0.586207
397,0,460,2.63,2,0.413793
398,0,700,3.65,2,0.827586


### 파이썬 활용

In [6]:
# 표준화 기능 객체
scaler = MinMaxScaler()

# 표준화를 적용할 필드를 객체에 알려준다.
scaler.fit(df[['필기점수']])

# 표준화 적용
df['필기점수_MinMax(2)'] = scaler.transform(df[['필기점수']])

df
# 한번에 쓸때
# scaler.fit_transform

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수_MinMax(1),필기점수_MinMax(2)
0,0,380,3.61,3,0.275862,0.275862
1,1,660,3.67,3,0.758621,0.758621
2,1,800,4.00,1,1.000000,1.000000
3,1,640,3.19,4,0.724138,0.724138
4,0,520,2.93,4,0.517241,0.517241
...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655
396,0,560,3.04,3,0.586207,0.586207
397,0,460,2.63,2,0.413793,0.413793
398,0,700,3.65,2,0.827586,0.827586


## #03. 표준화 (StandardScaler), z-score 정규화

데이터를 평균이 `0`, 표준편차가 `1`인 표준정규분포를 따르도록 반환

$ 정규화된 값 = (X-평균) / 표준편차 $

데이터를 정규분포에 근사시켜서 이상치에 덜 민감하게 만들어 줌

그래서 어쩌라구?
- 값들의 단위가 비슷하다면 MinMax
- 값들의 단위가 상이하다면 Standard
- 잘 모르겠으면 Standard


#### 직접계산

In [7]:
평균 = df['학부성적'].mean()
표준편차 = df['학부성적'].std()
df['학부성적_Standard(1)'] = (df['학부성적']-평균)/표준편차
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수_MinMax(1),필기점수_MinMax(2),학부성적_Standard(1)
0,0,380,3.61,3,0.275862,0.275862,0.578348
1,1,660,3.67,3,0.758621,0.758621,0.736008
2,1,800,4.00,1,1.000000,1.000000,1.603135
3,1,640,3.19,4,0.724138,0.724138,-0.525269
4,0,520,2.93,4,0.517241,0.517241,-1.208461
...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135
396,0,560,3.04,3,0.586207,0.586207,-0.919418
397,0,460,2.63,2,0.413793,0.413793,-1.996758
398,0,700,3.65,2,0.827586,0.827586,0.683455


#### 파이썬 활용

In [8]:
scaler = StandardScaler()
df['학부성적_Standard(2)'] = scaler.fit_transform(df[['학부성적']])
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수_MinMax(1),필기점수_MinMax(2),학부성적_Standard(1),학부성적_Standard(2)
0,0,380,3.61,3,0.275862,0.275862,0.578348,0.579072
1,1,660,3.67,3,0.758621,0.758621,0.736008,0.736929
2,1,800,4.00,1,1.000000,1.000000,1.603135,1.605143
3,1,640,3.19,4,0.724138,0.724138,-0.525269,-0.525927
4,0,520,2.93,4,0.517241,0.517241,-1.208461,-1.209974
...,...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135,1.605143
396,0,560,3.04,3,0.586207,0.586207,-0.919418,-0.920570
397,0,460,2.63,2,0.413793,0.413793,-1.996758,-1.999259
398,0,700,3.65,2,0.827586,0.827586,0.683455,0.684310


## #04. RobustScaler

이상치가 존재할 경우 사용하는 방법

이상치(outliers)에 영향을 최소화하여 데이터를 스케일링 하는 방법

이상치가 포함된 데이터를 표준화하거나 정규화할때, 이상치의 영향으로 전체 데이터의 분포가 왜곡됨

RobustScaler는 이 문제를 해결하기 위해 중앙값과 사분위수를 사용하여 데이터를 스케일링함.

$(X-medain)/iqr$

#### 직접계산

In [9]:
중앙값 = df['병원경력'].median()
iqr = df['병원경력'].quantile(0.75) - df['병원경력'].quantile(0.25)
df['병원경력_Robust(1)'] = (df['병원경력'] - 중앙값) / iqr
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수_MinMax(1),필기점수_MinMax(2),학부성적_Standard(1),학부성적_Standard(2),병원경력_Robust(1)
0,0,380,3.61,3,0.275862,0.275862,0.578348,0.579072,1.0
1,1,660,3.67,3,0.758621,0.758621,0.736008,0.736929,1.0
2,1,800,4.00,1,1.000000,1.000000,1.603135,1.605143,-1.0
3,1,640,3.19,4,0.724138,0.724138,-0.525269,-0.525927,2.0
4,0,520,2.93,4,0.517241,0.517241,-1.208461,-1.209974,2.0
...,...,...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135,1.605143,0.0
396,0,560,3.04,3,0.586207,0.586207,-0.919418,-0.920570,1.0
397,0,460,2.63,2,0.413793,0.413793,-1.996758,-1.999259,0.0
398,0,700,3.65,2,0.827586,0.827586,0.683455,0.684310,0.0


#### 파이썬 활용

In [10]:
scaler = RobustScaler()
scaler.fit(df[['병원경력']])
df['병원경력_Robust(2)'] = scaler.transform(df[['병원경력']])
df

Unnamed: 0,합격여부,필기점수,학부성적,병원경력,필기점수_MinMax(1),필기점수_MinMax(2),학부성적_Standard(1),학부성적_Standard(2),병원경력_Robust(1),병원경력_Robust(2)
0,0,380,3.61,3,0.275862,0.275862,0.578348,0.579072,1.0,1.0
1,1,660,3.67,3,0.758621,0.758621,0.736008,0.736929,1.0,1.0
2,1,800,4.00,1,1.000000,1.000000,1.603135,1.605143,-1.0,-1.0
3,1,640,3.19,4,0.724138,0.724138,-0.525269,-0.525927,2.0,2.0
4,0,520,2.93,4,0.517241,0.517241,-1.208461,-1.209974,2.0,2.0
...,...,...,...,...,...,...,...,...,...,...
395,0,620,4.00,2,0.689655,0.689655,1.603135,1.605143,0.0,0.0
396,0,560,3.04,3,0.586207,0.586207,-0.919418,-0.920570,1.0,1.0
397,0,460,2.63,2,0.413793,0.413793,-1.996758,-1.999259,0.0,0.0
398,0,700,3.65,2,0.827586,0.827586,0.683455,0.684310,0.0,0.0
