<a href="https://colab.research.google.com/github/JakeOh/202007_itw_bd18/blob/master/lab_python/python58_transform.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
np.random.seed(1)

df = pd.DataFrame(data={'gender': ['M'] * 3 + ['F'] * 3,
                        'income': np.random.randint(1, 11, 6)})

df

Unnamed: 0,gender,income
0,M,6
1,M,9
2,M,10
3,F,6
4,F,1
5,F,1


* **표준화(standardization)**: 변수(컬럼)의 평균을 0으로, 표준편차를 1로 변환.
* **정규화(normalization)**: 변수(컬럼)의 최솟값을 0으로, 최댓값을 1로, 사이의 값들은 0 ~ 1 사이의 값으로 변환.

In [10]:
def standardization(x):
    """x: array-like 자료 타입(numpy.ndarray, pandas.Series, ...).
    x_prime = (x - x.mean) / x.standard_deviation
    x_prime을 리턴.
    """
    return (x - x.mean()) / x.std()

In [11]:
def normalization(x):
    """x: array-like 자료 타입.
    x_prime = (x - x.min) / (x.max - x.min)
    x_prime을 리턴.
    """
    return (x - x.min()) / (x.max() - x.min())

In [12]:
income_std = standardization(df['income'])
income_std

0    0.130410
1    0.912871
2    1.173691
3    0.130410
4   -1.173691
5   -1.173691
Name: income, dtype: float64

In [13]:
income_std.mean()  #> 변환된 데이터의 평균 = 0

0.0

In [14]:
income_std.std()  #> 변환된 데이터의 표준편차 = 1

1.0

In [16]:
income_norm = normalization(df['income'])
income_norm

0    0.555556
1    0.888889
2    1.000000
3    0.555556
4    0.000000
5    0.000000
Name: income, dtype: float64

In [20]:
df['income'].transform(standardization)

0    0.130410
1    0.912871
2    1.173691
3    0.130410
4   -1.173691
5   -1.173691
Name: income, dtype: float64

In [21]:
df['income'].transform([standardization, normalization])

Unnamed: 0,standardization,normalization
0,0.13041,0.555556
1,0.912871,0.888889
2,1.173691,1.0
3,0.13041,0.555556
4,-1.173691,0.0
5,-1.173691,0.0


lambda expression(람다 표현식):
```
lambda param1, param2, ...: return_value
```

In [22]:
df['income'].transform(lambda x: (x - x.mean()) / x.std())

0    0.130410
1    0.912871
2    1.173691
3    0.130410
4   -1.173691
5   -1.173691
Name: income, dtype: float64

In [23]:
df['gender'].transform(lambda x: x.lower())

0    m
1    m
2    m
3    f
4    f
5    f
Name: gender, dtype: object

In [26]:
df['gender'].transform(lambda x: 0 if x == 'M' else 1)

0    0
1    0
2    0
3    1
4    1
5    1
Name: gender, dtype: int64

In [32]:
df['income_std'] = df['income'].transform(standardization)
df

Unnamed: 0,gender,income,income_std
0,M,6,0.13041
1,M,9,0.912871
2,M,10,1.173691
3,F,6,0.13041
4,F,1,-1.173691
5,F,1,-1.173691


In [34]:
df['income_norm'] = df['income'].transform(normalization)
df

Unnamed: 0,gender,income,income_std,income_norm
0,M,6,0.13041,0.555556
1,M,9,0.912871,0.888889
2,M,10,1.173691,1.0
3,F,6,0.13041,0.555556
4,F,1,-1.173691,0.0
5,F,1,-1.173691,0.0


In [36]:
df.groupby('gender')['income'].transform(standardization)

0   -1.120897
1    0.320256
2    0.800641
3    1.154701
4   -0.577350
5   -0.577350
Name: income, dtype: float64

In [37]:
df.groupby('gender')['income'].transform(normalization)

0    0.00
1    0.75
2    1.00
3    1.00
4    0.00
5    0.00
Name: income, dtype: float64

* 결측치 대체: 
  * 평균으로 대체, 최빈값 대체, ...
  * 그룹별 변환(transform)을 이용한 결측치(missing value) 대체

In [39]:
df = pd.DataFrame(data={'gender': ['M'] * 3 + ['F'] * 3,
                        'income': [1, np.nan, 3, np.nan, 4, 6]})
df

Unnamed: 0,gender,income
0,M,1.0
1,M,
2,M,3.0
3,F,
4,F,4.0
5,F,6.0


In [40]:
df['income'].mean()  # (1 + 3 + 4 + 6) / 4

3.5

In [41]:
df['income'].fillna(df['income'].mean())

0    1.0
1    3.5
2    3.0
3    3.5
4    4.0
5    6.0
Name: income, dtype: float64

In [51]:
s = df.groupby('gender')['income'].mean()
s

gender
F    5.0
M    2.0
Name: income, dtype: float64

In [55]:
df.fillna(s)

Unnamed: 0,gender,income
0,M,1.0
1,M,
2,M,3.0
3,F,
4,F,4.0
5,F,6.0


In [56]:
df.groupby('gender')['income'].transform(lambda x: x.fillna(x.mean()))

0    1.0
1    2.0
2    3.0
3    5.0
4    4.0
5    6.0
Name: income, dtype: float64

* seaborn 패키지에 포함된 iris 데이터 세트를 데이터 프레임으로 생성.
* 품종(species)을 제외한 모든 변수들을 표준화/정규화
* 품종(species)을 제외한 모든 변수들을 품종별로 표준화/정규화

* seaborn 패키지의 tips 샘플 데이터 프레임 로딩.
* 성별, 시간별 영수증금액의 평균, 팁의 최댓값과 최솟값.
  * pivot_table
  * groupby