# 
mLab 6: Encode & Transform Data

- Columnns represents variables and features.
- $y=f(x)$, where $y$ is the target/dependent variable and $x$ is the independent variable.
- Converting text to numbers is encoding.
- The process of replacing missing values is called **imputation**.

In [104]:
from numpy import array
from sklearn.preprocessing import LabelEncoder
import pandas as pd

In [132]:
# y = array(['Positive', 'Negative', 'Negative', 'Positive', 'Positive'])
y = array(['warm','hot','cold','luke warm','hot','cold','warm'])

In [146]:
label_encoder = LabelEncoder()

In [147]:
integer_encoded = label_encoder.fit_transform(y)

In [148]:
integer_encoded

array([3, 1, 0, 2, 1, 0, 3])

In [149]:
df = pd.Series(integer_encoded)

In [150]:
df

0    3
1    1
2    0
3    2
4    1
5    0
6    3
dtype: int64

## One Hot Encoder

In [152]:
from sklearn.preprocessing import OneHotEncoder

In [153]:
one_hot_encoder = OneHotEncoder(sparse_output=False)

In [154]:
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)

In [155]:
integer_encoded

array([[3],
       [1],
       [0],
       [2],
       [1],
       [0],
       [3]])

In [156]:
one_hot_encoded = one_hot_encoder.fit_transform(integer_encoded)

In [157]:
one_hot_encoded

array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.]])

One hot encoder removes bias in the integer value used to encode the non-numeric data

[0:] --> First row number and all columns

In [158]:
import numpy as np

In [191]:
label_encoder.inverse_transform([np.argmax(one_hot_encoded[0,:])])[0]

'warm'

In [192]:
label_encoder.inverse_transform([np.argmax(one_hot_encoded[1,:])])[0]

'hot'

In [193]:
label_encoder.inverse_transform([np.argmax(one_hot_encoded[2,:])])[0]

'cold'

In [194]:
label_encoder.inverse_transform([np.argmax(one_hot_encoded[3,:])])[0]

'luke warm'

## Dealing With Missing Data

In [198]:
df = pd.DataFrame({
    'A':[np.nan, 13, 14, 15, 16, np.nan],
    'B':[12, np.nan, 14, 15, 16, np.nan],
    'C':[7, 8, 9, np.nan, 12, np.nan],
})

In [199]:
df

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,,8.0
2,14.0,14.0,9.0
3,15.0,15.0,
4,16.0,16.0,12.0
5,,,


In [201]:
df.dropna(how='all')

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,,8.0
2,14.0,14.0,9.0
3,15.0,15.0,
4,16.0,16.0,12.0


In [202]:
df.dropna(how='any')

Unnamed: 0,A,B,C
2,14.0,14.0,9.0
4,16.0,16.0,12.0


In [203]:
df

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,,8.0
2,14.0,14.0,9.0
3,15.0,15.0,
4,16.0,16.0,12.0
5,,,


In [205]:
df.A.fillna(value=np.mean(df.A))

0    14.5
1    13.0
2    14.0
3    15.0
4    16.0
5    14.5
Name: A, dtype: float64

In [206]:
df

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,,8.0
2,14.0,14.0,9.0
3,15.0,15.0,
4,16.0,16.0,12.0
5,,,


Use `(inplace=True)` to reflect changes into the original dataframe

In [207]:
df

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,,8.0
2,14.0,14.0,9.0
3,15.0,15.0,
4,16.0,16.0,12.0
5,,,


Before encoding categorical data, first impute it.

In [208]:
# I always want to see with numeric missing values



In [211]:
df.fillna(method='ffill')

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,12.0,8.0
2,14.0,14.0,9.0
3,15.0,15.0,9.0
4,16.0,16.0,12.0
5,16.0,16.0,12.0


In [212]:
df.fillna(method='pad')

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,12.0,8.0
2,14.0,14.0,9.0
3,15.0,15.0,9.0
4,16.0,16.0,12.0
5,16.0,16.0,12.0


In [213]:
df.fillna(method='bfill')

Unnamed: 0,A,B,C
0,13.0,12.0,7.0
1,13.0,14.0,8.0
2,14.0,14.0,9.0
3,15.0,15.0,12.0
4,16.0,16.0,12.0
5,,,


In [214]:
from sklearn.impute import SimpleImputer

In [215]:
df

Unnamed: 0,A,B,C
0,,12.0,7.0
1,13.0,,8.0
2,14.0,14.0,9.0
3,15.0,15.0,
4,16.0,16.0,12.0
5,,,


In [217]:
imp = SimpleImputer(strategy='mean')

In [219]:
imp.fit_transform(df) # for dealing with missing numberical values

array([[14.5 , 12.  ,  7.  ],
       [13.  , 14.25,  8.  ],
       [14.  , 14.  ,  9.  ],
       [15.  , 15.  ,  9.  ],
       [16.  , 16.  , 12.  ],
       [14.5 , 14.25,  9.  ]])

1. Separate categorial and numeric.
2. Impute them separately.
3. Combine them