# Imputation

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

__Fill in missing values__ in *sex* column

1. Execute the code 
   

2. Understand what is happening
   
   QUESTION: What other imputation strategies exist (check out the "strategy" parameter in the documentation)?

3. Search on internet the benefit

4. Explain to the rest of the group what you did

### Import Libraries

In [115]:
import pandas as pd
from sklearn.impute import SimpleImputer

### Read data into a dataframe df

In [116]:
df = pd.read_csv('../data/penguins.csv')
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
337,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
338,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
339,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
340,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female


### Check Missing value in sex column

In [117]:
# Select sex col
columns = df[['sex']]
columns

Unnamed: 0,sex
0,Male
1,Female
2,Female
3,Female
4,Male
...,...
337,Female
338,Female
339,Male
340,Female


In [118]:
df.sex.describe()

count      333
unique       2
top       Male
freq       168
Name: sex, dtype: object

In [119]:
df.sex.unique()

array(['Male', 'Female', nan], dtype=object)

In [120]:
# count the number of missing values
print(columns['sex'].isna().sum())

9


### Impute a categorical column

In [121]:
df[['sex']].mode()

Unnamed: 0,sex
0,Male


In [122]:
df.sex.isna().sum()

9

In [124]:
male = df[df['sex']=='Male'].count()
female = df[df['sex']=='Female'].count()
nan_c = df.sex.isna().sum()

print(male.sex/342,(female.sex)/342,nan_c/342)

0.49122807017543857 0.4824561403508772 0.02631578947368421


In [75]:
# Define the transformer
imputer = SimpleImputer(strategy='most_frequent')

#### "Fitting" the imputer transformer
The imputer transformer during the fit it learns the most frequent value as the strategy is most_frequent. This value than is stored inn the "statistics_" attribute of the *imputer* object.

In [76]:
# Select column
columns

Unnamed: 0,sex
0,Male
1,Female
2,Female
3,Female
4,Male
...,...
337,Female
338,Female
339,Male
340,Female


In [77]:
imputer.fit(columns)            # learn the most frequent value

In [78]:
imputer.statistics_

array(['Male'], dtype=object)

#### Transform the columns
Only during transformation the missing values gets replaced with the value stored in the statistics_  attribute

In [79]:
t = imputer.transform(columns)
print(t.shape)
print()
t

(342, 1)



array([['Male'],
       ['Female'],
       ['Female'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Male'],
       ['Male'],
       ['Male'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Male'],
       ['Female'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Male'],
       ['Female'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Male'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female'],
       ['Male'],
       ['Female

In [80]:
# format output as a DataFame
cols_imputed = pd.DataFrame(t, columns=columns.columns)
cols_imputed.head()

Unnamed: 0,sex
0,Male
1,Female
2,Female
3,Female
4,Male


__Hint__: You may have noticed that the output of the transformations with sklearn Feature Engineering methods is a numpy array. In case you want/need a DataFrame as output you can add to your code:
```python
from sklearn import set_config
set_config(transform_output="pandas")
```


In [81]:
cols_imputed

Unnamed: 0,sex
0,Male
1,Female
2,Female
3,Female
4,Male
...,...
337,Female
338,Female
339,Male
340,Female


In [82]:
pd_conc = df
pd_conc['sex'] = cols_imputed


In [83]:
pd_conc.sex.unique()

array(['Male', 'Female'], dtype=object)

In [88]:
pd_conc

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male
...,...,...,...,...,...,...,...
337,Gentoo,Biscoe,47.2,13.7,214.0,4925.0,Female
338,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female
339,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male
340,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female
