## Limpieza 5: SkLearn

### A lo largo de este ejercicio de pair programming vamos a intentar eliminar los valores nulos de nuestras columnas. En la lección hemos aprendido varios métodos de skelarn intentemos aplicarlos todos. Manos a la obra!

In [1]:
import pandas as pd
import sidetable
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer

In [2]:
df = pd.read_csv("datos/Limpieza-3.csv", index_col = 0)
df.head()

Unnamed: 0,year,type,country,activity,age,species,month,fatal,sex
0,2018,Boating,usa,Paddling,57.0,White Shark,Jun,N,F
1,2018,Unprovoked,brazil,Swimming,18.0,Tiger Shark,Jun,Y,M
2,2018,Unprovoked,usa,Walking,15.0,Bull Shark,May,N,M
3,2018,Provoked,australia,Feeding sharks,32.0,Grey Shark,May,N,M
4,2018,Invalid,england,Fishing,21.0,Unspecified,May,N,M


In [3]:
df.stb.missing()

Unnamed: 0,missing,total,percent
month,181,1672,10.825359
age,158,1672,9.449761
species,126,1672,7.535885
fatal,99,1672,5.921053
activity,31,1672,1.854067
sex,14,1672,0.837321
country,10,1672,0.598086
year,0,1672,0.0
type,0,1672,0.0


### 1. Es el momento de eliminar los nulos:
- Reemplazad los valores nulos de la columna age por la media de la edad usando el método SimpleImputer.

In [4]:
def simple_imputer(df, col, estadis):
    imputer = SimpleImputer(strategy = estadis, missing_values = np.nan, copy = False)
    imputer = imputer.fit(df[[col]])
    df[col] = imputer.transform(df[[col]])
    return df

In [5]:
df = simple_imputer(df, "age", "mean")

In [6]:
df.stb.missing()

Unnamed: 0,missing,total,percent
month,181,1672,10.825359
species,126,1672,7.535885
fatal,99,1672,5.921053
activity,31,1672,1.854067
sex,14,1672,0.837321
country,10,1672,0.598086
year,0,1672,0.0
type,0,1672,0.0
age,0,1672,0.0


- Reemplazad los valores nulos de la columna sex por la moda, usando el método SimpleImputer.
  
   💡 Pista 💡 La moda en este tipo de aproximación se indica como most_frequent.


In [7]:
df = simple_imputer(df, "sex", "most_frequent")

In [8]:
df.stb.missing()

Unnamed: 0,missing,total,percent
month,181,1672,10.825359
species,126,1672,7.535885
fatal,99,1672,5.921053
activity,31,1672,1.854067
country,10,1672,0.598086
year,0,1672,0.0
type,0,1672,0.0
age,0,1672,0.0
sex,0,1672,0.0


- Reemplazad los valores nulos de la columna type por el valor más frecuente (la moda) con el método SimpleImputer.

La columna type no tiene valores nulos.

- Utilizad el método KNN Imputer para reemplazar todos los valores nulos de las columnas numéricas.

In [9]:
numericas = df.select_dtypes(include = np.number)
numericas.isnull().sum()

year    0
age     0
dtype: int64

Las dos columnas numéricas que tenemos, ya no tienen nulos, por lo que llamamos de vuelta al dataframe original.

In [10]:
df1 = pd.read_csv("datos/Limpieza-3.csv", index_col = 0)

In [11]:
numericas = df1.select_dtypes(include = np.number)
numericas.isnull().sum()

year      0
age     158
dtype: int64

In [12]:
def knn_imputer(df, vecino):

    imputer = KNNImputer(n_neighbors = vecino)

    imputer.fit(numericas)
  
    df_numericas_trans = pd.DataFrame(imputer.transform(numericas), columns = numericas.columns)

    columnas = df_numericas_trans.columns

    df.drop(columnas, axis = 1, inplace = True)
    
    df[columnas] = df_numericas_trans[columnas]

    return df

In [13]:
df1 = knn_imputer(df1, 5)

In [14]:
df1.stb.missing()

Unnamed: 0,missing,total,percent
month,181,1672,10.825359
species,126,1672,7.535885
fatal,99,1672,5.921053
activity,31,1672,1.854067
sex,14,1672,0.837321
country,10,1672,0.598086
type,0,1672,0.0
year,0,1672,0.0
age,0,1672,0.0


- Utilizad el método Iterative Imputer para reemplazar todos los valores nulos de las columnas numéricas.

In [15]:
df2 = pd.read_csv("datos/Limpieza-3.csv", index_col = 0)

In [16]:
def iterative_imputer(df2, estadis):

    imputer = IterativeImputer(n_nearest_features = None, initial_strategy = estadis, imputation_order = "ascending")

    imputer.fit(numericas)
 
    df_numericas_trans = pd.DataFrame(imputer.transform(numericas), columns = numericas.columns)

    columnas = df_numericas_trans.columns

    df2.drop(columnas, axis = 1, inplace = True)

    df2[columnas] = df_numericas_trans[columnas]

    return df2

In [17]:
df2 = iterative_imputer(df, "mean")

In [18]:
df2.stb.missing()

Unnamed: 0,missing,total,percent
month,181,1672,10.825359
species,126,1672,7.535885
fatal,99,1672,5.921053
activity,31,1672,1.854067
country,10,1672,0.598086
type,0,1672,0.0
sex,0,1672,0.0
year,0,1672,0.0
age,0,1672,0.0


- ¿Podríais explicar qué diferencia hay entre estos dos últimos métodos?

### 2. Guardad el csv para seguir trabajando con el en los siguientes ejercicios de pair.