# Pair Programming Limpieza IV - Valores nulos SKLEARN

Hipótesis

- La edad, el trabajo, el estado civil, la educación, la situación de deuda y la forma de contacto pueden influir en la probabilidad de que un cliente acepte la oferta.

- El número de veces que se ha contactado a un cliente en el pasado (campo campaign), el número de días que han pasado desde el último contacto (campo pdays), y el resultado de la campaña anterior (campo poutcome) pueden afectar la respuesta del cliente a una nueva oferta.

- Las variables económicas (tales como el índice de precios al consumidor (cons.price.idx), la tasa de variación del empleo (emp.var.rate), etc.) pueden influir en la probabilidad de que un cliente acepte la oferta.

- Los clientes que ya tienen una hipoteca (housing) o un préstamo (loan) pueden ser menos propensos a aceptar una nueva oferta, ya que podrían estar limitados financieramente.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np


from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import KNNImputer

In [2]:
df = pd.read_csv("bank-additional-full.csv", index_col = 0)
df.head(2)

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,duration,campaign,...,poutcome,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed",y,month_day_week,month,weekday
0,56,housemaid,married,basic 4y,Si,Si,Si,telephone,261,1,...,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",'may','mon'
1,57,services,married,high school,,Si,Si,telephone,149,1,...,NONEXISTENT,1.1,93.994,-36.4,4.857,5191.0,no,"['may', 'mon']",'may','mon'


Es el momento de ponernos a trabajar con los valores nulos de nuevo 💪🏽. A lo largo de este ejercicio de pair programming vamos a intentar eliminar los valores nulos de nuestras columnas. En la lección hemos aprendido varios métodos de skelarn intentemos aplicarlos todos. Manos a la obra!

Es el momento de eliminar los nulos:

- Reemplazad los valores nulos del resto de las columnas categóricas por la moda, usando el método SimpleImputer.

In [3]:
nulos = pd.DataFrame((df.isnull().sum() * 100) / df.shape[0]).reset_index()
nulos.columns = ["columna", "porcentaje"]
nulos

Unnamed: 0,columna,porcentaje
0,age,0.0
1,job,0.801438
2,marital,0.194288
3,education,4.201477
4,default,20.876239
5,housing,2.404313
6,loan,2.404313
7,contact,0.0
8,duration,0.0
9,campaign,0.0


In [4]:
categoricas = df.select_dtypes(include= object)
categoricas

Unnamed: 0,job,marital,education,default,housing,loan,contact,poutcome,y,month_day_week,month,weekday
0,housemaid,married,basic 4y,Si,Si,Si,telephone,NONEXISTENT,no,"['may', 'mon']",'may','mon'
1,services,married,high school,,Si,Si,telephone,NONEXISTENT,no,"['may', 'mon']",'may','mon'
2,services,married,high school,Si,No,Si,telephone,NONEXISTENT,no,"['may', 'mon']",'may','mon'
3,administration,married,basic 6y,Si,Si,Si,telephone,NONEXISTENT,no,"['may', 'mon']",'may','mon'
4,services,married,high school,Si,Si,No,telephone,NONEXISTENT,no,"['may', 'mon']",'may','mon'
...,...,...,...,...,...,...,...,...,...,...,...,...
41183,retired,married,professional course,Si,No,Si,cellular,NONEXISTENT,yes,"['nov', 'fri']",'nov','fri'
41184,blue-collar,married,professional course,Si,Si,Si,cellular,NONEXISTENT,no,"['nov', 'fri']",'nov','fri'
41185,retired,married,university degree,Si,No,Si,cellular,NONEXISTENT,no,"['nov', 'fri']",'nov','fri'
41186,technician,married,professional course,Si,Si,Si,cellular,NONEXISTENT,yes,"['nov', 'fri']",'nov','fri'


In [5]:
imputer = SimpleImputer(strategy='most_frequent', missing_values=np.nan)

In [6]:
imputer = imputer.fit(df[categoricas.columns])

In [7]:
df[categoricas.columns] = imputer.transform(df[categoricas.columns])

Hay que sobre escribir las columnas categóricas del df original para que el transform nos quite los nulos

In [8]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp,var,rate      0
cons,price,idx    0
cons,conf,idx     0
euribor3m         0
nr,employed       0
y                 0
month_day_week    0
month             0
weekday           0
dtype: int64


💡 Pista 💡 La moda en este tipo de aproximación se indica como most_frequent.

- Utilizad el método Iterative Imputer para reemplazar todos los valores nulos de las columnas numéricas.

In [9]:
def eliminar_nulos_iterative(dataframe_completo):
    df_numericas = dataframe_completo.select_dtypes(include = np.number)
    imputer = IterativeImputer(n_nearest_features=None, imputation_order='ascending')
    imputer.fit(df_numericas)
    numericas_trans = pd.DataFrame(imputer.transform(df_numericas), columns = df_numericas.columns)
    dataframe_completo.drop(columns = df_numericas.columns, inplace= True)
    dataframe_completo = pd.concat([dataframe_completo, numericas_trans], axis = 1)
    return dataframe_completo

In [10]:
df2 = df.copy()

In [11]:
eliminar_nulos_iterative(df2)

Unnamed: 0,job,marital,education,default,housing,loan,contact,poutcome,y,month_day_week,...,age,duration,campaign,pdays,previous,"emp,var,rate","cons,price,idx","cons,conf,idx",euribor3m,"nr,employed"
0,housemaid,married,basic 4y,Si,Si,Si,telephone,NONEXISTENT,no,"['may', 'mon']",...,56.0,261.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0
1,services,married,high school,Si,Si,Si,telephone,NONEXISTENT,no,"['may', 'mon']",...,57.0,149.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0
2,services,married,high school,Si,No,Si,telephone,NONEXISTENT,no,"['may', 'mon']",...,37.0,226.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0
3,administration,married,basic 6y,Si,Si,Si,telephone,NONEXISTENT,no,"['may', 'mon']",...,40.0,151.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0
4,services,married,high school,Si,Si,No,telephone,NONEXISTENT,no,"['may', 'mon']",...,56.0,307.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,retired,married,professional course,Si,No,Si,cellular,NONEXISTENT,yes,"['nov', 'fri']",...,,,,,,,,,,
41184,blue-collar,married,professional course,Si,Si,Si,cellular,NONEXISTENT,no,"['nov', 'fri']",...,,,,,,,,,,
41185,retired,married,university degree,Si,No,Si,cellular,NONEXISTENT,no,"['nov', 'fri']",...,,,,,,,,,,
41186,technician,married,professional course,Si,Si,Si,cellular,NONEXISTENT,yes,"['nov', 'fri']",...,,,,,,,,,,


In [13]:
df2.isnull().sum()

job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
poutcome          0
y                 0
month_day_week    0
month             0
weekday           0
dtype: int64

- Reemplazad los valores nulos de la columna age por la media de la edad usando el método KNN Imputer.

In [14]:
def eliminar_nulos_knn(dataframe_completo):
    df_numericas = dataframe_completo.select_dtypes(include = np.number)
    imputer_knn = KNNImputer(n_neighbors=5)
    imputer_knn.fit(df_numericas)
    numericas_trans = pd.DataFrame(imputer_knn.transform(df_numericas), columns = df_numericas.columns)
    dataframe_completo.drop(columns = df_numericas.columns, inplace= True)
    dataframe_completo = pd.concat([dataframe_completo, numericas_trans], axis = 1)
    return dataframe_completo

In [19]:
df3 = df['age'].reset_index()

In [20]:
df3 = eliminar_nulos_knn(df3)

In [21]:
df3.head()

Unnamed: 0,index,age
0,0.0,56.0
1,1.0,57.0
2,2.0,37.0
3,3.0,40.0
4,4.0,56.0


In [22]:
df3.isnull().sum()

index    0
age      0
dtype: int64


- Reemplazad los valores nulos de la columna age por la media de la edad usando el método Simple Imputer.


In [None]:
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)

In [23]:
imputer = imputer.fit(df[['age']])

In [24]:
df['age'] = imputer.transform(df[['age']])

In [26]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp,var,rate      0
cons,price,idx    0
cons,conf,idx     0
euribor3m         0
nr,employed       0
y                 0
month_day_week    0
month             0
weekday           0
dtype: int64


- ¿Podríais explicar qué diferencia hay entre estos últimos tres ejercicios?

Guardad el csv.

In [31]:
df2.to_csv('datos/bank-additional-full-limpio-iterative.csv')