# Objetivo del notebook

El notebook actual tiene como objetivo principal, llevar a cabo el procesamiento del conjunto de datos a utilizar. Se tiene como objetivo mayoritario la conversion de columnas categoricas nominales a categoricas numericas, que poder utilizar para el desarrollo de los modelos.

# Importar las librerias a utilizar

En las siguientes celdas, se importan todas las librerias externas y metodos especificos que son utilizados a lo largo del notebook. 

In [49]:
# Librerias y metodos para analisis y manipulacion de datos
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml 

# Clases y metodos de Sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer

# Otras librerias 
import os
import warnings
import random

warnings.filterwarnings(action='ignore')

# Procesamiento de datos (Data Wrangling)

Antes de comenzar, debo aclarar que el objetivo del procesamiento del dataset en este notebook, es el de dejar listo el conjunto de datos para poder entrenar a un modelo final. Con independencia de seleccionar despues mas o menos caracteristicas, tratare de adaptar todas las variables y registros de estas al procesamiento que es capaz de ejecutar un modelo de Machine Learning

In [2]:
# Cargo en memoria el dataset a utilizar
adult_df__ = fetch_openml('adult', version = 4)

# Defino una variable que apunte al propio objeto DataFrame de la instancia definida
adult_df = adult_df__.frame

# Muestro los primeros 5 registros del dataset
adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass,education,marital-status,occupation,relationship,race,sex,native-country,class
0,19,134974.0,10,0.0,0.0,20,,Some-college,Never-married,,Own-child,White,Female,United-States,<=50K
1,41,195096.0,13,0.0,0.0,50,Self-emp-inc,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,United-States,<=50K
2,31,152109.0,9,0.0,0.0,50,Private,HS-grad,Never-married,Exec-managerial,Not-in-family,White,Male,United-States,<=50K
3,40,202872.0,12,0.0,0.0,45,Private,Assoc-acdm,Never-married,Adm-clerical,Own-child,White,Female,United-States,<=50K
4,35,98989.0,5,0.0,0.0,38,,9th,Divorced,,Own-child,Amer-Indian-Eskimo,Male,United-States,<=50K


## Procesamiento de registros nulos

En el primer notebook dedicado al proyecto, se vio como, realmente, el dataset no muestra la presencia de nulos cuando ejecutamos el metodo .info().

La celda de abajo muestra, nuevamente, el resultado de esta ejecucion.

In [3]:
adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   age             48842 non-null  int64   
 1   fnlwgt          48842 non-null  float64 
 2   education-num   48842 non-null  int64   
 3   capital-gain    48842 non-null  float64 
 4   capital-loss    48842 non-null  float64 
 5   hours-per-week  48842 non-null  int64   
 6   workclass       48842 non-null  category
 7   education       48842 non-null  category
 8   marital-status  48842 non-null  category
 9   occupation      48842 non-null  category
 10  relationship    48842 non-null  category
 11  race            48842 non-null  category
 12  sex             48842 non-null  category
 13  native-country  48842 non-null  category
 14  class           48842 non-null  object  
dtypes: category(8), float64(3), int64(3), object(1)
memory usage: 3.0+ MB


Como vemos, esta ejecucion no muestra la presencia de valores nulos. Sin embargo, cuando revisamos, por ejemplo, los 5 primeros registros, vemos que existen valores nulos en varios registros para diferentes columnas, lo cual hace que debamos aplicar algun paso adicional en nuestro procesamiento, para que estos nulos sean considerados con tal por Pandas.

![Nulos en muestra de 5 primeros registros](../src/data/images/muestra_nulos.png)

In [4]:
## Defino los registros de nulos como valores nulos de numpy, creando un nuevo dataframe (las Series de Pandas no admiten modificacion directa)
columnas_names = adult_df.columns
registros_adult_df = []
for i, row in adult_df.iterrows():
    row_values = row.values
    row_values_processed = [np.nan if value == 'nan' else value for value in row_values]
    registros_adult_df.append(row_values_processed)

# Defino el nuevo dataframe, con nulos como np.nan()
wtnan_adult_df = pd.DataFrame(data = registros_adult_df,
                              columns = columnas_names)

# 5 primeros registros del dataframe
wtnan_adult_df.head(5)

# Elimino el dataset original
del adult_df

In [5]:
wtnan_adult_df.shape

(48842, 15)

In [6]:
# Ejecuto nuevamente el metodo .info() del dataframe
wtnan_adult_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             48842 non-null  int64  
 1   fnlwgt          48842 non-null  float64
 2   education-num   48842 non-null  int64  
 3   capital-gain    48842 non-null  float64
 4   capital-loss    48842 non-null  float64
 5   hours-per-week  48842 non-null  int64  
 6   workclass       46043 non-null  object 
 7   education       48842 non-null  object 
 8   marital-status  48842 non-null  object 
 9   occupation      46033 non-null  object 
 10  relationship    48842 non-null  object 
 11  race            48842 non-null  object 
 12  sex             48842 non-null  object 
 13  native-country  47985 non-null  object 
 14  class           48842 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory usage: 5.6+ MB


Con este nuevo dataframe, pandas detecta correctamente los valores nulos registrados.

Ahora, vemos como existen nulos en las siguientes columnas:

In [7]:
nulos_workclass = wtnan_adult_df.shape[0] - 46043
nulos_occupation = wtnan_adult_df.shape[0] - 46033
nulos_native_country = wtnan_adult_df.shape[0] - 47985

print(f'Columna workclass ==> {nulos_workclass} nulos')
print(f'Columna education ==> {nulos_occupation} nulos')
print(f'Columna native-country ==> {nulos_native_country} nulos')

Columna workclass ==> 2799 nulos
Columna education ==> 2809 nulos
Columna native-country ==> 857 nulos


Cuando contamos con valores nulos en un conjunto de datos, existen varias formas de lidiar con ellos. Debido el elevado numero de registros con el que contamos, podriamos simplemente eliminar todas las instancias registradas con valores nulos. 
No obstante, considero que esta practica es un poco como "barrer debajo del sofa", pues simplemente eliminamos estos registros para ahorrar un procesamiento de nulos adicional.

Como digo, para nuestro cometido, si contasemos con un numero de registros excepcionalmente grande, podria eliminar directamente los registros de nulos, sin que esto supusiese una perdida notable de datos.

No obstante, para este caso, voy a tratar de estimar los valores nulos faltantes, con metodos y clases que SkLearn incorpora para este fin.

Voy a aplicar un enfoque basado en KNN para la imputacion de los valores nulos registrados. No obstante, la clase a utilizar acepta unicamente valores numericos, por lo que, primero, voy a procesar todas las variables categoricas nominales para convertirlas en categoricas numericas. Una vez en formato numerico, puedo generar las estimaciones pertinentes para paliar los registros nulos.

## Definiendo las variables a procesar

A pesar de que voy a procesar todos los registros del dataset, incluyendo todas las variables registradas en este, existen algunas variables que van a requerir un procesamiento adicional. En este caso, se trataria de las variables categoricas nominales, cuyo informacion no puede ser entregada de manera "cruda" a un modelo de Machine Learning.

In [8]:
object_columns = []
for i, column in enumerate(wtnan_adult_df.dtypes):
    if column == "object":
        object_columns.append(wtnan_adult_df.dtypes.index[i])
        
        
# Muestro las columnas a procesar
for column in object_columns:
    print(column)

workclass
education
marital-status
occupation
relationship
race
sex
native-country
class


### Columna "workclass"



In [9]:
# Genero un dataframe con variables Dummy para la columna en cuestion
workclass_dummy_df = pd.get_dummies(wtnan_adult_df['workclass'], dummy_na=True).astype(int)

# 5 primeros registros del Dummy df
workclass_dummy_df.head()

Unnamed: 0,Federal-gov,Local-gov,Never-worked,Private,Self-emp-inc,Self-emp-not-inc,State-gov,Without-pay,NaN
0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1


In [10]:
# Modifico los nombres de las columnas generadas
column_names_workclass = workclass_dummy_df.columns
workclass_dummy_df.rename(columns = {column:f"workclass_{column}" for column in column_names_workclass}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_workclass = workclass_dummy_df.columns
# 5 primeros registros del Dummy df
workclass_dummy_df.head()

Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,workclass_nan
0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1


In [11]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['workclass'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, workclass_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,education,marital-status,occupation,relationship,...,class,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,workclass_nan
0,19,134974.0,10,0.0,0.0,20,Some-college,Never-married,,Own-child,...,<=50K,0,0,0,0,0,0,0,0,1
1,41,195096.0,13,0.0,0.0,50,Bachelors,Married-civ-spouse,Prof-specialty,Husband,...,<=50K,0,0,0,0,1,0,0,0,0
2,31,152109.0,9,0.0,0.0,50,HS-grad,Never-married,Exec-managerial,Not-in-family,...,<=50K,0,0,0,1,0,0,0,0,0
3,40,202872.0,12,0.0,0.0,45,Assoc-acdm,Never-married,Adm-clerical,Own-child,...,<=50K,0,0,0,1,0,0,0,0,0
4,35,98989.0,5,0.0,0.0,38,9th,Divorced,,Own-child,...,<=50K,0,0,0,0,0,0,0,0,1


### Columna "education"

In [12]:
# Genero un dataframe con variables Dummy para la columna en cuestion
education_dummy_df = pd.get_dummies(wtnan_adult_df['education'], dummy_na=True).astype(int)

# 5 primeros registros del Dummy df
education_dummy_df.head()

Unnamed: 0,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college,NaN
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [13]:
# Modifico los nombres de las columnas generadas
column_names_education = education_dummy_df.columns
education_dummy_df.rename(columns = {column:f"education_{column}" for column in column_names_education}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_education = education_dummy_df.columns
# 5 primeros registros del Dummy df
education_dummy_df.head()

Unnamed: 0,education_10th,education_11th,education_12th,education_1st-4th,education_5th-6th,education_7th-8th,education_9th,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,education_nan
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [14]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['education'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, education_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,marital-status,occupation,relationship,race,...,education_Assoc-acdm,education_Assoc-voc,education_Bachelors,education_Doctorate,education_HS-grad,education_Masters,education_Preschool,education_Prof-school,education_Some-college,education_nan
0,19,134974.0,10,0.0,0.0,20,Never-married,,Own-child,White,...,0,0,0,0,0,0,0,0,1,0
1,41,195096.0,13,0.0,0.0,50,Married-civ-spouse,Prof-specialty,Husband,White,...,0,0,1,0,0,0,0,0,0,0
2,31,152109.0,9,0.0,0.0,50,Never-married,Exec-managerial,Not-in-family,White,...,0,0,0,0,1,0,0,0,0,0
3,40,202872.0,12,0.0,0.0,45,Never-married,Adm-clerical,Own-child,White,...,1,0,0,0,0,0,0,0,0,0
4,35,98989.0,5,0.0,0.0,38,Divorced,,Own-child,Amer-Indian-Eskimo,...,0,0,0,0,0,0,0,0,0,0


### Columna marital-status

In [15]:
# Genero un dataframe con variables Dummy para la columna en cuestion
marital_status_dummy_df = pd.get_dummies(wtnan_adult_df['marital-status']).astype(int)

# 5 primeros registros del Dummy df
marital_status_dummy_df.head()

Unnamed: 0,Divorced,Married-AF-spouse,Married-civ-spouse,Married-spouse-absent,Never-married,Separated,Widowed
0,0,0,0,0,1,0,0
1,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0


In [16]:
# Modifico los nombres de las columnas generadas
column_names_marital_status = marital_status_dummy_df.columns
marital_status_dummy_df.rename(columns = {column:f"marital-status_{column}" for column in column_names_marital_status}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_marital_status = marital_status_dummy_df.columns
# 5 primeros registros del Dummy df
marital_status_dummy_df.head()

Unnamed: 0,marital-status_Divorced,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed
0,0,0,0,0,1,0,0
1,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0


In [17]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['marital-status'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, marital_status_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,occupation,relationship,race,sex,...,education_Prof-school,education_Some-college,education_nan,marital-status_Divorced,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed
0,19,134974.0,10,0.0,0.0,20,,Own-child,White,Female,...,0,1,0,0,0,0,0,1,0,0
1,41,195096.0,13,0.0,0.0,50,Prof-specialty,Husband,White,Male,...,0,0,0,0,0,1,0,0,0,0
2,31,152109.0,9,0.0,0.0,50,Exec-managerial,Not-in-family,White,Male,...,0,0,0,0,0,0,0,1,0,0
3,40,202872.0,12,0.0,0.0,45,Adm-clerical,Own-child,White,Female,...,0,0,0,0,0,0,0,1,0,0
4,35,98989.0,5,0.0,0.0,38,,Own-child,Amer-Indian-Eskimo,Male,...,0,0,0,1,0,0,0,0,0,0


### Columna occupation

In [18]:
# Genero un dataframe con variables Dummy para la columna en cuestion
occupation_dummy_df = pd.get_dummies(wtnan_adult_df['occupation']).astype(int)

# 5 primeros registros del Dummy df
marital_status_dummy_df.head()

Unnamed: 0,marital-status_Divorced,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed
0,0,0,0,0,1,0,0
1,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0


In [19]:
# Modifico los nombres de las columnas generadas
column_names_occupation = occupation_dummy_df.columns
occupation_dummy_df.rename(columns = {column:f"occupation_{column}" for column in column_names_occupation}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_occupation = occupation_dummy_df.columns
# 5 primeros registros del Dummy df
occupation_dummy_df.head()

Unnamed: 0,occupation_Adm-clerical,occupation_Armed-Forces,occupation_Craft-repair,occupation_Exec-managerial,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [20]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['occupation'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, occupation_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,relationship,race,sex,native-country,...,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,19,134974.0,10,0.0,0.0,20,Own-child,White,Female,United-States,...,0,0,0,0,0,0,0,0,0,0
1,41,195096.0,13,0.0,0.0,50,Husband,White,Male,United-States,...,0,0,0,0,0,1,0,0,0,0
2,31,152109.0,9,0.0,0.0,50,Not-in-family,White,Male,United-States,...,0,0,0,0,0,0,0,0,0,0
3,40,202872.0,12,0.0,0.0,45,Own-child,White,Female,United-States,...,0,0,0,0,0,0,0,0,0,0
4,35,98989.0,5,0.0,0.0,38,Own-child,Amer-Indian-Eskimo,Male,United-States,...,0,0,0,0,0,0,0,0,0,0


### Columna relationship

In [21]:
# Genero un dataframe con variables Dummy para la columna en cuestion
relationship_dummy_df = pd.get_dummies(wtnan_adult_df['relationship']).astype(int)

# 5 primeros registros del Dummy df
relationship_dummy_df.head()

Unnamed: 0,Husband,Not-in-family,Other-relative,Own-child,Unmarried,Wife
0,0,0,0,1,0,0
1,1,0,0,0,0,0
2,0,1,0,0,0,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0


In [22]:
# Modifico los nombres de las columnas generadas
column_names_relationship = relationship_dummy_df.columns
relationship_dummy_df.rename(columns = {column:f"relationship_{column}" for column in column_names_relationship}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_relationship = relationship_dummy_df.columns
# 5 primeros registros del Dummy df
relationship_dummy_df.head()

Unnamed: 0,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife
0,0,0,0,1,0,0
1,1,0,0,0,0,0
2,0,1,0,0,0,0
3,0,0,0,1,0,0
4,0,0,0,1,0,0


In [23]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['relationship'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, relationship_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,race,sex,native-country,class,...,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Husband,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife
0,19,134974.0,10,0.0,0.0,20,White,Female,United-States,<=50K,...,0,0,0,0,0,0,0,1,0,0
1,41,195096.0,13,0.0,0.0,50,White,Male,United-States,<=50K,...,0,0,0,0,1,0,0,0,0,0
2,31,152109.0,9,0.0,0.0,50,White,Male,United-States,<=50K,...,0,0,0,0,0,1,0,0,0,0
3,40,202872.0,12,0.0,0.0,45,White,Female,United-States,<=50K,...,0,0,0,0,0,0,0,1,0,0
4,35,98989.0,5,0.0,0.0,38,Amer-Indian-Eskimo,Male,United-States,<=50K,...,0,0,0,0,0,0,0,1,0,0


### Columna race

In [24]:
# Genero un dataframe con variables Dummy para la columna en cuestion
race_dummy_df = pd.get_dummies(wtnan_adult_df['race']).astype(int)

# 5 primeros registros del Dummy df
race_dummy_df.head()

Unnamed: 0,Amer-Indian-Eskimo,Asian-Pac-Islander,Black,Other,White
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,1,0,0,0,0


In [25]:
# Modifico los nombres de las columnas generadas
column_names_race = race_dummy_df.columns
race_dummy_df.rename(columns = {column:f"race_{column}" for column in column_names_race}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_race = race_dummy_df.columns
# 5 primeros registros del Dummy df
race_dummy_df.head()

Unnamed: 0,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,1,0,0,0,0


In [26]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['race'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, race_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,sex,native-country,class,workclass_Federal-gov,...,relationship_Not-in-family,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White
0,19,134974.0,10,0.0,0.0,20,Female,United-States,<=50K,0,...,0,0,1,0,0,0,0,0,0,1
1,41,195096.0,13,0.0,0.0,50,Male,United-States,<=50K,0,...,0,0,0,0,0,0,0,0,0,1
2,31,152109.0,9,0.0,0.0,50,Male,United-States,<=50K,0,...,1,0,0,0,0,0,0,0,0,1
3,40,202872.0,12,0.0,0.0,45,Female,United-States,<=50K,0,...,0,0,1,0,0,0,0,0,0,1
4,35,98989.0,5,0.0,0.0,38,Male,United-States,<=50K,0,...,0,0,1,0,0,1,0,0,0,0


### Columna sex

In [27]:
# Genero un dataframe con variables Dummy para la columna en cuestion
sex_dummy_df = pd.get_dummies(wtnan_adult_df['sex']).astype(int)

# 5 primeros registros del Dummy df
sex_dummy_df.head()

Unnamed: 0,Female,Male
0,1,0
1,0,1
2,0,1
3,1,0
4,0,1


In [28]:
# Modifico los nombres de las columnas generadas
column_names_sex = sex_dummy_df.columns
sex_dummy_df.rename(columns = {column:f"sex_{column}" for column in column_names_sex}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_sex = sex_dummy_df.columns
# 5 primeros registros del Dummy df
sex_dummy_df.head()

Unnamed: 0,sex_Female,sex_Male
0,1,0
1,0,1
2,0,1
3,1,0
4,0,1


In [29]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['sex'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, sex_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,native-country,class,workclass_Federal-gov,workclass_Local-gov,...,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Female,sex_Male
0,19,134974.0,10,0.0,0.0,20,United-States,<=50K,0,0,...,1,0,0,0,0,0,0,1,1,0
1,41,195096.0,13,0.0,0.0,50,United-States,<=50K,0,0,...,0,0,0,0,0,0,0,1,0,1
2,31,152109.0,9,0.0,0.0,50,United-States,<=50K,0,0,...,0,0,0,0,0,0,0,1,0,1
3,40,202872.0,12,0.0,0.0,45,United-States,<=50K,0,0,...,1,0,0,0,0,0,0,1,1,0
4,35,98989.0,5,0.0,0.0,38,United-States,<=50K,0,0,...,1,0,0,1,0,0,0,0,0,1


### Columna native-country

In [30]:
# Genero un dataframe con variables Dummy para la columna en cuestion
native_country_dummy_df = pd.get_dummies(wtnan_adult_df['native-country'], dummy_na=True).astype(int)

# 5 primeros registros del Dummy df
native_country_dummy_df.head()

Unnamed: 0,Cambodia,Canada,China,Columbia,Cuba,Dominican-Republic,Ecuador,El-Salvador,England,France,...,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia,NaN
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [31]:
# Modifico los nombres de las columnas generadas
column_names_native_country = native_country_dummy_df.columns
native_country_dummy_df.rename(columns = {column:f"native-country_{column}" for column in column_names_native_country}, inplace = True)

# Sobreescribo la lista anterior con los nombres actualizados de las columnas
column_names_native_country = native_country_dummy_df.columns
# 5 primeros registros del Dummy df
native_country_dummy_df.head()

Unnamed: 0,native-country_Cambodia,native-country_Canada,native-country_China,native-country_Columbia,native-country_Cuba,native-country_Dominican-Republic,native-country_Ecuador,native-country_El-Salvador,native-country_England,native-country_France,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,native-country_nan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [32]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['native-country'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, native_country_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,class,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,native-country_nan
0,19,134974.0,10,0.0,0.0,20,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,41,195096.0,13,0.0,0.0,50,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,31,152109.0,9,0.0,0.0,50,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,40,202872.0,12,0.0,0.0,45,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,35,98989.0,5,0.0,0.0,38,<=50K,0,0,0,...,0,0,0,0,0,0,1,0,0,0


### Columna class

In [33]:
# Genero un dataframe con variables Dummy para la columna en cuestion
class_dummy_df = pd.get_dummies(wtnan_adult_df['class']).astype(int)

# Elimino una de las columnas
class_dummy_df.drop(columns = ['>50K'], inplace = True)

# 5 primeros registros del Dummy df
class_dummy_df.head()

Unnamed: 0,<=50K
0,1
1,1
2,1
3,1
4,1


In [34]:
# Modifico los nombres de las columnas generadas
column_names_class = class_dummy_df.columns
class_dummy_df.rename(columns = {column:f"class_{column}" for column in column_names_class}, inplace = True)

# 5 primeros registros del Dummy df
class_dummy_df.head()

Unnamed: 0,class_<=50K
0,1
1,1
2,1
3,1
4,1


In [35]:
# Elimino la columna original en el dataframe donde esta cargado el dataset
wtnan_adult_df.drop(columns = ['class'], inplace = True)

# Concateno el dataset original junto con el que acabo de crear
wtnan_adult_df = pd.concat([wtnan_adult_df, class_dummy_df], axis = 1)

# 5 primeros registros 
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,native-country_nan,class_<=50K
0,19,134974.0,10,0.0,0.0,20,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,41,195096.0,13,0.0,0.0,50,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
2,31,152109.0,9,0.0,0.0,50,0,0,0,1,...,0,0,0,0,0,1,0,0,0,1
3,40,202872.0,12,0.0,0.0,45,0,0,0,1,...,0,0,0,0,0,1,0,0,0,1
4,35,98989.0,5,0.0,0.0,38,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1


Finalmente, he aplicado una codificacion One Hot a las variables categoricas nominales, propias del dataset en estado original. No obstante, todavia es necesario aplicar dos pasos adicionales:

* Imputar valores estimados para las columnas con registros nulos, y eliminar la columna de _nulos para dicha variable.

* Eliminar una de las columnas para cada una de las variables a las cuales hemos aplicado codificacion One Hot. Esto es mas que nada para evitar caer en la "Dummy trap" (mas informacion aqui ==> https://www.statology.org/dummy-variable-trap/)

Este ultimo paso lo aplicare una vez haya imputado los valores nulos de las columnas para las que faltan datos.

## Imputacion de valores nulos

In [36]:
# Instancio un objeto de la clase KNNImputer
imputer = KNNImputer(n_neighbors=5)

#### Imputacion nulos para variable workclass 

In [37]:
columnas_para_imputar = column_names_workclass
df_imputado = imputer.fit_transform(wtnan_adult_df[columnas_para_imputar])

# Creo un DataFrame con los resultados imputados
df_imputado = pd.DataFrame(df_imputado, columns=columnas_para_imputar)

# Convertimos los valores imputados a 1s y 0s para reflejar una categoría específica
for index, row in df_imputado.iterrows():
    # Ignoramos la columna de nulos al decidir cuál columna debe ser 1
    max_index = row[:-1].argmax()
    row[:] = 0  # Primero, establecemos todos los valores a 0
    row[max_index] = 1  # Luego, establecemos el valor máximo a 1

# Ahora, reemplazo las columnas originales en `wtnan_adult_df` con las columnas ajustadas
wtnan_adult_df[columnas_para_imputar[:-1]] = df_imputado.iloc[:, :-1]

wtnan_adult_df.drop(columnas_para_imputar[-1], axis=1, inplace=True)

# Mostramos el DataFrame final para verificación
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,native-country_nan,class_<=50K
0,19,134974.0,10,0.0,0.0,20,1.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1
1,41,195096.0,13,0.0,0.0,50,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1
2,31,152109.0,9,0.0,0.0,50,0.0,0.0,0.0,1.0,...,0,0,0,0,0,1,0,0,0,1
3,40,202872.0,12,0.0,0.0,45,0.0,0.0,0.0,1.0,...,0,0,0,0,0,1,0,0,0,1
4,35,98989.0,5,0.0,0.0,38,1.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1


#### Imputacion nulos para variable education 

In [38]:
columnas_para_imputar = column_names_education
df_imputado = imputer.fit_transform(wtnan_adult_df[columnas_para_imputar])

# Creo un DataFrame con los resultados imputados
df_imputado = pd.DataFrame(df_imputado, columns=columnas_para_imputar)

# Convertimos los valores imputados a 1s y 0s para reflejar una categoría específica
for index, row in df_imputado.iterrows():
    # Ignoramos la columna de nulos al decidir cuál columna debe ser 1
    max_index = row[:-1].argmax()
    row[:] = 0  # Primero, establecemos todos los valores a 0
    row[max_index] = 1  # Luego, establecemos el valor máximo a 1

# Ahora, reemplazo las columnas originales en `wtnan_adult_df` con las columnas ajustadas
wtnan_adult_df[columnas_para_imputar[:-1]] = df_imputado.iloc[:, :-1]

wtnan_adult_df.drop(columnas_para_imputar[-1], axis=1, inplace=True)

# Mostramos el DataFrame final para verificación
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,native-country_nan,class_<=50K
0,19,134974.0,10,0.0,0.0,20,1.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1
1,41,195096.0,13,0.0,0.0,50,0.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1
2,31,152109.0,9,0.0,0.0,50,0.0,0.0,0.0,1.0,...,0,0,0,0,0,1,0,0,0,1
3,40,202872.0,12,0.0,0.0,45,0.0,0.0,0.0,1.0,...,0,0,0,0,0,1,0,0,0,1
4,35,98989.0,5,0.0,0.0,38,1.0,0.0,0.0,0.0,...,0,0,0,0,0,1,0,0,0,1


#### Imputacion nulos para variable native-country

In [39]:
columnas_para_imputar = column_names_native_country
df_imputado = imputer.fit_transform(wtnan_adult_df[columnas_para_imputar])

# Creo un DataFrame con los resultados imputados
df_imputado = pd.DataFrame(df_imputado, columns=columnas_para_imputar)

# Convertimos los valores imputados a 1s y 0s para reflejar una categoría específica
for index, row in df_imputado.iterrows():
    # Ignoramos la columna de nulos al decidir cuál columna debe ser 1
    max_index = row[:-1].argmax()
    row[:] = 0  # Primero, establecemos todos los valores a 0
    row[max_index] = 1  # Luego, establecemos el valor máximo a 1

# Ahora, reemplazo las columnas originales en `wtnan_adult_df` con las columnas ajustadas
wtnan_adult_df[columnas_para_imputar[:-1]] = df_imputado.iloc[:, :-1]

wtnan_adult_df.drop(columnas_para_imputar[-1], axis=1, inplace=True)

# Mostramos el DataFrame final para verificación
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,class_<=50K
0,19,134974.0,10,0.0,0.0,20,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
1,41,195096.0,13,0.0,0.0,50,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2,31,152109.0,9,0.0,0.0,50,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
3,40,202872.0,12,0.0,0.0,45,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
4,35,98989.0,5,0.0,0.0,38,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1


In [40]:
wtnan_adult_df[column_names_workclass[:-1]].head()

Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [41]:
for i, nulos in zip(wtnan_adult_df.isna().sum().index, wtnan_adult_df.isna().sum()):
    print(f"{i} ==> {nulos} nulos")

age ==> 0 nulos
fnlwgt ==> 0 nulos
education-num ==> 0 nulos
capital-gain ==> 0 nulos
capital-loss ==> 0 nulos
hours-per-week ==> 0 nulos
workclass_Federal-gov ==> 0 nulos
workclass_Local-gov ==> 0 nulos
workclass_Never-worked ==> 0 nulos
workclass_Private ==> 0 nulos
workclass_Self-emp-inc ==> 0 nulos
workclass_Self-emp-not-inc ==> 0 nulos
workclass_State-gov ==> 0 nulos
workclass_Without-pay ==> 0 nulos
education_10th ==> 0 nulos
education_11th ==> 0 nulos
education_12th ==> 0 nulos
education_1st-4th ==> 0 nulos
education_5th-6th ==> 0 nulos
education_7th-8th ==> 0 nulos
education_9th ==> 0 nulos
education_Assoc-acdm ==> 0 nulos
education_Assoc-voc ==> 0 nulos
education_Bachelors ==> 0 nulos
education_Doctorate ==> 0 nulos
education_HS-grad ==> 0 nulos
education_Masters ==> 0 nulos
education_Preschool ==> 0 nulos
education_Prof-school ==> 0 nulos
education_Some-college ==> 0 nulos
marital-status_Divorced ==> 0 nulos
marital-status_Married-AF-spouse ==> 0 nulos
marital-status_Married-

Finalmente, hemos conseguido paliar correctamente la presencia de registros nulos en nuestro dataset.

Faltaria, como digo, reducir el numero de columnas Dummy para cada variable a la que se ha aplicado una codificacion OneHot, de forma que reduzcamos el riesgo de multicolinealidad

## Eliminando columnas Dummy "sobrantes"

In [42]:
object_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country',
 'class']

In [43]:
# Defino una lista con los nombres de una columna de las generadas para cada una de las variables a las que se les ha aplicado una codificacion OneHot.
for variable in object_columns[:-1]:
    columns_dummy = [column for column in wtnan_adult_df.columns if column.startswith(variable)]
    column_to_delete = random.choice(columns_dummy)
    print('Columna seleccionada para eliminar ==>', column_to_delete)
    
    # Elimino la variable escogida
    wtnan_adult_df.drop(columns = [column_to_delete], inplace = True)

Columna seleccionada para eliminar ==> workclass_State-gov
Columna seleccionada para eliminar ==> education_1st-4th
Columna seleccionada para eliminar ==> marital-status_Married-spouse-absent
Columna seleccionada para eliminar ==> occupation_Tech-support
Columna seleccionada para eliminar ==> relationship_Own-child
Columna seleccionada para eliminar ==> race_Amer-Indian-Eskimo
Columna seleccionada para eliminar ==> sex_Male
Columna seleccionada para eliminar ==> native-country_Canada


In [44]:
wtnan_adult_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,class_<=50K
0,19,134974.0,10,0.0,0.0,20,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
1,41,195096.0,13,0.0,0.0,50,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2,31,152109.0,9,0.0,0.0,50,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
3,40,202872.0,12,0.0,0.0,45,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
4,35,98989.0,5,0.0,0.0,38,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1


In [45]:
# Renombro la columna con la variable objetivo
wtnan_adult_df.rename(columns = {'class_<=50K':'class'}, inplace = True)

Es importante tener en cuenta que la variable objetivo, ahora procesada, esta contenida en una columna, manteniendo su caracteristica dicotomica.

La variable objetivo, entonces, contiene los siguientes valores:

* 1 ==> '<=50K' (clase menos favorecida economicamente)
* 0 ==> '>50K' (clase mas favorecida economicamente)


## Normalizacion de los datos

La gran mayoria de columnas son ahora variables dicotomicas. No obstante, las variables numericas del inicio deben ser normalizadas para evitar estandarizar las distancias entre cada observacion al momento de entrenar al modelo final.

In [50]:
# Instancio un objeto de la clase MinMaxScaler
scaler = MinMaxScaler()

X = wtnan_adult_df.drop(columns = ['class'])
y = wtnan_adult_df['class']

column_names = X.columns
# Normalizo los datos de mi matriz de caracteristicas
X = scaler.fit_transform(X)

# Defino un nuevo dataframe, con todas las instancias normalizadas
wtnan_adult_processed_df = pd.DataFrame(data = X,
                                        columns = column_names)

# Concateno la columna de la variable objetivo al DataFrame creado
wtnan_adult_processed_df['class'] = y

# Primeros 5 registros del dataframe
wtnan_adult_processed_df.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,class
0,0.027397,0.083004,0.6,0.0,0.0,0.193878,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
1,0.328767,0.123678,0.8,0.0,0.0,0.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2,0.191781,0.094596,0.533333,0.0,0.0,0.5,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
3,0.315068,0.128939,0.733333,0.0,0.0,0.44898,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1
4,0.246575,0.058658,0.266667,0.0,0.0,0.377551,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1


In [51]:
# Dimensiones del dataframe con el dataset procesado
wtnan_adult_processed_df.shape

(48842, 98)

Como podemos ver, nuestro dataset contiene ahora muchas mas caracteristicas que las que tenia al inico. No obstante, no deberia ser un problema, debido a la cantidad de ejemplos que registra nuestro dataset, cuyo numero de instancias es mas de dos ordenes de magnitud superior al numero de caracteristicas.

# Guardo el dataset procesado

Para poder aplicar los cambios del procesamiento, voy a generar un nuevo fichero .csv con el contenido del DataFrame procesado, y lo almacenare en el directorio local del proyecto.

In [52]:
#Ruta donde alojo los datos del proyecto
dataset__route = "../data/processed/"

if not os.path.exists(dataset__route):
    os.mkdir(dataset__route)
    
# Guardo el DataFrame "wtnan_adult_df" como fichero .csv
wtnan_adult_processed_df.to_csv(os.path.join(dataset__route, 'processed__census_income.csv'), index = False)
print('Conjunto de datos procesado guardado con exito.')

Conjunto de datos procesado guardado con exito.
