<h1>Data preparation</h1>

This notebook has some data preparation (change column type and NaNs treatment in numerical and categorical variables)

In [21]:
import pandas as pd
import pickle
from ipynb.fs.full.funPyModeling import status, freq_tbl

<h3>Data loading</h3>

In [2]:
data = pd.read_csv("data/eph2.txt", sep = ",")

<h3>Dataset status</h3>

In [4]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,edad,134,0.038111,0,0.0,70,float64
1,sexo,0,0.0,0,0.0,2,object
2,alfabeto,0,0.0,0,0.0,2,object
3,sistema_salud,0,0.0,0,0.0,7,object
4,nivel_educativo,66,0.018771,0,0.0,7,object
5,ocupacion_jerarquia,0,0.0,0,0.0,4,object
6,estado_civil,0,0.0,0,0.0,5,int64
7,ingreso_15k,0,0.0,0,0.0,2,object


There are 134 NaNs in 'edad' and in 'nivel_educativo'. Also, 'estado_civil' is numerical variable (it has to be categorical).

<h3>Data preparation</h3>

Convert 'estado_civil' from numerical to categorical

In [5]:
data['estado_civil_cat'] = data['estado_civil'].astype('str')

In [7]:
data = data.drop('estado_civil', axis = 1)

In [8]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,edad,134,0.038111,0,0.0,70,float64
1,sexo,0,0.0,0,0.0,2,object
2,alfabeto,0,0.0,0,0.0,2,object
3,sistema_salud,0,0.0,0,0.0,7,object
4,nivel_educativo,66,0.018771,0,0.0,7,object
5,ocupacion_jerarquia,0,0.0,0,0.0,4,object
6,ingreso_15k,0,0.0,0,0.0,2,object
7,estado_civil_cat,0,0.0,0,0.0,5,object


NaNs treament in 'nivel_educativo' (i will add 'nulo' as a new category)

In [9]:
data['nivel_educativo'] = data['nivel_educativo'].fillna(value="nulo") 

In [10]:
status(data['nivel_educativo'])

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,nivel_educativo,0,0.0,0,0.0,8,object


In [12]:
freq_tbl(data['nivel_educativo'])

Unnamed: 0,nivel_educativo,frequency,percentage,cumulative_perc
0,Secundaria Completa,915,0.260239,0.260239
1,Primaria Completa,673,0.191411,0.45165
2,Secundaria Incompleta,630,0.179181,0.63083
3,Superior Universitaria Completa,596,0.169511,0.800341
4,Superior Universitaria Incompleta,419,0.11917,0.919511
5,Primaria Incompleta(incluye educación especial),200,0.056883,0.976394
6,nulo,66,0.018771,0.995165
7,Sin instrucción,17,0.004835,1.0


NaNs treament in 'edad' (first i will discretize the variable, and next, add the 'nulo' category as before)

In [13]:
data['edad_cat'] = pd.qcut(data['edad'], q = 5)

In [14]:
data['edad_cat'] = data['edad_cat'].cat.add_categories('nulo')

In [16]:
freq_tbl(data['edad_cat'])

Unnamed: 0,edad_cat,frequency,percentage,cumulative_perc
0,"(32.0, 41.0]",762,0.216724,0.22531
1,"(14.999, 32.0]",687,0.195392,0.428445
2,"(48.0, 57.0]",673,0.191411,0.627439
3,"(57.0, 95.0]",643,0.182878,0.817564
4,"(41.0, 48.0]",617,0.175484,1.0
5,nulo,0,0.0,1.0


In [17]:
data['edad_cat']=data['edad_cat'].fillna(value="nulo")

In [18]:
data = data.drop('edad', axis = 1)

In [19]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,sexo,0,0.0,0,0.0,2,object
1,alfabeto,0,0.0,0,0.0,2,object
2,sistema_salud,0,0.0,0,0.0,7,object
3,nivel_educativo,0,0.0,0,0.0,8,object
4,ocupacion_jerarquia,0,0.0,0,0.0,4,object
5,ingreso_15k,0,0.0,0,0.0,2,object
6,estado_civil_cat,0,0.0,0,0.0,5,object
7,edad_cat,0,0.0,0,0.0,6,category


The new dataset doesnt have NaNs

Supossing that i will use this data in a predictive model, i will convert all the data to numerical, and save it using Pickle.

In [20]:
data_final = pd.get_dummies(data, drop_first=True)

In [22]:
status(data_final)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,sexo_mujer,0,0.0,2226,0.633106,2,uint8
1,alfabeto_Si,0,0.0,23,0.006542,2,uint8
2,sistema_salud_Mutual/prepaga/servicio de emerg...,0,0.0,3513,0.999147,2,uint8
3,sistema_salud_No paga ni le descuentan,0,0.0,1817,0.51678,2,uint8
4,sistema_salud_Ns./Nr.,0,0.0,3512,0.998862,2,uint8
5,sistema_salud_Obra social (incluye PAMI),0,0.0,2043,0.581058,2,uint8
6,sistema_salud_Obra social y mutual/prepaga/ser...,0,0.0,3444,0.979522,2,uint8
7,sistema_salud_Planes y seguros públicos,0,0.0,3471,0.987201,2,uint8
8,nivel_educativo_Primaria Incompleta(incluye ed...,0,0.0,3316,0.943117,2,uint8
9,nivel_educativo_Secundaria Completa,0,0.0,2601,0.739761,2,uint8


In [None]:
with open('data/d_eph5.pickle', 'wb') as f:
    pickle.dump(data_final, f)