<h1>Discretization</h1>

In this notebook, i will discretize some data (by equal frequency) and save the cut points to use it in the test set, simulating a production situation.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split 
from ipynb.fs.full.funPyModeling import status

<h3>Data loading</h3>

In [2]:
data=pd.read_csv("data/eph2.txt", sep = ",")

<h3>Dataset status</h3>

In [6]:
status(data)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,edad,134,0.038111,0,0.0,70,float64
1,sexo,0,0.0,0,0.0,2,object
2,alfabeto,0,0.0,0,0.0,2,object
3,sistema_salud,0,0.0,0,0.0,7,object
4,nivel_educativo,66,0.018771,0,0.0,7,object
5,ocupacion_jerarquia,0,0.0,0,0.0,4,object
6,estado_civil,0,0.0,0,0.0,5,int64
7,ingreso_15k,0,0.0,0,0.0,2,object


In this data set there are NaNs in 'edad' and 'nivel_educativo'. Also, there are 5 categorical variables and 3 numerical variables.

<h3>Discretization</h3>

By equal bin sizes

In [7]:
edad_cat=pd.qcut(data['edad'], q=5)

In [9]:
edad_cat

0       (14.999, 32.0]
1         (57.0, 95.0]
2       (14.999, 32.0]
3       (14.999, 32.0]
4         (57.0, 95.0]
             ...      
3511      (48.0, 57.0]
3512      (41.0, 48.0]
3513      (32.0, 41.0]
3514      (41.0, 48.0]
3515      (32.0, 41.0]
Name: edad, Length: 3516, dtype: category
Categories (5, interval[float64]): [(14.999, 32.0] < (32.0, 41.0] < (41.0, 48.0] < (48.0, 57.0] < (57.0, 95.0]]

In [8]:
status(edad_cat)

Unnamed: 0,variable,q_nan,p_nan,q_zeros,p_zeros,unique,type
0,edad,134,0.038111,0,0.0,5,category


Now, 'edad' is a categorical variable with 5 categories.

In [10]:
edad_cat.value_counts()

(32.0, 41.0]      762
(14.999, 32.0]    687
(48.0, 57.0]      673
(57.0, 95.0]      643
(41.0, 48.0]      617
Name: edad, dtype: int64

<h3>Discretization in production</h3>

In [12]:
data_tr, data_ts = train_test_split(data, test_size=0.3)

In [13]:
edad_tr, saved_bins = pd.qcut(data_tr['edad'],
                              q=5,
                              retbins=True) # importante!

In 'edad_tr' i have the 'edad' column discretized, and in 'saved_bins' the cut points.

In [14]:
edad_tr

1079    (41.0, 48.0]
2225    (41.0, 48.0]
3208    (48.0, 57.0]
2941    (33.0, 41.0]
206     (48.0, 57.0]
            ...     
730     (33.0, 41.0]
538     (41.0, 48.0]
598     (41.0, 48.0]
1058    (48.0, 57.0]
22      (57.0, 86.0]
Name: edad, Length: 2461, dtype: category
Categories (5, interval[float64]): [(14.999, 33.0] < (33.0, 41.0] < (41.0, 48.0] < (48.0, 57.0] < (57.0, 86.0]]

In [15]:
saved_bins

array([15., 33., 41., 48., 57., 86.])

Next, i will apply the same discretization to the test set.

In [16]:
edad_ts=pd.cut(data_ts['edad'],
        bins=saved_bins, 
        include_lowest=True)

In [17]:
edad_ts

647       (48.0, 57.0]
1281      (48.0, 57.0]
399       (41.0, 48.0]
2696      (48.0, 57.0]
2076      (33.0, 41.0]
             ...      
423       (48.0, 57.0]
623       (33.0, 41.0]
2036      (41.0, 48.0]
295     (14.999, 33.0]
2612      (48.0, 57.0]
Name: edad, Length: 1055, dtype: category
Categories (5, interval[float64]): [(14.999, 33.0] < (33.0, 41.0] < (41.0, 48.0] < (48.0, 57.0] < (57.0, 86.0]]