# Model Performance Transformations

Lets practice some basic data transformation for ML performance enhancement

In [230]:
# Imports
# sklearn.compose.ColumnTransformer

import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import warnings

In [2]:
# Categorical data analyser

def cat_var(df, cols):
    '''
    Return: a Pandas dataframe object with the following columns:
        - "categorical_variable" => every categorical variable include as an input parameter (string).
        - "number_of_possible_values" => the amount of unique values that can take a given categorical variable (integer).
        - "values" => a list with the posible unique values for every categorical variable (list).

    Input parameters:
        - df -> Pandas dataframe object: a dataframe with categorical variables.
        - cols -> list object: a list with the name (string) of every categorical variable to analyse.
    '''
    cat_list = []
    for col in cols:
        cat = df[col].unique()
        cat_num = len(cat)
        cat_dict = {"categorical_variable":col,
                    "number_of_possible_values":cat_num,
                    "values":cat}
        cat_list.append(cat_dict)
    df = pd.DataFrame(cat_list).sort_values(by="number_of_possible_values", ascending=False)
    return df.reset_index(drop=True)

## Scaling

Some ML algorithms have problems performing well whenever the data scale differ greatly between features. In those cases scaling the data is your best option.

- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler)

- [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

Try both options and see what happens with performance (i.e.: AUC).

<img src="../images/scaling.png" alt="Drawing" style="width: 500px;"/>

In [14]:
# Weather dataset (https://www.kaggle.com/jsphyg/weather-dataset-rattle-package)

weather = pd.read_csv('../data/weatherAUS.csv')
print(weather.shape)
weather.head()

(145460, 23)


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [15]:
# Uluru weather (numerical features)

weather = weather[weather['Location'].isin(['Uluru'])].reset_index(drop=True)
weather = weather[weather['RainToday'].isin(['No','Yes'])].reset_index(drop=True)
weather = weather[weather['RainTomorrow'].isin(['No','Yes'])]
weather = weather[['MinTemp',
                   'MaxTemp',
                   'Rainfall',
                   'WindSpeed9am',
                   'WindSpeed3pm',
                   'Humidity9am',
                   'Humidity3pm',
                   'Pressure9am',
                   'Pressure3pm',
                   'Temp9am',
                   'Temp3pm',
                   'RainTomorrow']]
weather = weather.dropna().reset_index(drop=True)
col_weather = list(weather.columns)
print(col_weather)
print(weather.shape)
print(weather.describe())
weather.head()

['MinTemp', 'MaxTemp', 'Rainfall', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm', 'RainTomorrow']
(1479, 12)
           MinTemp      MaxTemp     Rainfall  WindSpeed9am  WindSpeed3pm  \
count  1479.000000  1479.000000  1479.000000   1479.000000   1479.000000   
mean     14.368627    30.402299     0.716700     17.613928     17.050710   
std       7.432857     7.624058     4.208585      7.887082      6.893016   
min      -1.900000    11.300000     0.000000      0.000000      0.000000   
25%       8.100000    23.800000     0.000000     11.000000     11.000000   
50%      14.900000    31.200000     0.000000     17.000000     17.000000   
75%      20.800000    37.100000     0.000000     24.000000     22.000000   
max      31.000000    44.400000    83.800000     41.000000     48.000000   

       Humidity9am  Humidity3pm  Pressure9am  Pressure3pm      Temp9am  \
count  1479.000000  1479.000000  1479.000000  1479.000000  1479.0

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainTomorrow
0,19.7,30.0,0.8,30.0,24.0,76.0,54.0,1010.6,1007.5,21.7,28.4,No
1,21.6,33.1,0.0,22.0,11.0,44.0,33.0,1010.5,1006.5,24.6,31.3,No
2,21.3,36.1,0.0,24.0,13.0,39.0,27.0,1006.9,1002.7,27.6,34.5,No
3,22.9,37.7,0.0,28.0,13.0,35.0,22.0,1006.0,1002.1,28.7,35.4,No
4,24.0,39.0,0.0,20.0,19.0,33.0,21.0,1006.9,1003.5,29.9,37.3,No


### Data preparation

In [19]:
# Features + target
X = weather[['MinTemp',
          'MaxTemp',
          'Rainfall',
          'WindSpeed9am',
          'WindSpeed3pm',
          'Humidity9am',
          'Humidity3pm',
          'Pressure9am',
          'Pressure3pm',
          'Temp9am',
          'Temp3pm']]
y = pd.get_dummies(weather['RainTomorrow'], drop_first=True)['Yes']
print(X.shape,y.shape)

(1479, 11) (1479,)


In [20]:
# Train + test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}")
print(f"X_train: {type(X_train)}, X_test: {type(X_test)}, y_train: {type(y_train)}, y_test: {type(y_test)}")

X_train: (1183, 11), X_test: (296, 11), y_train: (1183,), y_test: (296,)
X_train: <class 'pandas.core.frame.DataFrame'>, X_test: <class 'pandas.core.frame.DataFrame'>, y_train: <class 'pandas.core.series.Series'>, y_test: <class 'pandas.core.series.Series'>


### Scaling

In [24]:
# Obtain the columns names to the X_train dataframe
columns_name = X_train.columns.values
columns_name

array(['MinTemp', 'MaxTemp', 'Rainfall', 'WindSpeed9am', 'WindSpeed3pm',
       'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm',
       'Temp9am', 'Temp3pm'], dtype=object)

#### RobustScaler

In [27]:
# Scaling X_train - RobustScaler
transformer_robust = RobustScaler().fit(X_train)
X_train_robust = transformer_robust.transform(X_train)

X_train_robust_df = pd.DataFrame(X_train_robust)
X_train_robust_df.columns = columns_name
X_train_robust_df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm
0,-0.535433,-0.785185,0.0,0.076923,0.636364,0.84375,1.375,0.948454,1.072917,-0.632727,-0.824903
1,-1.094488,-0.851852,0.0,-0.769231,-0.545455,0.6875,0.1875,0.731959,0.802083,-0.858182,-0.871595
2,0.984252,0.874074,0.0,-0.769231,0.0,-0.46875,-0.375,-1.134021,-1.177083,0.96,0.77821
3,-1.086614,-0.881481,0.0,0.0,0.181818,0.625,0.5,0.731959,0.729167,-0.836364,-0.964981
4,0.401575,0.533333,0.0,1.076923,0.454545,-0.8125,-0.75,-0.762887,-0.5,0.785455,0.513619


In [29]:
# Scaling X_test - RobustScaler
transformer_robust = RobustScaler().fit(X_test)
X_test_robust = transformer_robust.transform(X_test)

X_test_robust_df = pd.DataFrame(X_test_robust)
X_test_robust_df.columns = columns_name
X_test_robust_df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm
0,0.730924,0.877339,0.0,0.230769,0.222222,-0.645161,-0.473684,-0.572271,-0.60745,1.017682,0.900862
1,0.827309,0.819127,0.2,0.846154,-0.666667,-0.419355,-0.157895,-0.820059,-0.87106,0.884086,0.780172
2,-0.441767,-0.045738,0.0,-0.846154,-0.222222,0.580645,0.263158,0.135693,-0.011461,-0.247544,0.012931
3,-0.208835,0.794179,0.0,-0.846154,-0.222222,-0.903226,-0.789474,-0.79646,-0.561605,0.310413,0.685345
4,0.971888,0.06237,0.0,-0.307692,1.0,-0.677419,1.052632,-0.171091,0.022923,0.569745,-0.168103


#### StandardScaler 

In [28]:
# Scaling X_train - StandardScaler
transformer_standard = StandardScaler().fit(X_train)
X_train_standard = transformer_standard.transform(X_train)

X_train_standard_df = pd.DataFrame(X_train_standard)
X_train_standard_df.columns = columns_name
X_train_standard_df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm
0,-0.861467,-1.291213,-0.172964,0.298824,1.021273,1.043379,1.11676,1.408494,1.555798,-1.065644,-1.347235
1,-1.821583,-1.409177,-0.172964,-1.09594,-0.871947,0.814233,-0.05227,1.084717,1.157436,-1.466209,-1.4286
2,1.748425,1.64476,-0.172964,-1.09594,0.001847,-0.881446,-0.606021,-1.705931,-1.75367,1.764156,1.446322
3,-1.80806,-1.461605,-0.172964,0.172028,0.293111,0.722574,0.255369,1.084717,1.050184,-1.427444,-1.591332
4,0.747741,1.041836,-0.172964,1.947182,0.730008,-1.385567,-0.975189,-1.150885,-0.757766,1.454041,0.98525


In [30]:
# Scaling X_test - StandardScaler
transformer_standard = StandardScaler().fit(X_test)
X_test_standard = transformer_standard.transform(X_test)

X_test_standard_df = pd.DataFrame(X_test_standard)
X_test_standard_df.columns = columns_name
X_test_standard_df.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm
0,1.339316,1.515163,-0.168327,0.317912,0.24296,-1.030095,-0.779035,-0.726011,-0.808638,1.70382,1.566997
1,1.497833,1.422887,-0.099698,1.333858,-0.903116,-0.699853,-0.407309,-1.06507,-1.17881,1.485032,1.375546
2,-0.589306,0.051928,-0.168327,-1.459993,-0.330078,0.76265,0.088327,0.242731,0.028274,-0.368225,0.158465
3,-0.206224,1.38334,-0.168327,-1.459993,-0.330078,-1.407515,-1.150762,-1.032779,-0.74426,0.545534,1.22512
4,1.735608,0.223298,-0.168327,-0.57104,1.245777,-1.077273,1.017644,-0.177057,0.076557,0.970239,-0.128712


### Use the model to predict with both scalers

#### RobustScaler
(X_train_robust, X_test_robust)

In [39]:
# Linear model - RobustScaler
linear_model = LogisticRegression(max_iter=1000)
linear_param = linear_model.fit(X_train_robust, y_train)
linear_pred = linear_model.predict(X_test_robust)
linear_auc = roc_auc_score(y_test, linear_pred)
print(f"Linear model AUC is: {linear_auc}")

Linear model AUC is: 0.6542846285388563


In [40]:
# Ensemble model - RobustScaler
ensemble_model = RandomForestClassifier()
ensemble_param = ensemble_model.fit(X_train_robust, y_train)
ensemble_pred = ensemble_model.predict(X_test_robust)
ensemble_auc = roc_auc_score(y_test, ensemble_pred)
print(f"Linear model AUC is: {ensemble_auc}")

Linear model AUC is: 0.6225536766102983


#### StandardScaler
(X_train_standard, X_test_standard)

In [34]:
# Linear model - RobustScaler
linear_model = LogisticRegression(max_iter=1000)
linear_param = linear_model.fit(X_train_standard, y_train)
linear_pred = linear_model.predict(X_test_standard)
linear_auc = roc_auc_score(y_test, linear_pred)
print(f"Linear model AUC is: {linear_auc}")

Linear model AUC is: 0.6787953638609159


In [35]:
# Ensemble model - RobustScaler
ensemble_model = RandomForestClassifier()
ensemble_param = ensemble_model.fit(X_train_standard, y_train)
ensemble_pred = ensemble_model.predict(X_test_standard)
ensemble_auc = roc_auc_score(y_test, ensemble_pred)
print(f"Linear model AUC is: {ensemble_auc}")

Linear model AUC is: 0.6996959908797264


---

## Enconding

ML algorithms do not support categorical data. Therefore you need to find a way to transform categorical data into numerical. You must compare the results using both techniques: __One Hot Encoding__ or __Label Encoding__

- [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

- [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder)

<img src="../images/encoding.png" alt="Drawing" style="width: 500px;"/>

In [42]:
# Mushrooms dataset (https://www.kaggle.com/uciml/mushroom-classification)
mushrooms = pd.read_csv('../data/mushrooms.csv')
col_mushrooms = list(mushrooms.columns)
print(mushrooms.shape)
pd.set_option('display.max_columns', None)   # Show all dataframe columns names
mushrooms.head()

(8124, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g


In [227]:
mushrooms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [37]:
# Features analysis
cat_mushrooms = cat_var(mushrooms, col_mushrooms)
cat_mushrooms

Unnamed: 0,categorical_variable,number_of_possible_values,values
0,gill-color,12,"[k, n, g, p, w, h, u, e, b, r, y, o]"
1,cap-color,10,"[n, y, w, g, e, p, b, u, c, r]"
2,spore-print-color,9,"[k, n, u, h, w, r, o, y, b]"
3,odor,9,"[p, a, l, n, f, c, y, s, m]"
4,stalk-color-below-ring,9,"[w, p, g, b, n, e, y, o, c]"
5,stalk-color-above-ring,9,"[w, g, p, n, b, e, o, c, y]"
6,habitat,7,"[u, g, m, d, p, w, l]"
7,cap-shape,6,"[x, b, s, f, k, c]"
8,population,6,"[s, n, a, v, y, c]"
9,ring-type,5,"[p, e, l, f, n]"


In [59]:
# Obtain the columns names to the X_train dataframe
columns_name = mushrooms.columns.values
columns_name

array(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises',
       'odor', 'gill-attachment', 'gill-spacing', 'gill-size',
       'gill-color', 'stalk-shape', 'stalk-root',
       'stalk-surface-above-ring', 'stalk-surface-below-ring',
       'stalk-color-above-ring', 'stalk-color-below-ring', 'veil-type',
       'veil-color', 'ring-number', 'ring-type', 'spore-print-color',
       'population', 'habitat'], dtype=object)

### Encoding
#### One Hot Encoding

In [105]:
# Features + target (encoding). IMPORTANT: you may pick any of the 2-labeled features as you target (choose wisely!!!)
X = mushrooms
enc_one = OneHotEncoder(drop='first').fit(X)
X_enc_one = enc_one.transform(X).toarray()

# Extract columns name after transformation
columns_name_enc_one = enc_one.get_feature_names_out(columns_name)

# Transfomr to dataframe
X_enc_one_df = pd.DataFrame(X_enc_one)
X_enc_df.columns = columns_name_enc
X_enc_df.head()

Unnamed: 0,class_e,class_p,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_b,cap-color_c,cap-color_e,cap-color_g,cap-color_n,cap-color_p,cap-color_r,cap-color_u,cap-color_w,cap-color_y,bruises_f,bruises_t,odor_a,odor_c,odor_f,odor_l,odor_m,odor_n,odor_p,odor_s,odor_y,gill-attachment_a,gill-attachment_f,gill-spacing_c,gill-spacing_w,gill-size_b,gill-size_n,gill-color_b,gill-color_e,gill-color_g,gill-color_h,gill-color_k,gill-color_n,gill-color_o,gill-color_p,gill-color_r,gill-color_u,gill-color_w,gill-color_y,stalk-shape_e,stalk-shape_t,stalk-root_?,stalk-root_b,stalk-root_c,stalk-root_e,stalk-root_r,stalk-surface-above-ring_f,stalk-surface-above-ring_k,stalk-surface-above-ring_s,stalk-surface-above-ring_y,stalk-surface-below-ring_f,stalk-surface-below-ring_k,stalk-surface-below-ring_s,stalk-surface-below-ring_y,stalk-color-above-ring_b,stalk-color-above-ring_c,stalk-color-above-ring_e,stalk-color-above-ring_g,stalk-color-above-ring_n,stalk-color-above-ring_o,stalk-color-above-ring_p,stalk-color-above-ring_w,stalk-color-above-ring_y,stalk-color-below-ring_b,stalk-color-below-ring_c,stalk-color-below-ring_e,stalk-color-below-ring_g,stalk-color-below-ring_n,stalk-color-below-ring_o,stalk-color-below-ring_p,stalk-color-below-ring_w,stalk-color-below-ring_y,veil-type_p,veil-color_n,veil-color_o,veil-color_w,veil-color_y,ring-number_n,ring-number_o,ring-number_t,ring-type_e,ring-type_f,ring-type_l,ring-type_n,ring-type_p,spore-print-color_b,spore-print-color_h,spore-print-color_k,spore-print-color_n,spore-print-color_o,spore-print-color_r,spore-print-color_u,spore-print-color_w,spore-print-color_y,population_a,population_c,population_n,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [106]:
# Features + target
X_one = X_enc_df.iloc[:, 2:]
y_one = X_enc_df.iloc[:, 1:2]

#### Label Encoding

In [104]:
# Features + target (encoding). IMPORTANT: you may pick any of the 2-labeled features as you target (choose wisely!!!)
X_enc_label_df = pd.DataFrame()

# Obtain the dataframe encoded
for column in mushrooms.columns:
    if mushrooms[column].dtype == 'object':
        enc_label = LabelEncoder()
        X_enc_label_df[column] = enc_label.fit_transform(mushrooms[column])

X_enc_label_df.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,stalk-shape,stalk-root,stalk-surface-above-ring,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,1,5,2,4,1,6,1,0,1,4,0,3,2,2,7,7,0,2,1,4,2,3,5
1,0,5,2,9,1,0,1,0,0,4,0,2,2,2,7,7,0,2,1,4,3,2,1
2,0,0,2,8,1,3,1,0,0,5,0,2,2,2,7,7,0,2,1,4,3,2,3
3,1,5,3,8,1,6,1,0,1,5,0,3,2,2,7,7,0,2,1,4,2,3,5
4,0,5,2,3,0,5,1,1,0,4,1,3,2,2,7,7,0,2,1,0,3,0,1


In [153]:
# Features + target
X_label = X_enc_label_df.iloc[:, 2:]
y_label = X_enc_label_df.iloc[:, 0:1]

### Obtain the train and test
#### One Hot Encoding

In [142]:
# Train + test
X_train_one, X_test_one, y_train_one, y_test_one = train_test_split(X_one, y_one, test_size=0.2, random_state=42)
print(f"X_train: {X_train_one.shape}, X_test: {X_test_one.shape}, y_train: {y_train_one.shape},\
      y_test: {y_test_one.shape}")
print(f"X_train: {type(X_train_one)}, X_test: {type(X_test_one)}, y_train: {type(y_train_one)},\
      y_test: {type(y_test_one)}")

X_train: (6499, 117), X_test: (1625, 117), y_train: (6499, 1),      y_test: (1625, 1)
X_train: <class 'pandas.core.frame.DataFrame'>, X_test: <class 'pandas.core.frame.DataFrame'>, y_train: <class 'pandas.core.frame.DataFrame'>,      y_test: <class 'pandas.core.frame.DataFrame'>


#### Label Encoding

In [154]:
# Train + test
X_train_label, X_test_label, y_train_label, y_test_label = train_test_split(X_label, y_label, test_size=0.2, 
                                                                               random_state=42)
print(f"X_train: {X_train_label.shape}, X_test: {X_test_label.shape}, y_train: {y_train_label.shape},\
      y_test: {y_test_label.shape}")
print(f"X_train: {type(X_train_label)}, X_test: {type(X_test_label)}, y_train: {type(y_train_label)},\
      y_test: {type(y_test_label)}")

X_train: (6499, 21), X_test: (1625, 21), y_train: (6499, 1),      y_test: (1625, 1)
X_train: <class 'pandas.core.frame.DataFrame'>, X_test: <class 'pandas.core.frame.DataFrame'>, y_train: <class 'pandas.core.frame.DataFrame'>,      y_test: <class 'pandas.core.frame.DataFrame'>


### Scaling
#### One Hot Encoding - RobustScaler

In [130]:
# Scaling X_train - StandardScaler
# transformer_robust = RobustScaler().fit(X_train_one)
# X_train_one_robust = transformer_robust.transform(X_train_one)
X_train_one_robust = X_train_one

# Scaling X_test - RobustScaler
# transformer_robust = RobustScaler().fit(X_test_one)
# X_test_one_robust = transformer_robust.transform(X_test_one)
X_test_one_robust = X_test_one

# X_train_one_robust, X_test_one_robust

#### One Hot Encoding - StandardScaler

In [131]:
# Scaling X_train - StandardScaler
# transformer_standard = StandardScaler().fit(X_train_one)
# X_train_one_standard = transformer_standard.transform(X_train_one)
X_train_one_standard = X_train_one

# Scaling X_test - StandardScaler
# transformer_standard = StandardScaler().fit(X_test_one)
# X_test_one_standard = transformer_standard.transform(X_test_one)
X_test_one_standard = X_test_one

# X_train_one_standard, X_test_one_standard

#### Label Encoding - RobustScaler

In [113]:
# Scaling X_train - StandardScaler
transformer_robust = RobustScaler().fit(X_train_label)
X_train_label_robust = transformer_robust.transform(X_train_label)

# Scaling X_test - RobustScaler
transformer_robust = RobustScaler().fit(X_test_label)
X_test_label_robust = transformer_robust.transform(X_test_label)

# X_train_label_robust, X_test_label_robust

#### Label Encoding - StandardScaler

In [114]:
# Scaling X_train - StandardScaler
transformer_standard = StandardScaler().fit(X_train_label)
X_train_label_standard = transformer_standard.transform(X_train_label)

# Scaling X_test - StandardScaler
transformer_standard = StandardScaler().fit(X_test_label)
X_test_label_standard = transformer_standard.transform(X_test_label)

# X_train_label_standard, X_test_label_standard

### Prediction

In [212]:
# Initialize the dataframe in which to store all the information for each iteration
# result = pd.DataFrame(columns=['Encoding', 'Scaler', 'Model', 'Result', 'Time'])
result = []

#### Prediction - One Hot Encoding - RobustScaler

In [None]:
# Prepare data
y_train_one = y_train_one['class_p'].values
y_test_one = y_test_one['class_p'].values

In [213]:
# Measure execution time
start_time = time.time()

# Linear model
# X_train_one_robust, X_test_one_robust, y_train_one, y_test_one
linear_model = LogisticRegression(max_iter=1000)
linear_param = linear_model.fit(X_train_one_robust, y_train_one)
linear_pred = linear_model.predict(X_test_one_robust)
linear_auc = roc_auc_score(y_test_one, linear_pred)
# print(f"Linear model AUC is: {linear_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

# Store the values
result_dict = {'Encoding': 'One Hot', 
               'Scaler': 'Robust', 
               'Model': 'Linear',
               'Result': linear_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

In [214]:
# Measure execution time
start_time = time.time()

# Ensemble model
ensemble_model = RandomForestClassifier()
ensemble_param = ensemble_model.fit(X_train_one_robust, y_train_one)
ensemble_pred = ensemble_model.predict(X_test_one_robust)
ensemble_auc = roc_auc_score(y_test_one, ensemble_pred)
# print(f"Linear model AUC is: {ensemble_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

result_dict = {'Encoding': 'One Hot', 
               'Scaler': 'Robust', 
               'Model': 'RandomForest',
               'Result': ensemble_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

#### Prediction - One Hot Encoding - StandardScaler

In [215]:
# Measure execution time
start_time = time.time()

# Linear model
# X_train_one_standard, X_test_one_standard, y_train_one, y_test_one
linear_model = LogisticRegression(max_iter=1000)
linear_param = linear_model.fit(X_train_one_standard, y_train_one)
linear_pred = linear_model.predict(X_test_one_standard)
linear_auc = roc_auc_score(y_test_one, linear_pred)
#print(f"Linear model AUC is: {linear_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

result_dict = {'Encoding': 'One Hot', 
               'Scaler': 'Standard', 
               'Model': 'Linear',
               'Result': linear_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

In [216]:
# Measure execution time
start_time = time.time()

# Ensemble model
ensemble_model = RandomForestClassifier()
ensemble_param = ensemble_model.fit(X_train_one_standard, y_train_one)
ensemble_pred = ensemble_model.predict(X_test_one_standard)
ensemble_auc = roc_auc_score(y_test_one, ensemble_pred)
#print(f"Linear model AUC is: {ensemble_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

result_dict = {'Encoding': 'One Hot', 
               'Scaler': 'Standard', 
               'Model': 'RandomForest',
               'Result': ensemble_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

#### Prediction - Label Encoding - RobustScaler

In [207]:
# Prepare data
y_train_one = np.ravel(y_train_label['class'].values)
y_test_one = np.ravel(y_test_label['class'].values)

In [235]:
# Measure execution time
start_time = time.time()
# Ignore warnings
warnings.filterwarnings("ignore")

# Linear model
# X_train_label_robust, X_test_label_robust, y_train_label, y_test_label
linear_model = LogisticRegression(max_iter=1000)
linear_param = linear_model.fit(X_train_label_robust, y_train_label)
linear_pred = linear_model.predict(X_test_label_robust)
linear_auc = roc_auc_score(y_test_label, linear_pred)
#print(f"Linear model AUC is: {linear_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

# Store the values
result_dict = {'Encoding': 'Label', 
               'Scaler': 'Robust', 
               'Model': 'Linear',
               'Result': linear_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)


In [236]:
# Measure execution time
start_time = time.time()
# Ignore warnings
warnings.filterwarnings("ignore")

# Ensemble model
ensemble_model = RandomForestClassifier()
ensemble_param = ensemble_model.fit(X_train_label_robust, y_train_label)
ensemble_pred = ensemble_model.predict(X_test_label_robust)
ensemble_auc = roc_auc_score(y_test_label, ensemble_pred)
#print(f"Linear model AUC is: {ensemble_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

# Store the values
result_dict = {'Encoding': 'Label', 
               'Scaler': 'Robust', 
               'Model': 'RandomForest',
               'Result': ensemble_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

#### Prediction - Label Encoding - StandardScaler

In [237]:
# Measure execution time
start_time = time.time()
# Ignore warnings
warnings.filterwarnings("ignore")

# Linear model
# X_train_label_standard, X_test_label_standard, y_train_label, y_test_label
linear_model = LogisticRegression(max_iter=1000)
linear_param = linear_model.fit(X_train_label_standard, y_train_label)
linear_pred = linear_model.predict(X_test_label_standard)
linear_auc = roc_auc_score(y_test_label, linear_pred)
#print(f"Linear model AUC is: {linear_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

# Store the values
result_dict = {'Encoding': 'Label', 
               'Scaler': 'Standard', 
               'Model': 'Linear',
               'Result': linear_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

In [238]:
# Measure execution time
start_time = time.time()
# Ignore warnings
warnings.filterwarnings("ignore")

# Ensemble model
ensemble_model = RandomForestClassifier()
ensemble_param = ensemble_model.fit(X_train_label_standard, y_train_label)
ensemble_pred = ensemble_model.predict(X_test_label_standard)
ensemble_auc = roc_auc_score(y_test_label, ensemble_pred)
#print(f"Linear model AUC is: {ensemble_auc}")

end_time = time.time()
elapsed_time = (end_time - start_time)*1000

# Store the values
result_dict = {'Encoding': 'Label', 
               'Scaler': 'Standard', 
               'Model': 'RandomForest',
               'Result': ensemble_auc,
               'Time (ms)': elapsed_time}
result.append(result_dict)

### Result

In [221]:
result_df = pd.DataFrame(result)
result_df

Unnamed: 0,Encoding,Scaler,Model,Result,Time (ms)
0,One Hot,Robust,Linear,1.0,22.433758
1,One Hot,Robust,RandomForest,1.0,232.445002
2,One Hot,Standard,Linear,1.0,22.821188
3,One Hot,Standard,RandomForest,1.0,237.267256
4,Label,Robust,Linear,0.912189,35.54368
5,Label,Robust,RandomForest,1.0,205.275059
6,Label,Standard,Linear,0.956185,25.339842
7,Label,Standard,RandomForest,1.0,211.426258


---

## Bonus

Now that you can grasp the potential of pre-processing your data...what would you do about the following dataset?

<img src="../images/bonus.jpg" alt="Drawing" style="width: 500px;"/>

In [225]:
# Netflix dataset (https://www.kaggle.com/shivamb/netflix-shows)

netflix = pd.read_csv('../data/netflix_titles.csv')
col_netflix = list(netflix.columns)
pd.set_option('display.max_columns', None)   # Show all dataframe columns names
print(netflix.shape)
netflix.head()

(7787, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [229]:
netflix['listed_in'][0]

'International TV Shows, TV Dramas, TV Sci-Fi & Fantasy'

In [226]:
# Features analysis
cat_mushrooms = cat_var(netflix, col_netflix)
cat_mushrooms

Unnamed: 0,categorical_variable,number_of_possible_values,values
0,show_id,7787,"[s1, s2, s3, s4, s5, s6, s7, s8, s9, s10, s11,..."
1,title,7787,"[3%, 7:19, 23:59, 9, 21, 46, 122, 187, 706, 19..."
2,description,7769,[In a future where the elite inhabit an island...
3,cast,6832,"[João Miguel, Bianca Comparato, Michel Gomes, ..."
4,director,4050,"[nan, Jorge Michel Grau, Gilbert Chan, Shane A..."
5,date_added,1566,"[August 14, 2020, December 23, 2016, December ..."
6,country,682,"[Brazil, Mexico, Singapore, United States, Tur..."
7,listed_in,492,"[International TV Shows, TV Dramas, TV Sci-Fi ..."
8,duration,216,"[4 Seasons, 93 min, 78 min, 80 min, 123 min, 1..."
9,release_year,73,"[2020, 2016, 2011, 2009, 2008, 2019, 1997, 201..."


In [228]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [None]:
# ML workflow -> ¿what would you do?

Por un lado separar los registros que corresponden a series de los que corresponden a películas

De esta manera la duración se puede poner como un valor númerico

Separar listed_in en diferentes géneros

No utilizar las columnas de show_id, type, title, description

Separar la columna de actores, ver cuantos actores diferentes hay

Ver la columna rating, que indica





---