# Model Performance Transformations

Lets practice some basic data transformation for ML performance enhancement

In [1]:
# Imports

import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler, StandardScaler
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

In [2]:
# Categorical data analyser

def cat_var(df, cols):
    '''
    Return: a Pandas dataframe object with the following columns:
        - "categorical_variable" => every categorical variable include as an input parameter (string).
        - "number_of_possible_values" => the amount of unique values that can take a given categorical variable (integer).
        - "values" => a list with the posible unique values for every categorical variable (list).

    Input parameters:
        - df -> Pandas dataframe object: a dataframe with categorical variables.
        - cols -> list object: a list with the name (string) of every categorical variable to analyse.
    '''
    cat_list = []
    for col in cols:
        cat = df[col].unique()
        cat_num = len(cat)
        cat_dict = {"categorical_variable":col,
                    "number_of_possible_values":cat_num,
                    "values":cat}
        cat_list.append(cat_dict)
    df = pd.DataFrame(cat_list).sort_values(by="number_of_possible_values", ascending=False)
    return df.reset_index(drop=True)

## Scaling

Some ML algorithms have problems performing well whenever the data scale differ greatly between features. In those cases scaling the data is your best option.

- [RobustScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler)

- [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)

Try both options and see what happens with performance (i.e.: AUC).

<img src="../images/scaling.png" alt="Drawing" style="width: 500px;"/>

In [3]:
# Weather dataset (https://www.kaggle.com/jsphyg/weather-dataset-rattle-package)

weather = pd.read_csv('../data/weatherAUS.csv')
print(weather.shape)
weather.head()

(145460, 23)


Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [4]:
# Uluru weather (numerical features)

weather = weather[weather['Location'].isin(['Uluru'])].reset_index(drop=True)
weather = weather[weather['RainToday'].isin(['No','Yes'])].reset_index(drop=True)
weather = weather[weather['RainTomorrow'].isin(['No','Yes'])]
weather = weather[['MinTemp',
                   'MaxTemp',
                   'Rainfall',
                   'WindSpeed9am',
                   'WindSpeed3pm',
                   'Humidity9am',
                   'Humidity3pm',
                   'Pressure9am',
                   'Pressure3pm',
                   'Temp9am',
                   'Temp3pm',
                   'RainTomorrow']]
weather = weather.dropna().reset_index(drop=True)
col_weather = list(weather.columns)
print(col_weather)
print(weather.shape)
print(weather.describe())
weather.head()

['MinTemp', 'MaxTemp', 'Rainfall', 'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm', 'Temp9am', 'Temp3pm', 'RainTomorrow']
(1479, 12)
           MinTemp      MaxTemp     Rainfall  WindSpeed9am  WindSpeed3pm  \
count  1479.000000  1479.000000  1479.000000   1479.000000   1479.000000   
mean     14.368627    30.402299     0.716700     17.613928     17.050710   
std       7.432857     7.624058     4.208585      7.887082      6.893016   
min      -1.900000    11.300000     0.000000      0.000000      0.000000   
25%       8.100000    23.800000     0.000000     11.000000     11.000000   
50%      14.900000    31.200000     0.000000     17.000000     17.000000   
75%      20.800000    37.100000     0.000000     24.000000     22.000000   
max      31.000000    44.400000    83.800000     41.000000     48.000000   

       Humidity9am  Humidity3pm  Pressure9am  Pressure3pm      Temp9am  \
count  1479.000000  1479.000000  1479.000000  1479.000000  1479.0

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm,RainTomorrow
0,19.7,30.0,0.8,30.0,24.0,76.0,54.0,1010.6,1007.5,21.7,28.4,No
1,21.6,33.1,0.0,22.0,11.0,44.0,33.0,1010.5,1006.5,24.6,31.3,No
2,21.3,36.1,0.0,24.0,13.0,39.0,27.0,1006.9,1002.7,27.6,34.5,No
3,22.9,37.7,0.0,28.0,13.0,35.0,22.0,1006.0,1002.1,28.7,35.4,No
4,24.0,39.0,0.0,20.0,19.0,33.0,21.0,1006.9,1003.5,29.9,37.3,No


In [5]:
# Features + target

X = weather[['MinTemp',
          'MaxTemp',
          'Rainfall',
          'WindSpeed9am',
          'WindSpeed3pm',
          'Humidity9am',
          'Humidity3pm',
          'Pressure9am',
          'Pressure3pm',
          'Temp9am',
          'Temp3pm']]
y = pd.get_dummies(weather['RainTomorrow'], drop_first=True)['Yes']
print(X.shape,y.shape)

(1479, 11) (1479,)


In [6]:
# Train + test

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}")
print(f"X_train: {type(X_train)}, X_test: {type(X_test)}, y_train: {type(y_train)}, y_test: {type(y_test)}")

X_train: (1183, 11), X_test: (296, 11), y_train: (1183,), y_test: (296,)
X_train: <class 'pandas.core.frame.DataFrame'>, X_test: <class 'pandas.core.frame.DataFrame'>, y_train: <class 'pandas.core.series.Series'>, y_test: <class 'pandas.core.series.Series'>


In [7]:
X_train

Unnamed: 0,MinTemp,MaxTemp,Rainfall,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Temp9am,Temp3pm
405,8.0,20.5,0.0,20.0,24.0,65.0,42.0,1026.3,1023.5,12.7,19.1
478,0.9,19.6,0.0,9.0,11.0,60.0,23.0,1024.2,1020.9,9.6,18.5
598,27.3,42.9,0.0,9.0,17.0,23.0,14.0,1006.1,1001.9,34.6,39.7
1178,1.0,19.2,0.0,19.0,19.0,58.0,28.0,1024.2,1020.2,9.9,17.3
528,19.9,38.3,0.0,33.0,22.0,12.0,8.0,1009.7,1008.4,32.2,36.3
...,...,...,...,...,...,...,...,...,...,...,...
1130,8.6,27.8,0.0,9.0,41.0,57.0,25.0,1015.4,1010.1,16.3,27.6
1294,23.4,37.4,0.0,17.0,13.0,36.0,24.0,1008.2,1004.9,30.7,36.2
860,6.2,29.1,0.0,13.0,6.0,38.0,14.0,1021.8,1016.2,16.1,27.3
1459,4.9,20.7,0.0,17.0,24.0,38.0,11.0,1027.1,1023.9,10.3,19.8


## Standar scaling

In [10]:
# Scaling

# Standarization (z-score)

scaler = StandardScaler()

X_train_stand = scaler.fit_transform(X_train)
X_test_stand = scaler.fit_transform(X_test)


In [12]:
# Linear model

linear_model = LogisticRegression(max_iter=1000)
linear_param_stand = linear_model.fit(X_train_stand, y_train)
linear_pred_stand = linear_model.predict(X_test_stand)
linear_auc_stand = roc_auc_score(y_test, linear_pred_stand) # Área bajo la Curva ROC (AUC-ROC) para evaluar el rendimiento
print(f"Linear model AUC is: {linear_auc_stand}") # Cercano a 1 indica un buen rendimiento del modelo, mientras 0.5 es al azar


Linear model AUC is: 0.6787953638609159


In [None]:
''' Modelo Lineal (Logistic Regression):
1. Se utiliza para problemas de clasificación binaria
2. La regresión logísica asume una relación lineal entre las features y la probablidad de que la salida 
pertenezca a una clase particular. 
3. SE ajusta a los datos mediante la minimización de la función de pérdida, utilizado métodos como la 
optimización de gradiente descendente? 
4. Puede proporcionar una interpretación directa de la importancia de cada característica a través de los 
coeficientes asignados'''

In [13]:
# Ensemble model

ensemble_model = RandomForestClassifier()
ensemble_param_stand = ensemble_model.fit(X_train_stand, y_train)
ensemble_pred_stand = ensemble_model.predict(X_test_stand)
ensemble_auc_stand = roc_auc_score(y_test, ensemble_pred_stand)
print(f"Linear model AUC is: {ensemble_auc_stand}")

Linear model AUC is: 0.6960858825764773


In [None]:
''' Modelo Ensemble (Random Forest): 
1. Tipo de modelo de conjutno que combina múltiples modelos de árboles de dcisión para mejorar el rendimiento y la 
generalización. 
2. Cada árbol toma decisiones no lineales basadas en las característias de entrada. 
3. Cada árbol se entrena en una submuestra de los datos y utiliza un subconjunto aleatorio de características, lo que 
reduce el riesgo de sobreajuste y mejora la generalización. 
4. Son conocidos por ser robustos y tener buna capacidad de generalización, ya que reducen la variaza inherente a los
modelos de árboles individuales'''

### ¿Cuál funciona mejor en standar? > Ensemble model = 0.6960

## Robust scaling

In [15]:
# Scaling

# Robust

scaler = RobustScaler()

X_train_rob = scaler.fit_transform(X_train)
X_test_rob = scaler.fit_transform(X_test)

In [17]:
# Linear model

linear_model = LogisticRegression(max_iter=1000)
linear_param_rob = linear_model.fit(X_train_rob, y_train)
linear_pred_rob = linear_model.predict(X_test_rob)
linear_auc_rob = roc_auc_score(y_test, linear_pred_rob)
print(f"Linear model AUC is: {linear_auc_rob}")

Linear model AUC is: 0.6542846285388563


In [18]:
# Ensemble model

ensemble_model = RandomForestClassifier()
ensemble_param_rob = ensemble_model.fit(X_train_rob, y_train)
ensemble_pred_rob = ensemble_model.predict(X_test_rob)
ensemble_auc_rob = roc_auc_score(y_test, ensemble_pred_rob)
print(f"Linear model AUC is: {ensemble_auc_rob}")

Linear model AUC is: 0.6207486224586737


### ¿Cuál funciona mejor en standar? > Linear model = 0.6542

## Conclusion sobre scaling

1. Para standarization, resulta mejor el modelo Random Forest
2. Para robust, resulta mejor el modelo Logistic Regression
3. Entre ambos, resulta mejor escalado standar + modelo Random Forest

---

## Enconding

ML algorithms do not support categorical data. Therefore you need to find a way to transform categorical data into numerical. You must compare the results using both techniques: __One Hot Encoding__ or __Label Encoding__

- [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

- [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder)

<img src="../images/encoding.png" alt="Drawing" style="width: 500px;"/>

In [20]:
# Mushrooms dataset (https://www.kaggle.com/uciml/mushroom-classification)

mushrooms = pd.read_csv('../data/mushrooms.csv')
col_mushrooms = list(mushrooms.columns)
print(mushrooms.shape)
mushrooms.head()

(8124, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [21]:
col_mushrooms

['class',
 'cap-shape',
 'cap-surface',
 'cap-color',
 'bruises',
 'odor',
 'gill-attachment',
 'gill-spacing',
 'gill-size',
 'gill-color',
 'stalk-shape',
 'stalk-root',
 'stalk-surface-above-ring',
 'stalk-surface-below-ring',
 'stalk-color-above-ring',
 'stalk-color-below-ring',
 'veil-type',
 'veil-color',
 'ring-number',
 'ring-type',
 'spore-print-color',
 'population',
 'habitat']

In [22]:
# Features analysis

cat_mushrooms = cat_var(mushrooms, col_mushrooms)
cat_mushrooms

Unnamed: 0,categorical_variable,number_of_possible_values,values
0,gill-color,12,"[k, n, g, p, w, h, u, e, b, r, y, o]"
1,cap-color,10,"[n, y, w, g, e, p, b, u, c, r]"
2,spore-print-color,9,"[k, n, u, h, w, r, o, y, b]"
3,odor,9,"[p, a, l, n, f, c, y, s, m]"
4,stalk-color-below-ring,9,"[w, p, g, b, n, e, y, o, c]"
5,stalk-color-above-ring,9,"[w, g, p, n, b, e, o, c, y]"
6,habitat,7,"[u, g, m, d, p, w, l]"
7,cap-shape,6,"[x, b, s, f, k, c]"
8,population,6,"[s, n, a, v, y, c]"
9,ring-type,5,"[p, e, l, f, n]"


In [None]:
# Features + target (encoding). IMPORTANT: you may pick any of the 2-labeled features as you target (choose wisely!!!)

# Target será la columna class

## One hot encoding

In [23]:
mushrooms_ohe = pd.get_dummies(mushrooms[col_mushrooms], 
                                          columns=['class',
                                                     'cap-shape',
                                                     'cap-surface',
                                                     'cap-color',
                                                     'bruises',
                                                     'odor',
                                                     'gill-attachment',
                                                     'gill-spacing',
                                                     'gill-size',
                                                     'gill-color',
                                                     'stalk-shape',
                                                     'stalk-root',
                                                     'stalk-surface-above-ring',
                                                     'stalk-surface-below-ring',
                                                     'stalk-color-above-ring',
                                                     'stalk-color-below-ring',
                                                     'veil-type',
                                                     'veil-color',
                                                     'ring-number',
                                                     'ring-type',
                                                     'spore-print-color',
                                                     'population',
                                                     'habitat'], 
                                          drop_first=True)
mushrooms_ohe

Unnamed: 0,class_p,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,True,False,False,False,False,True,False,True,False,False,...,False,True,False,False,False,False,False,False,True,False
1,False,False,False,False,False,True,False,True,False,False,...,True,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,False,True,False,False,...,True,False,False,False,False,False,True,False,False,False
3,True,False,False,False,False,True,False,False,True,False,...,False,True,False,False,False,False,False,False,True,False
4,False,False,False,False,False,True,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,False,False,False,True,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
8120,False,False,False,False,False,True,False,True,False,False,...,False,False,True,False,False,True,False,False,False,False
8121,False,False,True,False,False,False,False,True,False,False,...,False,False,False,False,False,True,False,False,False,False
8122,True,False,False,True,False,False,False,False,True,False,...,False,False,True,False,False,True,False,False,False,False


In [24]:
# Features + target (class)
m2 = mushrooms_ohe.drop('class_p', axis=1)

X = m2
y = mushrooms_ohe['class_p']
print(X.shape,y.shape)

(8124, 95) (8124,)


In [30]:
y

0        True
1       False
2       False
3        True
4       False
        ...  
8119    False
8120    False
8121    False
8122     True
8123    False
Name: class_p, Length: 8124, dtype: bool

In [25]:
# Train + test


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"X_train: {X_train.shape}, X_test: {X_test.shape}, y_train: {y_train.shape}, y_test: {y_test.shape}")
print(f"X_train: {type(X_train)}, X_test: {type(X_test)}, y_train: {type(y_train)}, y_test: {type(y_test)}")


X_train: (6499, 95), X_test: (1625, 95), y_train: (6499,), y_test: (1625,)
X_train: <class 'pandas.core.frame.DataFrame'>, X_test: <class 'pandas.core.frame.DataFrame'>, y_train: <class 'pandas.core.series.Series'>, y_test: <class 'pandas.core.series.Series'>


## Standar scaling

In [26]:
# Scaling

# Standarization (z-score)

scaler = StandardScaler()
X_train_stand = scaler.fit_transform(X_train)
X_test_stand = scaler.fit_transform(X_test)


In [27]:
# Linear model

linear_model = LogisticRegression(max_iter=1000)
linear_param_stand = linear_model.fit(X_train_stand, y_train)
linear_pred_stand = linear_model.predict(X_test_stand)
linear_auc_stand = roc_auc_score(y_test, linear_pred_stand)
print(f"Linear model AUC is: {linear_auc_stand}")

Linear model AUC is: 1.0


In [29]:
# Ensemble model

ensemble_model = RandomForestClassifier()
ensemble_param_stand = ensemble_model.fit(X_train_stand, y_train)
ensemble_pred_stand = ensemble_model.predict(X_test_stand)
ensemble_auc_stand = roc_auc_score(y_test, ensemble_pred_stand)
print(f"Linear model AUC is: {ensemble_auc_stand}")

Linear model AUC is: 1.0


### Todo muy extraño porque es igual a 1.0 en ambos modelos

## Robust Scaler

In [32]:
# Scaling

# Robust

scaler = RobustScaler()
X_train_rob = scaler.fit_transform(X_train)
X_test_rob = scaler.fit_transform(X_test)

In [33]:
# Linear model

linear_model = LogisticRegression(max_iter=1000)
linear_param_rob = linear_model.fit(X_train_rob, y_train)
linear_pred_rob = linear_model.predict(X_test_rob)
linear_auc_rob = roc_auc_score(y_test, linear_pred_rob)
print(f"Linear model AUC is: {linear_auc_rob}")

Linear model AUC is: 1.0


In [35]:
# Ensemble model

ensemble_model = RandomForestClassifier()
ensemble_param_rob = ensemble_model.fit(X_train_rob, y_train)
ensemble_pred_rob = ensemble_model.predict(X_test_rob)
ensemble_auc_rob = roc_auc_score(y_test, ensemble_pred_rob)
print(f"Linear model AUC is: {ensemble_auc_rob}")

Linear model AUC is: 1.0


### Again. Voy a probar con los datos SIN escalar

In [36]:
# Linear model

linear_model = LogisticRegression(max_iter=1000)
linear_param_simple = linear_model.fit(X_train, y_train)
linear_pred_simple = linear_model.predict(X_test)
linear_auc_simple = roc_auc_score(y_test, linear_pred_simple)
print(f"Linear model AUC is: {linear_auc_simple}")

Linear model AUC is: 1.0


In [37]:
# Ensemble model

ensemble_model = RandomForestClassifier()
ensemble_param_simple = ensemble_model.fit(X_train, y_train)
ensemble_pred_simple = ensemble_model.predict(X_test)
ensemble_auc_simple = roc_auc_score(y_test, ensemble_pred_simple)
print(f"Linear model AUC is: {ensemble_auc_simple}")

Linear model AUC is: 1.0


### Again

## Ordinal enconding
- A diferencia del label enconding, asigna un número a cada categoría de manera más informada, manteniendo el orden 
natural de las categorías cuando sea posible

In [38]:
mushrooms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

In [39]:
X

Unnamed: 0,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_g,cap-surface_s,cap-surface_y,cap-color_c,cap-color_e,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,False,False,False,False,True,False,True,False,False,False,...,False,True,False,False,False,False,False,False,True,False
1,False,False,False,False,True,False,True,False,False,False,...,True,False,False,False,True,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,True,False,False,False,False,False,True,False,False,False
3,False,False,False,False,True,False,False,True,False,False,...,False,True,False,False,False,False,False,False,True,False
4,False,False,False,False,True,False,True,False,False,False,...,False,False,False,False,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,False,False,True,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
8120,False,False,False,False,True,False,True,False,False,False,...,False,False,True,False,False,True,False,False,False,False
8121,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
8122,False,False,True,False,False,False,False,True,False,False,...,False,False,True,False,False,True,False,False,False,False


In [40]:
y

0        True
1       False
2       False
3        True
4       False
        ...  
8119    False
8120    False
8121    False
8122     True
8123    False
Name: class_p, Length: 8124, dtype: bool

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [43]:
ordinal_encoder = OrdinalEncoder()

X_train_encoded_simple = ordinal_encoder.fit_transform(X_train)
X_test_encoded_simple = ordinal_encoder.transform(X_test)

In [46]:
# Linear model

linear_model = LogisticRegression(max_iter=1000)
linear_param_simple =linear_model.fit(X_train_encoded_simple, y_train)
linear_pred_simple = linear_model.predict(X_test_encoded_simple)
linear_auc_simple = roc_auc_score(y_test, linear_pred_simple)
print(f"Linear model AUC: {linear_auc_simple}")

Linear model AUC: 1.0


In [47]:
# ensemble model 

ensemble_model = RandomForestClassifier()
ensemble_param_simple = ensemble_model.fit(X_train_encoded_simple, y_train)
ensemble_pred_simple = ensemble_model.predict(X_test_encoded_simple)
ensemble_auc_simple = roc_auc_score(y_test, ensemble_pred_simple)
print(f"Random Forest AUC: {ensemble_auc_simple}")

Random Forest AUC: 1.0


### Todo es 1.0

---

## Bonus

Now that you can grasp the potential of pre-processing your data...what would you do about the following dataset?

<img src="../images/bonus.jpg" alt="Drawing" style="width: 500px;"/>

In [48]:
# Netflix dataset (https://www.kaggle.com/shivamb/netflix-shows)

netflix = pd.read_csv('../data/netflix_titles.csv')
col_netflix = list(netflix.columns)
print(netflix.shape)
netflix.head()

(7787, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [49]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [None]:
# ML workflow -> ¿what would you do?

1. Tomar la decisión sobre qué hacer con los datos nulos ya que, por ejemplo, en director hay gran cantidad de nulos pero 
puede llegar a ser una columna clave en función de lo que se busque. 

2. Eliminar ciertas features que, a priori, no aportan valor como el date_added cuando ya tenemos release_year o description

3. Transformar algunas features a valores numéricos como, por ejemplo, ver si puede hacerse con duration 

4. El resto de features las transformaríamos a valores númericos con one hot encoding o similar. 

5. Se podría fijar como la columna target el rating para calcular el rating predecido de nuevas series teniendo en cuenta directores, países, etc. 

6. Otro uso podría ser un recommender systems. 





---