## Ejercicio

Vamos a usar el dataset Hitters para entrenar un modelo de ensamble para clasificar el valor de `Salary` en alto o bajo.

El ensamble va a estar compuesto por

* Un modelo de clasificación Naive Bayes (Gaussian)

* Un modelo de regresión logística con regularización

* Un árbol de clasificación

Vamos a calcular la predicción para cada instancia como el promedio de las probabilidades resultado de cada uno de estos tres modelos.


1. Leer los datos y, para simplificar, conservar sólo los registros completos y las features numéricas.

2. Crear una variable categórica, a partir de `Salary`, de valores alto / bajo representados como 1 / 0, usando como umbral un valor de Salary igual a 600

3. Crear los conjuntos de train y test

4. Estandarizar las features

5. Entrenar cada uno de los modelos base del ensamble y evaluar con AUC la performance de cada uno de los modelos base

6. Escribir una función que devuelva la predicción del ensamble, y comparar la performance obtenida en test con las obtenidas con los modelos base

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import binarize
from sklearn.preprocessing import StandardScaler

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import roc_auc_score,roc_curve, auc

from sklearn.ensemble import VotingClassifier

In [2]:
df = pd.read_csv('../Data/Hitters.csv')
df.head()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,League,Division,PutOuts,Assists,Errors,Salary,NewLeague
0,293,66,1,30,29,14,1,293,66,1,30,29,14,A,E,446,33,20,,A
1,315,81,7,24,38,39,14,3449,835,69,321,414,375,N,W,632,43,10,475.0,N
2,479,130,18,66,72,76,3,1624,457,63,224,266,263,A,W,880,82,14,480.0,A
3,496,141,20,65,78,37,11,5628,1575,225,828,838,354,N,E,200,11,3,500.0,N
4,321,87,10,39,42,30,2,396,101,12,48,46,33,N,E,805,40,4,91.5,N


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 322 entries, 0 to 321
Data columns (total 20 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   AtBat      322 non-null    int64  
 1   Hits       322 non-null    int64  
 2   HmRun      322 non-null    int64  
 3   Runs       322 non-null    int64  
 4   RBI        322 non-null    int64  
 5   Walks      322 non-null    int64  
 6   Years      322 non-null    int64  
 7   CAtBat     322 non-null    int64  
 8   CHits      322 non-null    int64  
 9   CHmRun     322 non-null    int64  
 10  CRuns      322 non-null    int64  
 11  CRBI       322 non-null    int64  
 12  CWalks     322 non-null    int64  
 13  League     322 non-null    object 
 14  Division   322 non-null    object 
 15  PutOuts    322 non-null    int64  
 16  Assists    322 non-null    int64  
 17  Errors     322 non-null    int64  
 18  Salary     263 non-null    float64
 19  NewLeague  322 non-null    object 
dtypes: float64

In [4]:
df.isna().sum()

AtBat         0
Hits          0
HmRun         0
Runs          0
RBI           0
Walks         0
Years         0
CAtBat        0
CHits         0
CHmRun        0
CRuns         0
CRBI          0
CWalks        0
League        0
Division      0
PutOuts       0
Assists       0
Errors        0
Salary       59
NewLeague     0
dtype: int64

In [5]:
df.Salary.fillna(df.Salary.mean(), inplace=True)

In [6]:
df.isna().sum()

AtBat        0
Hits         0
HmRun        0
Runs         0
RBI          0
Walks        0
Years        0
CAtBat       0
CHits        0
CHmRun       0
CRuns        0
CRBI         0
CWalks       0
League       0
Division     0
PutOuts      0
Assists      0
Errors       0
Salary       0
NewLeague    0
dtype: int64

In [7]:
df.describe()

Unnamed: 0,AtBat,Hits,HmRun,Runs,RBI,Walks,Years,CAtBat,CHits,CHmRun,CRuns,CRBI,CWalks,PutOuts,Assists,Errors,Salary
count,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0,322.0
mean,380.928571,101.024845,10.770186,50.909938,48.02795,38.742236,7.444099,2648.68323,717.571429,69.490683,358.795031,330.118012,260.23913,288.937888,106.913043,8.040373,535.925882
std,153.404981,46.454741,8.709037,26.024095,26.166895,21.639327,4.926087,2324.20587,654.472627,86.266061,334.105886,333.219617,267.058085,280.704614,136.854876,6.368359,407.557548
min,16.0,1.0,0.0,0.0,0.0,0.0,1.0,19.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,67.5
25%,255.25,64.0,4.0,30.25,28.0,22.0,4.0,816.75,209.0,14.0,100.25,88.75,67.25,109.25,7.0,3.0,226.25
50%,379.5,96.0,8.0,48.0,44.0,35.0,6.0,1928.0,508.0,37.5,247.0,220.5,170.5,212.0,39.5,6.0,535.925882
75%,512.0,137.0,16.0,69.0,64.75,53.0,11.0,3924.25,1059.25,90.0,526.25,426.25,339.25,325.0,166.0,11.0,700.0
max,687.0,238.0,40.0,130.0,121.0,105.0,24.0,14053.0,4256.0,548.0,2165.0,1659.0,1566.0,1378.0,492.0,32.0,2460.0


In [8]:
to_drop = ['League', 'Division','NewLeague']

df = df.drop(to_drop, axis=1)

In [9]:
salario_categorico = (df.Salary.apply(lambda x: x > 600)).astype(int)

In [10]:
df ['Salario_categorico'] = salario_categorico

In [11]:
X = df.drop(['Salario_categorico', 'Salary'], axis=1)

y = df.Salario_categorico

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, shuffle = True)

In [13]:
esc = StandardScaler()

X_train = esc.fit_transform(X_train)
X_test = esc.transform(X_test)

In [14]:
nb = GaussianNB()
nb.fit(X_train, y_train)

GaussianNB()

In [15]:
lr = LogisticRegression(max_iter=10000)

lr.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [16]:
arbol = DecisionTreeClassifier(max_depth=4)
arbol.fit(X_train, y_train)



DecisionTreeClassifier(max_depth=4)

In [17]:
arbol_pred=arbol.predict_proba(X_test)

In [18]:
nb_pred = nb.predict_proba(X_test)
lr_pred = lr.predict_proba(X_test)

In [19]:
y_pred_nb=(nb_pred[:,1])

fpr, tpr, th=roc_curve(y_test, y_pred_nb)
auc_nb=auc(fpr, tpr)
auc_nb

0.7946026986506747

In [20]:
y_pred_lr=(lr_pred[:,1])

fpr, tpr, th=roc_curve(y_test, y_pred_lr)
auc_lr=auc(fpr, tpr)
auc_lr

0.8170914542728636

In [21]:
y_pred_arbol=(arbol_pred[:,1])

fpr, tpr, th=roc_curve(y_test, y_pred_arbol)
auc_arbol=auc(fpr, tpr)
auc_arbol


0.7949775112443778

In [22]:
prediccion_promedio = (arbol_pred + nb_pred + lr_pred)/3
y_pred_promedio=(prediccion_promedio[:,1])

fpr, tpr, th=roc_curve(y_test, y_pred_promedio)
auc_prom=auc(fpr, tpr)
auc_prom

0.828335832083958

In [23]:
roc_auc_score(y_test, y_pred_promedio)

0.828335832083958

In [25]:
vc=VotingClassifier([('nb', nb),('lr', lr),('arbol',arbol)] ,voting='soft')

In [26]:
vc.fit(X_train, y_train)

VotingClassifier(estimators=[('nb', GaussianNB()),
                             ('lr', LogisticRegression(max_iter=10000)),
                             ('arbol', DecisionTreeClassifier(max_depth=4))],
                 voting='soft')

In [27]:
y_pred_promedio_vc=vc.predict(X_test)

In [28]:
roc_auc_score(y_test, y_pred_promedio_vc)

0.7181409295352323

In [35]:
vc.get_params(deep=False)

{'estimators': [('nb', GaussianNB()),
  ('lr', LogisticRegression(max_iter=10000)),
  ('arbol', DecisionTreeClassifier(max_depth=4))],
 'flatten_transform': True,
 'n_jobs': None,
 'verbose': False,
 'voting': 'soft',
 'weights': None}

In [39]:
pd.DataFrame({'Probabilidades Clase 0':vc.predict_proba(X_test)[:,0], 'Probabilidades Clase 1':vc.predict_proba(X_test)[:,1]})

Unnamed: 0,Probabilidades Clase 0,Probabilidades Clase 1
0,0.415465,0.584535
1,0.287883,0.712117
2,0.265561,0.734439
3,0.960454,0.039546
4,0.967949,0.032051
...,...,...
76,0.096973,0.903027
77,0.977520,0.022480
78,0.607596,0.392404
79,0.215631,0.784369
