# Entrega

## Preparacion

### Imports

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from joblib import dump, load

In [2]:
from preprocessing import reemplazarNulls,reemplazarCategoricas,reemplazarFechas,regularizar,targetBooleano
from preprocessing import reemplazarCategoricas_OHE, keepFeat_OHE, reemplazarNullsNum
from preprocessing import reemplazarCategoricas_HashTrick, normalizar_HashTrick

[###] Initial Preprocessings Done                           
[###] Aditional Preprocessings Done                                                   


In [3]:
from utilities import score2

### Test Holdout

In [4]:
df_feat = pd.read_csv("datasets/holdout_features.csv", low_memory=False).set_index('id')
df_targ = pd.read_csv("datasets/holdout_target.csv")

## Preprocesamientos

preprocesamiento | descripcion | funcion
:--:|:--:|:--:
convertir target a booleano | Convierte los 'si' y 'no' por True y False | `targetBooleano`
reemplazar nulls de todas las features | Reemplaza los nulls de los features con un `simple imputer` | `reemplazarNulls`
tratar missings numericos| Reemplaza missings por su media y una feature bool de missing |`reemplazarNullsNum`
reemplazar categoricas de features | convierte las features categoricas en numericas | `reemplazarCategoricas`
reemplazar fechas de features | convierte las features de fecha en numericas | `reemplazarFechas`
regularizar features | Normaliza las features y elimina las menos significativas mediante lasso | `regularizar`
escalar features | Luego de normalizar features pueden ser escaladas segun el peso asignado por lasso | `regularizar`
One Hot | Reemplazar features categoricas con one hot encoding | `reemplazarCategoricas_OHE`
Seleccion OHE | Selecciona los `N` features mas significativos | `keepFeat_OHE(N)`
Hash Trick | Reemplazar features categoricas con hash trick | `reemplazarCategoricas_HashTrick`
Normalizar HT | Normaliza las features resultantes de hash trick | `normalizar_HashTrick`

identificacion | preprocesamientos
:--:|:--:
`Comun` | `targetBooleano` `reemplazarFechas`
`BAS` | `Comun` `reemplazarNulls` `reemplazarCategoricas`
`REG` | `BAS` `regularizar`
`OHE` | `Comun` `reemplazarCategoricas_OHE`
`OHE(N)` | `OHE` `keepFeat_OHE(N)`
`HT` | `Comun` `reemplazarCategoricas_HashTrick`
`HTN` | `HT` `normalizar_HashTrick`

In [5]:
targetBooleano(df_targ, inplace=True)
df_targ = df_targ.llovieron_hamburguesas_al_dia_siguiente

ohe_feat = reemplazarCategoricas_OHE(df_feat)
ht_feat = reemplazarCategoricas_HashTrick(df_feat)

reemplazarNulls(df_feat , inplace=True)
reemplazarCategoricas(df_feat , inplace=True)
reemplazarFechas(df_feat , inplace=True)

df_reg = regularizar(df_feat)

reemplazarNullsNum(ohe_feat, inplace=True)
reemplazarFechas(ohe_feat , inplace=True)
ohe_feat2 = keepFeat_OHE(ohe_feat, 10)

reemplazarFechas(ht_feat , inplace=True)
reemplazarNullsNum(ht_feat, inplace=True)
ht_feat2 = normalizar_HashTrick(ht_feat)

## Comparacion con Test Holdout

In [6]:
predictions = pd.DataFrame()

In [7]:
def predict( model, name, preproc, feat ):
    pred = model.predict(feat)
    prob = model.predict_proba(feat)
    return score2( name, preproc, df_targ, pred, prob[:,1] )

### Modelos

#### Arbol

In [8]:
arbol = load('models/Tree/tree.sk')

In [9]:
pdf = predict(arbol,"Arbol","BAS",df_feat)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.8545815778158975

#### Knn

In [10]:
knn = load('models/KNN/knn.sk')

In [11]:
pdf = predict(knn,"KNN","REG",df_reg)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.8731459243361229

#### Naive Bayes

In [12]:
nb = load('models/NB/nb.sk')

In [13]:
pdf = predict(nb,"Naive Bayes","REG",df_reg)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.8294019163885583

#### SVM (Poly)

In [14]:
svm = load('models/SVM/svm.sk')

In [15]:
pdf = predict(svm,"SVM (Poly)","OHE",ohe_feat)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.8743984357683136

#### Red Neuronal

In [16]:
nn = load('models/NN/nn.sk')

In [17]:
pdf = predict(nn,"Red Neuronal","HTN",ht_feat2)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.8766236292275961

#### Random Forest

In [18]:
random_forest = load('models/Ensambles/random_forest.sk')

In [19]:
pdf = predict(random_forest,"Random Forest","BAS",df_feat)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.8733757476434599

#### Boosting

In [20]:
boost = load('models/Ensambles/boost.sk')

In [21]:
pdf = predict(boost,"BOOST","OHE",ohe_feat)
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.9032807293813307

## Resultados

In [22]:
predictions

Unnamed: 0,Modelo,Preprocesamientos,Clase,AUC-ROC,Accuracy,Precision,Recall,F1 score,Support
0,Arbol,BAS,AVG,0.854582,0.840587,0.829657,0.840587,0.828618,11373
1,Arbol,BAS,True,,,0.713536,0.48055,0.574313,2545
2,Arbol,BAS,False,,,0.863133,0.944382,0.901931,8828
0,KNN,REG,AVG,0.873146,0.842786,0.833702,0.842786,0.825936,11373
1,KNN,REG,True,,,0.760137,0.434578,0.553,2545
2,KNN,REG,False,,,0.85491,0.960467,0.90462,8828
0,Naive Bayes,REG,AVG,0.829402,0.825991,0.817808,0.825991,0.820698,11373
1,Naive Bayes,REG,True,,,0.63114,0.535167,0.579205,2545
2,Naive Bayes,REG,False,,,0.871622,0.909832,0.890318,8828
0,SVM (Poly),OHE,AVG,0.874398,0.849028,0.84039,0.849028,0.835431,11373


## Conclusion

**Modelo Recomendado:** Boosting, evaluando con el test-holdout es el que mejores metricas en todos los campos excepto  Precision

- qué modelo elegiríamos si se necesitase tener la menor cantidad de falsos positivos

> Ninguno, los modelos fueron entrenados para optimizar AUR-ROC, si quisiera minimizar la cantidad de **FP** entrenaria para optimizar la **Presicion**.
>
> De los modelos entrenados, el de mayor Precision en el test-holdout fue `Boost` tanto para la clase **True** (0.78) y como para **False** (0.89), *y por ende tambien el promedio pesado*. 

- si necesitan tener una lista de todos los días que potencialmente lloverán hamburguesas al día siguiente sin preocuparse demasiado si metemos en la misma días que realmente no llovieron hamburguesas al día siguiente

> Ninguno, los modelos fueron entrenados para optimizar AUR-ROC, si quisiera minimizar la cantidad de **FN** entrenaria para optimizar el **Recall**.
>
> De los modelos entrenados, el de mayor Recall en el test-holdout fue `Boost` para la clase **True** (0.57) ,`Random Forest` para la clase **False** (0.96) y `Boost` para el promedio pesado (0.87).

#### Comparacion con Base Line

In [23]:
df_feat_base = pd.read_csv("datasets/holdout_features.csv", low_memory=False).set_index('id')
reemplazarNullsNum(df_feat_base , inplace=True)

Unnamed: 0_level_0,barrio,dia,direccion_viento_tarde,direccion_viento_temprano,horas_de_sol,humedad_tarde,humedad_temprano,llovieron_hamburguesas_hoy,mm_evaporados_agua,mm_lluvia_dia,...,missing_nubosidad_temprano,missing_presion_atmosferica_tarde,missing_presion_atmosferica_temprano,missing_rafaga_viento_max_velocidad,missing_temp_max,missing_temp_min,missing_temperatura_tarde,missing_temperatura_temprano,missing_velocidad_viendo_tarde,missing_velocidad_viendo_temprano
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
54297,Liniers,2017-02-19,Noroeste,Nornoreste,7.629393,23.0,53.0,si,5.470542,5.2,...,True,False,False,False,False,False,False,False,False,False
91989,Caballito,2011-01-17,Estenoreste,Noreste,7.629393,74.0,81.0,no,5.470542,0.0,...,True,False,False,False,False,False,False,False,False,False
58424,Coghlan,2014-12-04,Noreste,Norte,7.629393,51.0,89.0,si,5.470542,10.8,...,True,True,True,False,False,False,False,False,False,False
69479,Villa Soldati,2013-07-29,Nornoreste,Noreste,10.100000,44.0,68.0,no,4.400000,0.0,...,False,False,False,False,False,False,False,False,False,False
96106,Barracas,2012-08-20,Noreste,Estenoreste,7.629393,59.0,72.0,no,2.800000,0.0,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88861,Villa Lugano,2016-08-19,Oestenoroeste,Nornoreste,1.900000,56.0,74.0,no,3.000000,0.2,...,False,False,False,False,False,False,False,False,False,False
70212,Barracas,2012-06-21,Este,suroeste,7.629393,46.0,59.0,no,2.000000,0.0,...,False,False,False,False,False,False,False,False,False,False
4839,Villa Crespo,2015-08-11,Oestenoroeste,Norte,7.800000,21.0,40.0,no,5.400000,0.0,...,False,False,False,False,False,False,False,False,False,False
14019,Almagro,2016-01-16,Estesureste,Sursureste,7.629393,74.0,84.0,no,5.470542,0.6,...,True,False,False,False,False,False,False,False,False,False


In [24]:
def funcion_baseline(row):
    if row["llovieron_hamburguesas_hoy"] == "si":
        if row['horas_de_sol'] < 2:
            return True
        if row['nubosidad_tarde'] > 7:
            return True
        if row["humedad_tarde"] > 70:
            return True

    if row["mm_lluvia_dia"] > 10:
        return True
    if row["humedad_tarde"] > 80:
        return True

    return False

def baseline(df):
    return df.apply(funcion_baseline, axis=1)

In [30]:
pred = baseline(df_feat_base)
prob = pred.replace({True:80,False:20})
pdf = score2( "Baseline", "reemplazar nulls", df_targ, pred, prob )
predictions = predictions.append( pdf )
pdf['AUC-ROC'][0]

0.6972850494452817

In [33]:
predictions.tail(6)

Unnamed: 0,Modelo,Preprocesamientos,Clase,AUC-ROC,Accuracy,Precision,Recall,F1 score,Support
0,BOOST,OHE,AVG,0.903281,0.867405,0.860987,0.867405,0.859862,11373
1,BOOST,OHE,True,,,0.775651,0.573281,0.659286,2545
2,BOOST,OHE,False,,,0.885588,0.952198,0.917686,8828
0,Baseline,reemplazar nulls,AVG,0.697285,0.820276,0.807656,0.820276,0.810679,11373
1,Baseline,reemplazar nulls,True,,,0.630809,0.474656,0.541704,2545
2,Baseline,reemplazar nulls,False,,,0.858638,0.919914,0.88822,8828


Una mejora impresionante !

## Predicciones

In [89]:
try:
    df_feat = pd.read_csv('predictions/pred_feat.csv', low_memory=False)
except:
    df_feat = pd.read_csv('https://docs.google.com/spreadsheets/d/1mR_JNN0-ceiB5qV42Ff9hznz0HtWaoPF3B9zNGoNPY8/export?format=csv', low_memory=False)
    df_feat.to_csv('predictions/pred_feat.csv')
df_feat.drop('Unnamed: 0',axis=1,inplace=True)
df_feat

Unnamed: 0,barrio,dia,direccion_viento_tarde,direccion_viento_temprano,horas_de_sol,humedad_tarde,humedad_temprano,id,llovieron_hamburguesas_hoy,mm_evaporados_agua,...,presion_atmosferica_tarde,presion_atmosferica_temprano,rafaga_viento_max_direccion,rafaga_viento_max_velocidad,temp_max,temp_min,temperatura_tarde,temperatura_temprano,velocidad_viendo_tarde,velocidad_viendo_temprano
0,Villa General Mitre,2014-12-16,Oestesuroeste,Sursureste,13.4,38.0,51.0,116706,,,...,1010.9,1014.4,suroeste,41.0,26.8,8.9,24.9,20.6,28.0,13.0
1,Nueva Pompeya,2010-10-21,Nornoreste,Estesureste,,39.0,57.0,58831,no,,...,1020.2,1023.8,Norte,28.0,23.3,5.0,21.5,14.7,11.0,6.0
2,Constitución,2013-04-09,Estesureste,Oestenoroeste,3.6,73.0,90.0,31981,si,2.4,...,1024.3,1026.7,Oestenoroeste,24.0,22.0,15.6,20.7,16.7,6.0,15.0
3,Agronomía,2016-02-05,Sureste,Sureste,,34.0,47.0,2533,no,,...,1015.8,1018.3,Sureste,30.0,29.9,14.2,27.0,20.0,11.0,15.0
4,Balvanera,2012-06-05,suroeste,Noroeste,,77.0,87.0,7270,no,2.0,...,1007.6,1006.0,suroeste,39.0,11.5,5.5,11.2,7.0,20.0,17.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29087,Parque Chas,2013-04-24,suroeste,Oestenoroeste,,71.0,77.0,73456,no,,...,1018.9,1021.2,Oeste,37.0,19.8,9.8,17.3,12.8,9.0,13.0
29088,Belgrano,2015-10-30,Norte,Noreste,,37.0,64.0,14471,no,,...,1017.9,1021.8,Nornoreste,41.0,29.3,15.6,27.8,20.2,15.0,28.0
29089,Villa Crespo,2011-08-09,Nornoreste,Norte,10.1,31.0,77.0,106482,no,3.2,...,1011.1,1016.3,suroeste,41.0,19.8,5.5,18.6,11.1,20.0,11.0
29090,Caballito,2017-04-25,Nornoreste,Norte,,81.0,90.0,21057,no,,...,1008.2,1014.6,Nornoreste,39.0,25.4,17.8,22.0,19.5,33.0,15.0


In [90]:
cols = ['id', 'barrio', 'dia', 'direccion_viento_tarde','direccion_viento_temprano', 'horas_de_sol', 'humedad_tarde','humedad_temprano', 'llovieron_hamburguesas_hoy', 'mm_evaporados_agua','mm_lluvia_dia', 'nubosidad_tarde', 'nubosidad_temprano','presion_atmosferica_tarde', 'presion_atmosferica_temprano','rafaga_viento_max_direccion', 'rafaga_viento_max_velocidad','temp_max', 'temp_min', 'temperatura_tarde', 'temperatura_temprano','velocidad_viendo_tarde', 'velocidad_viendo_temprano']
ids = df_feat.id
df_feat = df_feat.reindex(cols, axis=1).set_index("id")

In [91]:
ohe_feat = reemplazarCategoricas_OHE(df_feat)
ht_feat = reemplazarCategoricas_HashTrick(df_feat)

reemplazarNulls(df_feat , inplace=True)
reemplazarCategoricas(df_feat , inplace=True)
reemplazarFechas(df_feat , inplace=True)

df_reg = regularizar(df_feat)

reemplazarNullsNum(ohe_feat, inplace=True)
reemplazarFechas(ohe_feat , inplace=True)
ohe_feat2 = keepFeat_OHE(ohe_feat, 10)

reemplazarFechas(ht_feat , inplace=True)
reemplazarNullsNum(ht_feat, inplace=True)
ht_feat2 = normalizar_HashTrick(ht_feat)

In [125]:
def save_pred(name, model, feat):
    pred_targ = model.predict(feat)
    pred_df = pd.DataFrame({'id':ids, 'llovieron_hamburguesas_al_dia_siguiente': pred_targ}).set_index('id').replace( {False:'no', True:'si'} )
    pred_df.to_csv(f'predictions/{name}.csv')
    return pred_df

In [126]:
toPredict = [
    ("arbol",arbol,df_feat),
    ("knn",knn,df_reg),
    ("naive_bayes",nb,df_reg),
    ("svm",svm,ohe_feat),
    ("red_neuronal",nn,ht_feat2),
    ("random_forest",random_forest,df_feat),
    ("boost",boost,ohe_feat),
]

In [130]:
%%time
newPredictions = {}
for name,model,feat in toPredict:
    pred = save_pred(name,model,feat)
    newPredictions[name] = pred.llovieron_hamburguesas_al_dia_siguiente

CPU times: user 1min 44s, sys: 417 ms, total: 1min 45s
Wall time: 1min 44s


In [142]:
newPredictions = pd.DataFrame(newPredictions).replace( {'no':False, 'si':True} )

In [154]:
train_targ_mean = df_targ.mean()*100
print(f'mean \t {train_targ_mean}')
train_targ_mean - newPredictions.mean()*100

mean 	 22.377560889826782


arbol            7.205005
knn              8.899629
naive_bayes      2.839544
svm              7.572804
red_neuronal     5.823182
random_forest    9.133370
boost            4.774783
dtype: float64