# [75.06 / 95.58] Organización de Datos <br> Trabajo Práctico 2: Machine Learning
# Notebook Principal

**Grupo 30: Datatouille**

- 101055 - Bojman, Camila
- 100029 - del Mazo, Federico
- 100687 - Hortas, Cecilia
- 97649 - Souto, Rodrigo

**http://fdelmazo.github.io/7506-Datos/**

**https://www.kaggle.com/datatouille2018/competitions**

Continuando la investigación sobre la empresa Trocafone realizada en el [TP1](https://fdelmazo.github.io/7506-Datos/TP1/TP1.html), se busca determinar la probabilidad de que un usuario del sitio realice una conversión en el período determinado.

Notebooks en orden de corrida y lectura:

0. [TP1](https://fdelmazo.github.io/7506-Datos/TP1/TP1.html) --> Familiarización con el set de datos y exploración de estos.

1. [Investigación Previa](https://fdelmazo.github.io/7506-Datos/TP2/investigacion.html) --> Con ayuda de lo trabajado en el TP1, se averiguan más cosas de las datos, en busqueda de que poder reutilizar.

2. [Creación de Dataframes](https://fdelmazo.github.io/7506-Datos/TP2/new_dataframes.html) --> Como parte del feature engineering, se crean dataframes nuevos con información de los productos del sitio y de como se accede a este (marcas, sistemas operativos, etc).

3. [Feature Engineering](https://fdelmazo.github.io/7506-Datos/TP2/feature_engineering.html) --> Busqueda de atributos de los usuarios de los cuales se busca predecir la conversión.

4. [Submission Framework](https://fdelmazo.github.io/7506-Datos/TP2/submission_framework.html) --> Pequeño framework para construir las postulaciones de labels. 

5. [Parameter Tuning](https://fdelmazo.github.io/7506-Datos/TP2/parameter_tuning.html) --> Busqueda de los mejores hiper-parametros para cada algoritmo de ML.

6. [Feature Selection](https://fdelmazo.github.io/7506-Datos/TP2/feature_selection.html) --> Busqueda de la combinación de features más favorable.

7. TP2 (este notebook)--> Teniendo todo en cuenta, usando los dataframes con todos los atributos buscados y encontrados, se definen y aplican los algoritmos de clasificación, se realizan los entrenamientos y posteriores predicciones de conversiones y finalmente se arman las postulaciones de labels.

In [1]:
# Set-up inicial, se deja comentado para evitar instalarle módulos al usuario
## Primero, descargar los datasets de no tenerlos

# Antes de comenzar, setear las credenciales (usuario y token)

# 1. Visitar: https://www.kaggle.com/datatouille2018/account (con la cuenta que sea)
# 2. Tocar en Create New API Token
# 3. Guardar el archivo descargado en ~/.kaggle/kaggle.json

#!pip install kaggle # https://github.com/Kaggle/kaggle-api
# !kaggle competitions download -c trocafone -p data
# !unzip -q data/events_up_to_01062018.csv.zip -d data
# !rm data/events_up_to_01062018.csv.zip
# !ls data/

## Luego, descargar los módulos a utilizar a lo largo de todo el trabajo

#!pip install nbimporter
#!conda install -y -c conda-forge xgboost 
#!conda install -y -c conda-forge lightgbm 
#!conda install -y -c conda-forge catboost

In [2]:
import nbimporter # pip install nbimporter
import pandas as pd
import numpy as np
import calendar
import requests
from bs4 import BeautifulSoup
from time import sleep
from parameter_tuning import get_hiper_params
from feature_selection import get_feature_selection
import submission_framework as SF

seed = 42
hiper_params = get_hiper_params()
feature_selection = get_feature_selection()

Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from submission_framework.ipynb
Importing Jupyter notebook from feature_selection.ipynb


In [3]:
df_users = pd.read_csv('data/user-features.csv',low_memory=False).set_index('person')
df_y = pd.read_csv('data/labels_training_set.csv').groupby('person').sum()

display(df_users.head(), df_y.head())

Unnamed: 0_level_0,total_viewed_products,total_checkouts,total_conversions,total_events,total_sessions,total_session_checkouts,total_session_conversions,total_events_ad_session,total_ad_sessions,avg_events_per_session,...,dom_last_viewed_product,woy_last_viewed_product,last_conversion_sku,last_conversion_price,percentage_last_week_activity,percentage_last_month_activity,days_between_last_event_and_checkout,percentage_regular_celphones_activity,var_viewed,conversion_gt_media
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0008ed71,0,3,0,6,3,3,0,0,0,2,...,1,1,0,0,0,1,0,0,0,0
00091926,372,2,0,448,34,2,0,39,5,13,...,1,1,0,0,0,1,5,0,12,0
00091a7a,3,0,0,10,1,0,0,10,1,10,...,1,1,0,0,0,0,180,0,11,0
000ba417,153,6,1,206,5,4,1,0,0,41,...,1,1,7631,2469,0,1,0,0,12,1
000c79fe,3,1,0,17,1,1,0,0,0,17,...,1,1,0,0,1,1,0,1,0,0


Unnamed: 0_level_0,label
person,Unnamed: 1_level_1
0008ed71,0
000c79fe,0
001802e4,0
0019e639,0
001b0bf9,0


## Algoritmos de Machine Learning

In [4]:
posibilidades_algoritmos = []

---

### Decision Tree


In [5]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

model_name = 'decision_tree'
params = hiper_params[model_name]
model = DecisionTreeClassifier(**params,random_state=seed)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)

Model: decision_tree - AUC: 0.7442 - AUCPR:0.1345 - Accuracy: 0.9496 


---

### Random Forest

In [6]:
from sklearn.ensemble import RandomForestClassifier

model_name = 'random_forest'
params = hiper_params[model_name]
model = RandomForestClassifier(**params,random_state=seed)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)

Model: random_forest - AUC: 0.7755 - AUCPR:0.1541 - Accuracy: 0.9496 


---

### XGBoost


In [7]:
import xgboost as xgb #conda install -c conda-forge xgboost 

model_name = 'xgboost'
params = hiper_params[model_name]
model = xgb.XGBClassifier(**params,random_state=seed)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)

Model: xgboost - AUC: 0.8556 - AUCPR:0.2193 - Accuracy: 0.9496 


---

### KNN

In [8]:
from sklearn.neighbors import KNeighborsClassifier

model_name = 'knn'
params = hiper_params[model_name]
K = params['n_neighbors']
model_name = f'KNN{K}'

model = KNeighborsClassifier(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)

Model: KNN21 - AUC: 0.7792 - AUCPR:0.1612 - Accuracy: 0.9497 


---

### Naive-Bayes

In [9]:
from sklearn.naive_bayes import GaussianNB

model_name = 'naive_bayes'
params = hiper_params[model_name]
model = GaussianNB(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)

Model: naive_bayes - AUC: 0.7809 - AUCPR:0.1534 - Accuracy: 0.9274 


---

### LightGBM

In [10]:
import lightgbm as lgb  #conda install -c conda-forge lightgbm 

model_name = 'lightgbm'
params = hiper_params[model_name]
model = lgb.LGBMClassifier(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)

Model: lightgbm - AUC: 0.8644 - AUCPR:0.2216 - Accuracy: 0.9491 


---

### Neural Network

In [11]:
from sklearn.neural_network import MLPClassifier

model_name = 'neuralnetwork'
params = {'activation':'relu', 'alpha':1e-05, 'beta_1':0.9, 
          'beta_2':0.999, 'early_stopping':False, 'epsilon':1e-08, 
          'hidden_layer_sizes':(4, 7), 'learning_rate':'constant', 
          'learning_rate_init':0.001, 'max_iter':200, 'momentum':0.9, 
          'nesterovs_momentum':True, 'power_t':0.5, 'random_state':seed, 
          'shuffle':True, 'solver':'adam', 'tol':0.0001, 'validation_fraction':0.1, 'verbose':False, 
          'warm_start':False}

model = MLPClassifier(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

Model: neuralnetwork - AUC: 0.5000 - AUCPR:0.0504 - Accuracy: 0.9496 


---

### Catboost

In [12]:
import catboost as cb #conda install -c conda-forge catboost

model_name = 'catboost'
params = hiper_params[model_name]

model = cb.CatBoostClassifier(**params,verbose=False)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

Model: catboost - AUC: 0.8362 - AUCPR:0.2023 - Accuracy: 0.9496 


---

### Gradient Boosting

In [13]:
from sklearn.ensemble import GradientBoostingClassifier as GBC  

model_name = 'gradient_boosting'
params = hiper_params[model_name]

model = GBC(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

Model: gradient_boosting - AUC: 0.5000 - AUCPR:0.0504 - Accuracy: 0.9496 


---

## Encontrando el mejor submit

Corremos todos los algoritmos definidos sobre esas combinaciones, incluso ensamblados, en busqueda de su mejor combinación de hiper-parametros.

Finalmente, se corren todos los algoritmos en su mejor combinación contra todos los set de features definidos, en busqueda de la mejor fusión universal.

In [14]:
columnas_a_mano = ['total_checkouts_month_5',
                    'timestamp_last_checkout',
                    'timestamp_last_event',
                    'has_checkout_month_5',
                    'total_checkouts',
                    'days_to_last_event',
                    'total_checkouts_last_week',
                    'total_checkouts_months_1_to_4',
                    'total_conversions',
                    'total_session_conversions',
                    'total_events',
                    'total_sessions',
                    'avg_events_per_session',
                    'total_session_checkouts',
                    'has_checkout'
                    ]

columnas_a_mano_2 = ['dow_last_conversion', 
                     'has_conversion_last_week', 'total_conversions_month_4', 
                     'total_session_checkouts', 'doy_last_conversion', 'timestamp_last_event', 
                     'dow_last_checkout', 'total_checkouts', 'has_checkout', 'doy_last_checkout', 
                     'has_checkout_month_1', 'timestamp_last_checkout', 'total_sessions', 
                     'woy_last_event', 'has_checkout_month_5', 'avg_events_per_session']

In [15]:
posibilidades_features = {
    'Full Dataframe':None,
    'Best Cumulative Importance':feature_selection['best_features_progresivo'],
    'Best Forward Selection':feature_selection['best_features_forward'],
    'Best Backward Elimination':feature_selection['best_features_backward'],
    'Leap Cumulative Importance':feature_selection['features_con_saltos_progresivo'],
    'Leap Forward Selection':feature_selection['features_con_saltos_forward'],
    'Selección a Mano': columnas_a_mano,
    'Selección a Mano 2': columnas_a_mano_2
}

In [16]:
from itertools import combinations
                             
def ensamblar_algoritmos(n):
    result = list(combinations(posibilidades_algoritmos, n))
    result_names = [f'{x[0][0]}+{x[1][0]}' for x in result]
    return list(zip(result_names,result))

In [17]:
posibilidades_algoritmos_y_ensambles = posibilidades_algoritmos + ensamblar_algoritmos(2)

In [18]:
resultados = [
    # (auc, forma, (nombre, algoritmo) features)
]

In [None]:
for forma, features in posibilidades_features.items():
    print(f'{forma}:')
    for nombre,algoritmo in posibilidades_algoritmos_y_ensambles:
        print('\t * ',end='')
        model_with_name = (f'{nombre}',algoritmo)
        model, auc = SF.full_framework_wrapper(df_users, df_y, model_with_name, columns=features)
        resultados.append((auc, forma, (nombre, algoritmo), features))

Full Dataframe:
	 * Model: decision_tree - AUC: 0.7442 - AUCPR:0.1345 - Accuracy: 0.9496 
	 * Model: random_forest - AUC: 0.7755 - AUCPR:0.1541 - Accuracy: 0.9496 
	 * Model: xgboost - AUC: 0.8556 - AUCPR:0.2193 - Accuracy: 0.9496 
	 * Model: KNN21 - AUC: 0.7792 - AUCPR:0.1612 - Accuracy: 0.9497 
	 * Model: naive_bayes - AUC: 0.7809 - AUCPR:0.1534 - Accuracy: 0.9274 
	 * Model: lightgbm - AUC: 0.8644 - AUCPR:0.2216 - Accuracy: 0.9491 
	 * Model: neuralnetwork - AUC: 0.5000 - AUCPR:0.0504 - Accuracy: 0.9496 
	 * Model: catboost - AUC: 0.8362 - AUCPR:0.2023 - Accuracy: 0.9496 
	 * Model: gradient_boosting - AUC: 0.5000 - AUCPR:0.0504 - Accuracy: 0.9496 
	 * Model: decision_tree+random_forest - AUC: 0.8169 - Accuracy: 0.9496
	 * Model: decision_tree+xgboost - AUC: 0.8792 - Accuracy: 0.9496
	 * Model: decision_tree+KNN21 - AUC: 0.8437 - Accuracy: 0.9496
	 * Model: decision_tree+naive_bayes - AUC: 0.8228 - Accuracy: 0.9368
	 * Model: decision_tree+lightgbm - AUC: 0.8897 - Accuracy: 0.9496
	

In [None]:
resultados.sort(reverse=True)
display([(x[0],x[1],x[2][0]) for x in resultados])

In [None]:
max_auc, campeon_forma, (campeon_nombre, campeon_algoritmo), campeon_features = resultados[0]
display(f"Mejor Apuesta: {campeon_nombre} ({max_auc:.4f} AUC) - Features: {campeon_forma}")
display(f"Features: {campeon_features}")

## Corrida Final

Se corre entrenando con X (y no X_train) el submit final.

In [None]:
n_ensamble = 300

campeon_model, campeon_auc, csv_name, campeon_message = SF.full_framework_wrapper(df_users, 
                                                                                    df_y, 
                                                                                    (campeon_nombre,campeon_algoritmo),
                                                                                    columns=campeon_features,
                                                                                    n_ensamble=n_ensamble,
                                                                                    submit=True,
                                                                                    verbosity=1,
                                                                                    all_in=True)   

In [None]:
## Descomentar y submitear!
## Ojo, solo correr una vez!!!

#!kaggle competitions submit -f {csv_name} -m "{campeon_message}" trocafone

In [None]:
# # Quemar 10 submits de punta a punta

# n_ensamble = 300

# for resultado in resultados[:10]:
#     max_auc, campeon_forma, (campeon_nombre, campeon_algoritmo), campeon_features = resultado
#     campeon_model, campeon_auc, csv_name, campeon_message = SF.full_framework_wrapper(df_users, 
#                                                                                     df_y, 
#                                                                                     (campeon_nombre,campeon_algoritmo),
#                                                                                     columns=campeon_features,
#                                                                                     n_ensamble=n_ensamble,
#                                                                                     submit=True,
#                                                                                     verbosity=1,
#                                                                                     all_in=True)   
#     !kaggle competitions submit -f {csv_name} -m "{campeon_message}" trocafone
#     sleep(10)

In [None]:
!kaggle competitions leaderboard -d trocafone
!unzip -o trocafone.zip

print("Last Best Score... ")
sleep(15) # Le damos 15 segundos a kaggle para evaluar el submit
!cat trocafone-publicleaderboard.csv | grep Datatouille | tail -n 1 | awk '{split($0,a,","); print a[3],a[4]}'

!rm trocafone.zip
!rm trocafone-publicleaderboard.csv

https://www.kaggle.com/c/trocafone/submissions?sortBy=date

https://www.kaggle.com/c/trocafone/leaderboard