

# Predicting ad clicking behavior with LASSO, elastic net, decision tree, random forest, and XGBoost.

In this exercise, we will predict users' ad clicking behavior using a LASSO model, a elastic net model, a decision tree, a random forest, and XGBoost.

<a id='1.1'></a>
## Loading the python packages

In [None]:
# Load libraries

import warnings
warnings.filterwarnings('ignore')

#import os

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# for higher resolution
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('svg','pdf')

# nice format for matplotlib https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html
plt.style.use('bmh')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from scipy import sparse
#Libraries for Deep Learning Models
from keras.models import Sequential
from keras.layers import Dense
import xgboost as xgb
from sklearn.linear_model import Lasso, ElasticNet
from sklearn.metrics import mean_squared_error
# from keras.wrappers.scikit_learn import KerasClassifier
# from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
#from keras.optimizers import SGD
from tensorflow.keras.optimizers import SGD


<a id='1.2'></a>
## Loading the Data



In [None]:
# load sample dataset (only load the first 300000 rows) (1pts)
data = pd.read_csv('PS2_casestudy_data.csv')
# data = data.drop(data.columns[0], axis=1).head(300000) #try a smaller size first
data = data.drop(data.columns[0], axis=1).head(50000) #try a smaller size first
data

In [None]:
data.info()

In [None]:
#Explore the data (print the first and last 5 rows) (2pts)
print('First five rows:')
print(data.head(5))
print('Last five rows:')
print(data.tail(5))

In [None]:
#remove nas from dataframe by simply dropping these rows (2pts)
data.dropna()

In [None]:
#create Y and X. Y is the "click" column, and X is the other 19 columns (Do not use ['id', 'hour', 'device_id', 'device_ip']) (1pts)
Y = data['click']
X = data.drop(columns=['click', 'id', 'hour', 'device_id', 'device_ip'])


In [None]:
# Use sklearn one-hot encoder to transform string variables in X to categorical columns (make sure to save the data as a sparse matrxi) (3pts)

string_cols = [i for i in X.columns if X[i].dtype is np.dtype('object')]
print(string_cols)

encoder = OneHotEncoder(sparse=True, handle_unknown = 'ignore')  #sparse=True)#, drop='first')

encoded_x = encoder.fit_transform(X[string_cols])

# encoded_df = pd.DataFrame(encoded_x.toarray(), columns=encoder.get_feature_names_out(string_cols))

unencoded_x = X[[col for col in X.columns if col not in list(string_cols)]].values

X_encoded = np.hstack([unencoded_x, encoded_x.toarray()])


In [None]:
#do a train-test split. Use the first 90% of the data as training. (1pts)
scaler = StandardScaler()
X_updated = scaler.fit_transform(X_encoded)

X_train, X_test, Y_train, Y_test = train_test_split(X_updated, Y.values, test_size=0.1, random_state = 42)

# Save the encoded data as a sparse matrix
# X_train = sparse.csr_matrix(X_train)
# X_test = sparse.csr_matrix(X_test)



## LASSO and ElasticNet
First use a sklearn LASSO model with alpha=0.005 (regularization penalty) to predict clicking behavior. Report the prediction accuracy. Then use sklearn elastic net with alpha = 0.005 and l1_ratio=0.5 (l1_ratio is a number from 0 to 1 which represents the portion of L1 penalization in the total penalization term)

In [None]:
#initialize and training the model (1pts)
lasso_model = Lasso(alpha=0.005)
lasso_model.fit(X_train, Y_train)


In [None]:
#testing the model (2pts)
Y_pred_lasso = lasso_model.predict(X_test)
lasso_accuracy = accuracy_score(Y_pred_lasso, Y_test)
print(f"LASSO accuracy: {lasso_accuracy}")

In [None]:
#initialize and training the elastic net (2pts)
elasticnet_model = ElasticNet(alpha=0.005, l1_ratio=0.5)
elasticnet_model.fit(X_train, Y_train)

In [None]:
#testing the elastic net (2pts)
elasticnet_accuracy = accuracy_score(elasticnet_model.predict(X_test), Y_test)
# elasticnet_mse = mean_squared_error(Y_test, Y_pred_elasticnet)
print(f"ElasticNet accuracy: {elasticnet_accuracy}")


What do you think is causing these differences? 

First, LASSO uses L1 regularization to perform variable selection, but it may miss capturing complex relationships between features. ElasticNet combines both L1 and L2 regularization, which can better handle scenarios with highly correlated variables, providing a balance between variable selection and coefficient shrinkage. 

## Decision tree and random forest

In [None]:
#build a single decision tree model using gini impurity and roc_auc scoring (2pts)
decision_tree = DecisionTreeClassifier(criterion='gini')
decision_tree.fit(X_train, Y_train)
y_predict = decision_tree.predict(X_test)
roc_auc_score(Y_test, y_predict)

In [None]:
#do a grid search on the max-depth variable [3,10,None]. (3pts)
param_grid = {'max_depth': [3, 10, None]}
grid_search_dt = GridSearchCV(decision_tree, param_grid, scoring='roc_auc', cv=5)
grid_search_dt.fit(X_train, Y_train)


In [None]:
#print the auc of the optimal model applied on the test set (3pts)
# best_dt_model = grid_search_dt.best_estimator_
# Y_pred_dt = best_dt_model.predict_proba(X_test)[:, 1]
# auc_dt = roc_auc_score(Y_test, Y_pred_dt)
auc_dt = roc_auc_score(list(Y_test), list(grid_search_dt.predict(X_test)))
print(f"Decision Tree Best AUC: {auc_dt}")

In [None]:
#build a random forest using gini impurity and roc_auc scoring (2pts)
random_forest = RandomForestClassifier(criterion='gini')


In [None]:
#do grid search to tune n_estimators and max_depth -- 'max_depth': [3, 10, None],'n_estimators': [10,50,100,200]. (3pts)
param_grid_rf = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [3, 10, None]
}
grid_search_rf = GridSearchCV(random_forest, param_grid_rf, scoring='roc_auc', n_jobs = 4)
grid_search_rf.fit(X_train, Y_train)


In [None]:
#print the performance of the best model (3pts)
# best_rf_model = grid_search_rf.best_estimator_
# Y_pred_rf = best_rf_model.predict_proba(X_test)[:, 1]
# auc_rf = roc_auc_score(Y_test, Y_pred_rf)

auc_rf = roc_auc_score(list(Y_test), list(grid_search_rf.predict(X_test)))
print(f"Random Forest Best AUC: {auc_rf}")

## XGBoost

Use the XGBoost classifier to predict clicking behavior. Fine-tune n_estimators over $[10,50,100]$ and $\eta$ (eta) over $[0.01,0.05,0.1]$. Use roc_auc as the scoring criterion for CV.

In [None]:
#initialize the parameter grid and the model (2pts）

xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')
param_grid_xgb = {
    'n_estimators': [10, 50, 100],
    'learning_rate': [0.01, 0.05, 0.1]
}

In [None]:
#do grid search (3pts)
grid_search_xgb = GridSearchCV(xgb_model, param_grid_xgb, scoring='roc_auc', cv=5)
grid_search_xgb.fit(X_train, Y_train)


In [None]:
#print the performance of the best model (2pts)
# best_xgb_model = grid_search_xgb.best_estimator_
# Y_pred_xgb = best_xgb_model.predict_proba(X_test)[:, 1]
# auc_xgb = roc_auc_score(Y_test, Y_pred_xgb)

auc_xgb = roc_auc_score(list(Y_test), list(grid_search_xgb.predict(X_test)))
print(f"XGBoost Best ROC AUC: {auc_xgb}")


In [None]:
#Looking at the CV results of the decision tree, random forest, and XGBoost model, which ones are likely underfitted/overfitted. (3pts)
# cv_results = grid_search_xgb.cv_results_
# for mean_score, params in zip(cv_results['mean_test_score'], cv_results['params']):
#     print(f"Mean Test ROC AUC: {mean_score} with params: {params}")

