Name: Aaron Bastian

## Data
#### Source: 
[Kaggle Link](https://www.kaggle.com/competitions/online-purchase-prediction/data?select=shop_train.csv)

#### Description:
Each row represents a visit to an online store.  The following are the column/feature descriptions:

`admin_pages`, `info_pages`, `product_pages` - number of pages in different categories visited by the user  
`admin_seconds`, `info_seconds`, `product_seconds` - time spent by the user on different page categories  
`bounce_rate`, `bounce_rate`, `quit_rate` - numbers from Google Analytics  
`is_holiday` - the proximity of important days for retail (such as the New Year)  
`month` - month (categorical variable)  
`operating_system_id`, `browser_id`, `region_id`, `traffic_type_id` are also categorical variables, although they are written as numbers  
`is_new_visitor`, `is_weekend` - binary signs  
`has_purchase` - binary attribute, target variable. It is he who needs to learn to predict.  

In [1]:
import numpy as np
import pandas as pd

In [2]:
train_df = pd.read_csv("Data/shop_train.csv")

train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6165 entries, 0 to 6164
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   admin_pages          6165 non-null   int64  
 1   admin_seconds        6165 non-null   float64
 2   info_pages           6165 non-null   int64  
 3   info_seconds         6165 non-null   float64
 4   product_pages        6165 non-null   int64  
 5   product_seconds      6165 non-null   float64
 6   bounce_rate          6165 non-null   float64
 7   quit_rate            6165 non-null   float64
 8   page_value           6165 non-null   float64
 9   is_holiday           6165 non-null   float64
 10  month                6165 non-null   object 
 11  operating_system_id  6165 non-null   int64  
 12  browser_id           6165 non-null   int64  
 13  region_id            6165 non-null   int64  
 14  traffic_type_id      6165 non-null   int64  
 15  is_new_visitor       6165 non-null   i

In [3]:
train_df.describe()

Unnamed: 0,admin_pages,admin_seconds,info_pages,info_seconds,product_pages,product_seconds,bounce_rate,quit_rate,page_value,is_holiday,operating_system_id,browser_id,region_id,traffic_type_id,is_new_visitor,is_weekend,has_purchase
count,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0,6165.0
mean,2.33528,80.852833,0.507867,33.280329,32.408273,1208.587423,0.021689,0.042682,5.937796,0.060697,2.122952,2.392863,3.086456,4.074615,0.133658,0.23017,0.154745
std,3.356446,175.414952,1.272868,134.065358,45.243435,1810.567652,0.047605,0.047973,18.211659,0.198423,0.906463,1.787116,2.377666,4.054737,0.340312,0.420976,0.36169
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,7.0,191.0,0.0,0.014196,0.0,0.0,2.0,2.0,1.0,2.0,0.0,0.0,0.0
50%,1.0,7.6,0.0,0.0,18.0,608.883333,0.003188,0.025492,0.0,0.0,2.0,2.0,3.0,2.0,0.0,0.0,0.0
75%,4.0,93.3,0.0,0.0,38.0,1503.25,0.016667,0.05,0.0,0.0,3.0,2.0,4.0,4.0,0.0,0.0,0.0
max,27.0,2720.5,16.0,2256.916667,686.0,24844.1562,0.2,0.2,360.953384,1.0,8.0,13.0,9.0,20.0,1.0,1.0,1.0


In [4]:
train_df.head()

Unnamed: 0,admin_pages,admin_seconds,info_pages,info_seconds,product_pages,product_seconds,bounce_rate,quit_rate,page_value,is_holiday,month,operating_system_id,browser_id,region_id,traffic_type_id,is_new_visitor,is_weekend,has_purchase
0,0,0.0,0,0.0,8,335.0,0.025,0.05,63.891,0.0,May,3,2,4,1,0,0,1
1,2,54.5,4,29.5,11,1055.75,0.0,0.026667,0.0,0.0,Dec,2,2,1,2,1,0,0
2,4,72.0,0,0.0,4,46.5,0.04,0.06,0.0,0.0,Nov,3,2,2,3,0,0,0
3,3,23.166667,0,0.0,12,122.225,0.0,0.014286,0.0,0.0,Dec,2,2,9,2,1,1,0
4,12,356.125,0,0.0,44,2187.338725,0.005102,0.029042,0.0,0.0,Dec,1,1,8,1,0,0,0


Our target variable will be `has_purchase`.  This is of course heavily skewed as most users will not purchase anything, and in this case it appears that about 15.5% of users made purchases (we can use the mean because it is a binary variable).  Thus, our target accuracy will be >84.5%, which is what a model could achieve by simply predicting 0 (no purchase) for eveything.  I will also be utilizing Ensemble methods for my classifier, as they are more resistant to skew and overfitting by wieghting their internal models.  In particular, I will use the `XGBClassifier` class from `xgboost`.

In [5]:
from sklearn.model_selection import train_test_split

X = train_df.drop("has_purchase", axis=1)
y = train_df.has_purchase

X_train, X_test, y_train, y_test = train_test_split(X, y)

y_train.value_counts(normalize=True), y_test.value_counts(normalize=True)

(0    0.844041
 1    0.155959
 Name: has_purchase, dtype: float64,
 0    0.848898
 1    0.151102
 Name: has_purchase, dtype: float64)

In [6]:
from sklearn.preprocessing import FunctionTransformer


month_dict = {"jan": 1, "feb": 2, "mar": 3, "apr": 4, "may": 5, "jun": 6,"jul": 7, "aug": 8, "sep": 9, "oct": 10, "nov": 11, "dec": 12}


def month_to_int(_X):
    _X = _X.copy()
    _X.month = _X.month.apply(str.lower).map(month_dict)
    return _X

month_transformer = FunctionTransformer(month_to_int)

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

steps = [
    ("month_trans", month_transformer),
    ("scaler", StandardScaler()),
    ("xgboost", XGBClassifier())
]

pipe = Pipeline(steps)

In [8]:
pipe.fit(X_train, y_train)



In [9]:
from sklearn.metrics import accuracy_score, f1_score

def analyze_model(model) -> None:

    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)

    print("Training Acc: ", accuracy_score(y_train, train_preds))
    print("Training F1: ", f1_score(y_train, train_preds), "\n")
    print("Testing Acc: ", accuracy_score(y_test, test_preds))
    print("Testing F1: ", f1_score(y_test, test_preds))

analyze_model(pipe)


Training Acc:  0.9974042829331603
Training F1:  0.9916083916083916 

Testing Acc:  0.8942931258106356
Testing F1:  0.6053268765133172


Not bad.  We are already outperforming our benchmark of 84.5%, but we can do better.  It appears that we are overfitting our data, as we have a near perfect accuracy on out training data and signifficantly lower accuracy on our testing data (11%).  I will tune the pipeline/model hyperparameters to do this using a grid search cross validation as the dataset is not too large and there are only two hyperparameters I would like to tune: `n_estimators` (default is 100) and `max_depth` (default is 6). [docs](https://xgboost.readthedocs.io/en/stable/parameter.html)

In [10]:
from sklearn.model_selection import GridSearchCV

pipe_2 = Pipeline([
    ("month_transformer", month_transformer),
    ("scaler", StandardScaler()),
    ("xgb", XGBClassifier())
])

param_grid = {
    "xgb__max_depth": [2, 3, 5, 7],
    "xgb__n_estimators" : [50, 75, 100]
}

grid_search = GridSearchCV(pipe_2, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)

grid_search.best_params_

{'xgb__max_depth': 2, 'xgb__n_estimators': 50}

In [11]:
xgb = XGBClassifier(
    max_depth=3,
    n_estimators=50)

pipe_3 = Pipeline([
    ("month_transformer", month_transformer),
    ("scaler", StandardScaler()),
    ("xgb", xgb)
])

pipe_3.fit(X_train, y_train)

analyze_model(pipe_3)

Training Acc:  0.9251568245727883
Training F1:  0.7334360554699538 

Testing Acc:  0.9014267185473411
Testing F1:  0.6292682926829268


A testing accuracy of 90.5% which is a 1.2% increase from our baseline model.

In [12]:
from sklearn.metrics import roc_auc_score

final_X = pd.read_csv("Data/shop_test.csv")

final_preds = pipe_3.predict_proba(final_X)[:,1]
final_preds

array([0.05175634, 0.01371861, 0.0061592 , ..., 0.0498422 , 0.080585  ,
       0.00568756], dtype=float32)

In [13]:
sub = pd.DataFrame(final_preds, columns=["prediction"])
sub["id"] = sub.index
sub

Unnamed: 0,prediction,id
0,0.051756,0
1,0.013719,1
2,0.006159,2
3,0.018625,3
4,0.005062,4
...,...,...
6160,0.008612,6160
6161,0.024360,6161
6162,0.049842,6162
6163,0.080585,6163


In [14]:
sub.to_csv("submission.csv", index=False)

The submissions for the Kaggle competition required the predicitons to be in the form of probabilities so they could calculate the ROC as the evaluation metric.  I recieved a final private score of 93.45% which would have put me in 2nd place! (out of 3 people...)