# I Introduction

## **Team**: Diamonds
#### Francisco Arrieta, Lucia Camenisch, and Emily Schmidt

### Data Description

A company wants to assess the quality of their online advertisement campaign. Online users are the main interest in this campaign. The users see a web banner during their browsing activity. For each user, they want to be able to predict whether they subscribe to the advertised product through the advertisement banner, based on the information they have about them. To subscribe, the user has to click on the banner and then subscribe to the service.

* The target variable name: `subsscription`.

### Data Fields

**Unique Identifier**
* `Id`: A unique identifier of the observations in each dataset. In the test set, it is used to match your predictions with the true values.

**Target Variable** (only in the training data)
* `subscription`: whether the user subscribed through the banner (1: yes, 0: no)

**Demographic Variables**
* `age`: (numeric)

* `job`: Type of job (categorical: teacher, industrial_worker, entrepreneur, housekeeper, manager, retired, freelance, salesman, student, technology, unemployed, na)

* `marital`: marital status (categorical: married, divorced, single)

* `education`: (categorical: high_school, university, grad_school, na)

**Variables about the current campaign**
* `device`: From which device does the user see the banner? (categorical: smartphone, desktop, na)

* `day`: Last day of the month when the user saw the banner (numeric)

* `month`: Last month of the year when the user saw the banner (numeric)

* `time_spent`: How long the user looked at the banner last time (in seconds) (numeric)

* `banner_views`: Number of times the user saw the banner (numeric)

**Variables About an Old Campaign for the Same Product**
* `banner_views_old`: Number of times the user saw the banner during an old (and related) online ads campaign (numeric)

* `days_elapsed_old`: Number of days since the user saw the banner of an old (and related) online ads campaign (numeric, -1 if the user never saw the banner)

* `outcome_old`: Outcome of the old (and related) online ads campaign (categorical: failure, other, success, na)

**Variables with No Name**
* X1: (categorical: 1, 0)

* X2: (categorical: 1, 0)

* X3: (categorical: 1, 0)

* X4: (numeric)

## Project Structure

### I Introduction
### II Data Exploration
### III Data Imputation
### IV Variable Selection
### V Attempted Models
* LDA and QDA

* Logistic Regression

* k-Nearest Neighbors

* Support Vector Machines (SVM)

* Decision Tree

* Random Forest

* Neural Networks

* XGBoost

### VI  Model Comparison
* Model Selection Approach

### VII Final Model
* Best Predictive Model Description

* Tuning Parameters Analysis

### VIII Best Model Diagnostics and Final Kaggle Predicition

### IX Conclusion




# II Data Exploration

In [None]:
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
from plotnine import ggplot, aes, labs, theme, theme_classic, facet_wrap, element_text
from plotnine import geom_histogram, geom_bar, geom_point
import missingno as msno
from pandas.plotting import scatter_matrix

## Training Set

In [None]:
campaign_ad = pd.read_csv("MLUnige2023_subscriptions_train.csv", index_col="Id")
campaign_ad.info()

There apppear to be no missing values in dataset indentified by Python.
However, the columns

- `job`
- `education`
- `device`
- `outcome_old`

have missing values in the form of the string `'na'` and this can be seen within the output below.

In [None]:
campaign_ad

Therefore, we replace `na` with `nan` that will be recoginzed by Python.

In [None]:
campaign_ad.replace('na', np.nan, inplace=True)

In [None]:
# Set the desired color
color = '#2A76B1'  # Replace with your preferred color

msno.bar(campaign_ad, figsize = (10,8), fontsize = 8, color = color, sort = "ascending",);

Now, we can clearly see the respective missing values throughout the data set.


In [None]:
campaign_ad.describe()

From a quantitative view, there are several factors that we can interpret:
- The average age if 41 years.
- The time spent is associated with how long the user looked at the banner last time, averaged about 17.1 seconds.
- Typically, a banner is seen 2.54 times on average, while the old banner was viewed less frequently at 0.76. 
- The `days_elapased_old` has many -1 values within the statistical measures because the individual never saw the banner. 

In [None]:
pd.options.display.max_seq_items = 2000
# Extract non-numerical predictors by mean and count
for col in campaign_ad.select_dtypes(exclude = "number").columns:
  print(str(campaign_ad[[col, "subscription"]].groupby(col, as_index = False).mean()))
  print(str(campaign_ad[[col, "subscription"]].groupby(col, as_index = False).count()))


The way one can interpret this table is by reviewing `entrepreneur` under job and looking at the `subscription` value, which corresponds to a 33.8% probability that this indiviaul is subscribed. For example, `students` appear to have the highest average of customers who have a subscription at 68% within this category. Therefore, we can assume that the individuals who typically have the highest probability of having a subscription are the following: `students` who are `single` that went or are enrolled to `graduate school` and uses a `smartphone` where the outcome of the old (and related) online ads campaign as a `success`.

In [None]:
# Plot histogram for continuous variables to look at distribution
camp_num_melt = campaign_ad[['age', 'time_spent', 'banner_views', 'banner_views_old', 'days_elapsed_old', 'X4']].melt()

ggplot(camp_num_melt) + \
    aes('value') + \
    facet_wrap('variable', scales = 'free') + \
    geom_histogram(bins = 20, color = 'black', fill = '#2A76B1', alpha = 0.5) + \
    theme_classic() + \
    theme(subplots_adjust = {'wspace': 0.5, 'hspace': 0.5}, 
          axis_text_x = element_text(rotation = 45, ha = 'right'),
          figure_size = (15, 10)) + \
    labs(title = 'Histograms of Numerical Variables',
         x = 'Value of Variable',
         y = 'Count')

It appears that all the continuous predictors are skewed. Therefore, we will need to transform these variables depending on the specific model. These transformations will occur within each of the respestive model notebooks.

In [None]:
# Plot histogram for qualitative variables to look at distribution
camp_cat_melt = campaign_ad[['job', 'marital', 'education', 'device', 'outcome_old', 'X1', 'X2', 'X3', 'subscription']].melt()

ggplot(camp_cat_melt) + \
    aes('value') + \
    facet_wrap('variable', scales = 'free') + \
    geom_bar(fill = '#2A76B1', color = 'black', alpha = 0.5) + \
    theme_classic() + \
    theme(subplots_adjust = {'wspace': 0.5, 'hspace': 0.5},
          axis_text_x = element_text(angle = 45, vjust = 1, hjust = 1, size = 8),
          figure_size = (15, 10),
          axis_title_x = element_text(margin = {'t': 20})) + \
    labs(title = 'Histograms of Categorical Variables - Train',
         x = 'Value of Variable',
         y = 'Count')

These bar charts help to visualize how the observations within the categorical predictors are distributed between their respective groupings. For instance, `married` under `marital` and `manager` within the job levels have the most observations. By knowing this information, it may help later if missing value imputation is needed.

## Possible Feature Engineering

Feature engineering involves the extraction and transformation of variables from the original data by converting those observations into desired features. This process was explored to get a better understanding of how the categorical variables relate to the response. In addition, some of the predictors had many levels. Therefore, the relationship to `subscription` was analyzed to see if these levels could possibly be combined to reduce the number of parameters.  

In [None]:
# Select columns to include in the scatterplot matrix (excluding 'X1', 'X2', and `X3`) for TRAINING
included_columns = [col for col in campaign_ad.columns if col not in ['X1', 'X2', 'X3']]

# Create scatter matrix plot with selected columns
scatter_matrix = pd.plotting.scatter_matrix(campaign_ad[included_columns], figsize=(12, 12))

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.1, wspace=0.1)

# Rotate x-axis labels by 45 degrees
for ax in scatter_matrix.flatten():
    ax.xaxis.label.set_rotation(45)
    ax.yaxis.label.set_rotation(45)
    ax.xaxis.label.set_ha('right')
    ax.yaxis.label.set_ha('right')

# Display the plot
plt.show()


In [None]:
campaign_ad_test = pd.read_csv("MLUnige2023_subscriptions_test.csv", index_col="Id")

In [None]:
# Select columns to include in the scatterplot matrix (excluding 'X1', 'X2', and `X3`) for TEST
included_columns = [col for col in campaign_ad_test.columns if col not in ['X1', 'X2', 'X3']]

# Create scatter matrix plot with selected columns
scatter_matrix = pd.plotting.scatter_matrix(campaign_ad_test[included_columns], figsize=(12, 12))

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.1, wspace=0.1)

# Rotate x-axis labels by 45 degrees
for ax in scatter_matrix.flatten():
    ax.xaxis.label.set_rotation(45)
    ax.yaxis.label.set_rotation(45)
    ax.xaxis.label.set_ha('right')
    ax.yaxis.label.set_ha('right')

# Display the plot
plt.show()

Between the training and test scatterplots, there are three variables, `time_spent`, `banner_views`, and `X4` that seem to have a similiar distribution. In the test data though, it appears to have more or less observations within the second and third bins as the data seems to be dispeared differently. 

In [None]:
import math

# Get unique job categories
job_categories = campaign_ad['job'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(job_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 12), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, job_category in enumerate(job_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if job_category in campaign_ad['job'].values:
        subset = campaign_ad[campaign_ad['job'] == job_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(job_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

Within the 11 levels, there are several that follow a similiar pattern. If feature engineering is utilized, we would combine these elements together based on their distribution regarding `subscription`.
1. `entrepreneur`, `housekeeper`, `industrial_worker`, and `salesman`

2. `freelance`, `manager`, and `na` (Later in variable selection, this is relevant because `manager` and `na` follow the same distribution, and `manager` has the most observations)

3. `teacher` and `technology`

4. `retired`

5. `student`

6. `unemployed`

Therefore, we reduce the parameters from 11 to 6. \

*Note*: Depending on how additional expertise, `retired` and `student` could possibly be considered under one parameter but since we do not know the background information of this data set, they will continue to be separate.

In [None]:
# Get unique marital categories
marital_categories = campaign_ad['marital'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(marital_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 8), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, marital_category in enumerate(marital_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if marital_category in campaign_ad['marital'].values:
        subset = campaign_ad[campaign_ad['marital'] == marital_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(marital_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

Between these three levels, combining them into less parameters does not seem worthwhile due to their different distributions.

In [None]:
# Get unique device categories
device_categories = campaign_ad['device'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(device_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 6), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, device_category in enumerate(device_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if device_category in campaign_ad['device'].values:
        subset = campaign_ad[campaign_ad['device'] == device_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(device_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

`device` will stay as is. This predictor will be reviewed again later in **Varible Selection** since it has the least amount of levels that could indicate an approach to use a model to predict its missing values. 

In [None]:
# Get unique outcome_old categories
outcome_old_categories = campaign_ad['outcome_old'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(outcome_old_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 8), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, outcome_old_category in enumerate(outcome_old_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if outcome_old_category in campaign_ad['outcome_old'].values:
        subset = campaign_ad[campaign_ad['outcome_old'] == outcome_old_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(outcome_old_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

There appears to be a empty spot between `success` and `other`, which is actually `na`. Once again, we would not consider combining these levels as they do not have similiar distributions. 

In [None]:
# Get unique X1 categories
X1_categories = campaign_ad['X1'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(X1_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 8), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, X1_category in enumerate(X1_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if X1_category in campaign_ad['X2'].values:
        subset = campaign_ad[campaign_ad['X2'] == X1_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(X1_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

The majority of `X1` is zero. It appears that most of the observations fall under not having a subscription.

In [None]:
# Get unique X2 categories
X2_categories = campaign_ad['X2'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(X2_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 8), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, X2_category in enumerate(X2_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if X2_category in campaign_ad['X2'].values:
        subset = campaign_ad[campaign_ad['X2'] == X2_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(X2_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

When comparing `X1` to `X2`, they have similiar patterns in that the majority of their obersvations are in class 0. Therefore, these two predictors could possibly be combined even though we are unsure of their categorical meaning.

In [None]:
# Get unique X3 categories
X3_categories = campaign_ad['X3'].unique()

# Determine the number of columns and rows for subplots
num_categories = len(X3_categories)
num_columns = math.ceil(num_categories / 4)
num_rows = math.ceil(num_categories / num_columns)

# Create subplots
fig, axs = plt.subplots(num_rows, num_columns, figsize=(12, 8), sharey=True)

# Flatten the axs array to simplify indexing
axs = axs.flatten()

# Iterate over job categories and create bar plot for each category
for i, X3_category in enumerate(X3_categories):
    ax = axs[i]

    # Check if there are any records for the current job category
    if X3_category in campaign_ad['X3'].values:
        subset = campaign_ad[campaign_ad['X3'] == X3_category]
        subset['subscription'].value_counts().plot(kind='bar', ax=ax)

        ax.set_title(X3_category)
        ax.set_xlabel('subscription')
        ax.set_ylabel('Count')
    else:
        # Remove the empty subplot
        fig.delaxes(ax)

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5)

# Display the plot
plt.show()

`X3` is different than the `X1` and `X2` because it appears that class 0 and 1 for the predictor are more evenly distributed. If the class for `X3` is 1, then most individuals do not have a subscription while in class 0, it is the opposite.

## Test Set

In [None]:
campaign_ad_test.info()

Although the `test set` does not contain `subscription`, we thought it would be important to review the missing data and distributions of the variables.

In [None]:
campaign_ad_test.replace('na', np.nan, inplace=True)

In [None]:
# Set the desired color
color = '#2A76B1'  # Replace with your preferred color

msno.bar(campaign_ad_test, figsize = (10,8), fontsize = 8, color = color, sort = "ascending",);

Within the missing values, we see a similiar occurence with `old_outcome`, `device`, `education`, and `job`. All other variables are complete.

In [None]:
campaign_ad_test.describe()

Between the training and set quantitative varaibles, there is not much difference between the averages of each. For example, `age` in training was 41.2 years while the test was a slightly lower at 41.1. `time_spent` slightly decreased on average compared to training, and the `banner_views` saw the opposite impact, as the views increased from the training data.

For the profile of the individual who is most likely to subscribe, it is not possible for the test data since `subscription` is not given.

In [None]:
# Continuous variables to look at distribution
camp_num_melt_test = campaign_ad_test[['age', 'time_spent', 'banner_views', 'banner_views_old', 'days_elapsed_old', 'X4']].melt()

In [None]:
# Set the figure size
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1) # row 1, col 2 index 1
plt.hist(campaign_ad['age'], bins=20, color='#2A76B1', alpha=0.5, label='age Train')
plt.hist(campaign_ad_test['age'], bins=20, color='green', alpha=0.5, label='age Test')
plt.title("age")
plt.ylabel('Value')
plt.legend()


plt.subplot(1, 3, 2) # row 1, col 2 index 1
plt.hist(campaign_ad['time_spent'], bins=20, color='#2A76B1', alpha=0.5, label='time_spent Train')
plt.hist(campaign_ad_test['time_spent'], bins=20, color='green', alpha=0.5, label='time_spent Test')
plt.title("time_spent")
plt.legend()


plt.subplot(1, 3, 3) # row 1, col 2 index 1
plt.hist(campaign_ad['banner_views'], bins=20, color='#2A76B1', alpha=0.5, label='banner_views Train')
plt.hist(campaign_ad_test['banner_views'], bins=20, color='green', alpha=0.5, label='banner_views Test')
plt.title("banner_views")
plt.legend()


Within these first these histograms, the proportion of the test observations (3,837) is much less than training which has about 9,000 rows. This is irrelevant though as we want to analyze the relative distributions of each predictor. For `age`, the train set appears to be more rigid than the test data, but as seen earlier have a similiar mean. In addition, `time_spent`'s bins one and two do not have a large difference in the test as the training set does. On the other hand,  `banner_views` appears to follow a similiar distribution between the respetive sets, but the gap between bin one and two still appear to be closer then in training.

In [None]:
# Set the figure size
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1) # row 1, col 2 index 1
plt.hist(campaign_ad['days_elapsed_old'], bins=20, color='#2A76B1', alpha=0.5, label='days_elapsed_old Train')
plt.hist(campaign_ad_test['days_elapsed_old'], bins=20, color='green', alpha=0.5, label='days_elapsed_ol Test')
plt.title("days_elapsed_old")
plt.ylabel('Value')
plt.legend()

plt.subplot(1, 2, 2) # row 1, col 2 index 1
plt.hist(campaign_ad['X4'], bins=20, color='#2A76B1', alpha=0.5, label='X4 Train')
plt.hist(campaign_ad_test['X4'], bins=20, color='green', alpha=0.5, label='X4 Test')
plt.title("X4")
plt.legend()


The other two predictors where we wanted to compare the training and test set distributions was between `days_elapsed_old` and `X4`. Both of them appear relatively similiar in their pattern. Keep in mind that the test set for these five predictors did not compare against `subscription` since the response was not included in that data set.

In [None]:
# Plot histogram for qualitative variables to look at distribution
camp_cat_melt_test = campaign_ad_test[['job', 'marital', 'education', 'device', 'outcome_old', 'X1', 'X2', 'X3']].melt()

ggplot(camp_cat_melt_test) + \
    aes('value') + \
    facet_wrap('variable', scales = 'free') + \
    geom_bar(fill = '#2A76B1', color = 'black', alpha = 0.5) + \
    theme_classic() + \
    theme(subplots_adjust = {'wspace': 0.5, 'hspace': 0.5},
          axis_text_x = element_text(angle = 45, vjust = 1, hjust = 1, size = 8),
          figure_size = (15, 10),
          axis_title_x = element_text(margin = {'t': 20})) + \
    labs(title = 'Histograms of Categorical Variables - Test',
         x = 'Value of Variable',
         y = 'Count')

Other than `subscription` not being present within the test set, all of the other categorical predictors dictributions look almost exactly like the training set.

Throughout each model in the **Attempted Models** section, there will be additional data exploration such as reviewing correlations, variance inflation factor (VIF), etc.

# III Data Imputation

# IV Variable Selection

Lasso regression was considered as a method to select variables that might be more relevant for prediction and would allow us to reduce the variables in our models. A logistic regression with $\ell_{1}$-norm penalty was generated.

The variables `banner_views_old`, `days_elapsed_old`, `X3`, `marital_divorced`, `job_entrepreneur`, `job_freelance`, `job_housekeeper`, `job_technology`, `job_unemployed` were all deemed of low predictive value as their $\Beta$ coefficient was reduced to 0. However, the removal of these predictors from the data set showed a reduction of prediction accuracy on all data sets. This is the main reason for which it is **not** considered in the final model to remove these columns.

For matters of comparison, the "Boosted tree", XX, and XX model where tested with this modified data set all to return lower prediction accuracies on all data sets. 

# V Attempted Models