# <center> Kaggle Competition Assignment <center> Ian Brandenburg (2304791) <center> [GitHub Repo](https://github.com/Iandrewburg/Data_Science/tree/main/Data_Science_2/Assignments/Take_Home_Final)
    
    
The Kaggle competition has been launched, please register using this [link](https://www.kaggle.com/t/f79b637ede074e70a233661b4614083c).

You will find the training and test data in the data section of the competition, along with a description of the features. You will need to build models on the training data and make predictions on the test data and submit your solutions to Kaggle. You will also find a sample solution file in the data section that shows the format you will need to use for your own submissions.

The deadline for Kaggle solutions is 8PM on 19 April. You will be graded primarily on the basis of your work and how clearly you explain your methods and results. Those in the top three in the competition will receive some extra points. I expect you to experiment with all the methods we have covered: linear models, random forest, gradient boosting, neural networks + parameter tuning, feature engineering.

You will see the public score of your best model on the leaderboard. A private dataset will be used to evaluate the final performance of your model to avoid overfitting based on the leaderboard.

You should also submit to Moodle the documentation (ipynb and pdf) of your work, including exploratory data analysis, data cleaning, parameter tuning and evaluation. Aim for concise explanations.

### Import Libraries
---

In [1]:
# General utilities
import numpy as np
import pandas as pd
import time
import os
import warnings
from itertools import combinations

# Sklearn model selection, preprocessing, metrics, and ensemble methods
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LassoCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

# Sklearn pipeline utilities
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline

# XGBoost
import xgboost as xgb

# Cat Boost Classifier
from catboost import CatBoostClassifier

# Light GBM
import lightgbm as lgb

# InterpretML for explainable boosting
from interpret.glassbox import ExplainableBoostingClassifier

# TensorFlow and Keras for neural networks
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Conv1D, MaxPooling1D, Flatten, BatchNormalization
from tensorflow.keras.metrics import AUC
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam

# Suppress warnings
warnings.filterwarnings('ignore')


# Data Wrangling
---

## Data Import
---

In [2]:
train_data = pd.read_csv("https://raw.githubusercontent.com/Iandrewburg/Data_Science/main/Data_Science_2/Assignments/Take_Home_Final/train.csv")
train_data.head()


Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,is_popular,article_id
0,594,9,702,0.454545,1.0,0.620438,11,2,1,0,...,1.0,-0.153395,-0.4,-0.1,0.0,0.0,0.5,0.0,0,1
1,346,8,1197,0.470143,1.0,0.666209,21,6,2,13,...,1.0,-0.308167,-1.0,-0.1,0.0,0.0,0.5,0.0,0,3
2,484,9,214,0.61809,1.0,0.748092,5,2,1,0,...,0.433333,-0.141667,-0.2,-0.05,0.0,0.0,0.5,0.0,0,5
3,639,8,249,0.621951,1.0,0.66474,16,5,8,0,...,0.5,-0.5,-0.8,-0.4,0.0,0.0,0.5,0.0,0,6
4,177,12,1219,0.397841,1.0,0.583578,21,1,1,2,...,0.8,-0.441111,-1.0,-0.05,0.0,0.0,0.5,0.0,0,7


In [3]:
test_data = pd.read_csv("https://raw.githubusercontent.com/Iandrewburg/Data_Science/main/Data_Science_2/Assignments/Take_Home_Final/test.csv")
test_data.head()


Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,min_positive_polarity,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,article_id
0,134,11,217,0.631579,1.0,0.818966,4,2,2,0,...,0.136364,0.5,-0.17037,-0.2,-0.155556,0.288889,-0.155556,0.211111,0.155556,2
1,415,11,1041,0.489423,1.0,0.700321,22,3,0,14,...,0.05,1.0,-0.426268,-1.0,-0.1,0.975,0.3,0.475,0.3,4
2,625,9,486,0.599585,1.0,0.727273,4,3,1,0,...,0.0625,0.7,-0.387821,-1.0,-0.05,0.0,0.0,0.5,0.0,10
3,148,14,505,0.509018,1.0,0.718861,8,4,1,1,...,0.1,1.0,-0.284722,-0.4,-0.05,0.0,0.0,0.5,0.0,13
4,294,14,274,0.620301,1.0,0.72619,5,1,1,0,...,0.1,0.6,-0.333333,-0.333333,-0.333333,0.0,0.0,0.5,0.0,26


In [4]:
test_data.columns

Index(['timedelta', 'n_tokens_title', 'n_tokens_content', 'n_unique_tokens',
       'n_non_stop_words', 'n_non_stop_unique_tokens', 'num_hrefs',
       'num_self_hrefs', 'num_imgs', 'num_videos', 'average_token_length',
       'num_keywords', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'kw_min_min', 'kw_max_min', 'kw_avg_min',
       'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg',
       'kw_avg_avg', 'self_reference_min_shares', 'self_reference_max_shares',
       'self_reference_avg_sharess', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend', 'LDA_00',
       'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
       'glob

## Exploratory Data Analysis
---

### Variable Descriptions
---


    timedelta: Days between the article publication and the dataset acquisition (non-predictive)
    n_tokens_title: Number of words in the title
    n_tokens_content: Number of words in the content
    n_unique_tokens: Rate of unique words in the content
    n_non_stop_words: Rate of non-stop words in the content
    n_non_stop_unique_tokens: Rate of unique non-stop words in the content
    num_hrefs: Number of links
    num_self_hrefs: Number of links to other articles published by Mashable
    num_imgs: Number of images
    num_videos: Number of videos
    average_token_length: Average length of the words in the content
    num_keywords: Number of keywords in the metadata
    data_channel_is_lifestyle: Is data channel 'Lifestyle'?
    data_channel_is_entertainment: Is data channel 'Entertainment'?
    data_channel_is_bus: Is data channel 'Business'?
    data_channel_is_socmed: Is data channel 'Social Media'?
    data_channel_is_tech: Is data channel 'Tech'?
    data_channel_is_world: Is data channel 'World'?
    kw_min_min: Worst keyword (min. shares)
    kw_max_min: Worst keyword (max. shares)
    kw_avg_min: Worst keyword (avg. shares)
    kw_min_max: Best keyword (min. shares)
    kw_max_max: Best keyword (max. shares)
    kw_avg_max: Best keyword (avg. shares)
    kw_min_avg: Avg. keyword (min. shares)
    kw_max_avg: Avg. keyword (max. shares)
    kw_avg_avg: Avg. keyword (avg. shares)
    self_reference_min_shares: Min. shares of referenced articles in Mashable
    self_reference_max_shares: Max. shares of referenced articles in Mashable
    self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
    weekday_is_monday: Was the article published on a Monday?
    weekday_is_tuesday: Was the article published on a Tuesday?
    weekday_is_wednesday: Was the article published on a Wednesday?
    weekday_is_thursday: Was the article published on a Thursday?
    weekday_is_friday: Was the article published on a Friday?
    weekday_is_saturday: Was the article published on a Saturday?
    weekday_is_sunday: Was the article published on a Sunday?
    is_weekend: Was the article published on the weekend?
    LDA_00: Closeness to LDA topic 0
    LDA_01: Closeness to LDA topic 1
    LDA_02: Closeness to LDA topic 2
    LDA_03: Closeness to LDA topic 3
    LDA_04: Closeness to LDA topic 4
    global_subjectivity: Text subjectivity
    global_sentiment_polarity: Text sentiment polarity
    global_rate_positive_words: Rate of positive words in the content
    global_rate_negative_words: Rate of negative words in the content
    rate_positive_words: Rate of positive words among non-neutral tokens
    rate_negative_words: Rate of negative words among non-neutral tokens
    avg_positive_polarity: Avg. polarity of positive words
    min_positive_polarity: Min. polarity of positive words
    max_positive_polarity: Max. polarity of positive words
    avg_negative_polarity: Avg. polarity of negative words
    min_negative_polarity: Min. polarity of negative words
    max_negative_polarity: Max. polarity of negative words
    title_subjectivity: Title subjectivity
    title_sentiment_polarity: Title polarity
    abs_title_subjectivity: Absolute subjectivity level
    abs_title_sentiment_polarity: Absolute polarity level
    is_popular: Whether or not the article was among the most popular ones based on shares on social media
    article_id: Unique identifier of the article


In [5]:
train_data.describe()

Unnamed: 0,timedelta,n_tokens_title,n_tokens_content,n_unique_tokens,n_non_stop_words,n_non_stop_unique_tokens,num_hrefs,num_self_hrefs,num_imgs,num_videos,...,max_positive_polarity,avg_negative_polarity,min_negative_polarity,max_negative_polarity,title_subjectivity,title_sentiment_polarity,abs_title_subjectivity,abs_title_sentiment_polarity,is_popular,article_id
count,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,...,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0,29733.0
mean,355.645646,10.390812,545.008274,0.555076,1.005852,0.695432,10.91269,3.290788,4.524535,1.263546,...,0.75778,-0.259709,-0.520981,-0.107793,0.281878,0.069691,0.341427,0.155234,0.121649,19834.91353
std,214.288261,2.110135,469.358037,4.064572,6.039655,3.768796,11.316508,3.840874,8.213823,4.18908,...,0.247293,0.128488,0.290454,0.095672,0.323461,0.264379,0.188735,0.225066,0.326886,11432.376037
min,8.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-1.0,-1.0,-1.0,0.0,-1.0,0.0,0.0,0.0,1.0
25%,164.0,9.0,246.0,0.4714,1.0,0.626126,4.0,1.0,1.0,0.0,...,0.6,-0.328704,-0.7,-0.125,0.0,0.0,0.166667,0.0,0.0,9965.0
50%,342.0,10.0,409.0,0.539894,1.0,0.690566,8.0,2.0,1.0,0.0,...,0.8,-0.252827,-0.5,-0.1,0.144444,0.0,0.5,0.0,0.0,19859.0
75%,545.0,12.0,712.0,0.609375,1.0,0.755208,14.0,4.0,4.0,1.0,...,1.0,-0.186494,-0.3,-0.05,0.5,0.136364,0.5,0.25,0.0,29742.0
max,731.0,23.0,8474.0,701.0,1042.0,650.0,304.0,74.0,111.0,91.0,...,1.0,0.0,0.0,0.0,1.0,1.0,0.5,1.0,1.0,39643.0


In [6]:
print(f"The shape of the training set is {train_data.shape[0]} rows, and {train_data.shape[1]} columns.") 

The shape of the training set is 29733 rows, and 61 columns.


In [7]:
total_missing_values = train_data.isnull().sum()[train_data.isnull().sum() > 0].sum()
print(f"There are a total of {total_missing_values} missing values in the dataset.")


There are a total of 0 missing values in the dataset.


In [8]:
train_data.columns

Index(['timedelta', 'n_tokens_title', 'n_tokens_content', 'n_unique_tokens',
       'n_non_stop_words', 'n_non_stop_unique_tokens', 'num_hrefs',
       'num_self_hrefs', 'num_imgs', 'num_videos', 'average_token_length',
       'num_keywords', 'data_channel_is_lifestyle',
       'data_channel_is_entertainment', 'data_channel_is_bus',
       'data_channel_is_socmed', 'data_channel_is_tech',
       'data_channel_is_world', 'kw_min_min', 'kw_max_min', 'kw_avg_min',
       'kw_min_max', 'kw_max_max', 'kw_avg_max', 'kw_min_avg', 'kw_max_avg',
       'kw_avg_avg', 'self_reference_min_shares', 'self_reference_max_shares',
       'self_reference_avg_sharess', 'weekday_is_monday', 'weekday_is_tuesday',
       'weekday_is_wednesday', 'weekday_is_thursday', 'weekday_is_friday',
       'weekday_is_saturday', 'weekday_is_sunday', 'is_weekend', 'LDA_00',
       'LDA_01', 'LDA_02', 'LDA_03', 'LDA_04', 'global_subjectivity',
       'global_sentiment_polarity', 'global_rate_positive_words',
       'glob

## Feature Engineering
---

In [9]:
# Defining variable groups
basic_text_features = ['n_tokens_title',
                       'n_tokens_content',
                       'n_unique_tokens',
                       'n_non_stop_words',
                       'n_non_stop_unique_tokens',
                       'average_token_length',
                       'num_keywords']
content_properties = ['num_hrefs',
                      'num_self_hrefs',
                      'num_imgs',
                      'num_videos',
                      'global_subjectivity',
                      'global_sentiment_polarity',
                      'global_rate_positive_words',
                      'global_rate_negative_words']
keyword_performance = ['kw_min_min',
                       'kw_max_min',
                       'kw_avg_min',
                       'kw_min_max',
                       'kw_max_max',
                       'kw_avg_max',
                       'kw_min_avg',
                       'kw_max_avg',
                       'kw_avg_avg']
self_reference_metrics = ['self_reference_min_shares',
                          'self_reference_max_shares',
                          'self_reference_avg_sharess']

# dropped 'weekday_is_monday' and 'is_weekend'
publication_timing = ['weekday_is_tuesday',
                      'weekday_is_wednesday',
                      'weekday_is_thursday',
                      'weekday_is_friday',
                      'weekday_is_saturday',
                      'weekday_is_sunday']

# dropped 'data_channel_is_lifestyle'
content_topic_and_sentiment = ['data_channel_is_entertainment',
                               'data_channel_is_bus',
                               'data_channel_is_socmed',
                               'data_channel_is_tech',
                               'data_channel_is_world',
                               'LDA_00',
                               'LDA_01',
                               'LDA_02',
                               'LDA_03',
                               'LDA_04',
                               'rate_positive_words',
                               'rate_negative_words',
                               'avg_positive_polarity',
                               'min_positive_polarity', 
                               'max_positive_polarity',
                               'avg_negative_polarity',
                               'min_negative_polarity',
                               'max_negative_polarity']
title_sentiment = ['title_subjectivity',
                   'title_sentiment_polarity',
                   'abs_title_subjectivity',
                   'abs_title_sentiment_polarity']


In [10]:
def square_features(variables, df): 
    sqaured_features = []
    for var in variables:
        feature_name = f'{var}_squared'
        df[feature_name] = df[var] ** 2
        sqaured_features.append(feature_name)
    return sqaured_features

def cube_features(variables, df): 
    cubed_features = []
    for var in variables:
        feature_name = f'{var}_cubed'
        df[feature_name] = df[var] ** 3
        cubed_features.append(feature_name)
    return cubed_features
        
def interact_features(variables, df):
    interacted_features = []
    for (var1, var2) in combinations(variables, 2):
        feature_name = f'{var1}_{var2}_interaction'
        df[feature_name] = df[var1] * df[var2]
        interacted_features.append(feature_name)
    return interacted_features

In [11]:
##################SQUARED TERMS###################
# square basic features
sqrd_basic_text_features = square_features(basic_text_features, train_data)
square_features(basic_text_features, test_data)
    
# square title sentiment features
sqrd_title_sentiment = square_features(title_sentiment, train_data)
square_features(title_sentiment, test_data)

# square content properties
sqrd_content_properties = square_features(content_properties, train_data)
square_features(content_properties, test_data)

# square keyword performance
sqrd_keyword_performance = square_features(keyword_performance, train_data)
square_features(keyword_performance, test_data)

# square self reference metrics
sqrd_self_reference_metrics = square_features(self_reference_metrics, train_data)
square_features(self_reference_metrics, test_data)

##################CUBED TERMS###################
# CUBED basic features
cube_basic_text_features = cube_features(basic_text_features, train_data)
cube_features(basic_text_features, test_data)
    
# CUBED title sentiment features
cube_title_sentiment = cube_features(title_sentiment, train_data)
cube_features(title_sentiment, test_data)

# CUBED content properties
cube_content_properties = cube_features(content_properties, train_data)
cube_features(content_properties, test_data)

# CUBED keyword performance
cube_keyword_performance = cube_features(keyword_performance, train_data)
cube_features(keyword_performance, test_data)

# CUBED self reference metrics
cube_self_reference_metrics = cube_features(self_reference_metrics, train_data)
cube_features(self_reference_metrics, test_data)
  
################INTERACTION TERMS##################
# Interacting the basic features
interaction_basic_text_features = interact_features(basic_text_features, train_data)
interact_features(basic_text_features, test_data)

# Interacting the title sentiment features
interaction_title_sentiment = interact_features(title_sentiment, train_data)
interact_features(title_sentiment, test_data)

# Interacting content properties
interaction_content_properties = interact_features(content_properties, train_data)
interact_features(content_properties, test_data)

# Interacting keyword performance
interaction_keyword_performance = interact_features(keyword_performance, train_data)
interact_features(keyword_performance, test_data)

# Interacting self reference metrics
interaction_self_reference_metrics = interact_features(self_reference_metrics, train_data)
interact_features(self_reference_metrics, test_data)


['self_reference_min_shares_self_reference_max_shares_interaction',
 'self_reference_min_shares_self_reference_avg_sharess_interaction',
 'self_reference_max_shares_self_reference_avg_sharess_interaction']

In [12]:
perm_importance_variables = ['n_tokens_title',
                             'n_tokens_content',
                             'n_unique_tokens',
                             'n_non_stop_words',
                             'n_non_stop_unique_tokens',
                             'average_token_length',
                             'num_keywords',
                             'num_hrefs', 
                             'num_self_hrefs', 
                             'num_imgs', 
                             'num_videos', 
                             'global_subjectivity', 
                             'global_sentiment_polarity', 
                             'kw_min_min', 
                             'kw_max_min', 
                             'kw_avg_min', 
                             'kw_min_max', 
                             'kw_max_max',
                             'kw_avg_max', 
                             'kw_min_avg', 
                             'kw_max_avg', 
                             'kw_avg_avg', 
                             'self_reference_min_shares', 
                             'self_reference_max_shares', 
                             'self_reference_avg_sharess',
                             'weekday_is_thursday', 
                             'weekday_is_friday',
                             'weekday_is_sunday', 
                             'data_channel_is_entertainment',
                             'data_channel_is_bus', 
                             'data_channel_is_socmed', 
                             'data_channel_is_tech', 
                             'data_channel_is_world', 
                             'LDA_00', 
                             'LDA_01',
                             'LDA_02', 
                             'LDA_03', 
                             'LDA_04', 
                             'rate_positive_words', 
                             'avg_positive_polarity',
                             'min_positive_polarity', 
                             'avg_negative_polarity', 
                             'min_negative_polarity', 
                             'max_negative_polarity', 
                             'title_subjectivity',
                             'abs_title_subjectivity']


In [13]:
test_data.columns

Index(['timedelta', 'n_tokens_title', 'n_tokens_content', 'n_unique_tokens',
       'n_non_stop_words', 'n_non_stop_unique_tokens', 'num_hrefs',
       'num_self_hrefs', 'num_imgs', 'num_videos',
       ...
       'kw_max_max_kw_avg_avg_interaction',
       'kw_avg_max_kw_min_avg_interaction',
       'kw_avg_max_kw_max_avg_interaction',
       'kw_avg_max_kw_avg_avg_interaction',
       'kw_min_avg_kw_max_avg_interaction',
       'kw_min_avg_kw_avg_avg_interaction',
       'kw_max_avg_kw_avg_avg_interaction',
       'self_reference_min_shares_self_reference_max_shares_interaction',
       'self_reference_min_shares_self_reference_avg_sharess_interaction',
       'self_reference_max_shares_self_reference_avg_sharess_interaction'],
      dtype='object', length=216)

In [14]:
# Define models
models = {
    'M1': basic_text_features,
    'M2': basic_text_features + content_properties,
    'M3': basic_text_features + content_properties + keyword_performance,
    'M4': basic_text_features + content_properties + keyword_performance + self_reference_metrics,
    'M5': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing,
    'M6': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment,
    'M7': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment, 
    'M8': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment,
    'M9': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + sqrd_basic_text_features,
    'M10': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + sqrd_basic_text_features + interaction_basic_text_features,
    'M11': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + sqrd_basic_text_features + interaction_basic_text_features + interaction_title_sentiment,
    'M12': perm_importance_variables,
    'M13': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_content_properties + sqrd_keyword_performance,
    'M14': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + sqrd_basic_text_features + sqrd_content_properties,
    'M15': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + sqrd_basic_text_features + sqrd_content_properties + sqrd_keyword_performance,
    'M16': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + sqrd_basic_text_features + sqrd_content_properties + sqrd_keyword_performance + sqrd_self_reference_metrics,
    'M17': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + interaction_content_properties,
    'M18': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + interaction_content_properties + interaction_keyword_performance,
    'M19': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + interaction_content_properties + interaction_keyword_performance + interaction_self_reference_metrics,
    'M20': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + interaction_content_properties + interaction_keyword_performance + interaction_self_reference_metrics + sqrd_title_sentiment + sqrd_basic_text_features + sqrd_content_properties + sqrd_keyword_performance + sqrd_self_reference_metrics,
    'M21': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + interaction_content_properties + interaction_basic_text_features + interaction_title_sentiment + interaction_keyword_performance + interaction_self_reference_metrics + sqrd_title_sentiment + sqrd_basic_text_features + sqrd_content_properties + sqrd_keyword_performance + sqrd_self_reference_metrics,
    'M22': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment, 
    'M23': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features, 
    'M24': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties, 
    'M25': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance, 
    'M26': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance + sqrd_self_reference_metrics + cube_self_reference_metrics, 
    'M27': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance + sqrd_self_reference_metrics + cube_self_reference_metrics + interaction_content_properties, 
    'M28': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance + sqrd_self_reference_metrics + cube_self_reference_metrics + interaction_content_properties + interaction_basic_text_features, 
    'M29': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance + sqrd_self_reference_metrics + cube_self_reference_metrics + interaction_content_properties + interaction_basic_text_features + interaction_title_sentiment, 
    'M30': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance + sqrd_self_reference_metrics + cube_self_reference_metrics + interaction_content_properties + interaction_basic_text_features + interaction_title_sentiment + interaction_keyword_performance, 
    'M31': basic_text_features + content_properties + keyword_performance + self_reference_metrics + publication_timing + content_topic_and_sentiment + title_sentiment + sqrd_title_sentiment + cube_title_sentiment + sqrd_basic_text_features + cube_basic_text_features + sqrd_content_properties + cube_content_properties + sqrd_keyword_performance + cube_keyword_performance + sqrd_self_reference_metrics + cube_self_reference_metrics + interaction_content_properties + interaction_basic_text_features + interaction_title_sentiment + interaction_keyword_performance + interaction_self_reference_metrics 


}


In [15]:
# Split 'train_data' into training and validation sets
X = train_data.drop(['is_popular', 'timedelta', 'article_id'], axis=1)
y = train_data['is_popular']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=20240407)


# Models
---

In [16]:
def calculateRMSLE(prediction, y_obs):
    return round(np.sqrt(
        np.mean(
            (
                np.log(np.where(prediction < 0, 0, prediction) + 1) - 
                np.log(y_obs + 1)
            )**2
        )
    ), 4)

In [17]:
# initilialize results list
results = []

## Logistic Regression
---

### Simple Logistic Regression
---

In [18]:
for model_name, features in models.items():
    # Append "Logistic Regression" to the model name for clarity
    full_model_name = f"{model_name} Logistic Regression"

    # Define steps for pipeline: feature scaling and logistic regression
    steps = [
        ("scale_features", ColumnTransformer([("scale", StandardScaler(), features)], remainder='drop')),
        ("log_reg", LogisticRegression())
    ]

    # Create pipeline
    pipeline = Pipeline(steps)

    # Fit the model on training data
    pipeline.fit(X_train[features], y_train)

    # Predict probabilities on the training and validation data
    # Note: We use predict_proba to get probabilities, and we're interested in the probabilities of the positive class (usually at index 1)
    train_prob = pipeline.predict_proba(X_train[features])[:, 1]
    val_prob = pipeline.predict_proba(X_val[features])[:, 1]

    # Calculate AUC
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    results.append([full_model_name, train_auc, val_auc, train_rmsle, val_rmsle])

results_df = pd.DataFrame(results, columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])

results_df.tail(31)


Unnamed: 0,Model,Training AUC,Validation AUC,Training RMSLE,Validation RMSLE
0,M1 Logistic Regression,0.548108,0.555135,0.2271,0.2314
1,M2 Logistic Regression,0.624687,0.62781,0.2253,0.2291
2,M3 Logistic Regression,0.682657,0.686424,0.2225,0.2259
3,M4 Logistic Regression,0.686342,0.688129,0.2224,0.2259
4,M5 Logistic Regression,0.687915,0.684988,0.2223,0.226
5,M6 Logistic Regression,0.693311,0.694309,0.222,0.2255
6,M7 Logistic Regression,0.694318,0.695099,0.2219,0.2253
7,M8 Logistic Regression,0.695176,0.696331,0.2219,0.2251
8,M9 Logistic Regression,0.695775,0.694353,0.2218,0.2252
9,M10 Logistic Regression,0.699305,0.693886,0.2216,0.2251


### Tuned Logistic Regression
---

In [None]:
for model_name, features in models.items():

    start_time = time.time()
    # Define steps for pipeline: feature scaling and logistic regression
    steps = [
        ("scale_features", ColumnTransformer([("scale", StandardScaler(), features)], remainder='drop')),
        ("log_reg", LogisticRegression(solver='liblinear'))
    ]

    # Create pipeline
    pipeline = Pipeline(steps)

    # Define a range of inverse regularization strength `C`
    param_grid = {
        'log_reg__C': [0.001, 0.01, 0.1, 1, 10, 100],
        'log_reg__penalty': ['l2']  # L2 regularization
    }

    grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='roc_auc')
    grid_search.fit(X_train[features], y_train)

    best_model = grid_search.best_estimator_
    train_prob = best_model.predict_proba(X_train[features])[:, 1]
    val_prob = best_model.predict_proba(X_val[features])[:, 1]

    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    results.append([f"{model_name} Logistic Regression Tuned", train_auc, val_auc, train_rmsle, val_rmsle])
    
    end_time = time.time()  # End timer
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df = pd.DataFrame(results, columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])

results_df.tail(31)


Completed M1 in 0.98 seconds
Completed M2 in 1.91 seconds
Completed M3 in 4.39 seconds
Completed M4 in 4.96 seconds
Completed M5 in 6.05 seconds
Completed M6 in 10.84 seconds
Completed M7 in 12.25 seconds
Completed M8 in 17.74 seconds


## Lasso Model
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Start timer

    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("lasso", LassoCV())
    ]
    pipe_lasso = Pipeline(steps)
    pipe_lasso.fit(X_train[features], y_train)

    train_scores = pipe_lasso.predict(X_train[features])
    val_scores = pipe_lasso.predict(X_val[features])

    # Convert scores to binary predictions based on the median threshold
    threshold = np.median(train_scores)
    train_pred = np.where(train_scores > threshold, 1, 0)
    val_pred = np.where(val_scores > threshold, 1, 0)

    # Calculate AUC scores
    train_auc = roc_auc_score(y_train, train_pred)
    val_auc = roc_auc_score(y_val, val_pred)
    
    train_rmsle = calculateRMSLE(train_pred, y_train) 
    val_rmsle = calculateRMSLE(val_pred, y_val)

    new_row = pd.DataFrame([[f"{group_name} Lasso", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()  # End timer
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

## Stacking Model
---

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

# Define your base models
base_models = [
    ('dt', DecisionTreeClassifier(random_state=20240407)),
    ('rf', RandomForestClassifier(random_state=20240407)),
    ('xgb', xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=20240407))
]

# Meta-model
meta_model = LogisticRegression()

# Stacking classifier
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)

for model_name, features in models.items():
    start_time = time.time()  # Start timer
    
    # Create a pipeline with scaling and stacking model
    pipeline = Pipeline([
        ("scale_features", ColumnTransformer([("scale", StandardScaler(), features)], remainder='drop')),
        ("stacking", stacking_model)
    ])
    
    # Fit the pipeline
    pipeline.fit(X_train[features], y_train)
    
    # Predict probabilities on the training and validation data
    train_prob = pipeline.predict_proba(X_train[features])[:, 1]
    val_prob = pipeline.predict_proba(X_val[features])[:, 1]
    
    # Calculate AUC
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)
    
    # Append results
    new_row = pd.DataFrame([[f"{model_name} STACKED", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")


results_df.tail(31)


## Decision Tree Classifier
---

### Decision Tree Classifer Max Depth 5

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Start timer
    
    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("tree", DecisionTreeClassifier(max_depth=5, random_state=20240407))
    ]
    pipe_tree = Pipeline(steps)

    # Fit the model on training data
    pipe_tree.fit(X_train[features], y_train)

    # Predict probabilities for the positive class
    train_prob = pipe_tree.predict_proba(X_train[features])[:, 1]
    val_prob = pipe_tree.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} Decision Tree MD5", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### Decision Tree Classifer Max Depth 6
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Start timer
    
    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("tree", DecisionTreeClassifier(max_depth=6, random_state=20240407))
    ]
    pipe_tree = Pipeline(steps)

    # Fit the model on training data
    pipe_tree.fit(X_train[features], y_train)

    # Predict probabilities for the positive class
    train_prob = pipe_tree.predict_proba(X_train[features])[:, 1]
    val_prob = pipe_tree.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)

    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} Decision Tree MD6", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

### Decision Tree Classifer Max Depth 7
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Start timer
    
    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("tree", DecisionTreeClassifier(max_depth=7, random_state=20240407))
    ]
    pipe_tree = Pipeline(steps)

    # Fit the model on training data
    pipe_tree.fit(X_train[features], y_train)

    # Predict probabilities for the positive class
    train_prob = pipe_tree.predict_proba(X_train[features])[:, 1]
    val_prob = pipe_tree.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)

    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} Decision Tree MD7", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

### Decision Tree Classifer Grid Search
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Start timer
    
    # Define the steps of the pipeline
    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("tree", DecisionTreeClassifier(random_state=20240407))
    ]
    pipe_tree = Pipeline(steps)
    
    # Define the parameter grid to search over
    param_grid = {
        "tree__max_depth": range(3, 9) 
    }
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(pipe_tree, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    
    # Fit the model on training data
    grid_search.fit(X_train[features], y_train)
    
    # Best model after grid search
    best_model = grid_search.best_estimator_
    
    # Predict probabilities for the positive class with the best model
    train_prob = best_model.predict_proba(X_train[features])[:, 1]
    val_prob = best_model.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)

    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    best_depth = best_model.named_steps['tree'].max_depth
    new_row = pd.DataFrame([[f"{group_name} Decision Tree Grid Search", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {group_name} with best max_depth={best_depth} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


## Random Forest
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Start timer
    
    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("random_forest", RandomForestClassifier(random_state=20240407))
    ]
    pipe_rf = Pipeline(steps)
    
    # Define the parameter grid to search over
    param_grid = {
        "random_forest__max_depth": [None, 3, 5, 7],  # None means no limit on the depth
        "random_forest__n_estimators": [10, 50, 100],  # Number of trees
        "random_forest__min_samples_split": [2, 4]  # Minimum number of samples required to split an internal node
    }
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(pipe_rf, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    
    # Fit the model on training data
    grid_search.fit(X_train[features], y_train)
    
    # Best model after grid search
    best_model = grid_search.best_estimator_
    
    # Predict probabilities for the positive class with the best model
    train_prob = best_model.predict_proba(X_train[features])[:, 1]
    val_prob = best_model.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    best_params = grid_search.best_params_
    new_row = pd.DataFrame([[f"{group_name} Random Forest", train_auc, val_auc, train_rmsle, val_rmsle]],
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {group_name} with best parameters {best_params} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


## Gradient Boosted Random Forest
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Timer start
    
    steps = [
        ("scale_features", ColumnTransformer([("scale_numeric_features", MinMaxScaler(), features)], remainder='drop')),
        ("xgb", xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'))
    ]
    pipe_xgb = Pipeline(steps)
    
    # Define the parameter grid
    param_grid = {
        "xgb__n_estimators": [100, 200],  # Number of trees
        "xgb__max_depth": [3, 5, 7],  # Depth of trees
        "xgb__learning_rate": [0.01, 0.1]  # Step size shrinkage used in update to prevents overfitting
    }
    
    # Initialize GridSearchCV
    grid_search = GridSearchCV(pipe_xgb, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
    
    # Fit the model on the training data
    grid_search.fit(X_train[features], y_train)
    
    # Best model after grid search
    best_model = grid_search.best_estimator_
    
    # Predict probabilities for the positive class
    train_prob = best_model.predict_proba(X_train[features])[:, 1]
    val_prob = best_model.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    best_params = grid_search.best_params_
    new_row = pd.DataFrame([[f"{group_name} XGBoost", train_auc, val_auc, train_rmsle, val_rmsle]],
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {group_name} with best parameters {best_params} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


## Light Gradient Boosting Model
---

### Simple Light Gradient Boosting
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()

    # Create datasets for LightGBM
    lgb_train = lgb.Dataset(X_train[features], label=y_train)
    lgb_val = lgb.Dataset(X_val[features], label=y_val, reference=lgb_train)

    # Simplify params by only setting the essentials
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'verbose': -1,
        'random_state': 20240325
    }

    # Train model with a fixed number of boost rounds to simplify
    num_boost_round = 100
    lgb_model = lgb.train(params,
                          lgb_train,
                          num_boost_round=num_boost_round,
                          valid_sets=[lgb_val])

    # Prediction and evaluation
    train_prob = lgb_model.predict(X_train[features], num_iteration=lgb_model.best_iteration)
    val_prob = lgb_model.predict(X_val[features], num_iteration=lgb_model.best_iteration)

    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)

    # Assuming calculateRMSLE is previously defined
    train_rmsle = calculateRMSLE(y_train, train_prob)
    val_rmsle = calculateRMSLE(y_val, val_prob)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} LightGBM Simple", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

### Tuned Light Gradient Boosting
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()

    # Create datasets for LightGBM
    lgb_train = lgb.Dataset(X_train[features], label=y_train)
    lgb_val = lgb.Dataset(X_val[features], label=y_val, reference=lgb_train)

    # Adjust parameters to reduce overfitting
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'learning_rate': 0.05,  # Lowered learning rate
        'num_leaves': 20,  # Fewer leaves
        'lambda_l1': 0.5,  # Added L1 regularization
        'lambda_l2': 0.5,  # Added L2 regularization
        'verbose': -1,
        'random_state': 20240325
    }

    # Train model with early stopping
    lgb_model = lgb.train(params,
                          lgb_train,
                          valid_sets=[lgb_val],
                          num_boost_round=1000)  # Maximum number of boosting rounds

    # Prediction and evaluation
    train_prob = lgb_model.predict(X_train[features], num_iteration=lgb_model.best_iteration)
    val_prob = lgb_model.predict(X_val[features], num_iteration=lgb_model.best_iteration)

    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)

    # Assuming calculateRMSLE is previously defined
    train_rmsle = calculateRMSLE(y_train, train_prob)
    val_rmsle = calculateRMSLE(y_val, val_prob)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} LightGBM Tuned", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")


results_df.tail(31)

## Cat Boosting
---

### Simple Cat Boost

In [None]:
for group_name, features in models.items():
    start_time = time.time()

    # Defining CatBoost model
    cb_model = CatBoostClassifier(
        iterations=500,  # Fewer iterations for quicker learning
        learning_rate=0.01,  # Higher learning rate for faster convergence
        depth=4,  # Lower depth to reduce model complexity and overfitting
        random_state=20240325,
        verbose=False  # Silence the output to avoid flooding the notebook/console
    )
    
    # Fit the model
    cb_model.fit(X_train[features], y_train, eval_set=(X_val[features], y_val), early_stopping_rounds=50, verbose=False)

    # Predict and evaluate
    train_prob = cb_model.predict_proba(X_train[features])[:, 1]
    val_prob = cb_model.predict_proba(X_val[features])[:, 1]

    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Assuming calculateRMSLE is defined elsewhere
    train_rmsle = calculateRMSLE(y_train, train_prob)
    val_rmsle = calculateRMSLE(y_val, val_prob)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} CatBoost Simple", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

### Tuned Cat Boost

In [None]:
for group_name, features in models.items():
    start_time = time.time()

    cb_model = CatBoostClassifier(
        iterations=2000,  # Explore more iterations for deeper learning
        learning_rate=0.001,  # Further reduce learning rate for more gradual learning
        depth=7,  # Slightly increase depth for capturing more complex patterns
        l2_leaf_reg=5,  # Increase L2 regularization to control overfit depth's complexity
        bagging_temperature=1,  # Introduce bagging for randomness, reducing overfitting
        early_stopping_rounds=100,
        random_state=20240325,
        verbose=False)  # Use only a portion of data for each tree, increasing diversity
    
    cb_model.fit(X_train[features], y_train, eval_set=(X_val[features], y_val), early_stopping_rounds=50, verbose=False)

    train_prob = cb_model.predict_proba(X_train[features])[:, 1]
    val_prob = cb_model.predict_proba(X_val[features])[:, 1]

    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)


    new_row = pd.DataFrame([[f"{group_name} CatBoost Tuned", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

## Explainable Boosting Machine
---

### Simple EBM

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Timer start

    # Adjusted EBM pipeline without SimpleImputer for numerical data
    ebm = ExplainableBoostingClassifier(random_state=20240325)

    ebm.fit(X_train[features], y_train)

    # Predict probabilities for the positive class
    train_prob = ebm.predict_proba(X_train[features])[:, 1]
    val_prob = ebm.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} EBM", train_auc, val_auc, train_rmsle, val_rmsle]],
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()  # End timer
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### **Permutation Importance**

In [None]:
# Choose a model (for example, M1 EBM) and its features for illustration
ebm = ExplainableBoostingClassifier(random_state=20240325)
ebm.fit(X_train[models['M7']], y_train)

# Compute permutation-based feature importance
perm_importance = permutation_importance(ebm, X_val[models['M7']], y_val, n_repeats=10, random_state=42, scoring='roc_auc')

# Retrieve and display feature importances
feature_names = np.array(models['M7'])
sorted_idx = perm_importance.importances_mean.argsort()

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
plt.barh(feature_names[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")
plt.show()


In [None]:
# Assuming perm_importance is calculated as shown previously
feature_names = np.array(models['M7'])  # Adjust to use the correct model features as needed

# Identify features with positive permutation importance values
positive_importance_features = feature_names[perm_importance.importances_mean > 0]

# Print out the feature names
print("Features with positive permutation importance:")
for feature in positive_importance_features:
    print(feature)

# Create a variable group with these features
perm_importance_positive = positive_importance_features.tolist()

print("Variable group with positive permutation importance:")
print(perm_importance_positive)


### Adjusted EBM 1
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()  # Timer start

    # Adjusted EBM pipeline without SimpleImputer for numerical data
    ebm_adjusted = ExplainableBoostingClassifier(
        random_state=20240325,
        learning_rate=0.01,
        max_bins=256,
        interactions=10,
        early_stopping_rounds=50
    )

    ebm_adjusted.fit(X_train[features], y_train)

    # Predict probabilities for the positive class
    train_prob = ebm_adjusted.predict_proba(X_train[features])[:, 1]
    val_prob = ebm_adjusted.predict_proba(X_val[features])[:, 1]

    # Calculate AUC scores using the probabilities
    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)
    
    # Calculate RMSLE
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    # Append results
    new_row = pd.DataFrame([[f"{group_name} EBM Adjusted 1", train_auc, val_auc, train_rmsle, val_rmsle]],
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()  # End timer
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### Adjusted EBM 2
---

In [None]:
for group_name, features in models.items():
    start_time = time.time()

    ebm_more_adjusted = ExplainableBoostingClassifier(
        random_state=20240325,
        learning_rate=0.005,  # Slightly lower learning rate for more fine-grained adjustments
        max_bins=512,  # Increased number of bins for potentially capturing more detail
        interactions=15,  # Allowing for more interactions
        early_stopping_rounds=100,  # More patience on early stopping to allow more rounds for convergence
        n_jobs=-1  # Utilize all CPU cores for faster training
    )

    ebm_more_adjusted.fit(X_train[features], y_train)

    train_prob = ebm_more_adjusted.predict_proba(X_train[features])[:, 1]
    val_prob = ebm_more_adjusted.predict_proba(X_val[features])[:, 1]

    train_auc = roc_auc_score(y_train, train_prob)
    val_auc = roc_auc_score(y_val, val_prob)

    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)

    new_row = pd.DataFrame([[f"{group_name} EBM Adjusted 2", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {group_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


## Neural Network Models
---

In [None]:
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)

n_features = X_train.shape[1]
    
# Reshape your data accordingly
X_train_reshaped = X_train_scaled.reshape((-1, n_features, 1)) 
X_val_reshaped = X_val_scaled.reshape((-1, n_features, 1))

### Simple Neural Network Model 1
---

In [None]:
for model_name, features in models.items():
    start_time = time.time()  # Timer start
    
    # Define the model
    model = Sequential([
        Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
        Dense(1, activation='sigmoid')
    ])
    
    # Compile the model
    model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=[AUC(name='auc')])
    
    model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, verbose=0,
              validation_data=(X_val_scaled, y_val),
              callbacks=[EarlyStopping(monitor='val_auc', patience=3, restore_best_weights=True, mode='max')])

    _, train_auc = model.evaluate(X_train_scaled, y_train, verbose=0)
    _, val_auc = model.evaluate(X_val_scaled, y_val, verbose=0)
    
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)
    
    new_row = pd.DataFrame([[f"{group_name} NN Simple", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()  # End timer
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)

### Simple Neural Network Model 2
---

In [None]:
for model_name, features in models.items():
    start_time = time.time()
    
    model = Sequential([
        Dense(32, activation='relu', input_shape=(X_train_scaled.shape[1],)),
        Dropout(0.5),
        Dense(16, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=[AUC(name='auc')])
    
    model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, verbose=0,
              validation_data=(X_val_scaled, y_val),
              callbacks=[EarlyStopping(monitor='val_auc', patience=5, restore_best_weights=True, mode='max')])

    _, train_auc = model.evaluate(X_train_scaled, y_train, verbose=0)
    _, val_auc = model.evaluate(X_val_scaled, y_val, verbose=0)
    
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)
    
    new_row = pd.DataFrame([[f"{group_name} NN Simple 2", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### Simple Neural Network Model 3
---

In [None]:
for model_name, features in models.items():
    start_time = time.time()
    
    model = Sequential([
        Dense(64, activation='relu', input_shape=(X_train_scaled.shape[1],)),
        Dropout(0.3),
        Dense(32, activation='relu'),
        Dropout(0.3),
        Dense(16, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer=Adam(learning_rate=0.0005), loss='binary_crossentropy', metrics=[AUC(name='auc')])
    
    es = EarlyStopping(monitor='val_auc', patience=10, restore_best_weights=True, mode='max')
    model.fit(X_train_scaled, y_train, epochs=150, batch_size=64, verbose=0,
              validation_data=(X_val_scaled, y_val),
              callbacks=[es])

    train_pred = model.predict(X_train_scaled).flatten()
    val_pred = model.predict(X_val_scaled).flatten()

    _, train_auc = model.evaluate(X_train_scaled, y_train, verbose=0)
    _, val_auc = model.evaluate(X_val_scaled, y_val, verbose=0)
    
    train_rmsle = calculateRMSLE(y_train, np.clip(train_pred, 0, None))  # Clipping predictions to ensure non-negative values
    val_rmsle = calculateRMSLE(y_val, np.clip(val_pred, 0, None))
    
    new_row = pd.DataFrame([[f"{model_name} NN Simple 3", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### Complex Neural Network Model
---

In [None]:
for model_name, features in models.items():
    start_time = time.time()
    
    model = Sequential([
        Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],)),
        BatchNormalization(),
        Dropout(0.5),
        Dense(64, activation='relu'),
        BatchNormalization(),
        Dropout(0.5),
        Dense(32, activation='relu'),
        BatchNormalization(),
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=[AUC(name='auc')])
    
    model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, verbose=0,
              validation_data=(X_val_scaled, y_val),
              callbacks=[EarlyStopping(monitor='val_auc', patience=5, restore_best_weights=True, mode='max')])

    _, train_auc = model.evaluate(X_train_scaled, y_train, verbose=0)
    _, val_auc = model.evaluate(X_val_scaled, y_val, verbose=0)
    
    train_rmsle = calculateRMSLE(train_prob, y_train)
    val_rmsle = calculateRMSLE(val_prob, y_val)
    
    new_row = pd.DataFrame([[f"{group_name} NN Complex", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)
    
    end_time = time.time()
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### Conv1D Adjusted Neural Network 1
---

In [None]:
for model_name, features in models.items():
    start_time = time.time()
    
    model = Sequential([
        # Applying Conv1D on the reshaped data; treating each feature as a timestep
        Conv1D(filters=32, kernel_size=1, activation='relu', input_shape=(n_features, 1)),
        MaxPooling1D(pool_size=2, strides=2),
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer=Adam(learning_rate=0.0001),
                  loss='binary_crossentropy', metrics=[AUC(name='auc')])
    
    es = EarlyStopping(monitor='val_auc', patience=15, restore_best_weights=True, mode='max')
    model.fit(X_train_reshaped, y_train, epochs=200, batch_size=32, verbose=0,
              validation_data=(X_val_reshaped, y_val),
              callbacks=[es])

    _, train_auc = model.evaluate(X_train_reshaped, y_train, verbose=0)
    _, val_auc = model.evaluate(X_val_reshaped, y_val, verbose=0)

    # Prediction and RMSLE calculation need correct predictions
    train_pred = model.predict(X_train_reshaped).flatten()
    val_pred = model.predict(X_val_reshaped).flatten()

    train_rmsle = calculateRMSLE(y_train, np.clip(train_pred, 0, None))
    val_rmsle = calculateRMSLE(y_val, np.clip(val_pred, 0, None))

    new_row = pd.DataFrame([[f"{model_name} NN Conv1D Adjusted", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


### Conv1D Adjusted Neural Network 2
---

In [None]:
for model_name, features in models.items():
    start_time = time.time()
    
    model = Sequential([
        Conv1D(filters=64, kernel_size=1, activation='relu', input_shape=(n_features, 1)), 
        MaxPooling1D(pool_size=2),
        Conv1D(filters=64, kernel_size=1, activation='relu'),  # Additional Conv layer
        MaxPooling1D(pool_size=2),
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.4),  # Slightly increased dropout
        Dense(64, activation='relu'),
        Dropout(0.4),
        Dense(1, activation='sigmoid')
    ])
    
    model.compile(optimizer=Adam(learning_rate=0.0005),  # Increased learning rate
                  loss='binary_crossentropy', metrics=[AUC(name='auc')])
    
    es = EarlyStopping(monitor='val_auc', patience=10, restore_best_weights=True, mode='max')  # Adjusted patience
    model.fit(X_train_reshaped, y_train, epochs=100, batch_size=64, verbose=0,  # Reduced epochs, increased batch size
              validation_data=(X_val_reshaped, y_val),
              callbacks=[es])

    _, train_auc = model.evaluate(X_train_reshaped, y_train, verbose=0)
    _, val_auc = model.evaluate(X_val_reshaped, y_val, verbose=0)

    train_pred = model.predict(X_train_reshaped).flatten()
    val_pred = model.predict(X_val_reshaped).flatten()

    train_rmsle = calculateRMSLE(y_train, np.clip(train_pred, 0, None))
    val_rmsle = calculateRMSLE(y_val, np.clip(val_pred, 0, None))

    new_row = pd.DataFrame([[f"{model_name} NN Conv1D Optimized 2", train_auc, val_auc, train_rmsle, val_rmsle]], 
                           columns=['Model', 'Training AUC', 'Validation AUC', 'Training RMSLE', 'Validation RMSLE'])
    results_df = pd.concat([results_df, new_row], ignore_index=True)

    end_time = time.time()
    print(f"Completed {model_name} in {end_time - start_time:.2f} seconds")

results_df.tail(31)


# Hypertuning
---

## Feature Importance

# Model Selection
---

In [None]:
# Add a 'Difference AUC' column to measure overfitting
results_df['Difference AUC'] = abs(results_df['Training AUC'] - results_df['Validation AUC'])

# Add a 'Complexity' column based on the model name. Assuming 'M1' is simpler than 'M11'.
results_df['Complexity'] = results_df['Model'].apply(lambda x: int(x.split()[0][1:]))

# Sort by Validation AUC (desc), then by Difference AUC (asc), then by Complexity (asc)
sorted_results_df = results_df.sort_values(by=['Validation AUC', 'Difference AUC', 'Complexity'], ascending=[False, True, True])

# Get the top 20 models
top_20_models = sorted_results_df.head(20)
top_20_models


In [None]:
# Sorting models by Validation RMSLE (ascending), then by Validation AUC (descending) for a focus on prediction accuracy
sorted_by_rmsle_df = results_df.sort_values(by=['Validation RMSLE', 'Validation AUC'], ascending=[True, False])

# Get the top 10 models focused on RMSLE
top_20_models_rmsle = sorted_by_rmsle_df.head(20)
print("Top 20 Models Sorted by RMSLE:")
top_20_models_rmsle


In [None]:
# Normalize RMSLE (assuming lower is better and to align with AUC's higher is better)
max_rmsle = results_df['Validation RMSLE'].max()
results_df['Normalized RMSLE'] = 1 - (results_df['Validation RMSLE'] / max_rmsle)

# Simple combined score (example: 70% weight on AUC, 30% weight on Normalized RMSLE)
results_df['Combined Score'] = 0.7 * results_df['Validation AUC'] + 0.3 * results_df['Normalized RMSLE']

# Sort by combined score (descending)
sorted_by_combined_score_df = results_df.sort_values(by='Combined Score', ascending=False)

# Get the top 20 models based on the combined score
top_20_models_combined = sorted_by_combined_score_df.head(20)
print("Top 20 Models Sorted by Combined Score (AUC & RMSLE):")
top_20_models_combined


# Test Set Prediction
---

In [None]:
def prediction_folder(day):
    folder_path = f'Predictions/Day_{day}'
    if not os.path.exists(folder_path):
        os.makedirs(folder_path)

### Prediction Functions
---

Since some models are repeated frequently, it will clean up the code to utilize functions.

#### Simple EBM Prediction Function
---

In [None]:
def simple_ebm_prediction(model, day):
    features = models[model]

    # Training the "M9" EBM model
    ebm = ExplainableBoostingClassifier(random_state=20240325)
    ebm.fit(X_train[features], y_train)

    X_test = test_data[features]

    # Predicting with the model
    test_data['score'] = ebm.predict_proba(X_test)[:, 1]

    # Saving the required predictions
    test_data[['article_id', 'score']].to_csv(f'Predictions/Day_{day}/{model}_ebm_predictions.csv', index=False)


#### Adjusted EBM 1 Prediction Function
---

In [None]:
def ebm_adjusted_1_prediction(model, day):
    features = models[model]

    # Adjusted EBM Model 1
    ebm_adjusted_1 = ExplainableBoostingClassifier(
        random_state=20240325,
        learning_rate=0.01,
        max_bins=256,
        interactions=10,
        early_stopping_rounds=50
    )
    ebm_adjusted_1.fit(X_train[features], y_train)

    X_test = test_data[features]

    # Predicting with the model
    test_data['score'] = ebm_adjusted_1.predict_proba(X_test)[:, 1]

    # Saving the required predictions
    test_data[['article_id', 'score']].to_csv(f'Predictions/Day_{day}/{model}_ebm_adjusted_1_predictions.csv', index=False)


#### Adjusted EBM 2 Prediction Function
---

In [None]:
def ebm_adjusted_2_prediction(model, day):
    features = models[model]

    # Adjusted EBM Model 2
    ebm_adjusted_2 = ExplainableBoostingClassifier(
        random_state=20240325,
        learning_rate=0.005,
        max_bins=512,
        interactions=15,
        early_stopping_rounds=100,
        n_jobs=-1  # Utilize all available CPU cores
    )
    ebm_adjusted_2.fit(X_train[features], y_train)

    X_test = test_data[features]

    # Predicting with the model
    test_data['score'] = ebm_adjusted_2.predict_proba(X_test)[:, 1]

    # Saving the required predictions
    test_data[['article_id', 'score']].to_csv(f'Predictions/Day_{day}/{model}_ebm_adjusted_2_predictions.csv', index=False)


## Day 1 Predictions
---

All of the predictions from day one came from the simple EBM model

In [None]:
prediction_folder('1')

### Simple EBM M9 Prediction 
---

In [None]:
simple_ebm_prediction('M9', '1')

### Simple EBM M7 Prediction
---

In [None]:
simple_ebm_prediction('M7', '1')

### Simple EBM M10 Prediction
---

In [None]:
simple_ebm_prediction('M10', '1')

### Simple EBM M6 Prediction
---

In [None]:
simple_ebm_prediction('M6', '1')

### Simple EBM M12 Prediction
---

In [None]:
simple_ebm_prediction('M12', '1')

## Day 2 Predictions
---

In [None]:
prediction_folder('2')

### Simple EBM M11 Prediction
---

In [None]:
simple_ebm_prediction('M11', '2')

### Adjusted EBM 1 M10 Prediction
---

In [None]:
ebm_adjusted_1_prediction('M10', '2')

### Adjusted EBM 1 M11 Prediction
---

In [None]:
ebm_adjusted_1_prediction('M11', '2')

### Adjusted EBM 1 M12 Prediction
---

In [None]:
ebm_adjusted_1_prediction('M12', '2')

### Adjusted EBM 1 M9 Prediction
---

In [None]:
ebm_adjusted_1_prediction('M9', '2')

## Day 3 Predictions
---

In [None]:
prediction_folder('3')

In [None]:
ebm_adjusted_1_prediction('M19', '3')

In [None]:
ebm_adjusted_2_prediction('M18', '3')

In [None]:
ebm_adjusted_2_prediction('M19', '3')

In [None]:
ebm_adjusted_2_prediction('M20', '3')

In [None]:
simple_ebm_prediction('M18', '3')

## Day 4 Predictions
---

In [None]:
prediction_folder('4')