# Kaggle Competiton Jupyter Script

Jacob Foster
jdf3438

This is my python script for the Business Data Science Kaggle Competition. I used a variety of preprocessing tactics and predictive models to place 3rd on the private leaderboard for the class.

#### Table of Contents
* Data Exploration and Preprocessing
* Model Creation
    * Basic Models
    * Boosting Models
    * Neural Network
* Initial Predictions and Evaluation
* Model Improvement
    * Feature Selection
    * Ensembling
    * Personal Stacking
* Final Model Training
* Summary

In addition to using XGBoost, I research some other libraries that provided powerful classification models. Ultimately, I decided to use CatBoost and LightGBM after reading about some algorithms previous Kaggle winners had used.

In [None]:
# installs
! pip install xgboost
! pip install catboost
! pip install lightgbm

I installed some basic data manipulation and visualization libraries

In [3]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
import warnings 

warnings.filterwarnings('ignore')

In [4]:
# read in data
train = pd.read_csv("train.csv", index_col='Id')
test = pd.read_csv("test.csv", index_col='Id')
rf1 = pd.read_csv("rf1.csv", index_col='Id')

## Data Exploration and Preprocessing

During my data exploration, I did some basic description of the data to learn some of the key statistical measurements were, such as minimum, maximum, average, standard deviation, etc. I found that null values were denoted with a -999 in place of the missing value. I removed these values to increase the integrity and representativeness of the data. After looking through the columns of data, I eventually decided to remove columns that only contained the number zero to test if that would improve the predictive capabilities of the models. This greatly improved the predictions on the public leaderboard so I kept the changes. Instead of doing this manually for each column, I did it programmatically, removing columns where the maximum and minimum values were both zero.

I also created a separate dataframe `X_norm` where I normalized the data to prevent the numeric columns with comparatively large numbers from skewing the predictions of the models. Ultimately, this did not help improve any of my predictive accuracy either locally or on Kaggle. Next, I created a function that would remove records that contained outliers. For the purpose of this dataset, I considered anything that was greater than three standard deviations from the mean as an outlier. Removing these outliers left me with too few rows to remain representative of the data (the final number of rows was ~1120). Regardless, I still tested my models locally to determine if there was any significant benefit to removing these outliers. Upon running the tests, the accuracy for all my models suffered greatly and I decided that removing the outliers was no longer a good idea. This was relatively surprising to me at first since most dataset I had previously worked with required the removal of outliers to improve the quality of data, but after looking closer at the numeric columns of the data, the range of the values was so large it made sense that so many values were being removed.

I then created a new dataframe `interaction` that contained a multiplicative interaction term between every term of the data. This dataframe was to be used later for forward and backward selection to determine if there was any predictive power hidden in an interaction term within the data. As we will see later, the interaction terms and forward/backward selection themselves did not offer much benefit to the model.

Ultimately, the preprocessing operations that yielded the most predictive data were removing the noisy columns of only zeros and removing the rows that had missing (-999) values. Thus, the preprocessing was relatively simple, yet effective.

In [4]:
train.describe()

Unnamed: 0,Y,2,3,4,5,6,7,8,9,10,...,77,78,79,80,81,82,83,84,85,86
count,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,...,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0,2853.0
mean,0.438486,-0.332282,-2.367683,0.00666,0.445846,0.031195,0.002454,0.04136,0.035401,0.011216,...,0.019979,0.191377,0.079215,0.004907,0.000351,0.039958,1.055731,4206.567473,0.042411,0.229861
std,0.496289,18.703948,49.436698,0.081349,0.497146,0.173875,0.049481,0.199156,0.184824,0.10533,...,0.139952,0.393454,0.270121,0.069891,0.018722,0.195895,0.312289,2429.57765,0.271268,1.89276
min,0.0,-999.0,-999.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,-3.0,0.0,-8.487258
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2114.0,0.0,-1.033209
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,4197.0,0.0,0.516411
75%,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,6302.0,0.0,1.545231
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,6.0,8416.0,7.0,7.473742


In [5]:
# find that -999 likely represents missing values
train["3"].value_counts()

# drop these missing values
train.replace(-999, np.nan, inplace=True)
train.dropna(inplace=True)

In [6]:
# remove noisy columns with only 0s
noise = [col for col in train.columns if (train[col].min() == 0) and (train[col].max() == 0)]
train = train.drop(noise, axis=1)
test = test.drop(noise, axis=1)

The data does not need a `train_test_split` since the entire dataset will be used later during the accuracy score phase using cross-validation to evaluate the models.

In [7]:
# data for testing algos and feature generation
X = train.drop(columns="Y")
y = train['Y']

# split the data and set the random state seed
rs = 0

In [8]:
def normalize(s):
    if s.max() > 0:
        return (s - s.min())/(s.max() - s.min())
    return 0.0
normalized = X.copy()
X_norm = normalized.apply(normalize)

In [9]:
# dataframe with all outliers removed
z_scores = np.abs((train - train.mean()) / train.std())
no_outliers = train[(z_scores < 3).all(axis=1)]
len(no_outliers)

1126

In [10]:
# generate interaction terms
interaction = train.copy()
for i in train.columns:
    for j in train.columns:
        if i != j:
            interaction[i+'x'+j] = interaction[i]*interaction[j]
interaction.head()

Unnamed: 0_level_0,Y,2,3,4,5,6,7,8,9,10,...,86x76,86x77,86x78,86x79,86x80,86x81,86x82,86x83,86x84,86x85
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0.0,0.0,0,1,0,0,0,0,0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7587.0,0.0
2,1,0.0,0.0,0,1,0,0,0,0,0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,10700.0,0.0
3,1,0.0,0.0,0,1,0,0,0,1,1,...,-3.785554,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-3.785554,-7858.809122,-0.0
4,0,0.0,1.0,0,0,0,0,0,0,0,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-1.733455,-4335.370892,-0.0
5,0,0.0,0.0,0,0,0,0,0,0,0,...,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-1.869119,-4764.384203,-0.0


## Model Creation

In order to create the most predictive model possible, I started with simpler models then moved to more complex models that used boosting and neural networks. After researching the models, reading documentation, and reading advice from former Kaggle achievers on several forums, I decided the parameters I wanted to target using the GridSearchCV and picked a range of reasonable values to test for my models. 

The Basic Models required very little time to train and had less paramters to test different values for. Out of the Basic Models, Random Forest was the best single predictor.

The Boosting Models required more time than the Basic Models to train and research. After consulting online forums I decided to include the LightGBM classifier since it had a strong reputation of working well for other Kaggle classification competitions. Since there were so many parameters for the CatBoost, XGBoost, and LightGBM, I had to research which parameters historically had the most impact on the predictive capabilities of the models. Something that was really surprising to me was the stark difference in the parameters that were considered the "best" for a model depending on the preprocessing performed on the data. The parameters for the models without the noisy data removed or when the data was normalized were drastically different from each other. 

After some time researching and experimenting, I determined these were the most powerful predictors for CatBoost and LightGBM:
* CatBoost
    * `depth`
    * `learning_rate`
    * `l2_leaf_reg`
    * `iterations`
    * `random_strength`
    
* LightGBM
    * `objective`
    * `metric`
    * `boosting_type`
    * `max_depth`
    * `num_iterations`
    * `learning_rate`
    
After running the GridSearchCV on most of the Boosting Models, I used the returned `best_params_` as the new parameters for each model. The best parameters for each model and visible in their respective cells. The LightGBM model was the best single predictor out of the Boosting Models.

Next, I trained a Multilayer Perceptron to test the predictive capabilities of a relatively simple neural network. I was disappointed at the lack of accuracy of the Multilayer Perceptron, and since the model took about 15 hours to perform GridSearchCV on, I opted against training another Neural Network Model for the sake of time.

In total, I tried 8 models and a combined total of 11,166 models through the use of GridSearchCV.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import cross_val_predict, StratifiedKFold, GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.feature_selection import RFECV
from catboost import CatBoostClassifier
from datetime import datetime
import xgboost as xgb
import lightgbm as lgb

### Basic Models

#### Logistic Regression

In [25]:
log_model = LogisticRegression(random_state=rs, max_iter=1000).fit(X, y)

#### Random Forest

In [26]:
rf_params = {
    "n_estimators": [50, 100, 200, 300],
    "criterion": ['gini', 'entropy', 'log_loss'],
    "max_depth": [3, 4, 5, 6, 7],
}

grid_rf_model = RandomForestClassifier(random_state=rs, criterion='entropy', n_estimators=50, max_depth=7)

'''
grid_rf_model = GridSearchCV(estimator=grid_rf_model, param_grid=rf_params, cv=5, n_jobs=-1)
best: {criterion='entropy', n_estimators=50, max_depth=7}
'''

grid_rf_model.fit(X, y)

### Boosting Models

#### AdaBoost

In [27]:
adb_model = AdaBoostClassifier(random_state=rs, 
                               n_estimators=100, 
                               learning_rate=.2).fit(X, y)

#### XGBoost

In [28]:
xgb_params = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.02, 0.05, 0.075, 0.1, .2],
    'max_depth': [3, 4, 5, 6, 7]
}
grid_xgb_model = xgb.XGBClassifier(random_state=rs,
                                   n_estimators=300, 
                                   learning_rate=.02, 
                                   max_depth=3,
                                   booster='gbtree')

'''
# grid_xgb_model = GridSearchCV(estimator=grid_xgb_model, param_grid=xgb_params, cv=5, n_jobs=-1)
# new best:  {'learning_rate': 0.02, 'max_depth': 3, 'n_estimators': 300}
'''
grid_xgb_model.fit(X, y)

#### GradientBoosting

In [29]:
gbc_params = {'n_estimators': [50, 100, 200, 300, 500],
               'learning_rate': [0.01, 0.02, 0.03, 0.04, 0.06, 0.08, 0.1, 0.15,],
               'max_depth': [3, 4, 5, 6, 7],
               'loss': ['log_loss', 'exponential']}
grid_gbc_model = GradientBoostingClassifier(learning_rate=0.1, 
                                            loss='exponential', 
                                            max_depth=6, 
                                            n_estimators=50, 
                                            random_state=rs)
'''
grid_gbc_model = GridSearchCV(estimator=grid_gbc_model, param_grid=gbc_params, cv=5, n_jobs=-1)
new best: {'learning_rate': 0.1, 'loss': 'exponential', 'max_depth': 6, 'n_estimators': 50}
'''

grid_gbc_model.fit(X, y)

#### CatBoost

In [30]:
cat_parameters = {'depth': [4, 5, 6, 7, 8],
                  'learning_rate' : [0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1],
                  'iterations'    : [50, 100, 200, 300],
                  'l2_leaf_reg'   : [3, 4, 5, 6, 7, 8],
                  'random_strength': [3, 4, 5, 6, 7, 8]}

grid_cat_model = CatBoostClassifier(random_state=rs, 
                                    verbose=False, 
                                    depth=5, 
                                    l2_leaf_reg=7,
                                    iterations=200,
                                    learning_rate=0.09,
                                    random_strength=7,
                                    loss_function='CrossEntropy',
                                    eval_metric='AUC')
'''
grid_cat_model = GridSearchCV(estimator=grid_cat_model, param_grid = cat_parameters, cv = 5, n_jobs=-1)
new best 
 {'depth': 5, 'iterations': 200, 'l2_leaf_reg': 7, 'learning_rate': 0.09, 'random_strength': 7}
'''
grid_cat_model.fit(X, y)

<catboost.core.CatBoostClassifier at 0x1ee5b9f6490>

#### LightGBM

In [31]:
lgb_params = {
    'objective': ['binary', 'regression'],
    'metric': 'binary_logloss',
    'boosting_type': ['gbdt', 'rf', 'dart'],
    'max_depth': [3, 4, 5, 6, 7],
    'num_iterations': [50, 100, 200, 300],
    'learning_rate': [0.025, 0.05, 1],
    'feature_fraction': 0.9
}

grid_lgb_model = lgb.LGBMClassifier(random_state=rs, 
                                    boosting_type='gbdt',
                                    learning_rate=.05,
                                    max_depth=4,
                                    metric='binary_logloss',
                                    num_iterations=200,
                                    objective='binary',
                                    verbose=-1)

'''
grid_lgb_model = GridSearchCV(estimator=grid_lgb_model, param_grid=lgb_params, cv=5, scoring='accuracy', n_jobs=-1)
best: {'boosting_type': 'gbdt', 'learning_rate': 0.05, 'max_depth': 4, 
'metric': 'binary_logloss', 'num_iterations': 200, 'objective': 'binary'}
'''
grid_lgb_model.fit(X, y)

### Neural Networks

#### MultiLayerPerceptron

In [32]:
mlp_params = {'hidden_layer_sizes': [(100, 50), (150, 100, 50), (50,), (100,), (50, 100, 50), (150, 100)],
              'activation': ['identity', 'logistic', 'tanh', 'relu'],
              'solver': ['adam', 'sgd', 'lbfgs'],
              'max_iter': [100, 200, 300, 400],
              'alpha': [0.00005, 0.0001, 0.00015, 0.00025, 0.0005, 0.001],
              'learning_rate': ['constant', 'invscaling', 'adaptive']}

grid_mlp_model = MLPClassifier(activation='tanh', 
                               alpha=0.00015, 
                               hidden_layer_sizes=(100, 50), 
                               learning_rate='constant', 
                               max_iter=600, 
                               solver='lbfgs', 
                               random_state=rs)
'''
grid_mlp_model = GridSearchCV(estimator=grid_mlp_model, param_grid=mlp_params, cv=5, n_jobs=-1)
best parameters were activation='tanh', alpha=0.00015, hidden_layer_sizes=(100, 50), 
learning_rate='constant', max_iter=400, solver='lbfgs'
'''
grid_mlp_model.fit(X_norm, y)

## Initial Predictions and Evaluations

To evaluate the predictive capability of each model, I used `cross_val_predict` from Sci-kit Learn so each model could be trained and evaluated on the entireity of the training dataset. Using the `predict_proba` method, I generated soft prediction values for the predicted class. I then used the AUC of each model to evaluate the accuracy of the predictions since that was the metric used to evalute our performance on Kaggle. Using the cross validation prediction method worked extremely well as it gave me a relatively accurate benchmark of what to expect from my Kaggle results. For the sake of time, I used 5-fold validation as using a higher number of folds took much more time and did not offer any signifcant insight to the predictive power of the data.

At the end of each `cross_val_predict` statement there is an index of `[:, 1]`. This is used because the `predict_proba` method produces a 2-dimensional list with a length of `X` and a width of 2. The second index of each sublist is the predicted probability that a record is of class 1. Thus, the index of `[:, 1]` gathers the entire vector of the predicted probabilities that each record is of class 1.

In [20]:
# predictions
log_pred = cross_val_predict(log_model, X, y, cv=5, method='predict_proba')[:, 1]
rf_pred = cross_val_predict(grid_rf_model, X, y, cv=5, method='predict_proba')[:,1]
adb_pred = cross_val_predict(adb_model, X, y, cv=5, method='predict_proba')[:,1]
gbc_pred = cross_val_predict(grid_gbc_model, X, y, cv=5, method='predict_proba')[:,1]
xgb_pred = cross_val_predict(grid_xgb_model, X, y, cv=5, method='predict_proba')[:,1]
cat_pred = cross_val_predict(grid_cat_model, X, y, cv=5, method='predict_proba')[:,1]
lgb_pred = cross_val_predict(grid_lgb_model, X, y, cv=5, method='predict_proba')[:,1]
mlp_pred = cross_val_predict(grid_mlp_model, X, y, cv=5, method='predict_proba')[:,1]

# get auc score
log_auc = roc_auc_score(y, log_pred)
rf_auc = roc_auc_score(y, rf_pred)
adb_auc = roc_auc_score(y, adb_pred)
gbc_auc = roc_auc_score(y, gbc_pred)
xgb_auc = roc_auc_score(y, xgb_pred)
cat_auc = roc_auc_score(y, cat_pred)
lgb_auc = roc_auc_score(y, lgb_pred)
mlp_auc = roc_auc_score(y, mlp_pred)

# display scores
print(f"Logistic AUC: {log_auc}")
print(f"Random Forest AUC: {rf_auc}")
print(f"AdaBoost AUC: {adb_auc}")
print(f"GradientBoosting: {gbc_auc}")
print(f"XGBoost: {xgb_auc}")
print(f"CatBoost: {cat_auc}")
print(f"LGBM:\t {lgb_auc}")
print(f"MLP: {mlp_auc}")

Logistic AUC: 0.8759196190751259
Random Forest AUC: 0.8869991736465455
AdaBoost AUC: 0.8895943146080036
GradientBoosting: 0.899469689676198
XGBoost: 0.8995047896408974
CatBoost: 0.9027615649369405
LGBM:	 0.9035605905619203
MLP: 0.8734310315779339


## Model Improvement

After getting the initial model predictions and performance, I used several of the ideas discussed in class in an attempt to improve the predictive capability of my models. I attempted backward and forward selection, model stacking, and bagging various models.

Backward selection was extremely time intensive and did not offer any significant improvement when used on the regular dataset or the dataset created with the interaction terms. It took several hours to run and did not perform as well as the basic boosting models did. Thus, when given the time sacrifice for the high computational cost, it was not beneficial to use backward selection.

Surprisingly, forward selection was a very similar story to backward selection. The computation was time intensive and ultimately did not produce as predictive of a model as the base boosting models. For both backward and forward selection, I added LightGBM after I has run the initial tests of feature selection, and since the performance for the original 4 boosted models was not improved, I decided it was not worth the time sacrifice to test backward or forward selection on LightGBM.

Next, I used the `StackingClassifier` from Sci-kit Learn, combining the AdaBoost, GradientBoost, XGBoost, and CatBoost models into a singular model. Again, this model was surprisingly not as powerfully predictive as the basic, unstacked models. When creating the stacked model, I was confident that the results would improve, but I was wrong; it was still better to go with a singular boosted model over the stacked model, even when considering the performance on the privat leaderboard. I also bagged the stacked model, but this did not prove to be helpful either. Admittedly, I could have used GridSearchCV to explore what the best `final_estimator` should have been for my stacking model, but the results on Kaggle were too discouraging to continue exploring this idea, so I decided to pursue other ideas that seemed more predictive. 

Even though stacking using the `StackingClassifier` did not perform as well as I was hoping, I tried to create my own stacked model by combining the predictions of the boosted models in a single feature of the dataset. This underperformed in both my local environment and on Kaggle **but** it did lead me to the idea that ultimately delivered my best predictions on Kaggle; thus, securing me 3rd place on the private leaderboard. Even though my idea of combining predictions into a singular feature did not work, it was a crucial stepping stone to the idea that did work.

Lastly, I bagged all of the boosted models in an attempt to create more generalized models that would perform better on data they had not seen before. The CatBoost bagged model performed very well on Kaggle and was my leading score for while before trying my final idea. The bagging was computationally intense, but I think the time sacrifice was worth the predictive capabilities of the final models.

### Feature Selection

#### Backward Selection

In [None]:
# backward select the models
adb_backward = RFECV(estimator=adb_model)
gbc_backward = RFECV(estimator=grid_gbc_model)
xgb_backward = RFECV(estimator=grid_xgb_model)
cat_backward = RFECV(estimator=grid_cat_model)

# fit the models
adb_backward.fit(X, y)
gbc_backward.fit(X, y)
xgb_backward.fit(X, y)
cat_backward.fit(X, y)

# adjust X based on the features selected for each model
adb_selected = X[:, adb_backward.support_]
gbc_selected = X[:, gbc_backward.support_]
xgb_selected = X[:, xgb_backward.support_]
cat_selected = X[:, cat_backward.support_]

# fit the adjusted models
adb_model.fit(adb_selected, y)
grid_gbc_model.fit(gbc_selected, y)
grid_xgb_model.fit(xgb_selected, y)
grid_cat_model.fit(cat_selected, y)

# generate predictions from the models
adb_backward_pred = cross_val_predict(grid_mlp_model, adb_selected, y, cv=5, method='predict_proba')[:,1]
gbc_backward_pred = cross_val_predict(grid_mlp_model, gbc_selected, y, cv=5, method='predict_proba')[:,1]
xgb_backward_pred = cross_val_predict(grid_mlp_model, xgb_selected, y, cv=5, method='predict_proba')[:,1]
cat_backward_pred = cross_val_predict(grid_mlp_model, cat_selected, y, cv=5, method='predict_proba')[:,1]

# get the AUC score for each model
adb_backward_auc = roc_auc_score(y, adb_backward_pred)
gbc_backward_auc = roc_auc_score(y, gbc_backward_pred)
xgb_backward_auc = roc_auc_score(y, xgb_backward_pred)
cat_backward_auc = roc_auc_score(y, cat_backward_pred)

# dipslay the AUCs
print(f"adb_backward AUC: {adb_backward_auc}")
print(f"gbc_backward AUC: {gbc_backward_auc}")
print(f"xgb_backward AUC: {xgb_backward_auc}")
print(f"cat_backward AUC: {cat_backward_auc}")

#### Forward Selection

When doing forward selection, I selected the CatBoost and GradientBoosting models as my initital test models since they routinely achieved a high AUC on their base models. After conducting forward selection and testing these models, I decided to not continue with forward selection for the other models since forward selection did not result in an increased in predictive power for these models. Initially, I chose 9 features as the `n_features_to_select` to follow the same philosophy as random forest feature selection, but this resulted in a poor local AUC. Thus, I increased the `n_features_to_select` to 15 for a compromise between training time and increased predictiveness.

In [None]:
from sklearn.feature_selection import SequentialFeatureSelector as sfs

# build the forward selection models
gbc_sfs = sfs(estimator=grid_gbc_model, n_features_to_select=15, direction='forward', cv=5, n_jobs=-1)
cat_sfs = sfs(estimator=grid_cat_model, n_features_to_select=15, direction='forward', cv=5, n_jobs=-1)

# fit the models
gbc_sfs.fit(X, y)
cat_sfs.fit(X, y)

# generate predictions
cat_sfs_pred = cross_val_predict(cat_sfs, X, y, cv=5, method='predict_proba')[:, 1]
cat_sfs_pred = cross_val_predict(cat_sfs, X, y, cv=5, method='predict_proba')[:, 1]

# get AUC
gbc_sfs_auc = roc_auc_score(y, gbc_sfs_pred)
cat_sfs_auc = roc_auc_score(y, cat_sfs_pred)

# display AUCs
print(f"gbc_sfs AUC: {gbc_sfs_auc}")
print(f"cat_sfs AUC: {cat_sfs_auc}")

### Ensembling

#### Stacking

In [23]:
from sklearn.ensemble import StackingClassifier

# estimators to stack
base_estimators = [
    ('adaboost', adb_model),
    ('gradientboost', grid_gbc_model),
    ('xgboost', grid_xgb_model),
    ('catboost', grid_cat_model)
]

# create the stacked model
stacked_model = StackingClassifier(estimators=base_estimators, 
                                   stack_method='predict_proba',
                                   final_estimator=xgb.XGBClassifier(random_state=rs, n_estimator=100, learning_rate=.02), 
                                   passthrough=True, 
                                   n_jobs=-1)
# bag the stacked model
stacked_bag = BaggingClassifier(estimator=stacked_model, n_estimators=20, n_jobs=-1)

# fit the models
stacked_model.fit(X, y)
stacked_bag.fit(X, y)

# generate predictions
stack_pred = cross_val_predict(stacked_model, X, y, cv=5, method='predict_proba')[:,1]
stack_bag_pred = cross_val_predict(stacked_bag, X, y, cv=5, method='predict_proba')[:,1]

# get AUC of predictions
stack_auc = roc_auc_score(y, stack_pred)
stack_bag_auc = roc_auc_score(y, stack_bag_pred)

# display the AUCs
print(f"Stacked Model AUC: {stack_auc}")
print(f"Stacked Bagged Model AUC: {stack_bag_auc}")

Stacked Model AUC: 0.8976555229293027
Stacked Bagged Model AUC: 0.897493060235551


This next cell was the experiment that ultimately led me to the idea that would deliver my best Kaggle performance.

In [24]:
# stacking retrain on same model
stacked = X.copy()
stacked['combo_pred'] = ((cat_pred + xgb_pred + gbc_pred + adb_pred) / 4)

grid_xgb_model.fit(stacked, y)
grid_cat_model.fit(stacked, y)
grid_gbc_model.fit(stacked, y)

xgb_stack_pred = cross_val_predict(grid_xgb_model, stacked, y, cv=5, method='predict_proba')[:,1]
cat_stack_pred = cross_val_predict(grid_cat_model, stacked, y, cv=5, method='predict_proba')[:,1]
gbc_stack_pred = cross_val_predict(grid_gbc_model, stacked, y, cv=5, method='predict_proba')[:,1]

xgb_stack_auc = roc_auc_score(y, xgb_stack_pred)
cat_stack_auc = roc_auc_score(y, cat_stack_pred)
gbc_stack_auc = roc_auc_score(y, gbc_stack_pred)

print(f"XGBoost (Stacked): {xgb_stack_auc}")
print(f"CatBoost (Stacked): {cat_stack_auc}")
print(f"GBoost (Stacked): {gbc_stack_auc}")

XGBoost (Stacked): 0.8948312293411637
CatBoost (Stacked): 0.8934279828952858
GBoost (Stacked): 0.8952105596739512


#### Bagging

In [21]:
# bagging only the best performing models
adb_bag = BaggingClassifier(estimator=adb_model, n_estimators=20, n_jobs=-1)
gbc_bag = BaggingClassifier(estimator=grid_gbc_model, n_estimators=20, n_jobs=-1)
xgb_bag = BaggingClassifier(estimator=grid_xgb_model, n_estimators=20, n_jobs=-1)
cat_bag = BaggingClassifier(estimator=grid_cat_model, n_estimators=20, n_jobs=-1)
lgb_bag = BaggingClassifier(estimator=grid_lgb_model, n_estimators=20, n_jobs=-1)
rf_bag = BaggingClassifier(estimator=grid_rf_model, n_estimators=20, n_jobs=-1)

# fitting the models
adb_bag.fit(X, y)
gbc_bag.fit(X, y)
xgb_bag.fit(X, y)
cat_bag.fit(X, y)
lgb_bag.fit(X, y)
rf_bag.fit(X, y)

# get the model predictions
adb_bag_pred = cross_val_predict(adb_bag, X, y, cv=5, method='predict_proba')[:,1]
gbc_bag_pred = cross_val_predict(gbc_bag, X, y, cv=5, method='predict_proba')[:,1]
xgb_bag_pred = cross_val_predict(xgb_bag, X, y, cv=5, method='predict_proba')[:,1]
cat_bag_pred = cross_val_predict(cat_bag, X, y, cv=5, method='predict_proba')[:,1]
lgb_bag_pred = cross_val_predict(lgb_bag, X, y, cv=5, method='predict_proba')[:,1]
rf_bag_pred = cross_val_predict(rf_bag, X, y, cv=5, method='predict_proba')[:,1]

# get the AUC score
adb_bag_auc = roc_auc_score(y, adb_bag_pred)
gbc_bag_auc = roc_auc_score(y, gbc_bag_pred)
xgb_bag_auc = roc_auc_score(y, xgb_bag_pred)
cat_bag_auc = roc_auc_score(y, cat_bag_pred)
lgb_bag_auc = roc_auc_score(y, lgb_bag_pred)
rf_bag_auc = roc_auc_score(y, rf_bag_pred)

# display AUC scores
print(f"AdaBoost AUC: {adb_bag_auc}")
print(f"GradientBoosting: {gbc_bag_auc}")
print(f"XGBoost: {xgb_bag_auc}")
print(f"CatBoost: {cat_bag_auc}")
print(f"LightGBM: {lgb_bag_auc}")
print(f"RandomForest: {rf_bag_auc}")

AdaBoost AUC: 0.8892756570713392
GradientBoosting: 0.8984778649594043
XGBoost: 0.8986187662462695
CatBoost: 0.9017531930939315
LightGBM: 0.9036716568787908
RandomForest: 0.8880245940438368


### Personal Stacking

This was the idea that generate the best predictions on Kaggle. I got this idea from my previous idea to create a new feature based on the stacked predictions from the various models. When that posed some issue with being able to fit the `test` dataset, I decided to just average the predictions from the test set to create a `y_pred` based on the average soft prediction value of the models I chose to combine.

The AUC from the combined `predict_probas` was notably high, so I was unsure of what the results would be on Kaggle since I was concerned the predictions would be overfit to the provided training data. However, when I tested the results on Kaggle, the score was lower than the local AUC, but it provided the most accurate Kaggle predictions I was able to generate. I was surprised that the predictions remained relatively accurate even after appearing to be very overfit to the initial data and even more surprised when the accuracy translated to the private leaderboard.

In [22]:
# get the predictions of the ensemble models

# condense variables
cat_pred = grid_cat_model.predict_proba(X)[:, 1]
lgb_pred = grid_lgb_model.predict_proba(X)[:, 1]
xgb_pred = grid_xgb_model.predict_proba(X)[:, 1]
gbc_pred = grid_gbc_model.predict_proba(X)[:, 1]
adb_pred = adb_model.predict_proba(X)[:, 1]
rf_pred = grid_rf_model.predict_proba(X)[:, 1]

# regular boosted models
cat_lgb_pred = (cat_pred + lgb_pred) / 2
cat_gbc_xgb_adb_pred = (cat_pred + gbc_pred + xgb_pred + adb_pred) / 4
cat_lgb_gbc_xgb_pred = (cat_pred + lgb_pred + gbc_pred + xgb_pred) / 4
cat_lgb_gbc_xgb_adb_pred = (cat_pred + lgb_pred + gbc_pred + xgb_pred + adb_pred) / 5
cat_xgb_lgb_gbc_rf_pred = (cat_pred + xgb_pred + lgb_pred + gbc_pred + rf_pred) / 5
cat_lgb_rf_pred = (cat_pred + lgb_pred + rf_pred) / 3
cat_xgb_rf_pred = (cat_pred + xgb_pred + rf_pred) / 3

# bagged models
bag_cat_xgb_gbc_lgb_pred = (cat_bag.predict_proba(X)[:, 1] + xgb_bag.predict_proba(X)[:, 1] + gbc_bag.predict_proba(X)[:, 1] + lgb_bag.predict_proba(X)[:, 1]) / 4

# get the AUC scores
cat_lgb = roc_auc_score(y, cat_lgb_pred)
cat_gbc_xgb_adb = roc_auc_score(y, cat_gbc_xgb_adb_pred)
cat_lgb_gbc_xgb = roc_auc_score(y, cat_lgb_gbc_xgb_pred)
cat_lgb_gbc_xgb_adb = roc_auc_score(y, cat_lgb_gbc_xgb_adb_pred)
cat_xgb_lgb_gbc_rf = roc_auc_score(y, cat_xgb_lgb_gbc_rf_pred)
cat_lgb_rf = roc_auc_score(y, cat_lgb_rf_pred)
cat_xgb_rf = roc_auc_score(y, cat_xgb_rf_pred)

# bagged models
bag_cat_xgb_gbc_lgb = roc_auc_score(y, bag_cat_xgb_gbc_lgb_pred)

# display the AUC scores
print(f"cat_lgb AUC: {cat_lgb}")
print(f"cat_gbc_xgb_adb AUC: {cat_gbc_xgb_adb}")
print(f"cat_lgb_gbc_xgb AUC: {cat_lgb_gbc_xgb}")
print(f"cat_lgb_gbc_xgb_adb AUC: {cat_lgb_gbc_xgb_adb}")
print(f"cat_xgb_lgb_gbc_rf AUC: {cat_xgb_lgb_gbc_rf}")
print(f"cat_lgb_rf AUC: {cat_lgb_rf}")
print(f"cat_xgb_rf AUC: {cat_xgb_rf}")

# bagged models
print(f"bag_cat_xgb_gbc_lgb AUC: {bag_cat_xgb_gbc_lgb}")

cat_lgb AUC: 0.9508720836943616
cat_gbc_xgb_adb AUC: 0.9552279893135649
cat_lgb_gbc_xgb AUC: 0.9554160248387407
cat_lgb_gbc_xgb_adb AUC: 0.9549010582137928
cat_xgb_lgb_gbc_rf AUC: 0.9523031593979655
cat_lgb_rf AUC: 0.9463146039921696
cat_xgb_rf AUC: 0.9367980007060107
bag_cat_xgb_gbc_lgb AUC: 0.9545651014088123


## Training on the best model

I took the best performing model from all the models that were tested above and converted the final `y_pred` series, then to a dataframe, then finally to a csv file for submission on Kaggle. I printed the length of the dataframe to ensure the file was formatted correct for Kaggle.

In [33]:
y_pred = (grid_cat_model.predict_proba(test)[:, 1] + grid_lgb_model.predict_proba(test)[:, 1] + grid_gbc_model.predict_proba(test)[:, 1] + grid_xgb_model.predict_proba(test)[:, 1]) / 4
y_pred = pd.DataFrame(y_pred, columns=['Y'], index=range(2854, 2854+len(y_pred)))
y_pred.index.name = 'Id'
print(y_pred.shape)

3854


In [34]:
y_pred.to_csv("output.csv")

## Summary

To conclude, I began with preprocessing the data by looking at standard statistical measures for the columns of the dataset. Then, I continued with preprocessing by removing missing values and noisy columns. I attempted to normalize the data and remove outliers; however, these ideas were not beneficial to the predictive power of my models. 

Then I moved to creating my models, begininning with basic models, moving to boosted models, then concluding with a neural network. To select the best parameters for my models, I used `GridSearchCV`, and models that had GridSearch ran on th,e are denoted with the prefix `grid_`. Out of all these models, the LightGBM model was the singular most predictive model. When generating predictions, I used the soft probability values for each record being classified as a 1. I used the AUC score to compare the predictive power of the models. 

After creating my initial models, I tried various models to improve my models. I began with feature selection, trying forward and backward selection. These methods were computationally expensive, and I was surprised that neither improved the predictive capability of my models. Next, I attempted to ensemble the boosted models I created earlier using stacking and bagging. The stacking classifier and bagged models were better than my initial predictions using the normalized data and better than the boosted models that were not bagged. However, these ensembling methods did not generate the best predictions I got. By averaging the final predictions of my boosted models and bagged boosted models, I was able to generate my best predictions. I scored .908 accuracy on the private leaderboard, placing me at third, even though my public leaderboard score did not place in the top 20.

Ultimately, my best model was the average predictions of the LightGBM, GradientBoosting, CatBoost, and XGBoost models.