# Tabular Playground Series - Oct 21

This month, our data consists of 284 feature variables and our target variable is binary classification. We will first perform some basic EDA to take a better look at this data following which we will start working on our models. 

## Plan

Moving forward this is the plan we are going to be following. Keep in mind, this is not a concrete plan and I might change it as we move through the notebook. This will show you my process on how I approach these datasets.

- *Memory Reduction*
- *Sampling to Reduce Training Time*
- *EDA*
- *Model Development*
- *Hyperparameter Tuning*
- *Feature Importance from top models*
- *Selecting the best Model*

## Imports 

Let's import some of the libraries we will be using throughout the notebook

In [1]:
# Data Import on Kaggle
import os
import time
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing processing libraries
import numpy as np
import pandas as pd

# Importing Visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt

# Importing libraries for the metrics
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, KFold

# Importing libraries for the model
import xgboost as xgb 
import lightgbm as lgb
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from collections import Counter
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from sklearn.experimental import enable_hist_gradient_boosting 
from sklearn.ensemble import HistGradientBoostingClassifier

# sklearn imports for analysis
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from scipy.stats import randint

In [2]:
data = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
test_data = pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')

In [3]:
data = data.drop('id', axis=1)

## Memory Reduction

If you don't have any issues with memory, you can go ahead and skip this step. 
Here, we will take a look at the memory consumption by the current data and each feature following which we will try to reduce it to some extent. 

There are several other methods to save RAM - you can refer to this article on [14 tips to save RAM memory](https://www.kaggle.com/pavansanagapati/14-simple-tips-to-save-ram-memory-for-1-gb-dataset). 

In [4]:
memory_usage = data.memory_usage(deep=True) / 1024 ** 2
print('memory usage of features: \n', memory_usage.head(7))
print('memory usage sum: ',memory_usage.sum())

In [5]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

data = reduce_memory_usage(data, verbose=True)
test_data = reduce_memory_usage(test_data, verbose=True)

In [6]:
data.describe()

## Sampling Data

Now that we have reduced the memory usage by over 70%, let's sample the data. 

Why are we doing this? Well, you don't have to. But if you're like me and own a Macbook Air that can't handle a dataset bigger than 100mb, this might be a good idea.

When we are performing model selection and hyperparameter tuning later, we can't afford to let the notebook run for hours on end testing every model. Doing this, preserves the distributions of each feature while taking only 20% of the entire dataset and we can reduce the training time by using this sampled data.

We can then perform EDA, modelling, hyperparameter tuning and other steps on this sampled data. Once we decide on the model we want to use and improve its performance, we can train the final model on the entire dataset again.

In [7]:
sample_df = data.sample(int(len(data) * 0.2))
sample_df.shape

In [8]:
# Let's confirm if the sampling is retaining the feature distributions

fig, ax = plt.subplots(figsize=(6, 4))

sns.histplot(
    data=data, x="f6", label="Original data", color="red", alpha=0.3, bins=15
)
sns.histplot(
    data=sample_df, x="f6", label="Sample data", color="green", alpha=0.3, bins=15
)

plt.legend()
plt.show();

## EDA

Let's start looking at any correlations that might exist among the features.
We will also be looking at the densities of every feature.

In [None]:
sample_df

In [None]:
# Check na values
print('Amount of existing NaN values', sample_df.isna().sum())

print('---------')
# Target Class Distribution
target_dist = sample_df.target.value_counts()
print('Distribution of Target Class \n',target_dist)
print(target_dist[0]/(target_dist[0] + target_dist[1]))

There doesn't seem to be any nan values in the data. Also, the target class is split evenly between the two groups

In [None]:
f, ax = plt.subplots(figsize=(8, 6))
corr = sample_df.iloc[:,:20].corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True),
            square=True, ax=ax)
plt.show()

Before we look at distributions, we need to split the data into continuous and categorical variables.

In [None]:
cat_variables = []

for column in sample_df.columns:
    if len(sample_df[column].unique()) < 10:
        cat_variables.append(column)
print(cat_variables)

So, the categorical features in the dataset are f22, f43 and all the ones from f242-f284. Let's find the  distributions of the rest using kdeplot and the distributions of these using barplot (compared to target).

In [None]:
fig = plt.figure(figsize = (18, 100))

for i in range(len(sample_df.columns.tolist()[:241])):
    if sample_df.columns.tolist()[:241][i] in ['f22', 'f43']: 
        continue
    else:
        plt.subplot(25,10,i+1)
        sns.set_style("white")
        plt.title(sample_df.columns.tolist()[:241][i], size = 12, fontname = 'monospace')
        a = sns.kdeplot(sample_df[sample_df.columns.tolist()[:241][i]], color = '#1a5d57', shade = True, alpha = 0.9, linewidth = 1.5, edgecolor = 'black')
        plt.ylabel('')
        plt.xlabel('')
        plt.xticks(fontname = 'monospace')
        plt.yticks([])
        for j in ['right', 'left', 'top']:
            a.spines[j].set_visible(False)
            a.spines['bottom'].set_linewidth(1.2)
        
fig.tight_layout(h_pad = 3)

plt.show()

In [None]:
# Code from https://www.kaggle.com/craigmthomas/tps-oct-2021-eda

cat_features = ["f22", "f43"]
cat_features.extend(["f{}".format(x) for x in range(242, 285)])

fig, axs = plt.subplots(11, 4, figsize=(4*4, 11*3), squeeze=False, sharey=True)

ptr = 0
for row in range(11):
    for col in range(4):  
        x = sample_df[[cat_features[ptr], "target"]].value_counts().sort_index().to_frame().rename({0: "# of Samples"}, axis="columns").reset_index()
        sns.barplot(x=cat_features[ptr], y="# of Samples", hue="target", data=x, ax=axs[row][col])
        plt.xlabel(cat_features[ptr])
        ptr += 1
        del(x)
plt.tight_layout()    
plt.show()

## Data Preparation

In this section, we will do some preprocessing. This part involves feature scaling and splitting the data into train and test sets.

### Scaling 

While most of the models I plan to use in the 'model selection' section will not require any form of feature scaling (like, XGBoost, Random Forest, etc.), some of them (like, KNN and SVM) need it to work. 

##### Why

In general, algorithms that exploit distances or similarities (e.g. in the form of scalar product) between data samples, such as K-NN and Support Vector Machines, are sensitive to feature transformations.

Graphical-model based classifiers, such as Fisher LDA or Naive Bayes, as well as Decision trees and Tree-based ensemble methods (Random Forests, XGBoost) are invariant to feature scaling, but still, it might be a good idea to rescale/standardize your data.

In [9]:
# We only want to scale the continuous features.

cont_features = []
for x in range(0, 242):
    if (x != 22) and (x!=43):
        cont_features.append("f{}".format(x))

In [10]:
from sklearn.preprocessing import MinMaxScaler

scale = MinMaxScaler()
sample_df[cont_features]=scale.fit_transform(sample_df[cont_features])
sample_df[cont_features]= scale.transform(sample_df[cont_features])  

print('Data scaled using : ', scale)

### Train-Test Split

Let's split our sampled data into train and test sets

In [11]:
X = sample_df.drop('target', axis=1)
y = sample_df.target

X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.7, random_state=42)

del sample_df # we do this to remove sample_df from memory

In [12]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [13]:
id_test_submission = test_data.id
X_test_submission = test_data.drop('id', axis=1)

del test_data

## Model Selection

Finally, now that we're done with the preprocessing and EDA, we are going to take a look at how some basic models perform on a subset of the data (20%) without any parameter tuning. We can then retrain and evaluate the top performing models with a bigger dataset and tuned parameters.

You can take a look at the model_dict and add any models that you think might perform well. Or leave a comment and I'll add them asap!


*P.S: Kaggle has a time-out error if the run time of a notebook exceeds a certain time limit. So, I'll comment some of these models out. However, I'll keep the top performing models uncommented.*

In [None]:
model_dict = {
    'ADABoost': AdaBoostClassifier(),
    'CatBoost': CatBoostClassifier(verbose=False),
    'Light GBM': lgb.LGBMClassifier(random_state=0, verbose=-1),
#     'XGB': xgb.XGBClassifier(random_state=0, n_estimators=10), 
#     'Gradient Boosting Classifier': GradientBoostingClassifier(random_state=0, verbose=1),
#     'Logistic Reg': LogisticRegression(random_state=0, max_iter=350, solver='lbfgs'),
#     'Naive Bayes': GaussianNB(), 
#     'Support Vector Machine': SVC(random_state=0, verbose=1),
#     'K Nearest Classifier': KNeighborsClassifier(),
#     'Decison Tree': DecisionTreeClassifier(random_state=0),
            }
model_list = []
train_acc_list = []
test_acc_list = []
counter_list = []

for model, clf in model_dict.items():
    start_time = time.time()

    clf.fit(X_train, y_train)
    
    # test results
    test_pred = clf.predict(X_test)
    test_acc = roc_auc_score(y_test, test_pred)
    
    # train results
    train_pred =  clf.predict(np.float32(X_train))
    train_acc = roc_auc_score(y_train, train_pred)

    print(model, 'Model')
    print('Classification Report \n',classification_report(y_test, test_pred))
    print('Confusion Matrix \n',confusion_matrix(y_test,test_pred))
    print('Train Accuracy: ', train_acc)
    print('Test Accuracy: ', test_acc)
    print("\n Ran in %s seconds" % (time.time() - start_time))
    print('--------------------------------')
    
    model_list.append(model)
    train_acc_list.append(train_acc)
    test_acc_list.append(test_acc)   
    

results = pd.DataFrame({"model": model_list, "train_accuracy": train_acc_list, "test_acc": test_acc_list})

In [None]:
results

## Initial Model Selection


Now that we've trained our first batch of models on default parameters, we can eliminate a few which didn't do well. The CatBoost, AdaBoost and the LightGBM models performed the best so we will keep those and perform hyperparamter tuning on them.


## Hyperparameter Tuning

We'll start off by using GridSearchCV/RandomizedSearchCV on both these models with various parameters and selecting the best performing ones based on their roc_auc score.

Again, since the Grid Search and Randomized Search run for a long time, we run into a time-out error. So, I have commented those segments out but you can run them if you'd like! I have used the best chosen parameters and built the final model with those.

## XGBoost

In [16]:
params = {
    'max_depth': 6,
    'n_estimators': 9500,
    'subsample': 0.7,
    'colsample_bytree': 0.2,
    'colsample_bylevel': 0.6000000000000001,
    'min_child_weight': 56.41980735551558,
    'reg_lambda': 75.56651890088857,
    'reg_alpha': 0.11766857055687065,
    'gamma': 0.6407823221122686,
    'booster': 'gbtree',
    'eval_metric': 'auc',
    'tree_method': 'gpu_hist',
    'predictor': 'gpu_predictor',
    'use_label_encoder': False
    }

In [None]:
%%time

# Code from https://www.kaggle.com/mohammadkashifunique/tsp-single-xgboost-model
kf = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

preds = []
scores = []

for fold, (idx_train, idx_valid) in enumerate(kf.split(X_train, y_train)):
    X_train_t, y_train_t = X.iloc[idx_train], y.iloc[idx_train]
    X_test_t, y_test_t = X.iloc[idx_valid], y.iloc[idx_valid]
    
    params['learning_rate']=0.007
    model1 = xgb.XGBClassifier(**params)
    
    model1.fit(X_train_t,y_train_t,
              eval_set=[(X_train_t, y_train_t),(X_test_t,y_test_t)],
              early_stopping_rounds=200,
              verbose=False)
    
    params['learning_rate']=0.01
    model2 = xgb.XGBClassifier(**params)
    
    model2.fit(X_train,y_train,
              eval_set=[(X_train_t, y_train_t),(X_test_t,y_test_t)],
              early_stopping_rounds=200,
              verbose=False,
              xgb_model=model1)
    
    params['learning_rate']=0.05
    model3 = xgb.XGBClassifier(**params)
    
    model3.fit(X_train,y_train,
              eval_set=[(X_train_t, y_train_t),(X_test_t,y_test_t)],
              early_stopping_rounds=200,
              verbose=False,
              xgb_model=model2)
    
    pred_valid = model3.predict_proba(X_test_t)[:,1]
    fpr, tpr, _ = roc_curve(y_test_t, pred_valid)
    score = auc(fpr, tpr)
    scores.append(score)
    
    print(f"Fold: {fold + 1} Score: {score}")
    print('||'*40)
    
    test_preds = model3.predict_proba(X_test_submisson)[:,1]
    preds.append(test_preds)
    
print(f"Overall Validation Score: {np.mean(scores)}")

In [18]:
predictions = np.mean(np.column_stack(preds),axis=1)

new_df = pd.DataFrame({'id': id_test_submission, 'target': predictions})
new_df.to_csv('./xgb_submission.csv', index=False)


## LightGBM

The parameters we set for grid search were:
- learning_rate: 0.003, 0.009
- max_depth: -1, 3, 5, 7
- n_estimators: 500, 1000, 2500
- num_leaves: 28, 31, 50, 75

and the top performing parameters were
- learning_rate: 0.003
- max_depth: -1
- n_estimators: 1000,
- num_leaves: 50

with a score of (0.76)

In [None]:
# params = {
#     'num_leaves': [50, 28, 31, 50, 75],
#     'learning_rate': [0.003, 0.009],
#     'max_depth': [-1, 3, 5, 7],
#     'n_estimators': [500, 1000, 2500],
# }

# lgb_estimator = lgb.LGBMClassifier(random_state=42)

# grid = GridSearchCV(lgb_estimator, param_grid=params, scoring='roc_auc_ovr', cv=5, verbose=100)
# lgb_model = grid.fit(X_train, y_train)

# print(lgb_model.best_params_, lgb_model.best_score_)

In [None]:
lgb_model = lgb.LGBMClassifier(learning_rate=0.003, max_depth=-1, n_estimators=1000, num_leaves=50, random_state=42, verbose=100)
lgb_model.fit(X_train, y_train)
preds = lgb_model.predict(X_test)
roc_auc = roc_auc_score(y_test, preds)
print('roc_auc of lgb:', roc_auc)

In [None]:
submission_preds = lgb_model.predict(test_data.drop('id', axis=1))
new_df = pd.DataFrame({'id': test_data['id'], 'target': submission_preds})
new_df.to_csv("lgb_submission.csv",index=False)


## CatBoost

To show you more techniques, I have used RandomizedSearchCV for this model which is an alternative to GridSearchCV. Now that you have code samples of both, you can test them out and select which ones you like.

In [None]:
# param_dist = { "learning_rate": np.linspace(0,0.2,5),
#                "max_depth": randint(3, 10)}
               
# #Instantiate RandomSearchCV object
# cat_model = CatBoostClassifier(random_state=42, verbose=500)
# rscv = RandomizedSearchCV(cat_model , param_dist, scoring='roc_auc', cv=3)

# #Fit the model
# rscv.fit(X_train,y_train)

# # Print the tuned parameters and score
# print(rscv.best_params_)
# print(rscv.best_score_)


In [None]:
cat_model = CatBoostClassifier(learning_rate=0.003, max_depth=3, n_estimators=1000, random_state=42, verbose=100)
cat_model.fit(X_train, y_train)
preds = cat_model.predict(X_test)
roc_auc = roc_auc_score(y_test, preds)
# cat_score, time = train_model(cat_model)
print('roc_auc of cat:', roc_auc)

In [None]:
submission_preds = cat_model.predict(test_data.drop('id', axis=1))
new_df = pd.DataFrame({'id': test_data['id'], 'target': submission_preds})
new_df.to_csv("cat_submission.csv",index=False)


## AdaBoost


In [None]:
ada_model = AdaBoostClassifier(random_state=42, verbose=100)
ada_model.fit(X_train, y_train)
preds = ada_model.predict(X_test)
roc_auc = roc_auc_score(y_test, preds)
# ada_score, time = train_model(ada_model)
print('\n roc_auc of ada:', roc_auc)

## Histogram Gradient Boosting

In [None]:
# Code from https://www.kaggle.com/ankitkalauni/tps-21-oct-single-histgbm-0-85651
hist_params = {'l2_regularization': 1.3244040135051264e-10,
               'early_stopping': 'True',
               'learning_rate': 0.0366777965884429, 
               'max_iter': 10000, 
               'max_depth': 3, 
               'max_bins': 129, 
               'min_samples_leaf': 13449, 
               'max_leaf_nodes': 68}

kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1)

for fold, (idx_train, idx_valid) in enumerate(kf.split(X_train, y_train)):
    hgbm_model = HistGradientBoostingClassifier(**hist_params)
    hgbm_model.fit(X_train, y_train)
    preds = hgbm_model.predict(X_test)
#     pred_valid = model.predict_proba(X_valid)[:,1]
    score = roc_auc_score(y_test, preds)
    print(f"Fold: {fold + 1} Score: {score}")
    print('--'*20)

## The Finale

Phew! Now that we're done testing all these different models and tuning them, we have the results. About time, huh. You've been reading for ages (Thank you for that btw!) :)

Our results showed the Histogram Gradient Boosting Classifier gave the best results. Let's train this (with the new parameters) on the entire dataset and see the results.

In [None]:
# Splitting the entire dataset into train and test

X = data.drop('target', axis=1)
y = data.target

X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=0.85, random_state=42)

In [None]:
hist_params = {'l2_regularization': 1.3244040135051264e-10,
               'early_stopping': 'True',
               'learning_rate': 0.0366777965884429, 
               'max_iter': 10000, 
               'max_depth': 3, 
               'max_bins': 129, 
               'min_samples_leaf': 13449, 
               'max_leaf_nodes': 68,
              'verbose': 500}


hgbm_model = HistGradientBoostingClassifier(**hist_params)
hgbm_model.fit(X_train, y_train)
preds = hgbm_model.predict(X_test)
score = roc_auc_score(y_test, preds)

## Submission

In [None]:
submission_preds = hgbm_model.predict(test_data.drop('id', axis=1))
new_df = pd.DataFrame({'id': test_data['id'], 'target': submission_preds})
new_df.to_csv("submission1.csv",index=False)

## What Next?

I'll continue testing with other models to find better results. In the meantime, you could reach out to me through the comments if you have ideas on models I could add. 
 
#### If you liked this notebook, I'd appreciate it if you would leave an upvote!
