# <center>TabularPlaygroundClassifier April2021</center>
<img src= "https://www.pixelstalk.net/wp-content/uploads/images1/Titanic-Wallpapers-HD-Free-download.jpg" height="200" align="center"/>


<a id="Table-Of-Contents"></a>
# Table Of Contents
* [Table Of Contents](#Table-Of-Contents)
* [Introduction](#Introduction)
* [Importing Libraries](#Importing-Libraries)
* [Task Details](#Task-Details)
* [Read in Data](#Read-in-Data)
    - [Train.csv](#Train.csv)
    - [Test.csv](#Test.csv)
    - [Notes](#Notes)
* [Data Visualization](#Data-Visualization)
    - [Data Dictionary](#Data-Dictionary)
    - [Variable Notes](#Variable-Notes)
    - [Categorical Features](#Categorical-Features)
    - [Continuous Features](#Continuous-Features)
    - [Target](#Target)
* [Preprocessing Data](#Preprocessing-Data)
    - [Label Encoding](#Label-Encoding)
    - [One Hot Encoding](#One-Hot-Encoding)
    - [Imputation](#Imputation)
    - [Train-Test Stratified Split](#Train-Test-Stratified-Split)
* [Model Performance Metrics](#Model-Performance-Metrics)
    - [ROC Curve](#ROC-Curve)
    - [Confusion Matrix](#Confusion-Matrix)
* [Initial Models](#Initial-Models)
* [Random Forest Classifier](#Random-Forest-Classifier)
    - [Random Forest Bayesian Optimization](#Random-Forest-Bayesian-Optimization)
    - [Random Forest Cross Validation](#Random-Forest-Cross-Validation)
    - [Random Forest CV Model Peformance](#Random-Forest-CV-Model-Peformance)
* [LightGBM Classifier](#LightGBM-Classifier)
    - [Bayesian Optimization](#Bayesian-Optimization)
    - [Tuning LightGBM](#Tuning-LightGBM)
    - [Feature Importance](#Feature-Importance)
    - [Cross Validation](#Cross-Validation)
    - [LightGBM CV Model Peformance](#LightGBM-CV-Model-Peformance)
* [Prediction for Test.csv](#Prediction-for-Test.csv)
* [Conclusion](#Conclusion)

<a id="Importing-Libraries"></a>
# Importing Libraries

In [None]:
#%% Importing Libraries

# Basic Imports 
import numpy as np
import pandas as pd
from IPython.display import display, HTML

# Plotting 
from matplotlib import pyplot
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold,cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
# explicitly require this experimental feature
from sklearn.experimental import enable_iterative_imputer  # noqa
# now you can import normally from sklearn.impute
from sklearn.impute import IterativeImputer

# Metrics 
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import classification_report,accuracy_score, recall_score, roc_auc_score, precision_score
from sklearn.metrics import roc_curve, auc


# ML Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from lightgbm import LGBMClassifier 

# Model Tuning 
from bayes_opt import BayesianOptimization

# Feature Importance 
import shap 

# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

<a id="Introduction"></a>
# Introduction
This is my fourth competition notebook on Kaggle. I hope to learn more about working with tabular data and I hope anyone who reads this learns more as well! This notebook will be working with a classfication task. If you have any questions or comments please leave below! 

<a id="Task-Details"></a>
# Task Detail 

## Goal
Your task is to predict whether or not a passenger survived the sinking of the Synthanic (a synthetic, much larger dataset based on the actual Titanic dataset). For each PasengerId row in the test set, you must predict a 0 or 1 value for the Survived target.

## Metric
Your score is the percentage of passengers you correctly predict. This is known as **[accuracy](http://https://en.wikipedia.org/wiki/Accuracy_and_precision#In_binary_classification)**.  

Accuracy = (TP + TN)/(TP + TN + FP + FN)

<a id="Read-in-Data"></a>
# Read in Data

<a id="Train.csv"></a>
## Train.csv

In [None]:
#%% Read train.csv
train_csv = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv')

# Initial glance at train.csv
print(train_csv.info(verbose = True,show_counts=True))

<a id="Test.csv"></a>
## Test.csv

In [None]:
#%% Read train.csv
test_csv = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv')

# Initial glance at train.csv
print(test_csv.info(verbose = True,show_counts=True))

<a id="Notes"></a>
## Notes

Train.csv and Test.csv have  missing values so imputation might be needed. Since there aren't many features in this dataset a quick explanatory data analysis can be done on the features and target.

<a id="Data-Visualization"></a>
# Data Visualization 

## Data Dictionary
| Variable | Definition                                 | Key                                            |
|----------|--------------------------------------------|------------------------------------------------|
| Survived | Survival                                   | 0 = No, 1 = Yes                                |
| Pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| Sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| SibSp    | # of siblings / spouses aboard the Titanic |                                                |
| Parch    | # of parents / children aboard the Titanic |                                                |
| Ticket   | Ticket number                              |                                                |
| Fare     | Passenger fare                             |                                                |
| Cabin    | Cabin number                               |                                                |
| Embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

## Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
#%% PlotMultiplePie 
# Input: df = Pandas dataframe, categorical_features = list of features , dropna = boolean variable to use NaN or not
# Output: prints multiple px.pie() 

def PlotMultiplePie(df,categorical_features = None,dropna = False):
    # set a threshold of 30 unique variables, more than 50 can lead to ugly pie charts 
    threshold = 30
    
    # if user did not set categorical_features 
    if categorical_features == None: 
        categorical_features = df.select_dtypes(['object','category']).columns.to_list()
        
    print("The Categorical Features are:",categorical_features)
    
    # loop through the list of categorical_features 
    for cat_feature in categorical_features: 
        num_unique = df[cat_feature].nunique(dropna = dropna)
        num_missing = df[cat_feature].isna().sum()
        # prints pie chart and info if unique values below threshold 
        if num_unique <= threshold:
            print('Pie Chart for: ', cat_feature)
            print('Number of Unique Values: ', num_unique)
            print('Number of Missing Values: ', num_missing)
            fig = px.pie(df[cat_feature].value_counts(dropna = dropna), values=cat_feature, 
                 names = df[cat_feature].value_counts(dropna = dropna).index,title = cat_feature,template='ggplot2')
            fig.show()
        else: 
            print('Pie Chart for ',cat_feature,' is unavailable due high number of Unique Values ')
            print('Number of Unique Values: ', num_unique)
            print('Number of Missing Values: ', num_missing)
            print('\n')

In [None]:
#%% Use PlotMultiplePie to see the distribution of the categorical variables for train_csv
PlotMultiplePie(train_csv)

In [None]:
#%% Use PlotMultiplePie to see the distribution of the categorical variables for test_csv
PlotMultiplePie(test_csv)

Categorical features Name, Cabin, and Ticket have a high number of unique value which makes it difficult to extract meaningful information from these features. Although the Cabin feature has high missing values and unique values, a closer look can be seen that the feature can be split up into the letter category (A - G). The letter represents the floor the passenger cabin was on. A diagram of the actual titanic is shown below to visualize this. 

<img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Olympic_%26_Titanic_cutaway_diagram.png/1200px-Olympic_%26_Titanic_cutaway_diagram.png" height="200" align="center"/>


In [None]:
# Cabin Feature for train.csv
# https://stackoverflow.com/questions/35552874/get-first-letter-of-a-string-from-column
train_csv_cabin_letter = pd.DataFrame(train_csv['Cabin'],columns = ['Cabin'])
train_csv_cabin_letter['Cabin'] = [x[0] if isinstance(x, str) else np.nan for x in train_csv_cabin_letter['Cabin']]
PlotMultiplePie(train_csv_cabin_letter)

In [None]:
# Cabin Feature for test.csv
# https://stackoverflow.com/questions/35552874/get-first-letter-of-a-string-from-column
test_csv_cabin_letter = pd.DataFrame(test_csv['Cabin'],columns = ['Cabin'])
test_csv_cabin_letter['Cabin'] = [x[0] if isinstance(x, str) else np.nan for x in test_csv_cabin_letter['Cabin']]
PlotMultiplePie(test_csv_cabin_letter)

<a id="Continuous-Features"></a>
## Continuous Features

In [None]:
#%% Print the continous features in train_csv 
continous_features = train_csv.drop(["Survived","PassengerId"],axis = "columns").select_dtypes(['float64']).columns.to_list()

for cont_feature in continous_features: 
    plt.figure()
    plt.title(cont_feature)
    ax = sns.histplot(train_csv[cont_feature])

In [None]:
#%% Print the continous features in test_csv 
continous_features = test_csv.drop(["PassengerId"],axis = "columns").select_dtypes(['float64']).columns.to_list()

for cont_feature in continous_features: 
    plt.figure()
    plt.title(cont_feature)
    ax = sns.histplot(train_csv[cont_feature])

<a id="Target"></a>
## Target

In [None]:
PlotMultiplePie(train_csv,['Survived'])

The target variable, **Survived** has about a 40 - 60 split which means imbalanced classfication techniques are not needed. 

<a id="Preprocessing-Data"></a>
# Preprocessing Data
Because Train.csv and Test.csv have missing data imputation is needed.  

Label encoding is also still require as this dataset has categorical features. 

In [None]:
# Save 'PassengerId' for Train and Test 
train_csv_id = train_csv['PassengerId'].to_list()
test_csv_id = test_csv['PassengerId'].to_list()

# Seperate train_csv into target and features 
y_train_csv = train_csv['Survived']
X_train_csv = train_csv.drop(['Survived','PassengerId'],axis = 'columns')

# Create a copy of test_csv 
X_test_csv = test_csv.copy(deep=True).drop('PassengerId',axis = 'columns')

# Features to drop
drop_features = ['Name','Ticket']
X_train_csv = X_train_csv.drop(drop_features,axis = 'columns')
X_test_csv = X_test_csv.drop(drop_features,axis = 'columns')

# Use list comprehension to extract letter from Cabin, if NaN replace with string 'Null'
X_train_csv['Cabin'] = [x[0] if isinstance(x, str) else 'Null' for x in X_train_csv['Cabin']]
X_test_csv['Cabin'] = [x[0] if isinstance(x, str) else 'Null' for x in X_test_csv['Cabin']]

display(X_train_csv.head())
display(X_test_csv.head())

<a id="Imputation"></a>
## Imputation

The features Age, Fare, and Embarked need to be imputed before bulding a model. I did not want to use NaN as a value as there weren't that many missing values. On the other hand, Cabin has many missing values so we should treat NaN as a categorical value. I used median and mode imputation for numeric and categorical features respectively. In addition, imputation must be done seperately btwn train.csv and test.csv as model building is strictly on train.csv and model performance is on test.csv 

In [None]:
# https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer
# from sklearn.impute import SimpleImputer

# use most frequent simple imputer which is median and mode imputation for numeric and categorical features respectively  
X_train_most_frequent_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent",
                            add_indicator=False).fit(X_train_csv)
X_train_simple_imputed = pd.DataFrame(X_train_most_frequent_imputer.transform(X_train_csv), columns = X_train_csv.columns.to_list()).convert_dtypes()

X_train_simple_imputed[X_train_simple_imputed.select_dtypes(['string']).columns.to_list()] =  X_train_simple_imputed.select_dtypes(['string']).astype('object')
X_train_simple_imputed[X_train_simple_imputed.select_dtypes(['Int64']).columns.to_list()] =  X_train_simple_imputed.select_dtypes(['Int64']).astype('int64')
X_train_simple_imputed[X_train_simple_imputed.select_dtypes(['Float64']).columns.to_list()] =  X_train_simple_imputed.select_dtypes(['Float64']).astype('float64')

# SimpleImputer summary statistics for X_train_csv
X_train_simple_imputed.describe(include='all')

In [None]:
# use most frequent simple imputer which is median and mode imputation for numeric and categorical features respectively  
X_test_most_frequent_imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent",
                            add_indicator=False).fit(X_test_csv)
X_test_simple_imputed = pd.DataFrame(X_test_most_frequent_imputer.transform(X_test_csv), columns = X_test_csv.columns.to_list()).convert_dtypes()

X_test_simple_imputed[X_test_simple_imputed.select_dtypes(['string']).columns.to_list()] =  X_test_simple_imputed.select_dtypes(['string']).astype('object')
X_test_simple_imputed[X_test_simple_imputed.select_dtypes(['Int64']).columns.to_list()] =  X_test_simple_imputed.select_dtypes(['Int64']).astype('int64')
X_test_simple_imputed[X_test_simple_imputed.select_dtypes(['Float64']).columns.to_list()] =  X_test_simple_imputed.select_dtypes(['Float64']).astype('float64')

# SimpleImputer summary statistics for X_test_csv
X_test_simple_imputed.describe(include='all')

In [None]:
X_test_simple_imputed.head()

In [None]:
# Save the index for X_train_csv 
X_train_csv_index = X_train_simple_imputed.index.to_list()

# Row bind train.csv features with test.csv features 
# this makes it easier to apply label encoding or one hot encoding onto the entire dataset 
X_train_test = X_train_simple_imputed.append(X_test_simple_imputed,ignore_index = True)

# save the index for test.csv 
X_test_csv_index = np.setdiff1d(X_train_test.index.to_list() ,X_train_csv_index) 

<a id="Label-Encoding"></a>
## Label Encoding

In [None]:
#%% MultiColumnLabelEncoder
# Code snipet found on Stack Exchange 
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
# from sklearn.preprocessing import LabelEncoder

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                # convert float NaN --> string NaN
                output[col] = output[col].fillna('NaN')
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

# store the catagorical features names as a list      
cat_features = X_train_test.select_dtypes(['object']).columns.to_list()

# use MultiColumnLabelEncoder to apply LabelEncoding on cat_features 
# uses NaN as a value , no imputation will be used for missing data

# using One Hot Encoding for this version 
# X_train_test_encoded = MultiColumnLabelEncoder(columns = cat_features).fit_transform(X_train_test)

<a id="One-Hot-Encoding"></a>
## One Hot Encoding

Since we only have 8 features, 4 which are categorical: Pclass, Sex, Cabin, Embarked; I believe using one hot encoding might improve the model.

In [None]:
# cast categorical features to object dtype
cat_features = ['Pclass','Sex','Cabin','Embarked']
X_train_test[cat_features] = X_train_test[cat_features].astype("object")
# get_dummies to apply one-hot encoding to X_train_test,drop_first = True to avoid dummy varibale trap
X_train_test_encoded = pd.get_dummies(X_train_test,drop_first=True)

In [None]:
##% Split X_train_clean_encoded 
X_train_csv_encoded = X_train_test_encoded.iloc[X_train_csv_index, :]
X_test_csv_encoded = X_train_test_encoded.iloc[X_test_csv_index, :].reset_index(drop = True) 

In [None]:
##% Before and After LabelEncoding for train.csv 
display(X_train_csv.head())
display(X_train_csv_encoded.head())

In [None]:
##% Before and After LabelEncoding for test.csv 
display(X_test_csv.head())
display(X_test_csv_encoded.head())
X_test_csv_encoded.info()

<a id="#Train-Test-Stratified-Split"></a>
## Train-Test Stratified Split

In [None]:
# Create test and train set 80-20
#%%  train-test stratified split using a 80-20 split
X_train, X_test, y_train, y_test = train_test_split(X_train_csv_encoded, y_train_csv, test_size=0.2, shuffle = True, stratify = y_train_csv, random_state=0)

for df in [X_train, X_test, y_train, y_test]:
    df.reset_index(drop = True,inplace = True)
    
print(" Training Target")
print(y_train.value_counts())
print("\n")
print(" Test Target")
print(y_test.value_counts())

In [None]:
display(X_train)
display(X_test)

<a id="Model-Performance-Metrics"></a>
# Model Performance Metrics

I created several functions to help evaluate the performance of a model. Some functions help binarize an array, plot the confusion matrix, and plot the ROC curve.

<a id="ROC-Curve"></a>
## ROC Curve

In [None]:
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py

def plot_roc_curve(y_true,y_probas, title = 'ROC Curve for training data'):
    # calculate roc curves
    fpr, tpr, thresholds = roc_curve(y_true, y_probas)
    # calculate the g-mean for each threshold
    gmeans = np.sqrt(tpr * (1-fpr))
    # locate the index of the largest g-mean
    ix = np.argmax(gmeans)
    print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))
    # plot the roc curve for the model
    pyplot.figure(num=0, figsize=[6.4, 4.8])
    pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill')
    pyplot.plot(fpr, tpr, marker='.', label='LightGBM')
    pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best')
    # axis labels
    pyplot.xlabel('False Positive Rate')
    pyplot.ylabel('True Positive Rate')
    plt.title(title)
    pyplot.legend()
    # show the plot
    pyplot.show()
    
    return thresholds[ix]

<a id="Confusion-Matrix"></a>
## Confusion Matrix

In [None]:
# Great Function found on Kaggle for plotting a Confusion Matrix
# https://www.kaggle.com/grfiv4/plot-a-confusion-matrix
def plot_confusion_matrix_kaggle(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()

# binarize an array based of a threshold 
def binarizeArray(array,threshold = 0.5):
    return [0 if num < threshold else 1 for num in array]

<a id="Initial Models"></a>
# Initial Models
I applied different machine learning algorthims to test which model perform better on this dataset. I've listed below various machine learning techniques applied in this section.

1. Logistic Regression
2. XGBoost Classifier
3. Random Forest Classifier
5. LightGBM Classifier

In [None]:
#% Initial Models
# Create initial models
LogReg = LogisticRegression(random_state=0).fit(X_train, y_train)

XGBClass = xgb.XGBClassifier(eval_metric  = "logloss", max_depth=5, learning_rate=0.01, n_estimators=100, gamma=0, 
                        min_child_weight=1, subsample=0.8, colsample_bytree=0.8, reg_alpha=0.005,seed = 0).fit(X_train,y_train)

RFClass = RandomForestClassifier(n_estimators = 50, max_depth = 50,n_jobs = -1, random_state = 0).fit(X_train,y_train)

LGBMClass = LGBMClassifier(random_state=0).fit(X_train, y_train)

In [None]:
# initial Model Peformance on Training Data
print("             Model Peformance on Training Data\n")

pred_y = LogReg.predict(X_train)
print("                    Logistic Regression")
print(classification_report(y_train,pred_y,digits=3))

pred_y = XGBClass.predict(X_train)
print("                    XGBoost Classifier")
print(classification_report(y_train,pred_y,digits=3))

pred_y = RFClass.predict(X_train)
print("                    Random Forest Classifier")
print(classification_report(y_train,pred_y,digits=3))

pred_y = LGBMClass.predict(X_train)
print("                    LightGBM Classifier")
print(classification_report(y_train,pred_y,digits=3))

In [None]:
# initial Model Peformance on Testing Data
print("             Model Peformance on Testing Data\n")

pred_y = LogReg.predict(X_test)
print("                    Logistic Regression")
print(classification_report(y_test,pred_y,digits=3))

pred_y = XGBClass.predict(X_test)
print("                    XGBoost Classifier")
print(classification_report(y_test,pred_y,digits=3))

pred_y = RFClass.predict(X_test)
print("                    Random Forest Classifier")
print(classification_report(y_test,pred_y,digits=3))

pred_y = LGBMClass.predict(X_test)
print("                    LightGBM Classifier")
print(classification_report(y_test,pred_y,digits=3))

<a id="Random-Forest-Classifier"></a>
# Random Forest Classifier

<a id="Random-Forest-Bayesian-Optimization"></a>
## Random Forest Bayesian Optimization

In [None]:
# https://github.com/fmfn/BayesianOptimization
# https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.py
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
# https://tech.ovoenergy.com/bayesian-optimisation/
# https://www.kdnuggets.com/2019/07/xgboost-random-forest-bayesian-optimisation.html
#crash
def search_best_param_rf(X, y):
    def rf_cv(X, y, **kwargs):
        estimator = RandomForestClassifier(**kwargs)
        cval = cross_val_score(
            estimator,
            X,
            y,
            scoring="roc_auc",
            cv=5,
            verbose=0,
            n_jobs=-1,
            error_score=0,
        )
        return cval.mean() # maximize roc_auc

    def rf_crossval(n_estimators, max_depth, min_samples_split, min_samples_leaf):
        return rf_cv(
            X=X,
            y=y,
            n_estimators=int(n_estimators),
            max_depth=int(max(max_depth, 1)),
            min_samples_split=int(max(min_samples_split, 2)),
            min_samples_leaf=int(max(min_samples_leaf, 1)),
        )
    
    RFC_BO_params = {
        "n_estimators": (10, 100),
        "max_depth": (1, 100),
        "min_samples_split": (2, 10),
        "min_samples_leaf": (1, 5),
    }

    RFC_Bo = BayesianOptimization(rf_crossval, 
                                  RFC_BO_params, 
                                  random_state=0, 
                                  verbose=2
                                 )
    np.random.seed(1)
    
    RFC_Bo.maximize(init_points=4, n_iter=4) # 5 + 5 = 10 iterations 
    # n_iter: How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
    # init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
    # more iterations more time spent searching 
    
    params_set = RFC_Bo.max['params']
    
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['min_samples_split'] = int(round(params_set['min_samples_split']))
    params_set['min_samples_leaf'] = int(round(params_set['min_samples_leaf']))
    
    params_set.update({'n_jobs': -1})
    params_set.update({'random_state': 0})
    
    return params_set

<a id="Random-Forest-Cross-Validation"></a>
## Random Forest Cross Validation 

In [None]:
# Random Forest Cross Validation

def K_Fold_RandomForest(X_train,y_train, params_set = [], num_folds = 5):
    model_num = 0 # model number 
    models = [] # model list
    folds = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=0) # create folds

        # num_folds times ; default is 5
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        
        print(f"     model{model_num}")
        
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        if (params_set == []): # if param_set is empty
            # find best param_set in each fold, can lead to overfitting
            params_set = search_best_param(train_X,train_y,cat_features) 
        
        # fit RFC based of param_set and current fold
        CV_RF = RandomForestClassifier(**params_set).fit(train_X, train_y)
        
        # append RF model to model list 
        models.append(CV_RF)
        
        # model metrics for current fold 
        print("Training Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_train,models[model_num].predict_proba(X_train)[:,1]))
        print("Test Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_test,models[model_num].predict_proba(X_test)[:,1]))
        print("\n")
        
        model_num = model_num + 1
        
    return models

In [None]:
best_params_rf_cv = search_best_param_rf(X_train_csv_encoded,y_train_csv)

In [None]:
# Print best_params_rf_cv
for key, value in best_params_rf_cv.items():
    print(key, ' : ', value)

In [None]:
rf_models = K_Fold_RandomForest(X_train_csv_encoded,y_train_csv,params_set = best_params_rf_cv,num_folds = 10)

<a id="Random-Forest-CV-Model-Peformance"></a>
## Random Forest CV Model Peformance 

In [None]:
# Predict y_preds using models from RFC cross validation 
def predict_models_RFC(models_cv,X):
    y_preds = np.zeros(shape = X.shape[0])
    for model in models_cv:
        y_preds += model.predict_proba(X)[:,1]
        
    return y_preds/len(models_cv)

In [None]:
# RFC ROC Curve
RFC_pred_y = predict_models_RFC(rf_models,X_train_csv_encoded)
RFC_threshold = plot_roc_curve(y_train_csv,RFC_pred_y)

In [None]:
# RFC Confusion Matrix
RFC_pred_y_bin = binarizeArray(RFC_pred_y,RFC_threshold)

cm = confusion_matrix(y_train_csv,RFC_pred_y_bin)
plot_confusion_matrix_kaggle(cm =cm, 
                      normalize    = False,
                      target_names = ['Did Not Survived(0)', 'Survived(1)'],
                      title        = "Confusion Matrix")
print(classification_report(y_train_csv,RFC_pred_y_bin))
print("Accuracy: %.2f%%" % (accuracy_score(y_train_csv, RFC_pred_y_bin)*100.0))

<a id="LightGBM-Classifier"></a>
# LightGBM Classifier

<a id="Bayesian-Optimization"></a>
## Bayesian Optimization

In [None]:
##% parameter tuning for lightgbm 

# store the catagorical features names as a list      
# cat_features = X_train_test.select_dtypes(['object']).columns.to_list() # for label encoding
cat_features = X_train_test.select_dtypes(['uint8']).columns.to_list() # for one hot encoding 
# print(cat_features)

# Create the LightGBM data containers
# Make sure that cat_features are used
train_lgbdata=lgb.Dataset(X_train,label=y_train, categorical_feature = cat_features,free_raw_data=False)
test_lgbdata=lgb.Dataset(X_test,label=y_test, categorical_feature = cat_features,free_raw_data=False)

In [None]:
# https://github.com/fmfn/BayesianOptimization
# https://testlightgbm.readthedocs.io/en/latest/Parameters.html
def search_best_param(X,y,cat_features):
    
    trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    # define the lightGBM cross validation
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                lambda_l1, lambda_l2, min_child_weight):
    
        params = {'boosting_type': 'gbdt', 'objective': 'binary', 'metric':'auc', 'verbose': -1,
                  'early_stopping_round':100}
        
        params['max_depth'] = int(round(max_depth))
        params["num_leaves"] = int(round(num_leaves))
        params["n_estimators"] = int(round(n_estimators))
        params['learning_rate'] = learning_rate
        params['subsample'] = subsample
        params['colsample_bytree'] = colsample_bytree
        params['lambda_l1'] = max(lambda_l1, 0)
        params['lambda_l2'] = max(lambda_l2, 0)
        params['min_child_weight'] = min_child_weight
    
        score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=True, verbose_eval =False, metrics=['auc'])
        return np.mean(score['auc-mean']) # maximize auc-mean

    # use bayesian optimization to search for the best hyper-parameter combination
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                       {
                                          'max_depth': (5, 50),
                                          'num_leaves': (20, 100),
                                          'n_estimators': (50, 500),
                                          'learning_rate': (0.01, 0.3),
                                          'subsample': (0.7, 0.8),
                                          'colsample_bytree' :(0.5, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 50) 
                                      },
                                       random_state = 1,
                                       verbose = 2
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points= 5, n_iter=5) # 5 + 5, 10 iterations 
    # n_iter: How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
    # init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
    # more iterations more time spent searching 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'auc'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'binary'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

In [None]:
best_params = search_best_param(X_train,y_train,cat_features)

In [None]:
# Print best_params
for key, value in best_params.items():
    print(key, ' : ', value)

<a id="Tuning-LightGBM"></a>
## Tuning LightGBM

In [None]:
# Train lgbm_best using the best params found from Bayesian Optimization
lgbm_best = lgb.train(best_params,
                 train_lgbdata,
                 num_boost_round = 100,
                 valid_sets = test_lgbdata,
                 early_stopping_rounds = 100,
                 verbose_eval = 50
                 )

<a id="LightGBM-Model-Peformance "></a>
## LightGBM Model Peformance 

In [None]:
print("     LGBM Tuned")
print("Training Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_train,lgbm_best.predict(X_train)))
print("Test Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_test,lgbm_best.predict(X_test)))

<a id="Feature-Importance "></a>
## Feature Importance 

In [None]:
##% Feature Importance 
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
lgb.plot_importance(lgbm_best,figsize=(25,20),max_num_features = 10)

In [None]:
##% Feature Importance using shap package 
# sample 10000 data points from X_train_test
shap_values = shap.TreeExplainer(lgbm_best).shap_values(X_train_test_encoded.sample(n=10000))
shap.summary_plot(shap_values, X_train_test_encoded.sample(n=10000))

<a id="Cross-Validation "></a>
## Cross Validation 

In [None]:
# Cross Validation with LightGBM

def K_Fold_LightGBM(X_train, y_train , cat_features, num_folds = 5, params_set = []):
    num = 0 # model number
    models = [] # list of models 
    folds = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=0) # create folds 

        # num_folds times 
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        
        print(f"     model{num}")
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
        valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)
        
        
        # params_set = search_best_param(train_X,train_y,cat_features) # find best param_set in each fold
        
        CV_LGBM = lgb.train(params_set,
                            train_data,
                            num_boost_round = 100,
                            valid_sets = valid_data,
                            early_stopping_rounds = 100,
                            verbose_eval = 50
                           )
        # increase early_stopping_rounds can lead to overfitting 
        
        # append LGBM model to models list 
        models.append(CV_LGBM)
        
        # model metrics for each fold 
        print("Training Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_train,models[num].predict(X_train)))
        print("Test Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_test,models[num].predict(X_test)))
        print("\n")
        
        num = num + 1
        
    return models

In [None]:
best_params_cv = search_best_param(X_train_csv_encoded,y_train_csv,cat_features)

In [None]:
lgbm_models = K_Fold_LightGBM(X_train_csv_encoded,y_train_csv,cat_features,10,params_set = best_params_cv)

<a id="LightGBM-CV-Model-Peformance "></a>
## LightGBM CV Model Peformance 

In [None]:
# Predict y_prds using models from cross validation 
def predict_models_LGBM(models_cv,X):
    y_preds = np.zeros(shape = X.shape[0])
    for model in models_cv:
        y_preds += model.predict(X)
        
    return y_preds/len(models_cv)

In [None]:
# LightGBM ROC Curve
LGBM_pred_y = predict_models_LGBM(lgbm_models,X_train_csv_encoded)
LGBM_threshold = plot_roc_curve(y_train_csv,LGBM_pred_y)

In [None]:
# LightGBM Confusion Matrix
LGBM_pred_y_bin = binarizeArray(LGBM_pred_y,LGBM_threshold)

cm = confusion_matrix(y_train_csv,LGBM_pred_y_bin)
plot_confusion_matrix_kaggle(cm =cm, 
                      normalize    = False,
                      target_names = ['Did Not Survived(0)', 'Survived(1)'],
                      title        = "Confusion Matrix")
print(classification_report(y_train_csv,LGBM_pred_y_bin))
print("Accuracy: %.2f%%" % (accuracy_score(y_train_csv, LGBM_pred_y_bin)*100.0))

<a id="Prediction-for-Test.csv"></a>
# Prediction for Test.csv

In [None]:
# Prediction for Test.csv using LightGBM CV 
predictLGBM = predict_models_LGBM(lgbm_models,X_test_csv_encoded) 
predictLGBM_bin = binarizeArray(predictLGBM,LGBM_threshold)

submissionLGBM = pd.DataFrame({'PassengerId':test_csv_id,'Survived':predictLGBM_bin})

display(submissionLGBM.head())

# Prediction for Test.csv using RFC CV 
predictRFC = predict_models_RFC(rf_models,X_test_csv_encoded) 
predictRFC_bin = binarizeArray(predictRFC,RFC_threshold)

submissionRFC = pd.DataFrame({'PassengerId':test_csv_id,'Survived':predictRFC_bin})

display(submissionRFC.head())

In [None]:
#% Submit Predictions 
submissionLGBM.to_csv('submissionCV_LGBM3.csv',index=False)
submissionRFC.to_csv('submissionCV_RFC3.csv',index=False)

<a id="Conclusion"></a>
# Conclusion

**Conclusion**
* LightGBM is a great ML algorithm that handles categorical features and missing values 
* Cross Validation is useful to combat overfitting 
* Bayesian Optimization is necessary to get hyper parameters when building an initial model
* This is a great dataset to work on and lots of knowledge can be gain from withing with this dataset 
* Researching and reading other Kaggle notebooks is essential for becoming a better data scientist

**Challenges**
* Difficult to utilize all features, feature engineering might be useful to research
* Overfitting might have occurred which reduces the model performance on the test set

**More Notebooks** 

**Regression Notebooks**   
[https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm](https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm)

[https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021](https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021)

[https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26](https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26)


**Classification Notebooks**  
[https://www.kaggle.com/josephchan524/bankruptcyclassifier-using-lightgbm-91-recall](https://www.kaggle.com/josephchan524/bankruptcyclassifier-using-lightgbm-91-recall)  

[https://www.kaggle.com/josephchan524/hranalytics-lightgbm-classifier-auc-80](https://www.kaggle.com/josephchan524/hranalytics-lightgbm-classifier-auc-80)

[https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95](https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95)

[https://www.kaggle.com/josephchan524/tabularplaygroundclassifier-using-lightgbm-mar2021](https://www.kaggle.com/josephchan524/tabularplaygroundclassifier-using-lightgbm-mar2021)  


04-06-2020
Joseph Chan 