# Titanic : Visualization & Prediction

Predict survival on the Titanic and get familiar with ML basics. (⭐️ Upvote my notebook — it helps! )

<img src="https://miro.medium.com/max/1680/1*vLzwEHLZH0vt3t3PzZnTAg.jpeg" height="70%" width="80%" >

Getting started with competitive data science can be quite intimidating. So I build this notebook for quick overview on `Titanic: Machine Learning from Disaster` competition. If there is interest, I’m happy to do deep dives into the intuition behind the feature engineering and models used in this kernel.

I encourage you to fork this kernel, play with the code and enter the competition. Good luck!

#### Competition Description

In this challenge, it ask you to build a predictive model that predicts which passengers survived the Titanic shipwreck and answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

#### Executive Summary

I started this competition by just focusing on getting a good understanding of the dataset. The EDA & Visualizations are included to allow developers to dive into analysis of dataset.

### Key features of the model training process in this kernel

K Fold Cross Validation: Using 5-fold cross-validation.

First Level Learning Model: On each run of cross-validation tried fitting following models :-
1. Random Forest classifier
2. Extra Trees classifier
3. AdaBoost classifer
4. Gradient Boosting classifer
5. Support Vector Machine

Second Level Learning Model : Trained a XGBClassifier using xgboost

In [None]:
# Load in our libraries
import random
import pandas as pd
import numpy as np
import re
import sklearn
from xgboost import XGBClassifier
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

# Going to use these 5 base models for the stacking
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_val_score


In [None]:
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
print("Train set size:", train.shape)
print("Test set size:", test.shape)

# Data Pre-processing

In [None]:
# Store our passenger ID for easy access 
PassengerId = test['PassengerId'] 
train.head(5)

In [None]:
train.info()
print('_'*40)
test.info()

#### What is the distribution of numerical feature values across the samples?

This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

* Total samples are 891 or 40% of the actual number of passengers on board the Titanic (2,224).
* Survived is a categorical feature with 0 or 1 values.
* Around 38% samples survived representative of the actual survival rate at 32%.
* Most passengers (> 75%) did not travel with parents or children.
* Nearly 30% of the passengers had siblings and/or spouse aboard.
* Fares varied significantly with few passengers (<1%) paying as high as $512.
* Few elderly passengers (< 1%) within age range 65-80.

In [None]:
train.describe()

# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.
# Review Parch distribution using `percentiles=[.75, .8]`
# SibSp distribution `[.68, .69]`
# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`

#### What is the distribution of categorical features?

* Names are unique across the dataset (count=unique=891).
* Sex variable as two possible values with 65% male (top=male, freq=577/count=891).
* Cabin values have several dupicates across samples. Alternatively several passengers shared a cabin.
* Embarked takes three possible values. S port used by most passengers (top=S).
* Ticket feature has high ratio (22%) of duplicate values (unique=681).

In [None]:
train.describe(include=['O'])

# Visualisations

### 1. Pearson Correlation Heatmap

In [None]:
corr_matrix = train.corr()
plt.figure(figsize=(8, 7))
sns.heatmap(data = corr_matrix,cmap='BrBG', annot=True, linewidths=0.2)

There are no very highly correlated columns

### 2. Number of missing values

In [None]:
train.isnull().sum()

The columns 'Age' and 'Cabin' contains more null values.

### 3. Pclass wise - Survival probability

In [None]:
plt = train[['Pclass', 'Survived']].groupby('Pclass').mean().Survived.plot('bar')
plt.set_xlabel('Pclass')
plt.set_ylabel('Survival Probability')

1st class has high chance of surviving than the other two classes.

### 4. Sex wise - Survival probability

In [None]:
plt = train[['Sex', 'Survived']].groupby('Sex').mean().Survived.plot('bar')
plt.set_xlabel('Sex')
plt.set_ylabel('Survival Probability')

The survival probaility for Female is more. They might have given more priority to female than male.

### 5. Embarked wise - Survival probability

In [None]:
plt = train[['Embarked', 'Survived']].groupby('Embarked').mean().Survived.plot('bar')
plt.set_xlabel('Embarked')
plt.set_ylabel('Survival Probability')

Survival probability: C > Q > S

### 6. SibSp - Siblings/Spouse wise - Survival probability

In [None]:
plt = train[['SibSp', 'Survived']].groupby('SibSp').mean().Survived.plot('bar')
plt.set_xlabel('SibSp')
plt.set_ylabel('Survival Probability')

The passengers having one sibling/spouse has more survival probability.

'1' > '2' > '0' > '3' > '4'

### 7. Parch - Children/Parents wise - Survival probability

In [None]:
plt = train[['Parch', 'Survived']].groupby('Parch').mean().Survived.plot('bar')
plt.set_xlabel('Parch')
plt.set_ylabel('Survival Probability')

The passengers having three children/parents has more survival probability.

'3' > '1' > '2' > '0' > '5'

### 8. Age wise - Survival probability

In [None]:
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(20, 7))
sns.violinplot(x='Sex', y='Age', 
               hue='Survived', data=train, 
               split=True,
               palette={0: "blue", 1: "yellow"}
              );

Younger male tend to survive.

Infants (Age <=4) had high survival rate.

Oldest passengers (Age = 80) survived.

A large number of passengers between 20 and 40 die.

The age doesn't seem to have a direct impact on the female survival

# Feature Engineering

In [None]:
y = train.shape[0]
print(y)
y2 = test.shape[0]

In [None]:
features = pd.concat([train, test]).reset_index(drop=True)
print(features.shape)

In [None]:
features['Name_length'] = features['Name'].str.len()

In [None]:
features['Has_Cabin'] = features["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

In [None]:
features['FamilySize'] = features['SibSp'] + features['Parch'] + 1

In [None]:
features['IsAlone'] = 0
features.loc[features['FamilySize'] == 1, 'IsAlone'] = 1

In [None]:
features['Embarked'] = features['Embarked'].fillna('S')

In [None]:
    features['Fare'] = features['Fare'].fillna(train['Fare'].median())
# Mapping Fare
    features.loc[ features['Fare'] <= 7.91, 'Fare'] 						        = 0
    features.loc[(features['Fare'] > 7.91) & (features['Fare'] <= 14.454), 'Fare'] = 1
    features.loc[(features['Fare'] > 14.454) & (features['Fare'] <= 31), 'Fare']   = 2
    features.loc[ features['Fare'] > 31, 'Fare'] 							        = 3
    features['Fare'] = features['Fare'].astype(int)

In [None]:
features.head(3)

In [None]:
features['Age'] = features['Age'].apply(lambda x: np.random.choice(features['Age'].dropna().values) if np.isnan(x) else x)
print(features['Age'].head(20))

In [None]:
features['Age'] = features['Age'].astype(int)
# Mapping Age
features.loc[ features['Age'] <= 16, 'Age'] 					= 0
features.loc[(features['Age'] > 16) & (features['Age'] <= 32), 'Age'] = 1
features.loc[(features['Age'] > 32) & (features['Age'] <= 48), 'Age'] = 2
features.loc[(features['Age'] > 48) & (features['Age'] <= 64), 'Age'] = 3
features.loc[ features['Age'] > 64, 'Age'] = 4 ;


In [None]:
features['Title'] = features['Name'].str.extract(pat = ' ([A-Za-z]+)\.' )
features['Title'] = features['Title'].replace('Mlle', 'Miss')
features['Title'] = features['Title'].replace('Ms', 'Miss')
features['Title'] = features['Title'].replace('Mme', 'Mrs')

features['Title'] = features['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

In [None]:
# Mapping titles
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    features['Title'] = features['Title'].map(title_mapping)
    features['Title'] = features['Title'].fillna(0)

In [None]:
features.tail(3)

In [None]:
# Feature selection
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp', 'Name_length']
features = features.drop(drop_elements, axis = 1)

In [None]:
 # Mapping Categorical data
features['Embarked'] = features['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
features['Sex'] = features['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
#features['Age*Class'] = features['Age']*features['Pclass']

#features = pd.get_dummies(features, prefix=['Embarked', 'Sex'], columns=['Embarked', 'Sex'])
# One hot encoding is not helping in this case

In [None]:
train = features.iloc[:y]
test = features.iloc[-y2:]

In [None]:
train.shape

In [None]:
test.shape

In [None]:
test  = test.drop('Survived', axis = 1)
test.head(3)

All right so now having cleaned the features and extracted relevant information and dropped the redundant columns, converted categorical columns to numerical ones, a format suitable to feed into our Machine Learning models.

### 8. Title wise - Survival probability

In [None]:
plt = train[['Title', 'Survived']].groupby('Title').mean().Survived.plot('bar')
plt.set_xlabel('Title')
plt.set_ylabel('Survival Probability')

The survival probability for 'Mrs' and 'Miss' is high comapred to other classes.

### 9. Correlation between columns

In [None]:
import matplotlib.pyplot as plt
colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

# Model 

### Trying Different Model, predict and solve without Any Cross Validation

We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

* Logistic Regression
* KNN or k-Nearest Neighbors
* Support Vector Machines
* Naive Bayes classifier
* Decision Tree
* Random Forrest
* Perceptron
* Artificial neural network
* RVM or Relevance Vector Machine

In [None]:
X_train = train.drop("Survived", axis=1)
Y_train = train["Survived"]
X_test  = test.copy()
X_train.shape, Y_train.shape, X_test.shape

In [None]:
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

In [None]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test).astype(int)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

In [None]:
coeff_df = pd.DataFrame(train.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

In [None]:
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test).astype(int)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
knn.predict(X_test).astype(int)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
decision_tree.predict(X_test).astype(int)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
random_forest.predict(X_test).astype(int)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

In [None]:
xgboost = XGBClassifier(
 learning_rate = 0.95,
 n_estimators= 5000,
 max_depth= 4,
 min_child_weight= 2,
 gamma=1,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1)

xgboost.fit(X_train, Y_train)
xgboost.predict(X_test).astype(int)
xgboost.score(X_train, Y_train)
acc_xgboost = round(xgboost.score(X_train, Y_train) * 100, 2)
acc_xgboost

#### Plotting above XGBoost Decision Tree ( Without Cross Validation )

In [None]:
from xgboost import plot_tree
from matplotlib.pylab import rcParams

##set up the parameters
rcParams['figure.figsize'] = 80,50
# plot single tree
plot_tree(xgboost, rankdir='LR')


#### Model Comparison

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Decision Tree', 'Xgboost'],
    'Confidence Score': [acc_svc, acc_knn, acc_log, acc_random_forest, acc_decision_tree,acc_xgboost],
'Real Score': [0.78947,0.74641,0.77511,0.76076,0.76076,0.74162]})
models.sort_values(by='Real Score', ascending=False)

In [None]:
submission = pd.DataFrame({
        "PassengerId": PassengerId,
        "Survived": Y_pred
    })
submission.to_csv("Submission2new.csv", index=False)

### Setup cross validation method

"kf” means K-Folds cross-validator, so "kf = KFold(n_splits = NFOLDS, random_state = SEED)" will generates a K-Folds cross-validator which provides train/test indices to split data in train/test sets. 

In [None]:
# Helpers via Python Classes

ntrain = train.shape[0]
ntest = test.shape[0]
# Choose a random seed
SEED = 0
NFOLDS = 5 # set folds for out-of-fold prediction
n_splits = 5 #needed later
kf = KFold(n_splits = NFOLDS, random_state = SEED)

# Class to extend the Sklearn classifier
# Standard code taken from http://blog.keyrus.co.uk/ensembling_ml_models.html.
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        """Get the name of the model for labelling purposes and set a random seed."""
        self.name = re.search("('.+')>$", str(clf)).group(1)
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        """Fit with training data."""
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        """Make a prediction."""
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        """Refit and get the feature importances."""
        print(self.clf.fit(x,y).feature_importances_)
        return(self.clf.fit(x,y).feature_importances_)
    
# Class to extend XGboost classifer

#### Explaination for above & below code :-

In each split of K-Folds, the original train data will be splited to new train data and new test data, which will be uesd to fit a model later. kf generates an index array based on current split, so the location selected as train data is marked as "trainindex", and the location selected as test data is marked as "testindex".

"ooftrain[testindex] = clf.predict(xte)" can only provide prediction values on the “testindex” location, so other values on "trainindex" location of "ooftrain" will hold zero.

In the next split, the values of "ooftrain" will hold the values which are provided in the last split, and only the values on the new "testindex" location in current split will be provided by current prediction.

The location of "testindex" in each split is different. So "the last iteration will overwrite the previous oftrain[test_index] result" will not happen.

After n splits, all of the values of "oof_train" will be given by n predictions.

### Out-of-Fold Predictions

In [None]:
def get_oof(clf, x_train, y_train, x_test):
    """Get out-of-fold predictions for a given classifier."""
    # Initialise the correct sized dfs that we will need to store our results.
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((n_splits, ntest))
    
    # Loop through our kfold object
    for i, (train_index, test_index) in enumerate(kf.split(x_train)):
        
        # Use kfold object indexes to select the fold for train/test split
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        # Train the model
        clf.train(x_tr, y_tr)
        
        # Predict on the in-fold training set
        oof_train[test_index] = clf.predict(x_te)
        
        # Predict on the out-of-fold testing set
        oof_test_skf[i, :] = clf.predict(x_test)
    
    # Take the mean of the 5 folds predictions
    oof_test[:] = oof_test_skf.mean(axis=0)
    
    # Returns both the training and testing predictions
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

# Our Base First-Level Models

So now let us prepare five learning models as our first level classification. These models can all be conveniently invoked via the Sklearn library and are listed as follows:

* Random Forest classifier
* Extra Trees classifier
* AdaBoost classifer
* Gradient Boosting classifer
* Support Vector Machine

#### Parameters

Just a quick summary of the parameters that we will be listing here for completeness,

**n_jobs :** Number of cores used for the training process. If set to -1, all cores are used.

**n_estimators :** Number of classification trees in your learning model ( set to 10 per default)

**max_depth :** Maximum depth of tree, or how much a node should be expanded. Beware if set to too high a number would run the risk of overfitting as one would be growing the tree too deep

**verbose :** Controls whether you want to output any text during the learning process. A value of 0 suppresses all text while a value of 3 outputs the tree learning process at every iteration.

In [None]:
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 575,
     'warm_start': True, 
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 3
}

# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':575,
    #'max_features': 0.5,
    'max_depth': 8,
    'min_samples_leaf': 3,
    'verbose': 3
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.95
}

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 575,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 3,
    'verbose': 3
}

# Support Vector Classifier parameters 
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
    }

Now, we can use our helper functions to initialise our level 1 classifiers and then return our results as a dataframe. We have already successfully imported all of your classifiers (from sklearn.ensemble import RandomForestClassifier) and created dictionaries of their respective hyper-parameters (rf_params).

In [None]:
# Create 5 objects that represent our 5 models
# # Initialize classifier objects
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

In [None]:
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data

### Output of the First level Predictions

We now feed the training and test data into our 5 base classifiers and use the Out-of-Fold prediction function we defined earlier to generate our first level predictions.

In [None]:
# Create our out-of-fold train and test predictions. These base results will be used as new features.
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")

#### Feature importances generated from the different classifiers
We can utilise a very nifty feature of the Sklearn models and that is to output the importances of the various features in the training and test sets with one very simple line of code.

In [None]:
rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)

In [None]:
rf_features = list(rf_feature) 
et_features = list(et_feature) 
ada_features = list(ada_feature) 
gb_features = list(gb_feature) 


In [None]:
cols = train.columns.values
# Create a dataframe with features
feature_dataframe = pd.DataFrame( {'features': cols,
    'Random Forest feature importances': rf_features,
    'Extra Trees  feature importances': et_features,
    'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
    })

### Interactive feature importances via Plotly scatterplots

In [None]:
# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

Now let us calculate the mean of all the feature importances and store it as a new column in the feature importance dataframe.

In [None]:
# Create the new column containing the average of values

feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
feature_dataframe.head(10)

# Second-Level Predictions from the First-level Output

We are therefore having as our new columns the first-level predictions from our earlier classifiers and we train the next classifier on this.

In [None]:
base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),
     'ExtraTrees': et_oof_train.ravel(),
     'AdaBoost': ada_oof_train.ravel(),
      'GradientBoost': gb_oof_train.ravel()
    })
base_predictions_train.head()

#### Correlation Heatmap of the Second Level Training set

In [None]:
data = [
    go.Heatmap(
        z= base_predictions_train.astype(float).corr().values ,
        x=base_predictions_train.columns.values,
        y= base_predictions_train.columns.values,
          colorscale='Viridis',
            showscale=True,
            reversescale = True
    )
]
py.iplot(data, filename='labelled-heatmap')

In [None]:
# Concatenate the training and test sets
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

Now, we can use the results from our level 1 classifier as inputs for our level 2 classifier. For our level 2 learner, we are going to use an XGBoost model.

### Second level learning model via XGBoost

Here we choose the eXtremely famous library for boosted tree learning model, XGBoost. It was built to optimize large-scale boosted tree algorithms.

We call an XGBClassifier and fit it to the first-level train and target data and use the learned model to predict the test data.

Just a quick run down of the XGBoost parameters used in the model:

max_depth : How deep you want to grow your tree. Beware if set to too high a number might run the risk of overfitting.

gamma : minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be.

In [None]:
xgboost = XGBClassifier(
 learning_rate = 0.95,
 n_estimators= 5000,
 max_depth= 4,
 min_child_weight= 2,
 gamma=1,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1)

xgb_model_full_data = xgboost.fit(x_train, y_train)
predictions = xgb_model_full_data.predict(x_test).astype(int)

#### Plot final XGBoost Decision Tree

In [None]:
from xgboost import plot_tree
from matplotlib.pylab import rcParams

##set up the parameters
rcParams['figure.figsize'] = 80,50
# plot single tree
plot_tree(xgb_model_full_data, rankdir='LR')


#### Producing the Submission file

Finally having trained and fit all our first-level and second-level models, we can now output the predictions into the proper format for submission to the Titanic competition as follows:

In [None]:
# Generate Submission File 
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,'Survived': predictions })
StackingSubmission.to_csv("Submission.csv", index=False)

# Acknowledgments

Inspirations are drawn from various Kaggle notebooks but majorly motivation is from the following :

1. https://www.kaggle.com/arthurtok/0-808-with-simple-stacking
2. https://www.kaggle.com/usharengaraju/data-visualization-titanic-survival
3. https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python
4. https://www.kaggle.com/startupsci/titanic-data-science-solutions

Till next time, Peace Out