# Predict survival on the Titanic
In this Lab, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy

### Dataset
The dataset contains 891 observations of 12 variables:
* **PassengerId**: Unique ID for each passenger
* **Survived**: Survival (0 = No; 1 = Yes)
* **Pclass**: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* **Name**: Name
* **Sex**: Sex
* **Age**: Age
* **Sibsp**: Number of Siblings/Spouses Aboard
* **Parch**: Number of Parents/Children Aboard
* **Ticket**: Ticket Number
* **Fare**: Passenger Fare
* **Cabin**: Cabin
* **Embarked** Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

In [1]:
# imports
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np

In [2]:
titanic = pd.read_csv("titanic.csv" )
titanic.drop('Cabin', axis=1, inplace=True) # Drop this column because it contains a lot of Nan values
titanic["Age"].fillna(titanic["Age"].median(),inplace=True)
titanic["Embarked"].fillna("S", inplace = True)
print ('survival rate =', titanic.Survived.mean())

survival rate = 0.3838383838383838


## Model training

In [3]:
# Some of the columns don't have predictive power, so let's specify which ones are included for prediction
predictors = ["Pclass", "Sex", "Age", 'SibSp' ,'Parch', "Fare", "Embarked"]  
# We need now to convert text columns in predictors to numerical ones
for col in predictors: # Loop through all columns in predictors
    if titanic[col].dtype == 'object':  # check if column's type is object (text)
        titanic[col] = pd.Categorical(titanic[col]).codes  # convert text to numerical

titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,2
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,2


In [11]:
# Split the data into a training set and a testing set
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(titanic[predictors], titanic['Survived'], test_size=0.3, random_state=1)

from sklearn.linear_model import LogisticRegression
clf_log = LogisticRegression(random_state=1)
clf_log.fit(X_train, y_train)
print ('train accuracy =', clf_log.score(X_train, y_train))
print ('test accuracy =', clf_log.score(X_test, y_test))

from sklearn import model_selection
scores = model_selection.cross_val_score(clf_log, titanic[predictors], titanic["Survived"], scoring='accuracy', cv=5)
print('scores =',scores)
print('cross validation accuracy =', scores.mean())

train accuracy = 0.8105939004815409
test accuracy = 0.7686567164179104
scores = [0.7877095  0.79888268 0.78089888 0.76404494 0.81920904]
cross validation accuracy = 0.7901490077087383


This is what we have seen last time. Now, let's implement [Bagging](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier) with this basic classifier.

# Bagging

We'll use the above classifier 400 times, with boosting, the bagging parameters were chosen manually here.

In [17]:
from sklearn.ensemble import BaggingClassifier
clf_log = LogisticRegression(random_state=1)  # base classifier (estimator)
clf_bag = BaggingClassifier(base_estimator=clf_log, random_state= 1, n_estimators=400, max_samples=0.7, max_features= 0.8)
clf_bag.fit(X_train, y_train)
print ('train accuracy =', clf_bag.score(X_train, y_train))
print ('test accuracy =', clf_bag.score(X_test, y_test))

train accuracy = 0.8186195826645265
test accuracy = 0.7649253731343284


The training accuracy had improved a little bit, but not the test one. This is not a problem since it's the cross validation accuracy that should improve to say that we got a better classifier. We'll see that later

Access the base classifier (estimator)

In [18]:
clf_bag.base_estimator_ 

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=1, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

Access the 400 classifiers (estimators)

In [19]:
print (len(clf_bag.estimators_))
clf_bag.estimators_

400


[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=1028862084,
           solver='warn', tol=0.0001, verbose=0, warm_start=False),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=870353631, solver='warn',
           tol=0.0001, verbose=0, warm_start=False),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='l2', random_state=788373214, solver='warn',
           tol=0.0001, verbose=0, warm_start=False),
 LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
           intercept_scaling=1, max_iter=100, multi_class='warn',
           n_jobs=None, penalty='

As you can see, The same classifier is repeated 400 times, the only thing that changed is the random_state 

Access the subsample that is used by each classifier

In [21]:
print (len(clf_bag.estimators_samples_))
clf_bag.estimators_samples_

400


[array([300, 365, 292,  74, 244, 316, 472, 363, 191, 615,  65, 499,  53,
        466, 412, 228, 413,  60,  13, 487, 331,  83, 376,  69,  33, 345,
         86, 303, 359, 549, 412, 387, 344, 129, 465, 395, 488, 612, 322,
         26, 221, 506, 185, 501, 204, 540, 598, 178, 444, 479,  18, 116,
        286, 524,  77, 177, 444, 282, 455, 223, 533, 211, 396, 183, 554,
        616, 124, 173, 518, 163,  55, 181, 241, 545, 154, 504, 529,  32,
        148, 320, 168, 269, 574, 297, 478, 572, 216, 120, 310, 490, 106,
        361, 477, 386, 605, 276,  75, 342, 270, 203, 106, 436, 549, 206,
         92, 100, 419, 404, 550,  85, 258, 103, 240, 194, 417, 443, 131,
        534, 375, 536,  57, 538, 462, 384, 170, 200, 556, 216, 590, 285,
        519, 387, 548,  83, 133, 438, 321, 181,  96, 235, 622, 423, 525,
        577, 352, 493, 460, 570, 267,  72, 280, 492,   0, 252, 394, 200,
         74, 487, 322, 120, 327, 408,  15, 552, 448, 353,  54, 368, 365,
        528,  44, 606, 576,  59, 249, 438,  97, 438

A list containing 400 arrays is returned. Let's see the first one.

In [15]:
print (len(clf_bag.estimators_samples_[0]))
clf_bag.estimators_samples_[0].sum()

436


132593

This array represents the subsample used by the first classifier. If we sum all the Trues, we obtain that 322 among the 623 data points in X_train were used in this subsample.<br>
Let's see the second one

In [24]:
print (len(clf_bag.estimators_samples_[1]))
clf_bag.estimators_samples_[1].sum()

436


130509

As you can see, this one used a different subsample. Now let's look at the features used by each classifier

In [26]:
print (len(clf_bag.estimators_features_))
clf_bag.estimators_features_

400


[array([0, 1, 5, 3, 2]),
 array([2, 3, 0, 6, 5]),
 array([0, 6, 4, 3, 2]),
 array([0, 5, 4, 2, 3]),
 array([1, 4, 0, 5, 3]),
 array([4, 1, 2, 0, 6]),
 array([1, 4, 5, 6, 0]),
 array([3, 1, 0, 6, 2]),
 array([2, 6, 1, 0, 3]),
 array([6, 2, 3, 5, 0]),
 array([2, 6, 1, 5, 0]),
 array([4, 0, 3, 5, 2]),
 array([1, 0, 4, 5, 2]),
 array([4, 0, 2, 6, 3]),
 array([4, 1, 2, 6, 0]),
 array([4, 0, 5, 2, 1]),
 array([4, 5, 0, 2, 3]),
 array([2, 3, 5, 1, 6]),
 array([6, 3, 4, 5, 2]),
 array([3, 5, 2, 4, 6]),
 array([4, 2, 6, 1, 0]),
 array([5, 6, 1, 3, 0]),
 array([1, 6, 5, 3, 2]),
 array([4, 6, 5, 0, 1]),
 array([2, 5, 6, 0, 4]),
 array([1, 4, 2, 0, 6]),
 array([4, 2, 3, 1, 5]),
 array([0, 1, 2, 3, 4]),
 array([0, 2, 4, 5, 1]),
 array([3, 6, 5, 0, 4]),
 array([1, 6, 5, 0, 4]),
 array([3, 4, 2, 0, 1]),
 array([5, 1, 2, 0, 4]),
 array([5, 6, 4, 3, 0]),
 array([5, 2, 4, 6, 3]),
 array([3, 2, 6, 1, 0]),
 array([2, 5, 0, 6, 1]),
 array([1, 6, 0, 5, 3]),
 array([6, 1, 2, 5, 3]),
 array([6, 5, 1, 2, 4]),


Here also, each classifier uses a different subset of features

### with cross validation

In [27]:
clf_bag = BaggingClassifier(base_estimator=clf_log, random_state= 1, n_estimators=200, max_samples=0.7, max_features= 0.7)
scores_bag = model_selection.cross_val_score(clf_bag, titanic[predictors], titanic["Survived"], scoring='accuracy', cv=5)
print('scores =',scores_bag)
print('cross validation accuracy =',scores_bag.mean())

scores = [0.7877095  0.80446927 0.81460674 0.76966292 0.82485876]
cross validation accuracy = 0.8002614381866431


As you can see, with the same base estimator used 400 times, we went from **0.790149007709** CV accuracy to **0.800261438187**.<br> A better improvement can be achieved by better tuning clf_bag parameters using Grid Search.