#Kaggle Titanic Dataset

The following is an implementation of titanic classification challenge solution using stacking. This model scored 80% accuracy in the competition which put it at top 10% on the leaderboard. In this notebook, we will explore feature enginnering methods and stacking methods

In [1]:
import pandas as pd
import shutil
import numpy as np
import re
from google.colab import drive
drive.mount('/content/gdrive')
#import dataset
train_set = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/titanic/data/train.csv')
test_set = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/titanic/data/test.csv')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


##Preprocessing  dataset

For the titanic dataset we will extract titles in people names to determine their social status as well as using Pclass feature. Numeric/continous data are bucketized into different ranges as categoricla inputs and synthetic features such as family size and IsAlone features are created from SibSp and Parch features. Credits on feature engineering methodologies to Kaggle Kernel "EDA to Predict"

In [0]:
def preprocess_data(data):
    def get_init(name):
        return re.search('([A-Za-z]+)\.',name).group(1)
    description = data.describe(include = 'all')
    data['Embarked'] = train_set['Embarked'].fillna('S')
    data['Age'] = data['Age'].fillna(description.loc['mean']['Age'])
    data['Age'] = pd.cut(data['Age'],5)
    data['Fare'] = data['Fare'].fillna(description.loc['mean']['Fare'])
    data['Fare'] = pd.qcut(data['Fare'],4)
    data['Pclass'] = data['Pclass'].apply(lambda x: str(x))
    data['FamilySize'] = data.apply(lambda x: x['SibSp'] + x['Parch'] + 1,axis=1)
    data['IsAlone'] = data.apply(lambda x: 1 if x['FamilySize']==1 else 0,axis=1)
    data['Name'] = data['Name'].apply(get_init)
    data['Name'] = data['Name'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'])
    
    return data

In [3]:
#check dataset
train_set = preprocess_data(train_set)
print(train_set)

     PassengerId  Survived Pclass    Name     Sex               Age  SibSp  \
0              1         0      3      Mr    male  (16.336, 32.252]      1   
1              2         1      1     Mrs  female  (32.252, 48.168]      1   
2              3         1      3    Miss  female  (16.336, 32.252]      0   
3              4         1      1     Mrs  female  (32.252, 48.168]      1   
4              5         0      3      Mr    male  (32.252, 48.168]      0   
5              6         0      3      Mr    male  (16.336, 32.252]      0   
6              7         0      1      Mr    male  (48.168, 64.084]      0   
7              8         0      3  Master    male    (0.34, 16.336]      3   
8              9         1      3     Mrs  female  (16.336, 32.252]      0   
9             10         1      2     Mrs  female    (0.34, 16.336]      1   
10            11         1      3    Miss  female    (0.34, 16.336]      1   
11            12         1      1    Miss  female  (48.168, 64.0

In [4]:
#We will select only these features because features such as ticket only contain strings with unknown meanings
features = train_set[['Name','Pclass','SibSp','Parch','Sex','Age','FamilySize','Fare','Embarked','IsAlone']]
features = pd.get_dummies(features)
print(features.columns)

Index(['SibSp', 'Parch', 'FamilySize', 'IsAlone', 'Name_Master', 'Name_Miss',
       'Name_Mr', 'Name_Mrs', 'Name_Other', 'Pclass_1', 'Pclass_2', 'Pclass_3',
       'Sex_female', 'Sex_male', 'Age_(0.34, 16.336]', 'Age_(16.336, 32.252]',
       'Age_(32.252, 48.168]', 'Age_(48.168, 64.084]', 'Age_(64.084, 80.0]',
       'Fare_(-0.001, 7.91]', 'Fare_(7.91, 14.454]', 'Fare_(14.454, 31.0]',
       'Fare_(31.0, 512.329]', 'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')


##Training base models

For stacking methods, we will use various models and learners to get our first prediction and use the output as features of the second round. First, let us choose models that perform well in our first stage. Note that models and their parameters were already chosen from parameters search

In [0]:
#get labels
labels = train_set.pop('Survived')
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

#fine-tuned models
classifiers = {
    'random_forest':RandomForestClassifier(n_estimators=3000,max_depth=20,max_features='auto',min_samples_leaf=2,min_samples_split=5,oob_score = True),
    'adaBoost':AdaBoostClassifier(n_estimators=500,learning_rate=0.05),
    'gradientBoosting':GradientBoostingClassifier(learning_rate=0.1,max_depth=3,max_features='sqrt',n_estimators=500,min_samples_split=4,min_samples_leaf=2),
    'svc':SVC(kernel='linear',C=0.5,gamma='auto',degree=1),
    'ex_tree':ExtraTreesClassifier(max_depth=12,max_features='sqrt',min_samples_leaf=1,min_samples_split=10,oob_score=True,bootstrap=True,n_estimators=500)
}


In [0]:
#prepare test features 
test_set = preprocess_data(test_set)
test_features = test_set[['Name','Pclass','SibSp','Parch','Sex','Age','FamilySize','Fare','Embarked','IsAlone']]
test_features = pd.get_dummies(test_features)
test_features = test_features.drop(['Name_Dona'],axis=1)


In [7]:
#use KFold validation for getting results
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
kf = KFold(n_splits=7,shuffle=True)
for key, classifier in classifiers.items():
    for train_index,test_index in kf.split(features):
        X_train, y_train = features.iloc[train_index], labels.iloc[train_index]
        X_test, y_test = features.iloc[test_index], labels.iloc[test_index]
        classifier.fit(X_train,y_train)
        print(key,accuracy_score(classifier.predict(features),labels))
    #print overall performance
    print("overall",key,accuracy_score(classifier.predict(features),labels))
    #get predictions from test_features
    predictions = classifier.predict(test_features)
    predictions_set = test_set.loc[:,['PassengerId']]
    predictions_set = predictions_set.assign(Survived=pd.Series(predictions))
    predictions_set.to_csv('/content/gdrive/My Drive/Colab Notebooks/titanic/saved_model/'+key+'.csv',index=False)
    #get stacking features from training set
    stacking_features = classifier.predict(features)
    stacking_set = train_set.loc[:,['PassengerId']]
    stacking_set = stacking_set.assign(Survived=pd.Series(stacking_features))
    stacking_set.to_csv('/content/gdrive/My Drive/Colab Notebooks/titanic/stacking_model/'+key+'.csv',index=False)

random_forest 0.8619528619528619
random_forest 0.8585858585858586
random_forest 0.8608305274971941
random_forest 0.8585858585858586
random_forest 0.8653198653198653
random_forest 0.8608305274971941
random_forest 0.8585858585858586
overall random_forest 0.8585858585858586
adaBoost 0.8237934904601572
adaBoost 0.8260381593714927
adaBoost 0.8271604938271605
adaBoost 0.8249158249158249
adaBoost 0.8282828282828283
adaBoost 0.8282828282828283
adaBoost 0.8260381593714927
overall adaBoost 0.8260381593714927
gradientBoosting 0.8765432098765432
gradientBoosting 0.8698092031425365
gradientBoosting 0.8720538720538721
gradientBoosting 0.8698092031425365
gradientBoosting 0.8731762065095399
gradientBoosting 0.8754208754208754
gradientBoosting 0.8787878787878788
overall gradientBoosting 0.8787878787878788
svc 0.8249158249158249
svc 0.8226711560044894
svc 0.8249158249158249
svc 0.8260381593714927
svc 0.8260381593714927
svc 0.8226711560044894
svc 0.8282828282828283
overall svc 0.8282828282828283
ex_tree 

The overall result from the KFold validation seems very promising with over 80% accuracy for most models. Now, let's proceed to the second stage with **xgboost-titanic** notebook for stacking methods with XGBoost Regressor