# Titanic Survival Prediction

## Introduction

This is my first kernel experiment machine learning. I will use Titanic dataset from Kaggle to predict how likely people survive through disaster.

## Steps
1. Load Data
2. Feature engineering
3. Train model using Decision Tree
4. Conclusion

### Load data

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import make_scorer, accuracy_score
%matplotlib inline
from subprocess import check_output
print(check_output(["ls", "./inputs"]).decode("utf8"))

test.csv
train.csv



In [2]:
train = pd.read_csv("./inputs/train.csv")
test = pd.read_csv("./inputs/test.csv")

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Some observations:
- We can drop Name,Ticket column
- We can drop Fare due to pclass already represent
- Cabin contains NaN value but its also potental relate to survival
- Combine column: SibSp/Parch

### Feature Engineering

In [4]:
train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Check missing values

In [5]:
print(pd.isnull(train).sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


- Drop Cabin due to having many Null values


In [6]:
train.pop('Name')
train.pop('Fare')
train.pop('Ticket')
train.pop('Cabin')

0              NaN
1              C85
2              NaN
3             C123
4              NaN
5              NaN
6              E46
7              NaN
8              NaN
9              NaN
10              G6
11            C103
12             NaN
13             NaN
14             NaN
15             NaN
16             NaN
17             NaN
18             NaN
19             NaN
20             NaN
21             D56
22             NaN
23              A6
24             NaN
25             NaN
26             NaN
27     C23 C25 C27
28             NaN
29             NaN
          ...     
861            NaN
862            D17
863            NaN
864            NaN
865            NaN
866            NaN
867            A24
868            NaN
869            NaN
870            NaN
871            D35
872    B51 B53 B55
873            NaN
874            NaN
875            NaN
876            NaN
877            NaN
878            NaN
879            C50
880            NaN
881            NaN
882         

In [7]:
print(pd.isnull(train).sum())

PassengerId      0
Survived         0
Pclass           0
Sex              0
Age            177
SibSp            0
Parch            0
Embarked         2
dtype: int64


In [8]:
train=train.dropna(how='any',axis=0)

In [9]:
print(pd.isnull(train).sum())

PassengerId    0
Survived       0
Pclass         0
Sex            0
Age            0
SibSp          0
Parch          0
Embarked       0
dtype: int64


In [10]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked
0,1,0,3,male,22.0,1,0,S
1,2,1,1,female,38.0,1,0,C
2,3,1,3,female,26.0,0,0,S
3,4,1,1,female,35.0,1,0,S
4,5,0,3,male,35.0,0,0,S


In [11]:
def age_to_int(value):
    if(value == 0):
        return 0
    if(0<value and value <=20):
        return 1
    if(20<=value and value<=38):
        return 2
    if(value>38):
        return 3

In [12]:
train['Sex']= train['Sex'].astype('category')
train['Sex_cat']=train['Sex'].cat.codes
train['Embarked']= train['Embarked'].astype('category')
train['Embarked_cat']=train['Embarked'].cat.codes
train['Age_cat']=train['Age'].apply(age_to_int)

train.head()
train.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked,Sex_cat,Embarked_cat,Age_cat
885,886,0,3,female,39.0,0,5,Q,0,1,3
886,887,0,2,male,27.0,0,0,S,1,2,2
887,888,1,1,female,19.0,0,0,S,0,2,1
889,890,1,1,male,26.0,0,0,C,1,0,2
890,891,0,3,male,32.0,0,0,Q,1,1,2


In [13]:
test.pop('Name')
test.pop('Fare')
test.pop('Ticket')
test.pop('Cabin')

test['Sex']= test['Sex'].astype('category')
test['Sex_cat']=test['Sex'].cat.codes
test['Embarked']= test['Embarked'].astype('category')
test['Embarked_cat']=test['Embarked'].cat.codes
test['Age_cat']=test['Age'].apply(age_to_int)
test = test.drop(['Sex','Age','Embarked'],axis=1)
test.head()

Unnamed: 0,PassengerId,Pclass,SibSp,Parch,Sex_cat,Embarked_cat,Age_cat
0,892,3,0,0,1,1,2.0
1,893,3,1,0,0,2,3.0
2,894,2,0,0,1,1,3.0
3,895,3,0,0,1,2,2.0
4,896,3,1,1,0,2,2.0


In [14]:
print(pd.isnull(test).sum())

PassengerId      0
Pclass           0
SibSp            0
Parch            0
Sex_cat          0
Embarked_cat     0
Age_cat         86
dtype: int64


## Training model

In [15]:
test["Age_cat"].fillna(test["Age_cat"].mean(),inplace=True)

### Split data

In [16]:
X=train.drop(['Survived','Sex','Age','Embarked'],axis=1)
y=train['Survived']

In [24]:
X.head()

Unnamed: 0,PassengerId,Pclass,SibSp,Parch,Sex_cat,Embarked_cat,Age_cat
0,1,3,1,0,1,2,2
1,2,1,1,0,0,0,2
2,3,3,0,0,0,2,2
3,4,1,1,0,0,2,2
4,5,3,0,0,1,2,2


In [17]:
tree= DecisionTreeClassifier(max_depth = 10, random_state = 0,max_leaf_nodes=None)


In [18]:
clf=tree.fit(X,y)

In [19]:
from IPython.display import Image 
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO 
import pydotplus

Visualize tree

In [20]:
# Visualize data
feature_names=list(X.columns[:8])
print(feature_names)

['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Sex_cat', 'Embarked_cat', 'Age_cat']


open image


In [21]:
clf.score(X, y)

0.9648876404494382

In [22]:
predictions = clf.predict(test)

output

In [23]:
passengerid=test['PassengerId']
output = pd.DataFrame({ 'PassengerId' : passengerid, 'Survived': predictions })
output.to_csv('./ouput/submission.csv', index=False)

### Result for Decision tree

Score is given 76.0. And training error is quite small -> overfitting tree