# Titanic Survival Machine Learning

ML with feature processing and model selecting 

Data can be downloaded from [this link](https://www.kaggle.com/c/titanic/data)

In [1]:
import pandas as pd
import numpy as np

import sklearn.preprocessing as prep
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier

In [2]:
df_train = pd.read_csv('train.csv')
df_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Feature Processing

Based on my previous EDA report, there is no duplicate value in the training data set.

For the missing values(column 'Age', 'Cabin', 'Embarked') in training data set:

1. 'Cabin' column will be dropped as most of its values are Nan;
2. 'S' will be used as the assumption values of 'Embarked' column considering the avaliable data's distribution
3. Median value will be used as a good approach for age column's missing values ( according to relate information : pclass and sex)
4. Adding new feature['Title'] based on the information from Name column

In [3]:
df_train = df_train.drop(['PassengerId', 'Ticket', 'Cabin'], axis = 1)
df_train['Embarked'].fillna('S', inplace = True)

# extract new feature from Name column
df_train['Title'] = df_train['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

# checking how many titles collected
pd.crosstab(df_train['Sex'], df_train['Title'])

Title,Capt,Col,Countess,Don,Dr,Jonkheer,Lady,Major,Master,Miss,Mlle,Mme,Mr,Mrs,Ms,Rev,Sir
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
female,0,0,1,0,1,0,1,0,0,182,2,1,0,125,1,0,0
male,1,2,0,1,6,1,0,2,40,0,0,0,517,0,0,6,1


In [4]:
# Replace the title with more common ones
df_train['Title'] = df_train['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr',\
                                               'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

df_train['Title'] = df_train['Title'].replace(['Mlle', 'Ms', 'Mme'],['Miss', 'Miss', 'Mrs'])

df_train[['Title', 'Survived']].groupby('Title', as_index = False).mean()

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,Rare,0.347826


In [5]:
guess_median = df_train[['Sex', 'Pclass', 'Age']].groupby(['Sex','Pclass']).median()
guess_values = guess_median['Age'].values.tolist()

#check the guess values
guess_median

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Sex,Pclass,Unnamed: 2_level_1
female,1,35.0
female,2,28.0
female,3,21.5
male,1,40.0
male,2,30.0
male,3,25.0


In [6]:
#encoding the sex information
df_train['Sex'] = df_train['Sex'].map({'female':1, 'male':0})

#replace the missing age values with guess values
for sex in range(0,2) :
    for pclass in range(1,4):
        df_train.loc[(df_train['Sex'] == sex) & (df_train['Pclass'] == pclass)\
                     & (df_train['Age'].isnull()), 'Age'] = guess_values[pclass*(sex+1)-1]
        
# encoding categorical features
df_train = pd.get_dummies(df_train, columns = ['Pclass', 'Embarked', 'Title'])

# scaling numerical features
df_train['Age'] = prep.scale(df_train['Age'])
df_train['Fare'] = prep.scale(df_train['Fare'])
df_train = df_train.drop('Name', axis = 1)

# obtain the trainng data set
X_train = df_train.drop('Survived', axis = 1)
y_train = df_train['Survived']

# Training the data
#logreg = LogisticRegression()
#logreg.fit(X_train, y_train)

In [7]:
### featuring the test data set in the same way
df_test = pd.read_csv('test.csv')
df_test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [8]:
df_test = df_test.drop(['PassengerId', 'Ticket', 'Cabin'], axis = 1)

# get the title information
df_test['Title'] = df_test['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(df_test['Sex'], df_test['Title'])

Title,Col,Dona,Dr,Master,Miss,Mr,Mrs,Ms,Rev
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
female,0,1,0,0,78,0,72,1,0
male,2,0,1,21,0,240,0,0,2


In [9]:
df_test['Title'] = df_test['Title'].replace(['Col','Dona','Dr','Rev'], 'Rare')
df_test['Title'] = df_test['Title'].replace('Ms', 'Miss')

#guess_median_t = df_test[['Sex', 'Pclass', 'Age']].groupby(['Sex','Pclass']).median()
#guess_values_t = guess_median_t['Age'].values.tolist()

#encoding the sex information
df_test['Sex'] = df_test['Sex'].map({'female':1, 'male':0})

#replace the missing age values with guess values

for sex in range(0,2) :
    for pclass in range(1,4):
        df_test.loc[(df_test['Sex'] == sex) & (df_test['Pclass'] == pclass)\
                     & (df_test['Age'].isnull()), 'Age'] = guess_values[pclass*(sex+1)-1]  # training data guessing values

# Replacing the missing value for test data
fare_md = df_test['Fare'].describe()[5]
df_test.loc[df_test['Fare'].isnull(),'Fare'] = fare_md

# Encoding the test features
df_test = pd.get_dummies(df_test, columns = ['Pclass', 'Embarked', 'Title'])
df_test['Age'] = prep.scale(df_test['Age'])
df_test['Fare'] = prep.scale(df_test['Fare'])
df_test = df_test.drop('Name', axis = 1)

# Predict the test data
#y_test = logreg.predict(df_test)

### ML Model Selecting 

In [10]:
# Making the train data(80%) and cross validation data (20%) based on the whole train data
# considering the whole data is list in random information, I will directly take the first 80% data 

train_index = int(round(0.8 * X_train.shape[0]))

X_train_08 = X_train.iloc[0:train_index, :]
y_train_08 = y_train[0:train_index]

X_cv = X_train.iloc[train_index::,:]
y_cv = y_train[train_index::]

In [11]:
# Training the data with Logreg
logreg = LogisticRegression()
logreg.fit(X_train_08, y_train_08)
y_logreg_p = logreg.predict(X_cv)
score_logreg = accuracy_score(y_cv, y_logreg_p)
print('The accuracy of Logreg is: {}\n'.format(score_logreg))

# Training the data with SVC
svc = SVC(C=10)
svc.fit(X_train_08, y_train_08)
y_svc_p = svc.predict(X_cv)
score_svc = accuracy_score(y_cv, y_svc_p)
print('The accuracy of SVC is: {}\n'.format(score_svc))

# Training the data with linear_SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train_08, y_train_08)
y_lsvc_p = linear_svc.predict(X_cv)
score_lsvc = accuracy_score(y_cv, y_lsvc_p)
print('The accuracy of Linear_svc is: {}\n'.format(score_lsvc))

# Training the data with DecisionTree
dtree = DecisionTreeClassifier()
dtree.fit(X_train_08, y_train_08)
y_dt_p = dtree.predict(X_cv)
score_dtree = accuracy_score(y_cv, y_dt_p)
print('The accuracy of Decision Tree is: {}\n'.format(score_dtree))

The accuracy of Logreg is: 0.88202247191

The accuracy of SVC is: 0.898876404494

The accuracy of Linear_svc is: 0.88202247191

The accuracy of Decision Tree is: 0.747191011236



In [12]:
# choosing the SVC model and train with whole data
svc = SVC(C=10)
svc.fit(X_train, y_train)
test_predict = svc.predict(df_test)

# output the resultes as csv file for submition online
sub_test = pd.read_csv('test.csv')
sub_test['Survived'] = test_predict
ans = sub_test.loc[:,['PassengerId', 'Survived']]

# please remove the '#' if you need to output the file in your computer
#ans.to_csv('titanic_submission.csv', index = False)

# linear_svc model had my best score in this Kaggle competition

## Reference 

1. [Titanic Data Science Solutions](https://www.kaggle.com/startupsci/titanic-data-science-solutions)
2. [achine Learning from Start to Finish with Scikit-Learn](https://www.kaggle.com/jeffd23/scikit-learn-ml-from-start-to-finish)