### This is one of the most famous kaggle project for beginners: Titanic
#### 2019/04/25
#### The main aim of this notebook is to improve the model trained on this project, in this project, we will try different models:
    * using SVM
    * using GradientBoosting
    * feature engineering 
#### Many thanks to this blog which gave me a lot of intuitions: http://www.mashangxue123.com/TensorFlow/2743066386.html
##### Also, we will not focus too much on tuning the parameters
##### The original project can be found on Kaggle: https://www.kaggle.com/startupsci/titanic-data-science-solutions

#### The prediction got 76% score on testing set

### First let's loading dataset and pre-process the datatset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import imputation
from sklearn.model_selection import cross_val_score

from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score


  from numpy.core.umath_tests import inner1d


In [2]:
train_path = os.path.join('titanic/train.csv')
test_path = os.path.join('titanic/test.csv')

train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

submission_sample_path = os.path.join('titanic/gender_submission.csv')
sample_submission = pd.read_csv(submission_sample_path)

train_y = train_data['Survived']
train_data.drop(axis=1,columns='Survived',inplace=True)
train_data = train_data.append(test_data)
train_data.reset_index(drop=True,inplace=True)

#### Let us have a feeling the dataset

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB


In [4]:
train_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Pre-processing: 
    * Check: Is there any missing data in the training dataset?

In [5]:
missDataPosition = np.where(train_data == np.nan)
missDataPosition

(array([], dtype=int64), array([], dtype=int64))

There is No missing data? No, we can see in the table above there are 'NaN' in the column, so they must be save as type string 'NaN'! Let's find them out

In [6]:
missDataPosition = np.where(train_data == 'NaN')
missDataPosition

(array([], dtype=int64), array([], dtype=int64))

Seems wired, try to figure out the truth. 
Okay, we have to use pd.isna().any() to find the mask rather than using np.where(... == np.nan), it is important. 
So let us firstly check which columns contains missing data using 

In [7]:
train_data.isna().any()

PassengerId    False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare            True
Cabin           True
Embarked        True
dtype: bool

In [8]:
print('total instance:',train_data.shape[0])
print('Age contains nan:',np.where(train_data['Age'].isna())[0].shape[0])
print('Cabin contains nan:',np.where(train_data['Cabin'].isna())[0].shape[0])
print('Embarked contains nan:',np.where(train_data['Embarked'].isna())[0].shape[0])

total instance: 1309
Age contains nan: 263
Cabin contains nan: 1014
Embarked contains nan: 2


#### we can see that the 'Age','Cabin', 'Embaked' contain missing data, let's handle them:
    * For 'Age': use mean/median
    * For 'Cabin', it have a crazy miss rate, we ignore this column
    * For Embarked, let us just drop these 2 instance

### Pre-processing: Missing data

In [9]:
# fill age with mean 
train_data2 = train_data.copy()
meanAge = train_data['Age'].mean()
train_data2.loc[train_data['Age'].isna(),'Age'] = meanAge # fill with mean value

In [10]:
#drop Cabin
train_data2.drop(axis=1,columns='Cabin',inplace=True)

In [11]:
train_data2.shape

(1309, 10)

In [12]:
#drop instances where Embarked is nan
print('Index of two instances which contains nan',np.where(train_data['Embarked'].isna())[0])
train_data2.drop(axis=0,index=[61,829],inplace=True)

Index of two instances which contains nan [ 61 829]


### Pre-processing: Drop, OneHotEncoding, Standarlization
    * ID: which might not helpful, let us drop this feature

In [13]:
train_x = train_data2.drop(axis=1,columns='PassengerId')
#drop the two instance when processing train_y
train_y.drop(axis=0,index=[61,829],inplace=True)

In [14]:
train_y = np.append(np.array(train_y),np.array(sample_submission['Survived']))

In [15]:
train_x.head()

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


0. Drop Ticket, Name
1. Encoding Pclass, Sex, Embarked
2. Standardization: Age,Fare, SibSp,Parch

In [16]:
train_x.drop(axis=1,columns=['Name','Ticket'],inplace=True)

In [17]:
train_x = pd.get_dummies(train_x,columns=['Pclass','Sex','Embarked'],drop_first=True)
train_x.reset_index(drop=True,inplace=True)
#train_x.drop(axis=1,columns=['Pclass','Sex','Embarked'],inplace = True)

In [18]:
test_x = train_x.iloc[-418:,:]
test_y = train_y[-418:]

train_x = train_x.iloc[0:-418,:]
train_y = train_y[:-418]

In [19]:
# process Age
AgeStd = StandardScaler()
a = AgeStd.fit(np.array(train_x['Age']).reshape(-1,1))
train_x['Age'] = AgeStd.transform(np.array(train_x['Age']).reshape(-1,1))
test_x['Age'] = AgeStd.transform(np.array(test_x['Age']).reshape(-1,1))

In [29]:
#process SibiSp
SpStd = StandardScaler()
a = SpStd.fit(np.array(train_x['SibSp']).reshape(-1,1))
train_x['SibSp'] = SpStd.transform(np.array(train_x['SibSp']).reshape(-1,1))
test_x['SibSp'] = SpStd.transform(np.array(test_x['SibSp']).reshape(-1,1))



In [32]:
#process Parch
ParchStd = StandardScaler()
ParchStd.fit(np.array(train_x['Parch']).reshape(-1,1))
train_x['Parch'] = ParchStd.transform(np.array(train_x['Parch']).reshape(-1,1))
test_x['Parch'] = ParchStd.transform(np.array(test_x['Parch']).reshape(-1,1))



In [41]:
# Processing missing data with mean
test_x.reset_index(drop=True,inplace=True)
imputer = imputation.Imputer(missing_values=np.nan,strategy='mean')
test_x['Fare'] = imputer.fit_transform(np.array(test_x['Fare']).reshape(-1,1))

In [44]:
FareStd = StandardScaler()
FareStd.fit(np.array(train_x['Fare']).reshape(-1,1))
train_x['Fare'] = FareStd.transform(np.array(train_x['Fare']).reshape(-1,1))
test_x['Fare'] = FareStd.transform(np.array(test_x['Fare']).reshape(-1,1))

In [46]:
test_x

Unnamed: 0,Age,SibSp,Parch,Fare,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,0.371126,-0.475199,-0.474326,-0.488579,0,1,1,1,0
1,1.335528,0.431350,-0.474326,-0.505273,0,1,0,0,1
2,2.492810,-0.475199,-0.474326,-0.451165,1,0,1,1,0
3,-0.207515,-0.475199,-0.474326,-0.471802,0,1,1,0,1
4,-0.593276,0.431350,0.765897,-0.398819,0,1,0,0,1
5,-1.210493,-0.475199,-0.474326,-0.460477,0,1,1,0,1
6,0.023941,-0.475199,-0.474326,-0.492605,0,1,0,1,0
7,-0.284667,0.431350,0.765897,-0.062346,1,0,1,0,1
8,-0.901884,-0.475199,-0.474326,-0.500659,0,1,0,0,0
9,-0.670428,1.337900,-0.474326,-0.159991,0,1,1,0,1


### We have finished data pre-processing, let's try some models:
    * Logistic regression
    * RandomForrest
    * GradientBoosting
    * SVC

In [47]:
# Logistic regression
model_lr = LogisticRegression()
acc = cross_val_score(model_lr,train_x,train_y,cv=5)
acc.mean()

0.7941979305529105

In [48]:
#random forrest
model_rf = RandomForestClassifier()
acc = cross_val_score(model_rf,train_x,train_y,cv=5)
acc.mean()

0.7986923125753824

In [53]:
#GradientBoosting
model_gb = GradientBoostingClassifier(n_estimators=500)
acc = cross_val_score(model_gb,train_x,train_y,cv=5)
acc.mean()

0.8324319177299563

In [62]:
#SVC
model_svc = SVC(kernel='rbf',C=1)
acc = cross_val_score(model_svc,train_x,train_y,cv=5)
acc.mean()

0.8223005141877738

In [70]:
model_gb.fit(train_x,train_y)
sample_submission['Survived'] = model_gb.predict(test_x)

In [78]:
sample_submission.to_csv('Titanic_sub.csv',index=False)

In [79]:
pd.read_csv('Titanic_sub.csv')

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,0
7,899,0
8,900,1
9,901,0


In [80]:
sample_submission

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
5,897,0
6,898,0
7,899,0
8,900,1
9,901,0
