# Titanic: Machine Learning from Disaster
# Research goal
This is an intro to kaggle prediction competitions. In this challenge, we will apply the tools of machine learning to predict which passengers survived the sinking of the RMS Titanic. The evaluation score is the percentage of passengers in the test set that are correctly predicted. The goal is to obtain maximum accuracy on the test set.

In [1449]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# Retrieving data

In [1450]:
train = pd.read_csv('train_titanic.csv')
test = pd.read_csv('test_titanic.csv')

# Data Preparation
Cleansing, integrating, and transforming data which will include preliminary data analysis to perform these tasks 

In [1451]:
train.info()
print('-'*40)
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
----------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null

Train and test data contain 11 features, 5 features are strings, 6 features are numerical. Train data contains an additional column "Survived". The model will predict this value for test data. 0 = No Survival, 1 = Survived. Some features in both data sets have missing values.

In [1452]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [1453]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [1454]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [1455]:
train.describe(include=['O'])

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Davies, Mr. Charles Henry",male,347082,C23 C25 C27,S
freq,1,577,7,4,644


In [1456]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [1457]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

Cabin feature is mostly incomplete in both train and test data. Passenger ID is not relevant to survival. We will drop these features from both datasets. Initial thoughts are to drop ticket number as well but there are some passengers with duplicate ticket numbers. We will investigate this further. All names are unique, however there are common titles that may give our model some predictive power. New Categorical feature Title will be extracted from Name before dropping the Name feature. SibSp and Parch features can be added to create a new feature titled family size which may be better correlated with survival rate. We will fill missing values for Age, Embarked, and Fare features. To be able to work with both datasets at once, we will create a variable "data" which will contain both test and train data. We should not worry too much about data leakage since test data does not have Survived values.

In [1458]:
data = train.append(test)
data.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450


In [1459]:
data.groupby('Pclass').Fare.median()

Pclass
1    60.0000
2    15.0458
3     8.0500
Name: Fare, dtype: float64

In [1460]:
data.Fare.fillna(8.0500, inplace = True)

In [1461]:
dup_ticket = data[data.duplicated(subset='Ticket', keep=False)].sort_values('Ticket')
dup_ticket.head(15)

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
257,30.0,B77,S,86.5,"Cherry, Miss. Gladys",0,258,1,female,0,1.0,110152
504,16.0,B79,S,86.5,"Maioni, Miss. Roberta",0,505,1,female,0,1.0,110152
759,33.0,B77,S,86.5,"Rothes, the Countess. of (Lucy Noel Martha Dye...",0,760,1,female,0,1.0,110152
558,39.0,E67,S,79.65,"Taussig, Mrs. Emil (Tillie Mandelbaum)",1,559,1,female,1,1.0,110413
262,52.0,E67,S,79.65,"Taussig, Mr. Emil",1,263,1,male,1,0.0,110413
585,18.0,E68,S,79.65,"Taussig, Miss. Ruth",2,586,1,female,0,1.0,110413
475,,A14,S,52.0,"Clifford, Mr. George Quincy",0,476,1,male,0,0.0,110465
110,47.0,C110,S,52.0,"Porter, Mr. Walter Chamberlain",0,111,1,male,0,0.0,110465
366,60.0,D37,C,75.25,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",0,367,1,female,1,1.0,110813
236,64.0,D37,C,75.25,"Warren, Mr. Frank Manley",0,1128,1,male,1,,110813


In [1462]:
single_ticket = data[~data.duplicated(subset='Ticket', keep=False)].sort_values('Ticket')
single_ticket.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
335,30.0,C106,S,26.0,"Maguire, Mr. John Edward",0,1227,1,male,0,,110469
158,42.0,D22,S,26.55,"Borebank, Mr. John James",0,1050,1,male,0,,110489
430,28.0,C52,S,26.55,"Bjornstrom-Steffansson, Mr. Mauritz Hakan",0,431,1,male,0,1.0,110564
191,,,S,26.0,"Salomon, Mr. Abraham L",0,1083,1,male,0,,111163
170,61.0,B19,S,33.5,"Van der hoef, Mr. Wyckoff",0,171,1,male,0,0.0,111240


In [1463]:
dup_ticket.describe()

Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
count,513.0,596.0,596.0,596.0,596.0,596.0,410.0
mean,28.581228,58.892919,0.807047,648.498322,2.035235,1.003356,0.517073
std,16.198831,67.789731,1.121831,381.473391,0.873763,1.329683,0.500319
min,0.17,0.0,0.0,2.0,1.0,0.0,0.0
25%,18.0,20.5625,0.0,314.5,1.0,0.0,0.0
50%,28.0,31.33125,0.0,642.5,2.0,1.0,1.0
75%,39.0,73.5,1.0,974.0,3.0,1.0,1.0
max,76.0,512.3292,9.0,1309.0,3.0,8.0,1.0


In [1464]:
single_ticket.describe()

Unnamed: 0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
count,533.0,713.0,713.0,713.0,713.0,713.0,481.0
mean,31.13227,11.863054,0.032258,660.434783,2.511921,0.077139,0.27027
std,12.342727,8.356034,0.225664,375.290706,0.740244,0.353071,0.444562
min,9.0,0.0,0.0,1.0,1.0,0.0,0.0
25%,22.0,7.75,0.0,345.0,2.0,0.0,0.0
50%,28.0,8.05,0.0,664.0,3.0,0.0,0.0
75%,37.0,13.0,0.0,986.0,3.0,0.0,1.0
max,80.0,50.4958,2.0,1308.0,3.0,4.0,1.0


A group having the same ticket number has the same fare. Most members sharing a ticket number also share family name. We may make the assumption that they purchased the ticket together and traveling together as well. Observing the Parch feature distribution, mean, and std, we see that mostly all children are on a duplicate ticket. We should expect median fares to be lower for duplicates yet this is not the case. Median fare for single tickets is 8.05 compared to 31.33 for those sharing a ticket number. We will assume that fare is the amount of a group ticket. We will engineer a feature named price to replace fare by individual average price.

In [1465]:
data['Group_Size']=data.groupby('Ticket')['Ticket'].transform('count')
data['Price']=data['Fare']/data['Group_Size']
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [1466]:
data['Title'] = data.Name.str.extract(' ([A-Za-z]+)\.')
train.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [1467]:
data.groupby('Title')['Title'].count()

Title
Capt          1
Col           4
Countess      1
Don           1
Dona          1
Dr            8
Jonkheer      1
Lady          1
Major         2
Master       61
Miss        260
Mlle          2
Mme           1
Mr          757
Mrs         197
Ms            2
Rev           8
Sir           1
Name: Title, dtype: int64

In [1468]:
data[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Unnamed: 0,Title,Survived
0,Capt,0.0
1,Col,0.5
2,Countess,1.0
3,Don,0.0
4,Dona,
5,Dr,0.428571
6,Jonkheer,0.0
7,Lady,1.0
8,Major,0.5
9,Master,0.575


In [1469]:
data['Title']=data['Title'].replace(['Jonkheer', 'Don', 'Sir', 'Countess', 'Dona', 'Lady'], 'Royalty')
data['Title']=data['Title'].replace('Mlle', 'Miss')
data['Title']=data['Title'].replace(['Mme', 'Ms'], 'Mrs')
data['Title']=data['Title'].replace(['Col', 'Capt', 'Major'], 'Military')

In [1470]:
MedianAgeGrouped = data.groupby(['Sex', 'Pclass', 'Title'])

In [1471]:
MedianAgeGrouped.Age.median()

Sex     Pclass  Title   
female  1       Dr          49.0
                Miss        30.0
                Mrs         45.0
                Royalty     39.0
        2       Miss        20.0
                Mrs         30.0
        3       Miss        18.0
                Mrs         31.0
male    1       Dr          47.0
                Master       6.0
                Military    53.0
                Mr          41.5
                Royalty     40.0
        2       Dr          38.5
                Master       2.0
                Mr          30.0
                Rev         41.5
        3       Master       6.0
                Mr          26.0
Name: Age, dtype: float64

In [1472]:
data["Age"] = MedianAgeGrouped.transform(lambda x: x.fillna(x.median()))

In [1473]:
data[pd.isnull(data.Embarked)]

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Group_Size,Price,Title
61,38.0,B28,,80.0,"Icard, Miss. Amelie",0,62,1,female,0,1.0,113572,2,40.0,Miss
829,62.0,B28,,80.0,"Stone, Mrs. George Nelson (Martha Evelyn)",0,830,1,female,0,1.0,113572,2,40.0,Mrs


In [1474]:
data.Embarked.value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

In [1475]:
data.Embarked.fillna('S', inplace = True)

In [1476]:
data[pd.isnull(titanic.Fare)]

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket,Group_Size,Price,Title
152,60.5,,S,8.05,"Storey, Mr. Thomas",0,1044,3,male,0,,3701,1,8.05,Mr


In [1477]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 15 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
Group_Size     1309 non-null int64
Price          1309 non-null float64
Title          1309 non-null object
dtypes: float64(4), int64(5), object(6)
memory usage: 163.6+ KB


In [1478]:
data['FamilySize']= titanic.SibSp + titanic.Parch + 1

In [1479]:
#data.to_csv('Titanic Exploration Set.csv', index=False)

In [1480]:
Cabin = data.Cabin
PassengerId = test.PassengerId
data.drop(['Fare', 'Name', 'Ticket', 'SibSp', 'Parch', 'PassengerId', 'Cabin'], axis=1, inplace=True)

In [1481]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 9 columns):
Age           1309 non-null float64
Embarked      1309 non-null object
Pclass        1309 non-null int64
Sex           1309 non-null object
Survived      891 non-null float64
Group_Size    1309 non-null int64
Price         1309 non-null float64
Title         1309 non-null object
FamilySize    1309 non-null int64
dtypes: float64(3), int64(3), object(3)
memory usage: 102.3+ KB


In [1482]:
data.Sex.replace(to_replace = ['male', 'female'], value=[0,1], inplace=True)

In [1483]:
data = pd.get_dummies(data, columns= ['Pclass', 'Embarked', 'Title'])

# Data Exploration

In [1484]:
#make a heat map 
data.corr()

Unnamed: 0,Age,Sex,Survived,Group_Size,Price,FamilySize,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Dr,Title_Master,Title_Military,Title_Miss,Title_Mr,Title_Mrs,Title_Rev,Title_Royalty
Age,1.0,-0.069857,-0.059304,-0.167985,0.395069,-0.217925,0.436975,0.007273,-0.384054,0.088014,-0.102744,-0.012292,0.085944,-0.390373,0.138834,-0.303155,0.187018,0.235967,0.069894,0.060063
Sex,-0.069857,1.0,0.543351,0.172765,0.122091,0.188583,0.107371,0.028862,-0.116562,0.066564,0.088651,-0.115193,-0.037831,-0.164375,-0.054516,0.672819,-0.870678,0.571176,-0.058302,0.020408
Survived,-0.059304,0.543351,1.0,0.064962,0.288337,0.016639,0.285904,0.093349,-0.322308,0.16824,0.00365,-0.149683,0.008185,0.085221,0.002496,0.332795,-0.549199,0.344935,-0.064988,0.033391
Group_Size,-0.167985,0.172765,0.064962,1.0,0.09428,0.800556,0.097786,-0.045356,-0.047429,0.028193,-0.114046,0.047711,-0.021006,0.305164,-0.015962,0.112016,-0.28686,0.105835,-0.026516,-0.010232
Price,0.395069,0.122091,0.288337,0.09428,1.0,-0.050028,0.809439,-0.127794,-0.595607,0.346157,-0.151948,-0.208849,0.09162,-0.113112,0.096917,0.003938,-0.096999,0.146873,-0.012406,0.060448
FamilySize,-0.217925,0.188583,0.016639,0.800556,-0.050028,1.0,-0.029656,-0.039976,0.05843,-0.036553,-0.08719,0.087771,-0.006632,0.355061,-0.021089,0.08735,-0.326487,0.157233,-0.019016,-0.0236
Pclass_1,0.436975,0.107371,0.285904,0.097786,0.809439,-0.029656,1.0,-0.296526,-0.622172,0.325722,-0.166101,-0.1818,0.091535,-0.084504,0.128109,-0.011733,-0.099725,0.141102,-0.044882,0.118561
Pclass_2,0.007273,0.028862,0.093349,-0.045356,-0.127794,-0.039976,-0.296526,1.0,-0.56318,-0.134675,-0.121973,0.196532,0.00737,-0.016933,-0.037988,-0.02544,-0.038595,0.071103,0.151358,-0.035156
Pclass_3,-0.384054,-0.116562,-0.322308,-0.047429,-0.595607,0.05843,-0.622172,-0.56318,1.0,-0.17143,0.243706,-0.003805,-0.085242,0.086998,-0.079706,0.031007,0.117925,-0.180375,-0.085242,-0.073765
Embarked_C,0.088014,0.066564,0.16824,0.028193,0.346157,-0.036553,0.325722,-0.134675,-0.17143,1.0,-0.164166,-0.778262,0.008476,-0.014172,0.040285,-0.014351,-0.065538,0.098379,-0.039974,0.077213


# Data Modeling

In [1485]:
train = data[~pd.isnull(titanic.Survived)]
test = data[pd.isnull(titanic.Survived)]

X_train = train.drop('Survived', axis=1)
Y_train = train.Survived.astype(int)
X_test = test.drop('Survived', axis=1)

In [1486]:
logreg = LogisticRegression().fit(X_train, Y_train)
log_acc = round(logreg.score(X_train, Y_train) * 100, 2)
log_acc

83.39

In [1487]:
logreg_coef_df = pd.DataFrame(X_train.columns)
logreg_coef_df.columns = ['Feature']
logreg_coef_df['Coefficient'] = logreg.coef_[0]
logreg_coef_df.sort_values('Coefficient', ascending=False)

Unnamed: 0,Feature,Coefficient
12,Title_Master,1.810736
1,Sex,1.672599
16,Title_Mrs,1.069628
5,Pclass_1,1.020095
8,Embarked_C,0.328144
6,Pclass_2,0.244155
9,Embarked_Q,0.182994
14,Title_Miss,0.1469
2,Group_Size,0.086722
3,Price,0.02343


In [1488]:
knn = KNeighborsClassifier().fit(X_train, Y_train)
knn_acc = round(knn.score(X_train, Y_train) * 100, 2)
knn_acc

84.06

In [1489]:
lin_svc = LinearSVC().fit(X_train, Y_train)
lin_svc_acc = round(lin_svc.score(X_train, Y_train) * 100, 2)
lin_svc_acc

83.5

In [1490]:
svc = SVC().fit(X_train, Y_train)
svc_acc = round(svc.score(X_train, Y_train) * 100, 2)
svc_acc

86.98

In [1491]:
gaussian = GaussianNB().fit(X_train, Y_train)
gaussian_acc = round(gaussian.score(X_train, Y_train) * 100, 2)
gaussian_acc

74.64

In [1492]:
tree = DecisionTreeClassifier().fit(X_train, Y_train)
tree_acc = round(tree.score(X_train, Y_train) * 100, 2)
tree_acc

98.32

In [1493]:
forest = RandomForestClassifier().fit(X_train, Y_train)
forest_acc = round(forest.score(X_train, Y_train) * 100, 2)
forest_acc

97.08

In [1494]:
gb = GradientBoostingClassifier().fit(X_train, Y_train)
gb_acc = round(gb.score(X_train, Y_train) * 100, 2)
gb_acc

89.45

In [1495]:
models = pd.DataFrame({
    'Model': ['Logistic Regression', 'KNN', 'Linear SVC', 'SVC',
              'Gaussian Navie Bayes', 'Decision Tree', 'Random Forest', 
              'Gradient Boosting'],
    'Score': [log_acc, knn_acc, lin_svc_acc, svc_acc, gaussian_acc,
              tree_acc, forest_acc, gb_acc]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
5,Decision Tree,98.32
6,Random Forest,97.08
7,Gradient Boosting,89.45
3,SVC,86.98
1,KNN,84.06
2,Linear SVC,83.5
0,Logistic Regression,83.39
4,Gaussian Navie Bayes,74.64


Decision Tree is best, this is likey due to overfitting. Let's perform 3-fold cross validation.

In [1496]:
Classifier_func = [LogisticRegression(), KNeighborsClassifier(), LinearSVC(), SVC(), 
         GaussianNB(), DecisionTreeClassifier(), RandomForestClassifier(), GradientBoostingClassifier()]
acc_scores = []
for i in Classifier_func:
    acc_scores.append(cross_val_score(i, X_train, Y_train))

In [1497]:
acc_scores

[array([0.81818182, 0.82154882, 0.84175084]),
 array([0.72053872, 0.74747475, 0.77104377]),
 array([0.76430976, 0.82491582, 0.82828283]),
 array([0.71043771, 0.78787879, 0.8013468 ]),
 array([0.71043771, 0.71043771, 0.3973064 ]),
 array([0.76094276, 0.76430976, 0.78451178]),
 array([0.79461279, 0.7979798 , 0.78787879]),
 array([0.8047138 , 0.82828283, 0.83164983])]

In [1498]:
cross_val = pd.DataFrame(acc_scores, columns=['1stFold', '2ndFold', '3rdFold'])

In [1499]:
cross_val['Model']=['Logistic Regression', 'KNN', 'Linear SVC', 'SVC',
              'Gaussian Navie Bayes', 'Decision Tree', 'Random Forest', 
              'Gradient Boosting']

In [1500]:
cross_val = cross_val[['Model','1stFold', '2ndFold', '3rdFold']]

In [1501]:
cross_val['Mean']=cross_val.mean(axis=1)

In [1502]:
cross_val['Std Dev']=cross_val.loc[:,'1stFold':'3rdFold'].std(axis=1)

In [1503]:
cross_val.sort_values(by = 'Mean', ascending=False, inplace=True)
cross_val

Unnamed: 0,Model,1stFold,2ndFold,3rdFold,Mean,Std Dev
0,Logistic Regression,0.818182,0.821549,0.841751,0.82716,0.012747
7,Gradient Boosting,0.804714,0.828283,0.83165,0.821549,0.014676
2,Linear SVC,0.76431,0.824916,0.828283,0.805836,0.036002
6,Random Forest,0.794613,0.79798,0.787879,0.79349,0.005143
5,Decision Tree,0.760943,0.76431,0.784512,0.769921,0.012747
3,SVC,0.710438,0.787879,0.801347,0.766554,0.049063
1,KNN,0.720539,0.747475,0.771044,0.746352,0.025271
4,Gaussian Navie Bayes,0.710438,0.710438,0.397306,0.606061,0.180786


Mean accuracy for the default 3-folds is highest with Logistic Regression. Also has less varaince then second best classifier. We will perform Grid Search with 3-fold cross validation to tune the parameters for our logistic model. Let's see if we can improve accuracy!

In [1504]:
penalty = ['l1', 'l2']
C = np.logspace(-3,3, num=7)
param_grid = dict(C=C, penalty=penalty)

In [1505]:
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=3)
grid_search.fit(X_train, Y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]), 'penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [1506]:
grid_search_results=grid_search.cv_results_['mean_test_score'].reshape(7,2)

In [1507]:
grid_search_df = pd.DataFrame(grid_search_results, index =C, columns=penalty)

In [1508]:
grid_search_df

Unnamed: 0,l1,l2
0.001,0.616162,0.720539
0.01,0.700337,0.766554
0.1,0.801347,0.819304
1.0,0.830527,0.82716
10.0,0.823793,0.824916
100.0,0.823793,0.822671
1000.0,0.822671,0.823793


In [1509]:
grid_search.best_score_

0.8305274971941639

In [1510]:
grid_search.best_estimator_

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [1511]:
Y_predict = grid_search.predict(X_test)

In [1512]:
Kaggle_Submission = pd.DataFrame({'PassengerID':PassengerId, 'Survived':Y_predict})

In [1513]:
Kaggle_Submission.to_csv('Titanic Submission.csv', index=False)

My first submission with logistic regression scored 0.7990, which placed me in the top 46% (4978 out of 11006). This accuracy score is less than my mean accuracy of 0.827. My model is slightly overfitting.

In [1514]:
grid_search.best_estimator_.coef_

array([[-0.03222607,  2.67850653,  0.06840704,  0.02361124, -0.46699501,
         0.82222255,  0.        , -0.99636627,  0.16476508,  0.        ,
        -0.15104446,  0.        ,  2.83686312,  0.        ,  0.        ,
        -0.1690901 ,  0.87286536,  0.        ,  0.        ]])

In [1515]:
logreg_coef_df = pd.DataFrame(X_train.columns)
logreg_coef_df.columns = ['Feature']
logreg_coef_df['Coefficient'] = grid_search.best_estimator_.coef_[0]
logreg_coef_df.sort_values('Coefficient', ascending=False)

Unnamed: 0,Feature,Coefficient
12,Title_Master,2.836863
1,Sex,2.678507
16,Title_Mrs,0.872865
5,Pclass_1,0.822223
8,Embarked_C,0.164765
2,Group_Size,0.068407
3,Price,0.023611
9,Embarked_Q,0.0
17,Title_Rev,0.0
14,Title_Miss,0.0


After observing visuals, I will create bins for Age and Price to help with outliers. Family Size will also be binned by alone, small family, or large family. This will help reduce multicollinearity between Family Size and Group Size. Group size differs from family size since there are some that may be traveling with others who are not family. Thus group size can be useful information for our model.

In [1516]:
data.loc[data['Age'] <= 16, 'Age'] = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age'] = 4

In [1517]:
data.head()

Unnamed: 0,Age,Sex,Survived,Group_Size,Price,FamilySize,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,Embarked_S,Title_Dr,Title_Master,Title_Military,Title_Miss,Title_Mr,Title_Mrs,Title_Rev,Title_Royalty
0,1.0,0,0.0,1,7.25,2,0,0,1,0,0,1,0,0,0,0,1,0,0,0
1,2.0,1,1.0,2,35.64165,2,1,0,0,1,0,0,0,0,0,0,0,1,0,0
2,1.0,1,1.0,1,7.925,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0
3,2.0,1,1.0,2,26.55,2,1,0,0,0,0,1,0,0,0,0,0,1,0,0
4,2.0,0,0.0,1,8.05,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0


In [1518]:
data['PriceBand'] = pd.qcut(data['Price'], 4)
print (data[['PriceBand', 'Survived']].groupby(['PriceBand'], as_index=False).mean())

         PriceBand  Survived
0   (-0.001, 7.55]  0.268398
1     (7.55, 8.05]  0.251101
2     (8.05, 15.0]  0.393519
3  (15.0, 128.082]  0.635945


In [1519]:
data.loc[ data['Price'] <= 7.55, 'Price'] = 0
data.loc[(data['Price'] > 7.55) & (data['Price'] <= 8.05), 'Price'] = 1
data.loc[(data['Price'] > 8.05) & (data['Price'] <= 15.0), 'Price']   = 2
data.loc[ data['Price'] > 15.0, 'Price'] = 3
data['Price'] = data['Price'].astype(int)

In [1520]:
family_map = {1: 'Alone', 2: 'Small', 3: 'Small', 4: 'Small', 5: 'Large', 6: 'Large', 7: 'Large', 8: 'Large', 11: 'Large'}
data['FamilySize'] = data['FamilySize'].map(family_map)

In [1521]:
data.head()

Unnamed: 0,Age,Sex,Survived,Group_Size,Price,FamilySize,Pclass_1,Pclass_2,Pclass_3,Embarked_C,...,Embarked_S,Title_Dr,Title_Master,Title_Military,Title_Miss,Title_Mr,Title_Mrs,Title_Rev,Title_Royalty,PriceBand
0,1.0,0,0.0,1,0,Small,0,0,1,0,...,1,0,0,0,0,1,0,0,0,"(-0.001, 7.55]"
1,2.0,1,1.0,2,3,Small,1,0,0,1,...,0,0,0,0,0,0,1,0,0,"(15.0, 128.082]"
2,1.0,1,1.0,1,1,Alone,0,0,1,0,...,1,0,0,0,1,0,0,0,0,"(7.55, 8.05]"
3,2.0,1,1.0,2,3,Small,1,0,0,0,...,1,0,0,0,0,0,1,0,0,"(15.0, 128.082]"
4,2.0,0,0.0,1,1,Alone,0,0,1,0,...,1,0,0,0,0,1,0,0,0,"(7.55, 8.05]"


In [1522]:
data = pd.get_dummies(data, columns= ['FamilySize'])

In [1523]:
data.drop(['PriceBand'], axis=1, inplace=True)

In [1524]:
data.head()

Unnamed: 0,Age,Sex,Survived,Group_Size,Price,Pclass_1,Pclass_2,Pclass_3,Embarked_C,Embarked_Q,...,Title_Master,Title_Military,Title_Miss,Title_Mr,Title_Mrs,Title_Rev,Title_Royalty,FamilySize_Alone,FamilySize_Large,FamilySize_Small
0,1.0,0,0.0,1,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
1,2.0,1,1.0,2,3,1,0,0,1,0,...,0,0,0,0,1,0,0,0,0,1
2,1.0,1,1.0,1,1,0,0,1,0,0,...,0,0,1,0,0,0,0,1,0,0
3,2.0,1,1.0,2,3,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
4,2.0,0,0.0,1,1,0,0,1,0,0,...,0,0,0,1,0,0,0,1,0,0


In [1525]:
train = data[~pd.isnull(titanic.Survived)]
test = data[pd.isnull(titanic.Survived)]

X_train = train.drop('Survived', axis=1)
Y_train = train.Survived.astype(int)
X_test = test.drop('Survived', axis=1)

logreg = LogisticRegression().fit(X_train, Y_train)
log_acc = round(logreg.score(X_train, Y_train) * 100, 2)
log_acc

82.94

In [1526]:
grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=3)
grid_search.fit(X_train, Y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]), 'penalty': ['l1', 'l2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [1527]:
grid_search.best_score_

0.8271604938271605

In [1528]:
grid_search.best_estimator_

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [1529]:
Y_predict = grid_search.predict(X_test)
Kaggle_Submission = pd.DataFrame({'PassengerID':PassengerId, 'Survived':Y_predict})

In [1530]:
#Kaggle_Submission.to_csv('Titanic Submission2.csv', index=False)

This submission scored 0.78468, which moved me to the top 37%.

In [1531]:
data['Deck'] = Cabin.apply(lambda s: s[0] if pd.notnull(s) else 'M')

In [1532]:
data.Deck.replace(to_replace ='T', value='A', inplace=True)

In [1533]:
data = pd.get_dummies(data, columns= ['Deck'])

In [1534]:
train = data[~pd.isnull(titanic.Survived)]
test = data[pd.isnull(titanic.Survived)]

X_train = train.drop('Survived', axis=1)
Y_train = train.Survived.astype(int)
X_test = test.drop('Survived', axis=1)

svc = SVC().fit(X_train, Y_train)
svc_acc = round(svc.score(X_train, Y_train) * 100, 2)
svc_acc

83.61

In [1535]:
gamma = np.logspace(-2,2, num=5)
C = np.logspace(-2,2, num=5)
param_grid = dict(C=C, gamma=gamma)
grid_search = GridSearchCV(SVC(), param_grid, cv=3)
grid_search.fit(X_train, Y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02]), 'gamma': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [1536]:
grid_search.best_score_

0.8316498316498316

In [1537]:
grid_search.best_estimator_

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [1538]:
Y_predict = grid_search.predict(X_test)
Kaggle_Submission = pd.DataFrame({'PassengerID':PassengerId, 'Survived':Y_predict})

In [1539]:
#Kaggle_Submission.to_csv('Titanic Submission3.csv', index=False)

This submission put me in the top 20%