# TITANIC: MACHINE LERANING FROM DISASTER

In [1]:
#IMPORTING THE BASIC LIBRARIES
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## PREPROCESSING DATA

In [2]:
train=pd.read_csv("train.csv")
test=pd.read_csv("test.csv")
print("Shape of train data (%d,%d)"%train.shape)
print("Shape of test data (%d,%d)"%test.shape)

Shape of train data (891,12)
Shape of test data (418,11)


In [3]:
train.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

We have apply tools of machine learning to see which groups survived the most

ABOUT THE DATA :

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

In [4]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [6]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

## FILLING THE MISSING VALUES

In [7]:
# AGE
train.Age=train.Age.fillna(train.Age.median())
test.Age=test.Age.fillna(train.Age.median())

In [8]:
# Embarked
train.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

As You can see that 'S' has highest frequency in the data. So we will replace the missing values by S in train data and there are no missing values in test data

In [9]:
train.Embarked=train.Embarked.fillna('S')
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

In [10]:
# FARE in test data 
test.Fare=test.Fare.fillna(train.Fare.median())
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          327
Embarked         0
dtype: int64

## EXTRACTING THE FEATURES

In [11]:
train.groupby("Embarked").Survived.value_counts()

Embarked  Survived
C         1            93
          0            75
Q         0            47
          1            30
S         0           427
          1           219
Name: Survived, dtype: int64

In [12]:
train.groupby("Sex").Survived.value_counts()

Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64

It can be seen that females have more survival rate than males 

In [13]:
# converting the categorical features to numerical features
train.Sex=train.Sex.replace(['male','female'],[0,1])
train.Embarked=train.Embarked.replace(['C','S','Q'],[0,1,2])

test.Sex=test.Sex.replace(['male','female'],[0,1])
test.Embarked=test.Embarked.replace(['C','S','Q'],[0,1,2])

In [14]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1


In [15]:
train.groupby("SibSp").Survived.value_counts()

SibSp  Survived
0      0           398
       1           210
1      1           112
       0            97
2      0            15
       1            13
3      0            12
       1             4
4      0            15
       1             3
5      0             5
8      0             7
Name: Survived, dtype: int64

In [16]:
train.groupby("Parch").Survived.value_counts()

Parch  Survived
0      0           445
       1           233
1      1            65
       0            53
2      0            40
       1            40
3      1             3
       0             2
4      0             4
5      0             4
       1             1
6      0             1
Name: Survived, dtype: int64

Consider Parents Children(Parch) & Sibling Spouse (SibSp) as Family. Adding this will give Family.

In [17]:
train['Family']=train['SibSp']+train['Parch']
test['Family']=test['SibSp']+test['Parch']
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1,0


In [18]:
train.groupby('Family').Survived.value_counts()

Family  Survived
0       0           374
        1           163
1       1            89
        0            72
2       1            59
        0            43
3       1            21
        0             8
4       0            12
        1             3
5       0            19
        1             3
6       0             8
        1             4
7       0             6
10      0             7
Name: Survived, dtype: int64

In [19]:
# we can just classify it as two groups
train.loc[train['Family']>0,'Family']=1

In [20]:
test.loc[test['Family']>0,'Family']=1
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family
0,892,3,"Kelly, Mr. James",0,34.5,0,0,330911,7.8292,,2,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,47.0,1,0,363272,7.0,,1,1
2,894,2,"Myles, Mr. Thomas Francis",0,62.0,0,0,240276,9.6875,,2,0
3,895,3,"Wirz, Mr. Albert",0,27.0,0,0,315154,8.6625,,1,0
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,22.0,1,1,3101298,12.2875,,1,1


In [21]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,1,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,1,0


In [22]:
# AGE IN parts
train.loc[ train['Age'] <= 16, 'Age'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <= 32), 'Age'] = 1
train.loc[(train['Age'] > 32) & (train['Age'] <= 48), 'Age'] = 2
train.loc[(train['Age'] > 48) & (train['Age'] <= 64), 'Age'] = 3
train.loc[ train['Age'] > 64, 'Age'] = 4  


test.loc[ test['Age'] <= 16, 'Age'] = 0
test.loc[(test['Age'] > 16) & (test['Age'] <= 32), 'Age'] = 1
test.loc[(test['Age'] > 32) & (test['Age'] <= 48), 'Age'] = 2
test.loc[(test['Age'] > 48) & (test['Age'] <= 64), 'Age'] = 3
test.loc[ test['Age'] > 64, 'Age'] = 4

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,A/5 21171,7.25,,1,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,PC 17599,71.2833,C85,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,STON/O2. 3101282,7.925,,1,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,113803,53.1,C123,1,1
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,373450,8.05,,1,0


In [23]:
train['Title'] = train['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
stat_min = 10
title_names = (train['Title'].value_counts() < stat_min)
train['Title'] = train['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family,Title
0,1,0,3,"Braund, Mr. Owen Harris",0,1.0,1,0,A/5 21171,7.25,,1,1,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,2.0,1,0,PC 17599,71.2833,C85,0,1,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",1,1.0,0,0,STON/O2. 3101282,7.925,,1,0,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,2.0,1,0,113803,53.1,C123,1,1,Mrs
4,5,0,3,"Allen, Mr. William Henry",0,2.0,0,0,373450,8.05,,1,0,Mr


In [24]:
train.Title.value_counts()

Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64

In [25]:
test['Title'] = test['Name'].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
stat_min = 10
title_names = (test['Title'].value_counts() < stat_min)
test['Title'] = test['Title'].apply(lambda x: 'Misc' if title_names.loc[x] == True else x)
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Family,Title
0,892,3,"Kelly, Mr. James",0,2.0,0,0,330911,7.8292,,2,0,Mr
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",1,2.0,1,0,363272,7.0,,1,1,Mrs
2,894,2,"Myles, Mr. Thomas Francis",0,3.0,0,0,240276,9.6875,,2,0,Mr
3,895,3,"Wirz, Mr. Albert",0,1.0,0,0,315154,8.6625,,1,0,Mr
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1,1.0,1,1,3101298,12.2875,,1,1,Mrs


In [26]:
test.Title.value_counts()

Mr        240
Miss       78
Mrs        72
Master     21
Misc        7
Name: Title, dtype: int64

In [27]:
# fare bands like age
train.loc[train['Fare']<=7.91 , 'Fare']=0
train.loc[(train['Fare']>7.91) & (train['Fare']<=14.454) , 'Fare']=1
train.loc[(train['Fare']>14.454) & (train['Fare']<=31) , 'Fare']=2
train.loc[train['Fare']>31 , 'Fare']=3
train['Fare']=train['Fare'].astype(int)

test.loc[test['Fare']<=7.91 , 'Fare']=0
test.loc[(test['Fare']>7.91) & (test['Fare']<=14.454) , 'Fare']=1
test.loc[(test['Fare']>14.454) & (test['Fare']<=31) , 'Fare']=2
test.loc[test['Fare']>31 , 'Fare']=3
test['Fare']=test['Fare'].astype(int)

In [28]:
train_df_passenger=train["PassengerId"]
test_df_passenger=test["PassengerId"]

x1=np.array(train['Fare'])
x2=np.array(test['Fare'])
del train['PassengerId']
del train['Name']
del train['Cabin']
del train['Ticket']
del train['Fare']

del test['PassengerId']
del test['Name']
del test['Cabin']
del test['Ticket']
del test['Fare']
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked,Family,Title
0,0,3,0,1.0,1,0,1,1,Mr
1,1,1,1,2.0,1,0,0,1,Mrs
2,1,3,1,1.0,0,0,1,0,Miss
3,1,1,1,2.0,1,0,1,1,Mrs
4,0,3,0,2.0,0,0,1,0,Mr


In [29]:
test.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Embarked,Family,Title
0,3,0,2.0,0,0,2,0,Mr
1,3,1,2.0,1,0,1,1,Mrs
2,2,0,3.0,0,0,2,0,Mr
3,3,0,1.0,0,0,1,0,Mr
4,3,1,1.0,1,1,1,1,Mrs


In [30]:
train["Age"]=pd.Series(train["Age"],dtype=int)
test["Age"]=pd.Series(test["Age"],dtype=int)
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Embarked,Family,Title
0,0,3,0,1,1,0,1,1,Mr
1,1,1,1,2,1,0,0,1,Mrs
2,1,3,1,1,0,0,1,0,Miss
3,1,1,1,2,1,0,1,1,Mrs
4,0,3,0,2,0,0,1,0,Mr


In [31]:
del train['SibSp']
del test['SibSp']

del train['Parch']
del test['Parch']

In [32]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Embarked,Family,Title
0,0,3,0,1,1,1,Mr
1,1,1,1,2,0,1,Mrs
2,1,3,1,1,1,0,Miss
3,1,1,1,2,1,1,Mrs
4,0,3,0,2,1,0,Mr


In [33]:
test.head()

Unnamed: 0,Pclass,Sex,Age,Embarked,Family,Title
0,3,0,2,2,0,Mr
1,3,1,2,1,1,Mrs
2,2,0,3,2,0,Mr
3,3,0,1,1,0,Mr
4,3,1,1,1,1,Mrs


In [34]:
train.Title.value_counts()

Mr        517
Miss      182
Mrs       125
Master     40
Misc       27
Name: Title, dtype: int64

In [35]:
train.Title=train.Title.replace(['Mr','Miss','Mrs','Master','Misc'],[0,1,2,3,4])
test.Title=test.Title.replace(['Mr','Miss','Mrs','Master','Misc'],[0,1,2,3,4])

train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Embarked,Family,Title
0,0,3,0,1,1,1,0
1,1,1,1,2,0,1,2
2,1,3,1,1,1,0,1
3,1,1,1,2,1,1,2
4,0,3,0,2,1,0,0


In [36]:
X=train.iloc[:,1:].values
y=train.iloc[:,0].values
X_test=test.iloc[:,:].values

print(X.shape,y.shape,X_test.shape)
print(X)

(891, 6) (891,) (418, 6)
[[3 0 1 1 1 0]
 [1 1 2 0 1 2]
 [3 1 1 1 0 1]
 ..., 
 [3 1 1 1 1 1]
 [1 0 1 0 0 0]
 [3 0 1 2 0 0]]


In [37]:
from sklearn.preprocessing import OneHotEncoder
one_pclass=OneHotEncoder(categorical_features=[0])
X=one_pclass.fit_transform(X).toarray()
X_test=one_pclass.fit_transform(X_test).toarray()
print(X.shape,X_test.shape)

(891, 8) (418, 8)


In [38]:
print(X)
print("\n")
print(X_test)

[[ 0.  0.  1. ...,  1.  1.  0.]
 [ 1.  0.  0. ...,  0.  1.  2.]
 [ 0.  0.  1. ...,  1.  0.  1.]
 ..., 
 [ 0.  0.  1. ...,  1.  1.  1.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  1. ...,  2.  0.  0.]]


[[ 0.  0.  1. ...,  2.  0.  0.]
 [ 0.  0.  1. ...,  1.  1.  2.]
 [ 0.  1.  0. ...,  2.  0.  0.]
 ..., 
 [ 0.  0.  1. ...,  1.  0.  0.]
 [ 0.  0.  1. ...,  1.  0.  0.]
 [ 0.  0.  1. ...,  0.  1.  3.]]


In [39]:
# removing one column for avoiding dummy trap
X=X[:,1:]
X_test=X_test[:,1:]
print(X_test.shape,X.shape)

(418, 7) (891, 7)


In [40]:
print(X)
print('\n')
print(X_test)

[[ 0.  1.  0. ...,  1.  1.  0.]
 [ 0.  0.  1. ...,  0.  1.  2.]
 [ 0.  1.  1. ...,  1.  0.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  2.  0.  0.]]


[[ 0.  1.  0. ...,  2.  0.  0.]
 [ 0.  1.  1. ...,  1.  1.  2.]
 [ 1.  0.  0. ...,  2.  0.  0.]
 ..., 
 [ 0.  1.  0. ...,  1.  0.  0.]
 [ 0.  1.  0. ...,  1.  0.  0.]
 [ 0.  1.  0. ...,  0.  1.  3.]]


In [41]:
one_age=OneHotEncoder(categorical_features=[3])
X=one_age.fit_transform(X).toarray()
X_test=one_age.fit_transform(X_test).toarray()
print(X.shape,X_test.shape)

(891, 11) (418, 11)


In [42]:
train['Age'].value_counts()

1    523
2    188
0    100
3     69
4     11
Name: Age, dtype: int64

In [43]:
print(X)
print('\n')
print(X_test)

[[ 0.  1.  0. ...,  1.  1.  0.]
 [ 0.  0.  1. ...,  0.  1.  2.]
 [ 0.  1.  0. ...,  1.  0.  1.]
 ..., 
 [ 0.  1.  0. ...,  1.  1.  1.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  2.  0.  0.]]


[[ 0.  0.  1. ...,  2.  0.  0.]
 [ 0.  0.  1. ...,  1.  1.  2.]
 [ 0.  0.  0. ...,  2.  0.  0.]
 ..., 
 [ 0.  0.  1. ...,  1.  0.  0.]
 [ 0.  1.  0. ...,  1.  0.  0.]
 [ 0.  1.  0. ...,  0.  1.  3.]]


In [44]:
#for avoding the dummy trap
X=X[:,1:]
X_test=X_test[:,1:]
print(X.shape,X_test.shape)

(891, 10) (418, 10)


In [45]:
print(X)
print('\n')
print(X_test)

[[ 1.  0.  0. ...,  1.  1.  0.]
 [ 0.  1.  0. ...,  0.  1.  2.]
 [ 1.  0.  0. ...,  1.  0.  1.]
 ..., 
 [ 1.  0.  0. ...,  1.  1.  1.]
 [ 1.  0.  0. ...,  0.  0.  0.]
 [ 1.  0.  0. ...,  2.  0.  0.]]


[[ 0.  1.  0. ...,  2.  0.  0.]
 [ 0.  1.  0. ...,  1.  1.  2.]
 [ 0.  0.  1. ...,  2.  0.  0.]
 ..., 
 [ 0.  1.  0. ...,  1.  0.  0.]
 [ 1.  0.  0. ...,  1.  0.  0.]
 [ 1.  0.  0. ...,  0.  1.  3.]]


In [46]:
one_em=OneHotEncoder(categorical_features=[8])
X=one_em.fit_transform(X).toarray()
X_test=one_em.fit_transform(X_test).toarray()

#for avoiding the trap
X=X[:,1:]
X_test=X_test[:,1:]
print(X.shape,X_test.shape)

(891, 10) (418, 10)


In [47]:
train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Embarked,Family,Title
0,0,3,0,1,1,1,0
1,1,1,1,2,0,1,2
2,1,3,1,1,1,0,1
3,1,1,1,2,1,1,2
4,0,3,0,2,1,0,0


In [48]:
print(X)
print('\n')
print(X_test)

[[ 1.  1.  0. ...,  0.  1.  0.]
 [ 1.  0.  1. ...,  1.  0.  2.]
 [ 0.  1.  0. ...,  1.  1.  1.]
 ..., 
 [ 1.  1.  0. ...,  1.  1.  1.]
 [ 0.  1.  0. ...,  0.  0.  0.]
 [ 0.  1.  0. ...,  0.  2.  0.]]


[[ 0.  0.  1. ...,  0.  2.  0.]
 [ 1.  0.  1. ...,  1.  1.  2.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 ..., 
 [ 0.  0.  1. ...,  0.  1.  0.]
 [ 0.  1.  0. ...,  0.  1.  0.]
 [ 1.  1.  0. ...,  0.  0.  3.]]


In [49]:
one_title=OneHotEncoder(categorical_features=[9])
X=one_title.fit_transform(X).toarray()
X_test=one_title.fit_transform(X_test).toarray()

#for avoiding the trap
X=X[:,1:]
X_test=X_test[:,1:]
print(X.shape,X_test.shape)

(891, 13) (418, 13)


In [50]:
print(x1.shape,x2.shape)
x1=x1.reshape(x1.size,1)
x2=x2.reshape(x2.size,1)
print(x1.shape,x2.shape)

(891,) (418,)
(891, 1) (418, 1)


In [51]:
one_fare=OneHotEncoder(categorical_features=[0])
x1=one_fare.fit_transform(x1).toarray()
x2=one_fare.fit_transform(x2).toarray()
print(x1.shape,x2.shape)

(891, 4) (418, 4)


In [52]:
# for avoiding the trap
x1=x1[:,1:]
x2=x2[:,1:]

X=np.hstack((X,x1))
X_test=np.hstack((X_test,x2))
print(X.shape,X_test.shape)

(891, 16) (418, 16)


## TRAINING

In [53]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier


In [54]:
from sklearn.model_selection import train_test_split
X_train,X_cross,y_train,y_cross=train_test_split(X,y,test_size=0.1,random_state=42)

print(X_train.shape,y_train.shape,X_cross.shape,y_cross.shape)

(801, 16) (801,) (90, 16) (90,)


## Logistic

In [55]:
lr=LogisticRegression()
lr.fit(X_train,y_train)
print(lr.score(X_train,y_train))
print(lr.score(X_cross,y_cross))

0.811485642946
0.833333333333


In [56]:
y_pred_lr=lr.predict(X_test)
print(y_pred_lr.shape)

(418,)


## Decision Tree

In [57]:
dt=DecisionTreeClassifier(random_state=42)
dt.fit(X_train,y_train)
print(dt.score(X_cross,y_cross))
print(dt.score(X_train,y_train))

0.811111111111
0.871410736579


## Random Forest Classifier

In [58]:
from sklearn.ensemble import RandomForestClassifier
rf=RandomForestClassifier(n_estimators=20,random_state=42)
rf.fit(X_train,y_train)
print(rf.score(X_cross,y_cross))
print(rf.score(X_train,y_train))

0.811111111111
0.868913857678


In [59]:
y_predict_dt=pd.Series(dt.predict(X_test))
y_predict_rf=pd.Series(rf.predict(X_test))

In [73]:
rf_ans=pd.DataFrame({
    "PassengerId":test_df_passenger,
    "Survived": y_predict_rf
})
dt_ans=pd.DataFrame({
    "PassengerId":test_df_passenger,
    "Survived":y_predict_dt
})
dt_ans.head(10)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [61]:
rf_ans.to_csv("long_time_no_see_fare.csv",index=False)                 # got accuracy of 79.425 %
dt_ans.to_csv("long_time_no_see_dt_fare.csv",index=False)               
print('DONE')

DONE


Before adding Fare feature I got a accuracy of 79.425 with random forest classifier and after adding the fare feature i am getting a accuracy of 79.904 with decision tree.

In [62]:
from xgboost import XGBClassifier 
xgb=XGBClassifier()
xgb.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [63]:
y_pred_cross=xgb.predict(X_cross)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_cross,y_pred_cross))

0.811111111111


In [64]:
y_pred_xg=xgb.predict(X_test)
xg_ans=pd.DataFrame({
    "PassengerId":test_df_passenger,
    "Survived":y_pred_xg
})
print(type(y_pred_xg[0]))

<class 'numpy.int64'>


In [65]:
xg_ans.to_csv("XGB_model.csv",index=False)
print("Done")

Done


In [66]:
from keras.models import Sequential
from keras.layers import Dense,Dropout

Using TensorFlow backend.
  return f(*args, **kwds)


In [67]:
model=Sequential()

model.add(Dense(32,input_shape=(16,),activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(32,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                544       
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                528       
__________

In [68]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 32)                544       
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_2 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 16)                528       
__________

In [69]:
model.fit(X_train,y_train,batch_size=20,epochs=50,validation_data=(X_cross,y_cross))

Train on 801 samples, validate on 90 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f847a19b860>

In [70]:
y_pred_nn=model.predict_classes(X_test)
print(y_pred_nn.shape)
y_pred_nn.shape=(418,)

(418, 1)


In [71]:
nn_ans=pd.DataFrame({
    "PassengerId":test_df_passenger,
    "Survived":y_pred_nn
})
nn_ans.to_csv("nn_model.csv",index=False)