# Challenge understanding

## Objective

Predict survival on Titanic dataset

## Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
<https://www.kaggle.com/c/titanic>


### Initial Idea
1. Load Library Modules
2. Load Datasets
3. Explore datasets
4. Analyse relations between features
5. Analyse missing values
6. Analyse features
7. Prepare for modelling
8. Modelling
9. Prepare the prediction for submission

### 1. Loading Library Modules

In [13]:
import warnings
warnings.filterwarnings('ignore')

# SKLearn Model Algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression , Perceptron

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC

# SKLearn ensemble classifiers
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier , BaggingClassifier
from sklearn.ensemble import VotingClassifier , AdaBoostClassifier

# SKLearn Modelling Helpers
from sklearn.preprocessing import Imputer , Normalizer , scale
from sklearn.cross_validation import train_test_split , StratifiedKFold
from sklearn.feature_selection import RFECV

# Handle table-like data and matrices
import numpy as np
import pandas as pd

# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns

# plot functions
import pltFunctions as pfunc

# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
sns.set_style( 'white' )
pylab.rcParams[ 'figure.figsize' ] = 8 , 6

### 2. Loading Datasets

In [14]:
train = pd.read_csv("./input/train.csv")
test    = pd.read_csv("./input/test.csv")

In [15]:
#combined = pd.concat([train.drop('Survived',1),test])
#combined = train.append( test, ignore_index = True)
full = train.append( test, ignore_index = True)
del train, test
#train = full[ :891 ]
#combined = combined.drop( 'Survived',1)

In [16]:
#print ('Datasets:' , 'combined:' , combined.shape , 'full:' , full.shape , 'train:' , train.shape)

### 3. Exploring datasets

In [17]:
full.head(10)

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,1,3,male,1,0.0,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,1.0,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,3,female,0,1.0,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,1.0,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,5,3,male,0,0.0,373450
5,,,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,0.0,330877
6,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,0.0,17463
7,2.0,,S,21.075,"Palsson, Master. Gosta Leonard",1,8,3,male,3,0.0,349909
8,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,1.0,347742
9,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,1.0,237736


In [18]:
print(full.isnull().sum())

Age             263
Cabin          1014
Embarked          2
Fare              1
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64


In [19]:
pd.crosstab(full['Pclass'], full['Sex'])

Sex,female,male
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,144,179
2,106,171
3,216,493


In [20]:
print( full.groupby(['Sex','Pclass'])['Age'].mean() )
agedf = full.groupby(['Sex','Pclass'])['Age'].mean()
type( agedf )

Sex     Pclass
female  1         37.037594
        2         27.499223
        3         22.185329
male    1         41.029272
        2         30.815380
        3         25.962264
Name: Age, dtype: float64


pandas.core.series.Series

In [21]:
#for age in full:
#    if full['Age'].isnull():
#        print (agedf.where(agedf['Sex'] == full['Sex'])&(agedf['Pclass']==full['Pclass']))

In [22]:
def fillMissingAge(dframe):
    dframe['Age'] = dframe['Age'].fillna( dframe['Age'].mean())
    return dframe

def fillMissingFare(dframe):
    dframe['Fare'] = dframe['Fare'].fillna( dframe['Fare'].mean() )
    return dframe

In [23]:
full = fillMissingAge(full)
full = fillMissingFare(full)
print(full.isnull().sum())

Age               0
Cabin          1014
Embarked          2
Fare              0
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64


In [24]:
print(full[full['Embarked'].isnull()])

      Age Cabin Embarked  Fare                                       Name  \
61   38.0   B28      NaN  80.0                        Icard, Miss. Amelie   
829  62.0   B28      NaN  80.0  Stone, Mrs. George Nelson (Martha Evelyn)   

     Parch  PassengerId  Pclass     Sex  SibSp  Survived  Ticket  
61       0           62       1  female      0       1.0  113572  
829      0          830       1  female      0       1.0  113572  


In [25]:
pd.crosstab(full['Embarked'], full['Sex'].where(full['Sex'] == 1))

In [26]:
full.where((full['Sex']==1) & (full['Pclass']==1)).groupby(['Embarked','Pclass','Parch','SibSp']).size()

Series([], dtype: int64)

In [27]:
nt=(115+60+291)
pC=115/nt
pQ=60/nt
pS=291/nt
print('Prob C :', pC, 'Prob Q :', pQ ,'Prob S :' , pS)

nC=(30+2+20)
p0C=30/nC
p0Q=2/nC
p0S=20/nC
print('Prob C :', p0C, 'Prob Q :', p0Q ,'Prob S :' , p0S)

print( 'Sum of probabilities')
print('Prob C :', pC+p0C, 'Prob Q :', pQ+p0Q ,'Prob S :' , pS+p0S)


Prob C : 0.24678111587982832 Prob Q : 0.12875536480686695 Prob S : 0.6244635193133047
Prob C : 0.5769230769230769 Prob Q : 0.038461538461538464 Prob S : 0.38461538461538464
Sum of probabilities
Prob C : 0.8237041928029052 Prob Q : 0.1672169032684054 Prob S : 1.0090789039286894


In [28]:
# Trying S for both  passengers
full['Embarked'].iloc[61] = "S"
full['Embarked'].iloc[829] = "S"

In [29]:
print(full.isnull().sum())

Age               0
Cabin          1014
Embarked          0
Fare              0
Name              0
Parch             0
PassengerId       0
Pclass            0
Sex               0
SibSp             0
Survived        418
Ticket            0
dtype: int64


In [30]:
def fillCabin(dframe):
    dframe[ 'Cabin' ] = dframe['Cabin'].fillna( 'U' )
    dframe[ 'Cabin' ] = dframe[ 'Cabin' ].map( lambda c : c[0] )
    # dummy encoding ...
    dframe = pd.get_dummies( dframe['Cabin'] , prefix = 'Cabin' )
    return dframe

In [31]:
print(fillCabin(full))
newDF = fillCabin(full)
full = pd.concat([full, newDF], axis=1)
#full = full.drop('Cabin',1)

      Cabin_A  Cabin_B  Cabin_C  Cabin_D  Cabin_E  Cabin_F  Cabin_G  Cabin_T  \
0           0        0        0        0        0        0        0        0   
1           0        0        1        0        0        0        0        0   
2           0        0        0        0        0        0        0        0   
3           0        0        1        0        0        0        0        0   
4           0        0        0        0        0        0        0        0   
5           0        0        0        0        0        0        0        0   
6           0        0        0        0        1        0        0        0   
7           0        0        0        0        0        0        0        0   
8           0        0        0        0        0        0        0        0   
9           0        0        0        0        0        0        0        0   
10          0        0        0        0        0        0        1        0   
11          0        0        1        0

In [32]:
full

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,Sex,SibSp,...,Ticket,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U
0,22.000000,U,S,7.2500,"Braund, Mr. Owen Harris",0,1,3,male,1,...,A/5 21171,0,0,0,0,0,0,0,0,1
1,38.000000,C,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,female,1,...,PC 17599,0,0,1,0,0,0,0,0,0
2,26.000000,U,S,7.9250,"Heikkinen, Miss. Laina",0,3,3,female,0,...,STON/O2. 3101282,0,0,0,0,0,0,0,0,1
3,35.000000,C,S,53.1000,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,female,1,...,113803,0,0,1,0,0,0,0,0,0
4,35.000000,U,S,8.0500,"Allen, Mr. William Henry",0,5,3,male,0,...,373450,0,0,0,0,0,0,0,0,1
5,29.881138,U,Q,8.4583,"Moran, Mr. James",0,6,3,male,0,...,330877,0,0,0,0,0,0,0,0,1
6,54.000000,E,S,51.8625,"McCarthy, Mr. Timothy J",0,7,1,male,0,...,17463,0,0,0,0,1,0,0,0,0
7,2.000000,U,S,21.0750,"Palsson, Master. Gosta Leonard",1,8,3,male,3,...,349909,0,0,0,0,0,0,0,0,1
8,27.000000,U,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,9,3,female,0,...,347742,0,0,0,0,0,0,0,0,1
9,14.000000,U,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,10,2,female,1,...,237736,0,0,0,0,0,0,0,0,1


In [33]:
#print( full.where((full['Sex'] == 0) & (full['Pclass'] == 1)).groupby(['Pclass','Sex'])['Age'].mean() )
print( full['Sex'].isnull().sum() )

0


In [34]:
#byTicket = full.where(full['Cabin'].isnull()).groupby(['Name'])['Ticket']
#byFare = full.where(full['Cabin'].isnull()).groupby(['Pclass'])['Fare']
#byTicket.head(5)
#byFare.head(5)

In [35]:
full = pfunc.convertSexToNum(full)
full.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,SibSp,Survived,...,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_U,Sex
0,22.0,U,S,7.25,"Braund, Mr. Owen Harris",0,1,3,1,0.0,...,0,0,0,0,0,0,0,0,1,0
1,38.0,C,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,1,1.0,...,0,0,1,0,0,0,0,0,0,1
2,26.0,U,S,7.925,"Heikkinen, Miss. Laina",0,3,3,0,1.0,...,0,0,0,0,0,0,0,0,1,1
3,35.0,C,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,1,1.0,...,0,0,1,0,0,0,0,0,0,1
4,35.0,U,S,8.05,"Allen, Mr. William Henry",0,5,3,0,0.0,...,0,0,0,0,0,0,0,0,1,0


In [36]:
# Naming the Deck accordingly to the Cabin description
# Naming the Deck as U due to unknown Cabin description
full = pfunc.fillDeck(full)

pd.crosstab(full['Deck'], full['Survived'])

Survived,0.0,1.0
Deck,Unnamed: 1_level_1,Unnamed: 2_level_1
A,8,7
B,12,35
C,24,35
D,8,25
E,8,24
F,5,8
G,2,2
T,1,0
U,481,206


In [37]:
print(full.isnull().sum())
print("========================================")
print(full.info())

Age              0
Cabin            0
Embarked         0
Fare             0
Name             0
Parch            0
PassengerId      0
Pclass           0
SibSp            0
Survived       418
Ticket           0
Cabin_A          0
Cabin_B          0
Cabin_C          0
Cabin_D          0
Cabin_E          0
Cabin_F          0
Cabin_G          0
Cabin_T          0
Cabin_U          0
Sex              0
Deck             0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 22 columns):
Age            1309 non-null float64
Cabin          1309 non-null object
Embarked       1309 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
Cabin_A        1309 non-null uint8
Cabin_B        1309 non-null uint

In [38]:
print(pfunc.featureEng( full ))
full = pfunc.featureEng( full )

            Age Cabin Embarked      Fare  \
0     22.000000     U        S    7.2500   
1     38.000000     C        C   71.2833   
2     26.000000     U        S    7.9250   
3     35.000000     C        S   53.1000   
4     35.000000     U        S    8.0500   
5     29.881138     U        Q    8.4583   
6     54.000000     E        S   51.8625   
7      2.000000     U        S   21.0750   
8     27.000000     U        S   11.1333   
9     14.000000     U        C   30.0708   
10     4.000000     G        S   16.7000   
11    58.000000     C        S   26.5500   
12    20.000000     U        S    8.0500   
13    39.000000     U        S   31.2750   
14    14.000000     U        S    7.8542   
15    55.000000     U        S   16.0000   
16     2.000000     U        Q   29.1250   
17    29.881138     U        S   13.0000   
18    31.000000     U        S   18.0000   
19    29.881138     U        C    7.2250   
20    35.000000     U        S   26.0000   
21    34.000000     D        S  

In [39]:
#pfunc.pltCorrel( combined )
#pfunc.pltCorrel( full )
#pfunc.pltCorrel( full )

### Correlations to Investigate

  __Pclass__ is correlated to __Fare__  ( 1st class tickets would be more expensive than other classes )
  
  __Pclass__ x __Age__
  
  __SibSp__ X __Age__

  __SibSp__ x __Fare__

  __SibSp__ is correlate to __Parch__  ( large families would have high values of parents aboard and solo travellers would have zero parents aboard )

  __Pclass__ noticeable correlates to __Survived__  ( Expected correlation with higher classes to survive as known ) 

In [40]:
# Plot distributions of Age of passangers who survived or did not survive
#pfunc.pltDistro( train , var = 'Age' , target = 'Survived' , row = 'Sex' )

In [41]:
# Plot distributions of Fare of passangers who survived or did not survive
#pfunc.pltDistro( train , var = 'Survived' , target = 'Pclass' , row = 'Sex' )

In [42]:
# Plot distributions of Parch of passangers who survived or did not survive
#pfunc.pltDistro( train , var = 'Parch' , target = 'Survived' , row = 'Sex' )

In [43]:
full.head(5)

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,SibSp,Survived,...,FamilyLarge,TicketType,Title,Fare_cat,Bad_ticket,Young,Shared_ticket,Ticket_group,Fare_eff,Fare_eff_cat
0,22.0,U,S,7.25,"Braund, Mr. Owen Harris",0,1,3,1,0.0,...,0,A,Mr,0,True,True,0,1,7.25,0
1,38.0,C,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,1,1.0,...,0,P,Mrs,1,False,False,1,2,35.64165,2
2,26.0,U,S,7.925,"Heikkinen, Miss. Laina",0,3,3,0,1.0,...,0,S,Miss,0,False,True,0,1,7.925,0
3,35.0,C,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,1,1.0,...,0,1,Mrs,1,False,False,1,2,26.55,2
4,35.0,U,S,8.05,"Allen, Mr. William Henry",0,5,3,0,0.0,...,0,3,Mr,0,True,False,0,1,8.05,0


In [49]:
# Plot distributions of Age of passangers who survived or did not survive

#pfunc.pltCategories( train , cat = 'Embarked' , target = 'Survived' ) 
#pfunc.pltCategories( train , cat = 'Pclass' , target = 'Survived' )
#pfunc.pltCategories( train , cat = 'Sex' , target = 'Survived' )
#pfunc.pltCategories( train , cat = 'Parch' , target = 'Survived' )
#pfunc.pltCategories( train , cat = 'SibSp' , target = 'Survived' )
#pfunc.pltDistro( train , var = 'Age' , target = 'Survived' , row = 'Sex' )
full = full.drop('Survived',1)

In [None]:
def getTitles(dframe):
    dframe['Title'] = dframe['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
    myDict = {	"Capt":       "Officer", 
    "Col":        "Officer",
    "Major":      "Officer",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "Lady" :      "Royalty",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master"
    }
    
    dframe['Title'] = dframe.Title.map(myDict)
    return dframe

In [57]:
full = getTitles(full)
full.head()

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,PassengerId,Pclass,SibSp,Ticket,...,FamilyLarge,TicketType,Title,Fare_cat,Bad_ticket,Young,Shared_ticket,Ticket_group,Fare_eff,Fare_eff_cat
0,22.0,U,S,7.25,"Braund, Mr. Owen Harris",0,1,3,1,A/5 21171,...,0,A,Mr,0,True,True,0,1,7.25,0
1,38.0,C,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,2,1,1,PC 17599,...,0,P,Mrs,1,False,False,1,2,35.64165,2
2,26.0,U,S,7.925,"Heikkinen, Miss. Laina",0,3,3,0,STON/O2. 3101282,...,0,S,Miss,0,False,True,0,1,7.925,0
3,35.0,C,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,4,1,1,113803,...,0,1,Mrs,1,False,False,1,2,26.55,2
4,35.0,U,S,8.05,"Allen, Mr. William Henry",0,5,3,0,373450,...,0,3,Mr,0,True,False,0,1,8.05,0


In [56]:
# plot functions
import pltFunctions as pfunc
train_X, test_X, target_y = pfunc.prepareTrainTestTarget(full)
#train_valid_X = full[ 0:891 ]
#train_valid_y = full.Survived
#test_X = full[ 891: ]
#train_X , valid_X , train_y , valid_y = train_test_split( train_X , train_valid_y , train_size = .7 )

print (full.shape , train_X.shape , target_y.shape , test_X.shape)

(1309, 37) (891, 37) (891,) (418, 37)


In [51]:
model = RandomForestClassifier(n_estimators=100)
#model = SVC()
model.fit( train_X , target_y )

ValueError: could not convert string to float: 'Mr'