# Titanic: Machine Learning from Disaster -> Predict if a passenger will survive or not

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this notebook, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Install libraries if you are using Binder

In [None]:
!pip install pandas numpy matplotlib seaborn sklearn xlrd

## Import Libraries

In [None]:
#importing the libraries
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
import seaborn as sns
# from google.colab import files
import io

## Load Training Data

In [None]:
# reading the training dataset
# uploaded = files.upload()

In [None]:
# titanic_train = pd.read_excel(io.BytesIO(uploaded['train.xlsx']))
titanic_train =  pd.read_csv('./data/train.csv')
#titanic_test =  pd.read_csv('./data/test.csv')

In [None]:
#shape command will give number of rows/samples/examples and number of columns/features/predictors in dataset
print(titanic_train.shape)
print(titanic_train.dtypes)
titanic_train._____() # fill here

In [None]:
titanic_train.dtypes

#### Data Dictionary

- Survival
    - 0 = No
    - 1 = Yes
- pclass - Ticket class
    - 1 = 1st
    - 2 = 2nd
    - 3 = 3rd
- sex	Sex	
- Age	Age in years	
- sibsp -	# of siblings / spouses aboard the Titanic	
- parch - # of parents / children aboard the Titanic	
- ticket - Ticket number	
- fare - Passenger fare	
- cabin - Cabin number	
- embarked - Port of Embarkation
    - C = Cherbourg
    - Q = Queenstown
    - S = Southampton


#### Variable Notes
- pclass: A proxy for socio-economic status (SES)
    - 1st = Upper
    - 2nd = Middle
    - 3rd = Lower

- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

- sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)

- parch: The dataset defines family relations in this way...
- Parent = mother, father
- child = daughter, son, stepdaughter, stepson
    Some children travelled only with a nanny, therefore parch=0 for them.

In [None]:
#info method provides information about dataset like 
#total values in each column, null/not null, datatype, memory occupied etc
titanic_train._____() # fill here

In [None]:
#Describe gives statistical information about numerical columns in the dataset
titanic_train.describe()
#you can check from count if there are missing vales in columns, here age has got missing values

In [None]:
#show first 5 rows of dataset
titanic_train.head()

In [None]:
#check missing values
titanic_train.isnull().sum()

Describe only gives stats about numerical columns
even Embarked and cabin has missing values

Age, Fare and cabin has missing values. we will see how to fill missing values next.

# Pandas Operations

In [None]:
#Create a sample dataframe
titanic_sample = titanic_train.copy()
titanic_sample.shape

In [None]:
# select the 'Age' Series using bracket notation
titanic_sample[_____] # fill here

# or equivalently, use dot notation
titanic_sample.Age.head()

In [None]:
#selecting multiple features
titanic_sample[['Age', 'Name']].head()

In [None]:
type(titanic_sample), type(titanic_sample.Age), type(titanic_sample['Age']), type(titanic_sample[['Age']])

In [None]:
#renaming a column in dataframe: Parch -> ParentChildren, SibSp -> SiblingSpouse
print('before renaming',titanic_sample.columns)
titanic_sample.rename(columns={'Parch':'ParentChildren', 'SibSp':_____}, inplace=True) # fill here
titanic_sample.columns

In [None]:
#remove column from a dataframe. Remove PassengerId, Survived columns from dataframe
titanic_sample.drop(['PassengerId', _____],axis = 1, inplace = True) # fill here
titanic_sample.columns

In [None]:
#drop a row (temporarily)
titanic_sample.drop(2, axis=0).head()

In [None]:
#Filter rows in dataframe
# Give passenger name & pclass with Age greater than 35 
titanic_sample.head()

In [None]:
# select age greater than 35
titanic_sample[titanic_sample.Age > _____][['Name','Pclass']].head() # fill here

In [None]:
titanic_sample.loc[titanic_sample.Age > 35,['Name','Pclass']].head()

In [None]:
titanic_sample.iloc[0:2, 0:4] # Exclusive of last row and column

In [None]:
#iterate throgh rows
i = 0
for index, row in titanic_sample.iterrows():
    print (index,row.Age,row.Pclass)
    i += 1
    if i == 10:
        break

In [None]:
#calculate the mean age for each pclass category
titanic_sample.groupby(_____).Age.mean() # fill here

In [None]:
titanic_sample.groupby('Pclass').Age.agg(['count', 'mean', 'min', 'max'])

In [None]:
# count how many times each value in the Series/Feature occurs
titanic_sample.Sex.value_counts()

In [None]:
# Sorting series by count
titanic_sample.Sex.value_counts().sort_values()

In [None]:
# display percentages instead of raw counts
titanic_sample.Sex.value_counts(normalize= True)

In [None]:
#find unique values
titanic_sample.Pclass.unique()

In [None]:
#Simply mean of age
titanic_sample.Age.mean()

In [None]:
# Convert a fetaure into categorical from continuous
titanic_sample['Pclass'] = titanic_sample['Pclass'].astype('category')
titanic_sample.info()

# EDA (Exploratory Data Analysis)

In [None]:
# this line allows ipython notebook to display the plots in the output
%matplotlib inline

In [None]:
# gives the count of each unique value in a column
titanic_train.Sex.value_counts()

In [None]:
sns.countplot(x='Sex', data=_____) # fill here

- Gives the graphical representation of value_counts that gives the count of each unique values in a feature
- Bar plot - Gives count of the different categories in the categorical feature

# Pclass vs Fare

In [None]:
ax = sns.boxplot(x="Pclass", y="Fare", hue="_____", data=titanic_train) # fill here
ax.set_yscale('log')

- Fares decrease as the Pclass increases

# Embarked vs Fare

In [None]:
ax = sns.boxplot(x="Embarked", y="Fare", hue="Survived", data=titanic_train)
ax.set_yscale('log')

- Survival rates for passengers embarked at S and C who paid higher fare seems to be more

# Fare vs survival rates

In [None]:
#making fares into categories of ranges(<=7.91,[7.91,14.454],[14.454,31],[31,513])
titanic_train['Fare_cat']=0
titanic_train.loc[titanic_train['Fare']<=7.91,'Fare_cat']=0
titanic_train.loc[(titanic_train['Fare']>7.91)&(titanic_train['Fare']<=14.454),'Fare_cat']=1
titanic_train.loc[(titanic_train['Fare']>14.454)&(titanic_train['Fare']<=31),'Fare_cat']=2
titanic_train.loc[(titanic_train['Fare']>_____)&(titanic_train['Fare']<=_____),'Fare_cat']=3 # fill here

In [None]:
sns.factorplot('Fare_cat','Survived',data=titanic_train,hue='Sex',aspect=2.5)
#aspect-> signifies the width of the plot
plt.show()

- Clearly, as the Fare_cat increases, the survival chances increases

In [None]:
g = sns.FacetGrid(titanic_train, hue="Survived", col="_____", margin_titles=True,
                palette="Set1",hue_kws=dict(marker=["^", "v"]),size=5) # fill here
g.map(plt.scatter, "Fare", "Age",edgecolor="w").add_legend()
#plt.subplots_adjust(top=0.8)
g.fig.suptitle('Survival by Gender , Age and Fare')

- Females who have paid higher fare are mostly survived but it's not the same case with Males

## Survival Rates 

In [None]:
titanic_train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
print("% of survivals in") 
print("Pclass=1 : ", titanic_train.Survived[titanic_train.Pclass == 1].sum()/titanic_train[titanic_train.Pclass == 1].Survived.count())
print("Pclass=2 : ", titanic_train.Survived[titanic_train.Pclass == 2].sum()/titanic_train[titanic_train.Pclass == 2].Survived.count())
print("Pclass=3 : ", titanic_train.Survived[titanic_train.Pclass == 3].sum()/titanic_train[titanic_train.Pclass == 3].Survived.count())

In [None]:
sns._____(x='Pclass',y='Survived', data=titanic_train) # fill here
#By default factorplot shows point plot
#They are particularly good at showing interactions: 
#i.e.,how the relationship between levels of one categorical variable changes across levels of a second categorical variable
plt.show()

- Survival rate is more for class 1 passengers
- Survival rate decreases with Pclass

## Survival rate based on gender

In [None]:
titanic_train.groupby(['Survived','Sex'])['Survived'].count()

In [None]:
sns.factorplot(x='Sex', col='Survived', kind='count', data=titanic_train)

In [None]:
print("% of women survived: " , titanic_train[titanic_train.Sex == 'female'].Survived.sum()/titanic_train[titanic_train.Sex == 'female'].Survived.count())
print("% of men survived:   " , titanic_train[titanic_train.Sex == 'male'].Survived.sum()/titanic_train[titanic_train.Sex == 'male'].Survived.count())

 - females are most likely to survive than males
 - 74% of women survived while only 19% of men survived (233 out of 314 females survived while only 109 males out of 577 survived)

In [None]:
sns.factorplot(x='Sex',y='Survived',data=titanic_train)

- Most of the females survived (above 70%) & most of the males dies (below 20%)

In [None]:
sns.factorplot(x='Pclass',y='Survived', hue='Sex',data=titanic_train)
#adding the parameter 'hue' helps in comparing the plots based on the value of hue, here we gave 'Sex' as a value in hue and hence we got plot to compare the survival rates for each gender
plt._____() # fill here

- Almost all women in Pclass 1 and 2 survived and nearly all men in Pclass 2 and 3 died

## Survival rate based on embarked(boarding) place

In [None]:
sns.factorplot(x='Survived', col='Embarked', kind='count', data=titanic_train)

- Most of the people died are Embarked on S

In [None]:
g=sns.factorplot('Embarked','Survived', data=titanic_train,size=5)
g.set_xticklabels(["Southampton(S)", "Cherbourg(C)", "Queenstown(Q)"])

- Most people embarked on C survived

In [None]:
sns.factorplot('Embarked','Survived', hue= 'Sex', data=titanic_train)

- Approximately 85% women embarked at C survived.


## Embarked, Pclass and Sex vs Survival

In [None]:
sns.factorplot('Embarked','Survived', col='Pclass', hue= 'Sex', data=titanic_train,ci=False)
#separate plots for embarked
plt.show()

- Practically all women of Pclass 2 that embarked in C and Q survived, also nearly all women of Pclass 1 survived.
- All men of Pclass 1 and 2 embarked at Q died, survival rate for men in Pclass 2 and 3 is always below 0.2
- For the remaining men in Pclass 1 that embarked at S and C, survival rate is approx. 0.4

In [None]:
_ = sns.factorplot('Pclass', 'Survived', hue='Sex', col = 'Embarked', data=titanic_train)
#separate plots for Pclass
_ = sns.factorplot('Pclass', 'Survived', col = 'Embarked', data=titanic_train)

- As noticed already before, the class 1 passengers had a higher survival rate.
- Most of the women who died were from the 3rd class.
- Embarked at Q as a 3rd class gave you slighly better survival chances than embarked in S for the same class.

# Embarked vs Pclass (categorical vs categorical)

In [None]:
tab = pd.crosstab(titanic_train['Embarked'],titanic_train['Pclass'])
print(tab)
tab_prop = tab.div(tab.sum(1).astype(float), axis=0)
tab_prop.plot(kind="bar", stacked=True)

## Age Distribution

In [None]:
age_counts = titanic_train[['Age', 'PassengerId']].groupby('Age').count().reset_index()
age_counts.columns = ['AGE', 'FREQUENCY']
plt.figure(figsize=(12,6))
plt.bar(age_counts.AGE, age_counts.FREQUENCY)
plt.title('Bar Plot on Ages (AGE > 1 and AGE < 90)')
plt.xlabel('Age')
plt.ylabel('_____') # fill here
plt.show()

- Most of the passengers are aged between 18 and 36

## Survival based on Age

In [None]:
facet = sns.FacetGrid(titanic_train, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, titanic_train['Age'].max()))
facet.add_legend()

In [None]:
g = sns.FacetGrid(titanic_train, col='Survived',size=5)
g.map(plt.hist, 'Age', bins=20)

- Infants (Age <=4) had high survival rate
- Oldest passengers (Age = 80) survived
- Large number of 15-25 year olds did not survive
- Most passengers are in 15-35 age range

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(15,5))
#create 1 row and 2 columns (2 subplots)
females = titanic_train[titanic_train['Sex']=='female']
males = titanic_train[titanic_train['Sex']=='male']

ax = sns.distplot(females[females['Survived']==1].Age.dropna(), bins=30, ax = axes[0], kde =False)
ax = sns.distplot(females[females['Survived']==0].Age.dropna(), bins=30, ax = axes[0], kde =False)
#plot both on the same axes(axes[0])
ax.legend()
ax.set_title('_____') # fill here
ax = sns.distplot(males[males['Survived']==1].Age.dropna(), bins=30, ax = axes[1], kde = False)
ax = sns.distplot(males[males['Survived']==0].Age.dropna(), bins=30, ax = axes[1], kde = False)
#plot both on axes[1]
ax.legend()
_ = ax.set_title('Male')

- Most of the Females with age 18-38 survived.
- Most of the Males with age 18-35 died.

# Survival Rate vs Number of family members aboard

In [None]:
# To get the full family size of a person, added siblings and parch.
#fig, axes = plt.subplots(nrows=1, ncols=1,figsize=(15, 5))
titanic_train['family_size'] = titanic_train['SibSp'] + titanic_train['Parch'] + 1 
_ = sns.factorplot('family_size','Survived', hue = 'Sex', data=titanic_train, aspect = 4)
#separate for male and female
_ = sns.factorplot('SibSp','Survived',data=titanic_train,aspect=4)
#all passengers

- Assumption: the less people was in your family the faster you were to get to the boat. The more people they are the more managment is required. However, if you had no family members you might wanted to help others and therefore sacrifice
- The females traveling with up to 2 more family members had a higher chance to survive. However, a high variation of survival rate appears once family size exceeds 4 as mothers/daughters would search longer for the members and therefore the chanes for survival decrease
- Alone men might want to sacrifice and help other people to survive

# General overview of all variables vs survival

In [None]:
plain_features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']
fig, ax = plt.subplots(nrows = 2, ncols = 3 ,figsize=(20,10))
start = 0
for j in range(2):
    for i in range(3):
        if start == len(plain_features):
            break
        sns.barplot(x=plain_features[start], y='Survived', data=titanic_train, ax=ax[j,i], ci = False)
        start += 1

Observations in a Nutshell for all features:
- Sex: The chance of survival for women is high as compared to men.

- Pclass:There is a visible trend that being a 1st class passenger gives you better chances of survival. The survival rate for Pclass3 is very low. For women, the chance of survival from Pclass1 is almost 1 and is high too for those from Pclass2. Money Wins!!!.

- Age: Children less than 5-10 years do have a high chance of survival.Also all old passengers survived(>=80). Most of the passengers between age group 15 to 35 died.

- Embarked: This is a very interesting feature. The chances of survival at C looks even though the majority of Pclass1 passengers boarded at S. Passengers at Q were all from Pclass3.

- Parch+SibSp: Having 1-2 siblings,spouse on board or 1-3 count of parents/children shows a greater probablity of survival rather than being alone or having a large family travelling with you

# Missing Value Imputation

In [None]:
titanic_train.isnull().sum()

In [None]:
# get records which has null values in 'Embarked' column
titanic_train[titanic_train[_____].isnull()] # fill here

PassengerId 62 and 830 have missing embarked values

Both have Passenger class 1 and fare $80.

plot a graph to visualize and try to guess from where they embarked

In [None]:
sns.boxplot(x="_____", y="_____", hue="Pclass", data=titanic_train) # fill here

In [None]:
titanic_train["Embarked"] = titanic_train["Embarked"].fillna('C')

For 1st class median line is coming around fare $80 for embarked value 'C'. So we can replace NA values in Embarked column with 'C'

# Data Preprocessing

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import tree, model_selection
from sklearn.metrics import accuracy_score

In [None]:
# Preprocessing data
titanic_train1 = pd.get_dummies(titanic_train, columns = ['Embarked','Sex'])

In [None]:
titanic_train1.dtypes

In [None]:
# get features and target variables from dataframe. Here target variable is 'Survived'
X = titanic_train1.drop(['PassengerId','Survived', 'Cabin', 'Name', 'Age', 'Ticket', ], axis = 1)
y = titanic_train1[_____] # fill here

In [None]:
X.head()

In [None]:
 y.head()

In [None]:
# Dividing into train and test dataset. Here test_size is 25% i.e., 0.25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = _____, stratify = y, random_state = 10 ) # fill here

In [None]:
#Baseline Model - As deduced from EDA, most passengers died. Make all passengers as died.
X_test['y_pred_bas'] = 0

In [None]:
accuracy_score(y_test, X_test.y_pred_bas)

# Model Building

In [None]:
#Make Decision Tree Model
dt = tree.DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
X_test.columns

In [None]:
y_pred = dt.predict(X_test.drop('y_pred_bas', axis=1))

In [None]:
y_pred

In [None]:
_____(y_pred, y_test) # fill here

In [None]:
#Before Actually applying to the test data, you need to check how your model is performing. So we keep some data for validation

In [None]:
cv_scores = model_selection.cross_val_score(dt, X_train, y_train, cv=10, verbose=1)
print ('Cross Validation Score : ',cv_scores.mean() )

In [None]:
print('Training Accuracy : ',dt.score(X_train,y_train))

In [None]:
confusion_matrix(_____, y_test) # fill here