Understanding the data 

PassengerId - this is a just a generated Id
Pclass - which class did the passenger ride - first, second or third
Name - self explanatory
Sex - male or female
Age
SibSp - were the passenger's spouse or siblings with them on the ship
Parch - were the passenger's parents or children with them on the ship
Ticket - ticket number
Fare - ticket price
Cabin
Embarked - port of embarkation
Survived - did the passenger survive the sinking of the Titanic?


In [1]:
import numpy as np
import pandas as pd

In [2]:
pd.read_csv("C:/Users/omola/Downloads/TitanicData.csv")

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,0
887,888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1
888,889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,0
889,890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1


In [3]:
titanic = pd.read_csv("C:/Users/omola/Downloads/TitanicData.csv")

In [4]:
import matplotlib.pyplot as plt

In [5]:
print('Total number of passengers in the training data =', len(titanic))

Total number of passengers in the training data = 891


In [6]:
print('Total number of passengers that survived =', len(titanic[titanic['Survived']==1])) #number of passengers in the training data that survived

Total number of passengers that survived = 342


# EDA

In [7]:
print('% of men that survived =', 100*np.mean(titanic['Survived'][titanic['Sex']=='male']))
print('% of women that survived =', 100*np.mean(titanic['Survived'][titanic['Sex']=='female']))

#we did 100 multiplied by the mean of the survived column only where the sex is male

% of men that survived = 18.890814558058924
% of women that survived = 74.20382165605095


In [8]:
#Analysing if class has an effect on survival
print('% of passengers who survived first class =', 100*np.mean(titanic['Survived'][titanic['Pclass']==1]))
print('% of passengers who survived third class =', 100*np.mean(titanic['Survived'][titanic['Pclass']==3]))

% of passengers who survived first class = 62.96296296296296
% of passengers who survived third class = 24.236252545824847


In [9]:
#Analysing if age has an effect of survival
print('% of children who survived =', 100*np.mean(titanic['Survived'][titanic['Age']<18]))
print('% of adults who survived =', 100*np.mean(titanic['Survived'][titanic['Age']>18]))

% of children who survived = 53.98230088495575
% of adults who survived = 38.26086956521739


# Data Preprocessing

In [10]:
#Using the decison tree model, changing of non-numeric features to numeric features, #apply means do for every value in the column. lambda is use for small sets of statements
titanic['Sex'] = titanic['Sex'].apply(lambda x:1 if x == 'male' else 0) 

In [11]:
titanic.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
0,1,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S,1
4,5,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S,0


In [12]:
#next is to check for missing vale and replace accordingly
titanic.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Survived         0
dtype: int64

In [13]:
#for age that has a lot of missing values, you don't want to replace it with 0 because it might have a huge effect on the output later on, you can fill with mean or mode or median
titanic['Age'] = titanic['Age'].fillna(np.mean(titanic['Age']))

In [14]:
titanic.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Survived         0
dtype: int64

In [15]:
#embark has two missing values an as a categorical data, you can replace the missing values with the mode
titanic.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [16]:
titanic['Embarked'].fillna(value ='S', axis = 0, inplace = True)

In [17]:
titanic.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
Survived         0
dtype: int64

In [19]:
#omit irrelevant columns- you can either drop or redefine 
titanic = titanic[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]

In [20]:
#separate the input and target variable, x now has the train data without the output variable
import pandas as pd
import numpy as np

X = titanic.drop('Survived', axis = 1)
y = titanic['Survived']

In [21]:
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1309 sha256=64b1e7d1596a7a14f69af0f8491abeb99fa0c6ec1cc300543d4a4fec97a85092
  Stored in directory: c:\users\omola\appdata\local\pip\cache\wheels\e4\7b\98\b6466d71b8d738a0c547008b9eb39bf8676d1ff6ca4b22af1c
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0


In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [24]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

The training happens in the third line (the "fit" function). 
During the training, the algorithm is trying to build the optimal decision tree.

# Evaluating the Model

In [25]:
!pip install graphviz

Collecting graphviz
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
Installing collected packages: graphviz
Successfully installed graphviz-0.20.1


In [29]:
y_pred = model.predict(X_test)

In [32]:
from sklearn import metrics
from sklearn.metrics import accuracy_score
print('Accuracy:', metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.75


In [31]:
print('Training accuracy...', accuracy_score(y_train, model.predict(X_train)))

Training accuracy... 0.9807383627608347


In [34]:
#improve the model
model_improved = DecisionTreeClassifier(max_depth=3)
model_improved.fit(X_train, y_train)

In [37]:
print('Training accuracy:', accuracy_score(y_train, model_improved.predict(X_train)))
print('Test accuracy:', accuracy_score(y_test, model_improved.predict(X_test)))

Training accuracy: 0.8314606741573034
Test accuracy: 0.8097014925373134
