## Loading libraries

In [1]:
import numpy as np
import pandas as pd

In [43]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


This is the description of our columns
- PassengerId - this is a just a generated Id
- Pclass - which class did the passenger ride - first, second or third
- Name - self explanatory
- Sex - male or female
- Age
- SibSp - were the passenger's spouse or siblings with them on the ship
- Partch - were the passenger's parents or children with them on the ship
- Ticket - ticket number
- Fare - ticker price
- Cabin
- Embarked - port of embarkation
- Survived - did the passenger survive the sinking of the Titanic?

Survived is the **target**

## Exploratory Data Analysis (EDA)
After loading, we need to examine the data. If you try modeling something without understanding your dataset, it would be inaccurate and difficult.

In [4]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [15]:
print(('We have {} total of passengers in the training set').format(len(train)))
print(('We have {} total of survivors in the training set').format(sum(train['Survived'])))
print(('\nWe have {}% of survivors').format(round(sum(train['Survived'])/len(train)*100)))

We have 891 total of passengers in the training set
We have 342 total of survivors in the training set

We have 38% of survivors


Is there a difference between men and women?

In [34]:
print(('There are {} men in the boat and {} survived').format(len(train[train['Sex']=='male']), sum(train[train['Sex']=='male']['Survived']==1)))
print(('{}% survived').format(round(sum(train[train['Sex']=='male']['Survived']==1)/len(train[train['Sex']=='male'])*100)))

print(('\nThere are {} women in the boat and {} survived').format(len(train[train['Sex']=='female']), sum(train[train['Sex']=='female']['Survived']==1)))
print(('{}% survived').format(round(sum(train[train['Sex']=='female']['Survived']==1)/len(train[train['Sex']=='female'])*100)))

There are 577 men in the boat and 109 survived
19% survived

There are 314 women in the boat and 233 survived
74% survived


**It seems that men sacrificed to maximize the survival of women**

What regarding the type of class?

In [38]:
print(('There are {} first class in the boat and {} survived').format(len(train[train['Pclass']==1]), sum(train[train['Pclass']==1]['Survived']==1)))
print(('{}% survived').format(round(sum(train[train['Pclass']==1]['Survived']==1)/len(train[train['Pclass']==1])*100)))

print(('\nThere are {} third class in the boat and {} survived').format(len(train[train['Pclass']==3]), sum(train[train['Pclass']==3]['Survived']==1)))
print(('{}% survived').format(round(sum(train[train['Pclass']==3]['Survived']==1)/len(train[train['Pclass']==3])*100)))

There are 216 first class in the boat and 136 survived
63% survived

There are 491 third class in the boat and 119 survived
24% survived


**It seems that there is a correlation between the class and the chance of survive**

What about the age?

In [40]:
print(('There are {} children in the boat and {} survived').format(len(train[train['Age']<18]), sum(train[train['Age']<18]['Survived']==1)))
print(('{}% survived').format(round(sum(train[train['Age']<18]['Survived']==1)/len(train[train['Age']<18])*100)))

print(('\nThere are {} adults in the boat and {} survived').format(len(train[train['Age']>18]), sum(train[train['Age']>18]['Survived']==1)))
print(('{}% survived').format(round(sum(train[train['Age']>18]['Survived']==1)/len(train[train['Age']>18])*100)))

There are 113 children in the boat and 61 survived
54% survived

There are 575 adults in the boat and 220 survived
38% survived


**Also, naturally, people prefered to save children before adults**

## Data Pre-processing
### Non-numeric features
It's imposible for the model to understand non-numerical values. We can convert some of them to boolean male=1 and female=0

In [45]:
train['Sex'] = train['Sex'].apply(lambda x: 1 if x == "male" else 0)
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",1,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",0,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",0,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",1,26.0,0,0,111369,30.0000,C148,C


## Is there missing values?
Let's see

In [46]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can handle missing values with different methods:
- Removing the lines ou columns
- Replacing by means

Here we'll try to replace Age missing values with mean

In [48]:
train['Age'] = train['Age'].fillna(np.mean(train['Age']))

In [49]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Remove useless data in our dataset
Let's investigate to know which column we need or not

In [50]:
train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S


We don't need:
- PassengerId
- Name
- Ticket
- Embarked
- Cabin (actually it could be)

In [52]:
columns_to_del = ['PassengerId', 'Name', 'Ticket', 'Embarked', 'Cabin']
train = train.drop(columns_to_del, axis=1)
train

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
0,0,3,1,22.000000,1,0,7.2500
1,1,1,0,38.000000,1,0,71.2833
2,1,3,0,26.000000,0,0,7.9250
3,1,1,0,35.000000,1,0,53.1000
4,0,3,1,35.000000,0,0,8.0500
...,...,...,...,...,...,...,...
886,0,2,1,27.000000,0,0,13.0000
887,1,1,0,19.000000,0,0,30.0000
888,0,3,0,29.699118,1,2,23.4500
889,1,1,1,26.000000,0,0,30.0000


Let's define our X (features) and y (target)

In [53]:
X = train.drop('Survived', axis=1)
y = train['Survived']

## Split our dataset in 80/20
It's important to split our dataset (another good method is cross validation but I don't know how to use it properly for now)

In [54]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Training the model
In my tutorial, DecisionTreeClassifier was used

In [55]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

## Evaluation of the model
Let's calculate the accuracy_score (number of right answers divided by the total numbers) here how many people were predicted as survivor or not in the model

In [58]:
from sklearn.metrics import accuracy_score
print(('Training accuracy : {}').format(accuracy_score(y_train, classifier.predict(X_train))))
print(('Validation accuracy : {}').format(accuracy_score(y_test, classifier.predict(X_test))))

Training accuracy : 0.9803370786516854
Validation accuracy : 0.7653631284916201


The difference between training and validation score could be explained because of **overfitting**.

## Improving the model

Reducing overfitting can be done by reducing depth of the three

In [60]:
classifier = DecisionTreeClassifier(max_depth=3)
classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [61]:
print(('Training accuracy : {}').format(accuracy_score(y_train, classifier.predict(X_train))))
print(('Validation accuracy : {}').format(accuracy_score(y_test, classifier.predict(X_test))))

Training accuracy : 0.8342696629213483
Validation accuracy : 0.7988826815642458


# Conclusion
This tutorial was quite good. Not so much because we don't go deep with this model and we do not compare with others models. Also, there isn't any data vizualisation that could be interesting in this case to look at the chance of survive depending on the sex / age etc...

What I've learned:
- First step: EDA, you need to explore a bit the data, understand the data (make some insights)
- 2nd step: Data pre-processig (handling categorical features, missing values and useless columns)
- 3rd step: choose your model and train your model
- 4th step: 