# Probability of survival of Titanic's passengers using Decision Trees and Random Forests

Used Techniques: Supervised learning, Decision Trees and Random Forests. 

In this exercise, we are going to use a kaggle dataset coming from this URL: https://www.kaggle.com/code/faressayah/decision-trees-random-forest-for-beginners/input

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Since we downloaded the data into a csv, let's convert it into a pandas dataframe.
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

#We first see how our data looks like
train_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Before starting the analysis, the documentation highlights the following column definitions:
- pclass: Ticket class. 1=1st, 2=2nd, 3=3rd
- sibsp: # of siblings/spouses aboard the titanic
- parch: # of parents/children aboard the titanic
- fare: Passenger fare
- Cabin: Cabin number
- Embarked: port of embakation. C=Cherbourg, Q=Queenstown, S=Southampton

In [3]:
#Now that our data is legible our first step is knowing both of our dataframes. 
#Magnitude of or dataframe
print(train_df.shape)
print(test_df.shape)

#Information per column
print("\n Train dataset information")
print(train_df.info())
print("\n Test dataset information")
print(test_df.info())

(891, 12)
(418, 11)

 Train dataset information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

 Test dataset information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------      

By seeing both lists, there are several points to address: 
1. The training and testing sets the columns "Age" and "Cabin" have some null data. It's important to fix it in order to avoid any model underfitting. 
2. The columns "Name", "Sex", "Tickets", "Cabin" and "Embarked" are an object.This must be corrected as the models do not accept object type data.
3. The column "Age" is a float type data. Which by default doesn't make any sense. 

Now that we know more about our data, let's dive more into the samples to correct each case.

The easiest case to correct would be the "Sex" column. Intuitively, there must be only 2 options "F" or "M". If that's the case, we could just transform this into a binary data. If not, we must ensure we have tops 3 options, "F", "M" or "N/A". 

In [4]:
print(train_df["Sex"].value_counts()) #data label counting
print(test_df["Sex"].value_counts()) #data label counting

#Since our first hypothesis was correct we can turn the data into binary (boolean).
train_df["Sex_bool"] = train_df["Sex"].map({'female':1, 'male':0})
test_df["Sex_bool"] = test_df["Sex"].map({'female':1, 'male':0})

#We check that the changes have been correctly made. 
train_df.head(5)
test_df.head(5)

#Now we delete the object type "Sex" column.
train_df = train_df.drop('Sex', axis=1)
test_df = test_df.drop('Sex', axis=1)

#We check how the dataset looks
print(train_df.columns)
print(test_df.columns)

male      577
female    314
Name: Sex, dtype: int64
male      266
female    152
Name: Sex, dtype: int64
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked', 'Sex_bool'],
      dtype='object')
Index(['PassengerId', 'Pclass', 'Name', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked', 'Sex_bool'],
      dtype='object')


Based on the information provided by the documentation, we know that the "Embarked" column has 3 multiple choices that we can handle as a categorical data. However, since the goal of the project is to practice decision trees and Random Forest techniques, our best choice is to apply One-hot Encoding. We can not implement ordinal encoding, because trees tend to interpret the values as if they would have a real order, even if the categories don't have any, like this case. 

In [7]:
train_df = pd.get_dummies(train_df, columns=['Embarked'])
train_df

test_df = pd.get_dummies(test_df, columns=['Embarked'])
test_df

Unnamed: 0,PassengerId,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_bool,Embarked_C,Embarked_Q,Embarked_S
0,892,3,"Kelly, Mr. James",34.5,0,0,330911,7.8292,,0,0,1,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",47.0,1,0,363272,7.0000,,1,0,0,1
2,894,2,"Myles, Mr. Thomas Francis",62.0,0,0,240276,9.6875,,0,0,1,0
3,895,3,"Wirz, Mr. Albert",27.0,0,0,315154,8.6625,,0,0,0,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",22.0,1,1,3101298,12.2875,,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",,0,0,A.5. 3236,8.0500,,0,0,0,1
414,1306,1,"Oliva y Ocana, Dona. Fermina",39.0,0,0,PC 17758,108.9000,C105,1,1,0,0
415,1307,3,"Saether, Mr. Simon Sivertsen",38.5,0,0,SOTON/O.Q. 3101262,7.2500,,0,0,0,1
416,1308,3,"Ware, Mr. Frederick",,0,0,359309,8.0500,,0,0,0,1


Our Next columns to treat are "Name" and "Tickets". What these columns have in common is that both of them can be considered as unique ids per passenger. What we can do for now is just transform them as categorical data in case we later need to transform any of these two into numerical and meanwhile reduce the memory usage. This won't make any changes on the data apart from the data type, meaning the data will be displayed exactly the same as the beginning. As an advantage, having them as categorical data, allow us to see if they are indeed unique and make analyses related to the target.

In [11]:
train_df['Name'] = train_df['Name'].astype('category')
train_df['Ticket'] = train_df['Ticket'].astype('category')

test_df['Name'] = test_df['Name'].astype('category')
test_df['Ticket'] = test_df['Ticket'].astype('category')

#Unique values
print("Name unique count: ",train_df['Name'].nunique())
print("Ticket unique count: ",train_df['Ticket'].nunique())

print("Test Name unique count: ",test_df['Name'].nunique())
print("Test ticket unique count: ",test_df['Ticket'].nunique())

Name unique count:  891
Ticket unique count:  681
Test Name unique count:  418
Test ticket unique count:  363


As we can see, in the training names, we don't have any repeated data, but in the rest of counts, the ids are not unique.This might just be a characteristic of the data or maybe a clue of a estructural relationship. But we will address this possible relationship when we get to the exploratory analysis.In this part, we will only focus on getting quality data to analyze.

For this next part we will approach a solution for the null data in the "Age" Column. For this column we can have two different solutions: 
1. We can calculate the median per class and impute the result. In one hand, this is simple, fast, the median is robust upon outliers and makes sense that the age can be correlated with socioeconomic conditions. On the other hand, this technique totally ignores important variables that can give us more accuracy on the age, like sex, number of members of the family, etc. 
2. Impute using KNN. On the bright side, this solution takes more variables into account, which lead to more personalized estimations. Besides it can capture fine patterns like the approximate age for the specific profile. Normally, this type of solutions can be much more lower to implement on huge datasets, but since our dataset is not that big we can use this solution. 

Both can be really good for this problem, but since our dataset is not that huge and we want to obtain a much more related value to the situation of each passenger, the most adequate solution would be number 2.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Cabin,Sex_bool,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.2500,,0,0,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,C85,1,1,0,0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.9250,,1,0,0,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1000,C123,1,0,0,1
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.0500,,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",27.0,0,0,211536,13.0000,,0,0,0,1
887,888,1,1,"Graham, Miss. Margaret Edith",19.0,0,0,112053,30.0000,B42,1,0,0,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",,1,2,W./C. 6607,23.4500,,1,0,0,1
889,890,1,1,"Behr, Mr. Karl Howell",26.0,0,0,111369,30.0000,C148,0,1,0,0


In [None]:
#Statistical description quick view
print(train_df.describe())