This is the legendary Titanic ML competition: the first best challenge to dive into ML competitions and familiarize yourself with how the Kaggle platform works. The challenge can be found here: https://www.kaggle.com/competitions/titanic.

The sinking of the Titanic is one of the most notorious shipwrecks in history.

On April 15, 1912, during its maiden voyage, the RMS Titanic, widely considered "unsinkable", sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

Although there was some element of luck involved in survival, it appears that some groups of people were more likely to survive than others.

In this challenge, you are asked to build a predictive model that answers the question "what kind of people were most likely to survive?" using passenger data (name, age, gender, socio-economic class, etc.).

In [None]:
# load libraries
import pandas as pd
import numpy as np

The data were divided into two groups:
* training set (train.csv);
* test set (test.csv).

The training set is to be used to build your machine learning models. For the training set, the outcome for each passenger is provided. Your model will be based on features such as gender and passenger class. You can also use feature engineering to create new features.

The test set should be used to see your model's performance on new data. For the test set, the ground truth is not provided. It is your job to predict these results. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

In [None]:
# load training set
path = "/content/drive/MyDrive/Didattica/Corsi/Machine Learning (DLI)/notebook/train.csv"
train = pd.read_csv(path)

In [None]:
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Column header variables have the following meanings:
* Survived: survival (0 = no; 1 = yes);
* Pclass: passenger class (1 = first; 2 = second; 3 = third);
* Name: name;
* Sex: gender;
* Age: age;
* SibSp: number of siblings/spouses on board;
* Parch: number of parents/children on board;
* Ticket: ticket number;
* Fare: passenger fare;
* Cabin: cabin;
* Embarked: port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton).

In [None]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Let's do some preliminary analysis.

In [None]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
# chance of survival... 
train['Survived'].mean()

0.3838383838383838

In [None]:
# chance of survival by class
train.groupby('Pclass')['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [None]:
# chance of survival by gender
train.groupby(['Pclass','Sex'])['Survived'].mean()

Pclass  Sex   
1       female    0.968085
        male      0.368852
2       female    0.921053
        male      0.157407
3       female    0.500000
        male      0.135447
Name: Survived, dtype: float64

In [None]:
# chance of survival by age
group_by_age = pd.cut(train['Age'], np.arange(0, 90, 10))
train.groupby(group_by_age)['Survived'].mean()

Age
(0, 10]     0.593750
(10, 20]    0.382609
(20, 30]    0.365217
(30, 40]    0.445161
(40, 50]    0.383721
(50, 60]    0.404762
(60, 70]    0.235294
(70, 80]    0.200000
Name: Survived, dtype: float64

In [None]:
# some data are missing... 
train.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

# **Homework**

Remove unnecessary features. Remove or replace features with missing data. Transform categorical values into numeric values. Normalize your features. Remember to do, on the test set, the same transformations you applied to the training set. Finally, create and evaluate different predictive models.