# Titanic - Machine Learning from Disaster

## The Dataset

The Titanic dataset is a classic in machine learning.

The data for this project comes from [Kaggle](https://www.kaggle.com/competitions/titanic/data) - you can explore & learn from [other people's solutions](https://www.kaggle.com/competitions/titanic/code) as well.

## Project Goals

Predict whether a passenger survived the Titanic disaster. This is a classification problem.

You can find the Jupyter Notebook for this lesson here - run it on Binder here.

## Project Plan

In this project, we will:

- explore the Titanic dataset using pandas,
- develop first pipeline to make predictions with a baseline & random forest,
- develop a second pipeline to also use logistic regression and do grid searching.

## Exploratory Data Analysis

Let's start by loading our dataset:

In [1]:
import pandas as pd

data = pd.read_csv('./data/train.csv')

One option here is to separate out a holdout set before we continue with any further data analysis.  For this project, we will continue with the entire dataset.

### How many Rows and Columns Are There?

Our dataset has 891 rows and 12 columns:

In [2]:
data.shape

(891, 12)

### What Does The Data Look Like?

We can take a look at the raw data directly with `head`, `tail` and `sample`:

In [3]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
data.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [5]:
data.sample(n=5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
34,35,0,1,"Meyer, Mr. Edgar Joseph",male,28.0,1,0,PC 17604,82.1708,,C
156,157,1,3,"Gilnagh, Miss. Katherine ""Katie""",female,16.0,0,0,35851,7.7333,,Q
576,577,1,2,"Garside, Miss. Ethel",female,34.0,0,0,243880,13.0,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
367,368,1,3,"Moussa, Mrs. (Mantoura Boulos)",female,,0,0,2626,7.2292,,C


### Exploring the Features

Most of our features are self-explanatory - some of the less obvious features are explored below.  [The dataset is also documented on Kaggle](https://www.kaggle.com/competitions/titanic/data).

`siBsp` describes family relations - it is the sum of the total siblings or spouses of that passenger on the ship:

In [6]:
data['SibSp'].value_counts()

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

`Parch` describes family relations for parents and children:

In [7]:
data['Parch'].value_counts()

0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64

`Ticket` is the ticket number - multiple passengers can be on the same ticket:

In [8]:
data['Ticket'].value_counts().sort_values()

312993      1
349234      1
A/5 3540    1
3101264     1
PC 17595    1
           ..
CA 2144     6
3101295     6
1601        7
CA. 2343    7
347082      7
Name: Ticket, Length: 681, dtype: int64

`Fare` is the cost of a ticket:

In [9]:
data['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

`Cabin` is the cabin number - multiple passengers can be in the same cabin:

In [10]:
data['Cabin'].value_counts().sort_values()

B102               1
C99                1
B94                1
C87                1
D15                1
A31                1
B80                1
B86                1
B4                 1
C49                1
A7                 1
B19                1
D47                1
D7                 1
F E69              1
A32                1
C95                1
E10                1
B39                1
B82 B84            1
D6                 1
B3                 1
F38                1
E77                1
D11                1
D30                1
C46                1
D45                1
B101               1
B38                1
C45                1
C90                1
C62 C64            1
F G63              1
C110               1
A36                1
D10 D12            1
E31                1
C111               1
C104               1
C82                1
C106               1
E50                1
D37                1
E38                1
C128               1
E40                1
C91          

`Embarked` is the port of embarkation - where the passenger boarded the Titanic:

In [11]:
data['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

### Missing Values

We can check for missing values by taking the `sum` across the boolean array returned by `pd.DataFrame.isnull()`.

We can see we have missing values in `Age` and `Cabin`:

In [12]:
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [13]:
data[~data['Cabin'].isnull()].shape

(204, 12)

## First Pipeline - Predict with a Baseline Model & Random Forest

For our first pipeline, we will:

1. test train split,
2. data cleaning / feature eng as needed (little as possible),
3. baseline model (dummy classification),
4. random forest.

The mindset for this first iteration is trying to figure out whether this problem is worth spending more time on.

### Test Train Split

First thing we do is split our data - creating a train and test set:

In [14]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.15)
assert train.shape[0] > test.shape[0]

print(train.shape, test.shape)

(757, 12) (134, 12)


As we discovered during EDA, our data has null values:

In [15]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            156
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          589
Embarked         1
dtype: int64

```
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            153
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          588
Embarked         2
dtype: int64
```

### Drop Rows Where Age & Embarked Are Missing

We can deal with our missing values in the `Age` and `Embarked` columns by dropping the rows.  

We choose to drop rows here as we will not lose too much data when doing so:

In [16]:
train = train[~train['Age'].isnull()]
train = train[~train['Embarked'].isnull()]
print(train.isnull().sum(), train.shape)

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          450
Embarked         0
dtype: int64 (600, 12)


```
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          446
Embarked         0
dtype: int64
```

### Drop Entire Cabin Column

We have so many missing values in the `Cabin` column that it makes sense to drop the entire column - if we dropped rows, we would lose too much data, so we drop the column instead:

In [17]:
train = train.drop('Cabin', axis=1)
assert train.isnull().sum().sum() == 0
print(train.shape)

(600, 11)


```
(607, 11)
```

### Encode Categorical Variables by Dropping

Our feature engineering for categorical variables here is to remove them - in a later iteration, we would integrate these as features using either one-hot encoding or label encoding:

In [18]:
train = train.drop(['Name', 'Sex', 'Embarked', 'Ticket'], axis=1)

### Drop No Information PassengerId

`PassengerId` is a unique identifier for each passenger - it does not provide any information about the passenger, so we can drop it:

In [19]:
train = train.drop('PassengerId', axis=1)

### Create Target

Our target engineering involves separating the `Survived` column into a separate dataframe:

In [20]:
target = train['Survived'].to_frame()
features = train.drop('Survived', axis=1)
print(target.shape, features.shape)

(600, 1) (600, 5)


```
(607, 1) (607, 5)
```

### Dummy Classifier

At this point we have both our target and our features - let's train our baseline:

In [21]:
from sklearn.dummy import DummyClassifier

mdl = DummyClassifier()
mdl = mdl.fit(features, target)
predictions = mdl.predict(features)
print(mdl.score(features, target))

0.605


```
0.5831960461285008
```

We use the `.score()` method to get the accuracy of our model - the `.score` method is different for different models.  scikit-learn has [metrics for many machine learning problems](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics).

### Bring It All Together

We can bring together all the code for our first iteration into a single script:

In [22]:
from sklearn.model_selection import train_test_split

def pipeline(train):
    train = train[~train['Age'].isnull()]
    train = train[~train['Embarked'].isnull()]
    train = train.drop('Cabin', axis=1)
    train = train.drop(['Name', 'Sex', 'Embarked', 'Ticket'], axis=1)
    train = train.drop('PassengerId', axis=1)
    target = train['Survived'].to_frame()
    features = train.drop('Survived', axis=1)
    return features, target

data = pd.read_csv('data/train.csv')
train, test = train_test_split(data, test_size=0.15)
assert train.shape[0] > test.shape[0]

features_tr, target_tr = pipeline(train)
features_te, target_te = pipeline(test)

mdl = DummyClassifier()
mdl = mdl.fit(features_tr, target_tr)
print(mdl.score(features_tr, target_tr))
print(mdl.score(features_te, target_te))

0.6085526315789473
0.5192307692307693


We end up with a 65% accuracy for our baseline model.

## Add Random Forest

Now let's add a random forest:

In [23]:
from sklearn.ensemble import RandomForestClassifier

mdl = RandomForestClassifier()
mdl = mdl.fit(features_tr, target_tr)
print(mdl.score(features_tr, target_tr))
print(mdl.score(features_te, target_te))

0.975328947368421
0.7692307692307693


  mdl = mdl.fit(features_tr, target_tr)


We end up with a 68% accuracy for our baseline model on the test set, but with almost 100% accuracy on the training set.

This suggests that if we can reduce this overfitting, we perhaps can improve our generalization.

### Final First Pipeline

Here is our complete final pipeline:

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

def pipeline(train):
    train = train[~train['Age'].isnull()]
    train = train[~train['Embarked'].isnull()]
    train = train.drop('Cabin', axis=1)
    train = train.drop(['Name', 'Sex', 'Embarked', 'Ticket'], axis=1)
    train = train.drop('PassengerId', axis=1)
    target = train['Survived'].to_frame()
    features = train.drop('Survived', axis=1)
    return features, target

data = pd.read_csv('data/train.csv')
train, test = train_test_split(data, test_size=0.15)
assert train.shape[0] > test.shape[0]

features_tr, target_tr = pipeline(train)
features_te, target_te = pipeline(test)

mdl = DummyClassifier()
mdl = mdl.fit(features_tr, target_tr)
print(mdl.score(features_tr, target_tr))
print(mdl.score(features_te, target_te))

mdl = RandomForestClassifier()
mdl = mdl.fit(features_tr, target_tr)
print(mdl.score(features_tr, target_tr))
print(mdl.score(features_te, target_te))

  mdl = mdl.fit(features_tr, target_tr)


0.5963756177924218
0.5904761904761905
0.9736408566721582
0.6666666666666666


## Second Pipeline

For a second pipeline, we want to add:

- missing value imputation,
- categorical features,
- logistic regression,
- grid searching for hyperparameters.

### Small Refactor

Let's start with a small refactor of data loading:

In [25]:
def load_data():
    data = pd.read_csv('data/train.csv')
    train, test = train_test_split(data, test_size=0.15, random_state=42)
    assert train.shape[0] > test.shape[0]
    return train, test

train, test = load_data()

### Missing Value Imputation

In our first iteration, we dropped some samples due to missing values.

We will impute the missing values in the `Age` column using the median age:

In [26]:
def impute_age(train, test):
    train['Age'] = train['Age'].fillna(train['Age'].median())
    test['Age'] = test['Age'].fillna(train['Age'].median())
    return train, test

train, test = load_data()
train, test = impute_age(train, test)

### Encode Categorical Variables

To include categorical variables in our model, we will apply one-hot encoding to the 'Sex' and 'Embarked' columns:

In [27]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

def encode_categorical(train, test):
    ohe = OneHotEncoder(sparse_output=False)
    column_transformer = ColumnTransformer(
        transformers=[
            ('one_hot', ohe, ['Sex', 'Embarked'])
        ],
        remainder='passthrough',
        verbose_feature_names_out=False
    )

    # Fit the transformer on the train dataset and transform both train and test datasets
    column_transformer.fit(train)
    train_transformed = column_transformer.transform(train)
    test_transformed = column_transformer.transform(test)

    # Get the new column names after encoding
    columns = column_transformer.get_feature_names_out(input_features=train.columns)

    # Convert the transformed datasets back to DataFrames
    train_encoded = pd.DataFrame(train_transformed, columns=columns, index=train.index)
    test_encoded = pd.DataFrame(test_transformed, columns=columns, index=test.index)

    return train_encoded, test_encoded

train, test = load_data()
train, test = encode_categorical(train, test)

### Logistic Regression

In addition to the Random Forest model, we will also use Logistic Regression as a classifier:

In [28]:
from sklearn.linear_model import LogisticRegression

mdl = LogisticRegression()
mdl.fit(features_tr, target_tr)
mdl.score(features_tr, target_tr)

  y = column_or_1d(y, warn=True)


0.7067545304777595

### Grid Search

Let's find good hyperparameters for our models using grid search:

In [29]:
from sklearn.model_selection import GridSearchCV

def grid_search(mdl, param_grid, features, target):
    grid = GridSearchCV(estimator=mdl, param_grid=param_grid, cv=2, verbose=1)
    grid.fit(features_tr, target_tr.values.reshape(-1, ))
    return grid.best_params_

# Random Forest
param_grid_rf = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}
mdl_rf = RandomForestClassifier()
best_rf_params = grid_search(mdl_rf, param_grid_rf, features_tr, target_tr)
print(f"best random forest params: {best_rf_params}")

# Logistic Regression
param_grid_log = {
    'C': [0.001, 0.1, 1, 100],
    'penalty': [None, 'l2'],
    'max_iter': [1000]
}
mdl_log = LogisticRegression()
best_log_params = grid_search(mdl_log, param_grid_log, features_tr, target_tr)
print(f"best logistic regression params: {best_log_params}")

Fitting 2 folds for each of 54 candidates, totalling 108 fits
best random forest params: {'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 200}
Fitting 2 folds for each of 8 candidates, totalling 16 fits
best logistic regression params: {'C': 1, 'max_iter': 1000, 'penalty': 'l2'}




### Second Pipeline

Now we will combine all the new steps into the second pipeline function:

In [30]:
def pipeline(train, test):
    train, test = encode_categorical(train, test)
    train, test = impute_age(train, test)
    train = train.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)
    test = test.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)

    #  need the .astype as our encode_categorical passes through as float
    target_tr = train['Survived'].to_frame().astype(int)
    target_te = test['Survived'].to_frame().astype(int)

    features_tr = train.drop('Survived', axis=1)
    features_te = test.drop('Survived', axis=1)
    return features_tr, target_tr, features_te, target_te

train, test = load_data()
features_tr, target_tr, features_te, target_te = pipeline(train, test)

# Dummy Classifier
mdl = DummyClassifier()
mdl = mdl.fit(features_tr, target_tr)
print(mdl.score(features_tr, target_tr))
print(mdl.score(features_te, target_te))

# Random Forest
params = grid_search(RandomForestClassifier(), param_grid_rf, features_tr, target_tr)
mdl_rf = RandomForestClassifier(**params)
mdl_rf = mdl_rf.fit(features_tr, target_tr)
print(mdl_rf.score(features_tr, target_tr))
print(mdl_rf.score(features_te, target_te))

# Logistic Regression
params = grid_search(LogisticRegression(), param_grid_log, features_tr, target_tr)
mdl_log = LogisticRegression(**params)
mdl_log = mdl_log.fit(features_tr, target_tr)
print(mdl_log.score(features_tr, target_tr))
print(mdl_log.score(features_te, target_te))

0.6221928665785997
0.582089552238806
Fitting 2 folds for each of 54 candidates, totalling 108 fits


  mdl_rf = mdl_rf.fit(features_tr, target_tr)


0.8982826948480845
0.8283582089552238
Fitting 2 folds for each of 8 candidates, totalling 16 fits




0.8018494055482166
0.8059701492537313


  y = column_or_1d(y, warn=True)


We end our project with an 82% accuracy with our random forest - a nice improvement over our 58% for the baseline or the 68% we got with a random forest in our first iteration.

## Now It's Your Turn

Time for a third iteration!  

Take the code developed above and add to it:

- better data cleaning,
- more feature engineering,
- different models,
- different grid searches.

You can explore & learn from [other people's solutions on Kaggle](https://www.kaggle.com/competitions/titanic/code) as well.