# Titanic

This is my 1st attempt at tackling an ML problem - mostly - on my own. I'll try to train a model and submit its predictions for Kaggle's [Titanic competition]().

First: let's load the data.

In [1]:
import os
import pandas as pd

titanic_train_set = pd.read_csv(os.path.join(os.getcwd(), "datasets", "train.csv"))

Nothing too difficult. Let's examine it and see what we got.

In [2]:
titanic_train_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
n_examples = titanic_train_set.shape[0]
print("Number of examples: ", n_examples)

Number of examples:  891


First impressions:
- `Cabin` has missing values for some rows.
- `Name`, `Sex`, `Ticket`, `Cabin` and `Embarked` have text values.

The label column is `Survived`. Don't see anything special there, so I'll put it aside. I might have to come back to clean it up in case there are missing entries.

In [4]:
labels = titanic_train_set["Survived"].copy()
features = titanic_train_set.drop(labels=["Survived"], axis=1)

Ok. The features (`X`) and the labels (`y`) have been separated.

Now I'd like to do some cleanup. Right off the bat I have a hunch that says the `Name` column won't help a lot with the predictions. From what I know about the Titanic tragedy, and what's stated on the [competition page](https://www.kaggle.com/c/titanic/data), a person's sex and age played a big part in determining whether a person was allowed aboard the lifeboats. `Name` and `Sex` are probably highly correlated, but I'll use the latter because it's way easier to encode numerically and well, because why would we want to predict a person's sex through their name if we already know it?

I can't see how `Embark` (the column holding the values for the ports where each passenger embarked) will be of much use, so I'll drop it too, along with `PassengerId`.

Another feature that I suspect won't tell us much is `Cabin`.

In [5]:
features["Cabin"].notna().sum()

204

687 rows out of 891 don't have a value for `Cabin`. Right now I wouldn't know how to extrapolate the available `Cabin` data (204 rows), which is less than half the available examples, into the rest of the examples. :/

I'll drop it.

Last, I'm pretty sure the ticket number doesn't matter, but I wanna check something before.

In [6]:
print("Number of examples (rows): ", n_examples)
print("Number of unique Ticket values: ", features["Ticket"].nunique())

Number of examples (rows):  891
Number of unique Ticket values:  681


_Huhhh_. I really thought there was gonna be one ticket per passenger. Maybe there are rows with no value for `Ticket`.

In [7]:
print("Number of null/ NA Ticket values: ", features["Ticket"].isna().sum())

Number of null/ NA Ticket values:  0


I guess everyone had a ticket, then!

Ok, so maybe some passengers share a single ticket.

In [8]:
features["Ticket"].value_counts().head()

1601        7
CA. 2343    7
347082      7
CA 2144     6
3101295     6
Name: Ticket, dtype: int64

Bingo. Nevertheless, I think the passengers-to-ticket ratio is way too high to try and create a category out of the tickets. I'll drop it too.

I'm not sure about the `SibSp` and `Parch` columns, but I'll leave them there for the time being.

In [9]:
dropped_cols = ["PassengerId", "Name", "Ticket", "Cabin", "Embarked"]
features_clean = features.drop(labels=dropped_cols, axis=1)
features_clean.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
0,3,male,22.0,1,0,7.25
1,1,female,38.0,1,0,71.2833
2,3,female,26.0,0,0,7.925
3,1,female,35.0,1,0,53.1
4,3,male,35.0,0,0,8.05


_Boom_.

2nd impressions:
- I have to divide `Sex` into categories. Seems like a job for a one-hot encoder, according to Chapter 2 of [Hands On Machine Learning](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/). Especifically, it has few values (`male` and `female`), so the sparse matrix won't be too large.
- I have a strong feeling that `Fare` and `Pclass` (passenger class) are strongly corelated B)

Let's transform the `Sex` feature into a category.

In [10]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

col_trans = ColumnTransformer([
    ("cat", OneHotEncoder(), ["Sex"]),
], remainder="passthrough")

features_trans = col_trans.fit_transform(features_clean)

features_trans

array([[ 0.    ,  1.    ,  3.    , ...,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  1.    , ...,  1.    ,  0.    , 71.2833],
       [ 1.    ,  0.    ,  3.    , ...,  0.    ,  0.    ,  7.925 ],
       ...,
       [ 1.    ,  0.    ,  3.    , ...,  1.    ,  2.    , 23.45  ],
       [ 0.    ,  1.    ,  1.    , ...,  0.    ,  0.    , 30.    ],
       [ 0.    ,  1.    ,  3.    , ...,  0.    ,  0.    ,  7.75  ]])

Let's train a linear regresion model!

In [11]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(features_trans, labels)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Whooops. An error.
```
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
```
I forgot to check all the other columns for `NaN` values...

In [None]:
features_clean.isna().any()

The only column with missing values is `Age`. Let's see how many.

In [None]:
features_clean["Age"].isna().sum()

More than I expected, but still less than 1/4 of the rows. I'll follow the advice from the book and fill it with the median using an imputer.

In [14]:
from sklearn.impute import SimpleImputer

# We have to drop "Sex" because its values are strings.
features_clean_num = features_clean.drop(["Sex"], axis=1)

# It's easier to do everything in one go with a pipeline.
col_trans = ColumnTransformer([
    ("imputer", SimpleImputer(strategy="median"), list(features_clean_num)),
    ("cat", OneHotEncoder(), ["Sex"]),
], remainder="passthrough")

features_trans_nona = col_trans.fit_transform(features_clean)
lin_reg.fit(features_trans_nona, labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

w00t, it worked!

Let's load the test data and make some predictions.

In [16]:
test_raw = pd.read_csv(os.path.join(os.getcwd(), "datasets", "test.csv"))
test_raw.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


The dataset has the same columns, except the labels (`Survived`), as expected.

In [43]:
import numpy as np

test_clean = test_raw.drop(labels=dropped_cols, axis=1)

test_trans = col_trans.fit_transform(test_clean)

predictions = list(map(int, lin_reg.predict(test_trans).round()))
predictions

[0,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 0,
 1,
 1,


Now let's export it into the format expected by kaggle.

In [44]:
# Zip the passenger ID and corresponding predictions together. 
output = np.array(list(zip(test_raw["PassengerId"], predictions)))
formatted_output = pd.DataFrame(output, columns=["PassengerId", "Survived"])
formatted_output.to_csv(os.path.join(os.getcwd(), "output.csv"), index=False)

Let's submit it!

_drumroll_

![1st Submission](img/1st-submission.png)

A score of ~77! I was expecting my first submission to perform waaay worse. No doubt it can improve. I can explore other models, come up with new features, or drop some more. 

To try the simplest thing first, I will drop the `SibSp` and `Parch` columns, see if the performance drops.