# Titanic

2018-04-19

This is my 1st attempt at tackling an ML problem - mostly - on my own.

First: let's load the data.

In [13]:
import os
import pandas as pd

titanic_train_set = pd.read_csv(os.path.join(os.getcwd(), "datasets", "train.csv"))

Nothing too difficult. Let's examine it and see what we got.

In [14]:
titanic_train_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [58]:
n_examples = titanic_train_set.shape[0]
print("Number of examples: ", n_examples)

Number of examples:  891


First impressions:
- `Cabin` has missing values for some rows.
- `Name`, `Sex`, `Ticket`, `Cabin` and `Embarked` have text values.

The label column is `Survived`. Don't see anything special there, so I'll put it aside. I might have to come back to clean it up in case there are missing entries.

In [25]:
labels = titanic_train_set["Survived"].copy()
features = titanic_train_set.drop(labels=["Survived"], axis=1)

Ok. The features (`X`) and the labels (`y`) have been separated.

Now I'd like to do some cleanup. Right off the bat I have a hunch that says the `Name` column won't help a lot with the predictions. From what I know about the Titanic tragedy, and what's stated on the [competition page](https://www.kaggle.com/c/titanic/data), a person's sex and age played a big part in determining whether a person was allowed aboard the lifeboats. `Name` and `Sex` are probably highly correlated, but I'll use the latter because it's way easier to encode numerically and well, because why would we want to predict a person's sex through their name if we already know it?

I can't see how `Embark` (the column holding the values for the ports where each passenger embarked) will be of much use, so I'll drop it too.

Another feature that I suspect won't tell us much is `Cabin`.

In [61]:
features["Cabin"].notna().sum()

204

687 rows out of 891 don't have a value for `Cabin`. Right now I wouldn't know how to extrapolate the available `Cabin` data (204 rows), which is less than half the available examples, into the rest of the examples. :/

I'll drop it.

Last, I'm pretty sure the ticket number doesn't matter, but I wanna check something before.

In [59]:
print("Number of examples (rows): ", n_examples)
print("Number of unique Ticket values: ", features["Ticket"].nunique())

Number of examples (rows):  891
Number of unique Ticket values:  681


_Huhhh_. I really thought there was gonna be one ticket per passenger. Maybe there are rows with no value for `Ticket`.

In [27]:
print("Number of null/ NA Ticket values: ", features["Ticket"].isna().sum())

Number of null/ NA Ticket values:  0


I guess everyone had a ticket, then!

Ok, so maybe some passengers share a single ticket.

In [34]:
features["Ticket"].value_counts().head()

347082      7
CA. 2343    7
1601        7
3101295     6
CA 2144     6
Name: Ticket, dtype: int64

Bingo. Nevertheless, I think the passengers-to-ticket ratio is way too high to try and create a category out of the tickets. I'll drop it too.

In [64]:
features_clean = features.drop(labels=["Name", "Ticket", "Cabin", "Embarked"], axis=1)
features_clean.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,3,male,22.0,1,0,7.25
1,2,1,female,38.0,1,0,71.2833
2,3,3,female,26.0,0,0,7.925
3,4,1,female,35.0,1,0,53.1
4,5,3,male,35.0,0,0,8.05


_Boom_.

2nd impressions:
- I have to make a category out `Sex`. Seems like a job for a one-hot encoder.
- I have a strong feeling that `Fare` and `Pclass` (passenger class) are strongly corelated.