# Titanic - Machine Learning from Disaster
The following notebook outlines my approach for the Kaggle competition ["Titanic Machine Learing from Disaster"](https://www.kaggle.com/competitions/titanic/overview), details of which can be found via the link provided. The goal of the competition is to find a binary classification method which will yield the highest accuracy on an unseen data set. 

## The Data
The website provides data in the form of csv files; [train.csv](data/train.csv) is intended for training models, while [test.csv](data/test.csv) is intended for evaluating the performance of said models. Note that the test set does not include ground truth observations. The first step in our analysis will be to perform some exploratory data analysis to better understand the data set.

In [9]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Reading in the data.
# File pathways are given relative to Linux filesystem, where the python kernel uses the notebook directory as the active directory.
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

print('Training data has {} rows and {} columns.'.format(*train.shape))
print('Testing data has {} rows and {} columns.'.format(*test.shape))

Training data has 891 rows and 12 columns.
Testing data has 418 rows and 11 columns.


### Exploring the Data
As expected the training set has one more column that our test set, which is our variable of interest (whether or not the passenger survived). Our training set also has a relatively small number of observations 891. Our next step will be do explore the features given by the columns.

In [21]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


The output above tells us that we have 12 features with 7 numeric features:
- PassengerId
- Survived
- Pclass
- Age
- SibSp
- Parch
- Fare

And 5 non-numeric features:
- Name
- Sex
- Ticket
- Cabin
- Embarked

We can also review quantities calculated from the numeric features which might be useful.

In [27]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The count observations shows us that for the training data at least we have all observations for all numeric features with the exception of age, for which we are missing 177 observations.