# Titanic: Machine Learning from Disaster
##### By Stephen Mak

The Titanic's maiden voyage was on April 15th, 1912, which killed 1502 out of 2224 (67.5%) of the onboard passengers. The intent of this competition is to determine which passengers were likely to survive using various machine learning tools.

The exam question is to predict if a passenger survived the sinking of the Titanic or not. 

The metric used to evaluate the model will simply be the Rand accuracy. This is simply defined as the sum of True Positives (TP) and True Negatives (TN) divided by the number of examples.



## Features available

The response variable will be "Survival" which is a binary yes/no (1 or 0).

Various features are available including:

"PassengerId" - Unique passenger ID.

"pclass", or Ticket Class, a categorical variable where 1 = 1st Class, 2 = 2nd Class, 3 = 3rd Class.
This could be an important feature as the number of lifeboats available to each ticket class may have varied.

"Sex" (could) be a binary variable, assuming one denotes male, and another female.
This could be an important feature as it is believed that females and children were advised to leave the boat first.

"Age" is the passenger's age in years. Note that if the age has been estimated, it will finish in "xx.5". Therefore, there is potential for feature engineering to create an "estimated age yes/no" feature. Note also if the age is less than 1 it is fractional.

"sibsp" - Number of siblings/spouses onboard - defined as brother/sister/stepbrother/stepsister/husband/wife.

"parch" - Number of parents/children onboard - defined as mother/father/daughter/son/stepdaughter/stepson.

"ticket" - Ticket Number

"fare" - Passenger Fare

"cabin" - Cabin Number

"embarked" - Port of Embarkation - C = Cherbourg (France), Q = Queenstown (now Cobh, Ireland), S = Southampton (UK)

# Interesting sidenotes

Titanic only had enough lifeboats for 1178 people due to outdated maritime safety regulations.

Women and children were evacuated first, therefore, it is suspected that this will be crucial for the model.

RMS Carpathia arrived 2 hours later to save approximately 705 people.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from pathlib import Path

In [2]:
print(os.getcwd())

C:\Users\Stephen\Desktop\Documents\A Learning Journey\Kaggle


In [3]:
PATH = Path('C:/Users/Stephen/Desktop/Documents/A Learning Journey/Kaggle/Titanic/')

TRAIN = PATH/'train/train.csv'
TEST = PATH/'test/test.csv'

First, let's read in the data and explore the dataset to better understand the data and what features could be useful for our model.

In [5]:
columns = ['ID', 'Survived', 'Class', 'Name', 'Sex', 'Age', 'SibSp', 'ParCh', 'TicketNo', 'Price', 'CabinNo', 'EmbarkCity']

data = pd.read_csv(TRAIN, names = columns)

print(data.columns)
print(data.dtypes)
print(data.iloc[:6, 11])

Index(['ID', 'Survived', 'Class', 'Name', 'Sex', 'Age', 'SibSp', 'ParCh',
       'TicketNo', 'Price', 'CabinNo', 'EmbarkCity'],
      dtype='object')
ID            object
Survived      object
Class         object
Name          object
Sex           object
Age           object
SibSp         object
ParCh         object
TicketNo      object
Price         object
CabinNo       object
EmbarkCity    object
dtype: object
0    Embarked
1           S
2           C
3           S
4           S
5           S
Name: EmbarkCity, dtype: object


Through printing the first 5 rows, the data is ordered by Passenger ID. Therefore, for the model it is important to randomly shuffle the rows so that PassengerId doesn't become a predictive feature as it wouldn't make too much sense.

The 'TicketNo' column seems to have a weird structure. I need to better understand what this means - possibly extract more features from it? Likewise for 'CabinNo'

In [7]:
plt.plot(x, y, color = 'blue')
plt.xlabel
plt.ylabel
plt.title
plt.show
plt.axes([x_lo, y_lo, width, height])
plt.axis([x_min, x_max, y_min, y_max])
#Units in figure units, between 0 and 1.

plt.subplot(2,1,1)
#Number rows, number columns, which subplot to make active. Order goes from top to bottom, left to right, indexed from 1.
plt.tight_layout()
#Pads spaces between subplots so that there is no overlap

plt.annotate()
plt.style.use()
plt.style.available()

plt.legend()

plt.savefig()

plt.hist()
plt.hist2d()

plt.hexbin()

plt.colorbar()


NameError: name 'x_lo' is not defined