# TITANIC: Survival of a Disaster

Long before Artifical Intelligence or Machine Learning was a thing. We were already explorers of endless possibilites. 

All these oppurtunities come with a risk. With the technology of today, would it be possible to predict an outcome of the past?

The Titanic disaster happened over 100 years ago. And a lot of passenger data has been saved. The dataset consists of the following:

* ROW - Row number of datapoint
* PCLASS - Class of ticket (1st, 2nd, 3rd)
* SURVIVED - Did the passenger survive the incident (1, 0)
* NAME - Name of the passenger
* AGE - Age of the passenger
* EMBARKED - Place of embarkment on to the Titanic (city)
* HOME - Home city of passenger
* ROOM - Passenger room number
* TICKET - Ticket number of the passenger
* BOAT - Escape boat for passenger
* SEX - Gender of passenger

Can we produce a model that predicts the survival based on the data, using one-hot-encoding on all categorical data?

## Importing Libraries & Data

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import sklearn as sk

print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('scikit-learn version:', sk.__version__)
print('matplotlib version:', mpl.__version__)

numpy version: 1.20.1
pandas version: 1.2.4
scikit-learn version: 0.24.1
matplotlib version: 3.3.4


In [2]:
column_names = ['ROW', 'PCLASS', 'SURVIVED', 'NAME',
                'AGE', 'EMBARKED', 'HOME', 'ROOM', 'TICKET', 'BOAT', 'SEX']
dataset = pd.read_csv('/Users/matt/Desktop/AI/PersonalNotebooks/Titanic/Titanic.csv',
                     delimiter=',', names=column_names)
dataset = dataset.iloc[1: , :]
dataset.head()

Unnamed: 0,ROW,PCLASS,SURVIVED,NAME,AGE,EMBARKED,HOME,ROOM,TICKET,BOAT,SEX
1,1,1st,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2.0,female
2,2,1st,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
3,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,-135.0,male
4,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
5,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11.0,male


In [3]:
print('Shape of the dataset: {}'.format(dataset.shape))

Shape of the dataset: (1313, 11)


## Treating the Dataset

In [4]:
dataset.describe()

Unnamed: 0,ROW,PCLASS,SURVIVED,NAME,AGE,EMBARKED,HOME,ROOM,TICKET,BOAT,SEX
count,1313,1313,1313,1313,633,821,754,77,69,347,1313
unique,1313,3,2,1310,73,3,371,53,41,99,2
top,1002,3rd,0,"Connolly, Miss Kate",30,Southampton,"New York, NY",F-33,17608 L262 7s 6d,4,male
freq,1,711,864,2,28,573,65,4,5,27,850


The distribution displays that:

1. There are a few features which can be categorized, namely: PCLASS, SURVIVED, EMBARKED and SEX
2. The count of TICKET and ROOM are relatively low.

There are some oddities aswell

1. The top row is 480, yet there are 1313 rows.
2. The name 'Connolly, Miss Kate' appears twice. Also, there are 1313 names, with 1310 unique. This means there is one more double present.

To start out ROW has no importance because it represents the index of the datapoint. 

NAME would have little to no importance in predicting the survival rate. In an extensive analysis this could be viable, but not in this situation. 

In addition to that the ROOM and TICKET features can be dropped because they have a lot of null values, this could have impact on the model. but making an estimation for over 90% of the data is not feasible.

BOAT also has bad descriptions and wouldn't be of great importance in a basic situation.

Tables ROW, NAME, ROOM, BOAT, HOME and TICKET will be dropped.

In [5]:
dataset = dataset.drop(['ROW', 'ROOM', 'TICKET', 'NAME', 'BOAT', 'HOME'], axis=1)
dataset.head()

Unnamed: 0,PCLASS,SURVIVED,AGE,EMBARKED,SEX
1,1st,1,29.0,Southampton,female
2,1st,0,2.0,Southampton,female
3,1st,0,30.0,Southampton,male
4,1st,0,25.0,Southampton,female
5,1st,1,0.9167,Southampton,male


As a final basic preperation stage the SURVIVED feature will be moved to the right.

In [6]:
df1 = dataset.pop('SURVIVED')
dataset['SURVIVED'] = df1
dataset.head()

Unnamed: 0,PCLASS,AGE,EMBARKED,SEX,SURVIVED
1,1st,29.0,Southampton,female,1
2,1st,2.0,Southampton,female,0
3,1st,30.0,Southampton,male,0
4,1st,25.0,Southampton,female,0
5,1st,0.9167,Southampton,male,1


### Null values

In [7]:
dataset.isnull().sum()

PCLASS        0
AGE         680
EMBARKED    492
SEX           0
SURVIVED      0
dtype: int64

The remaining null values will be filled out by using a median.

### Categorizing data

#### Embarked

In [8]:
dataset['EMBARKED'].value_counts()

Southampton    573
Cherbourg      203
Queenstown      45
Name: EMBARKED, dtype: int64

In [9]:
dataset['EMBARKED'].fillna(method='ffill', inplace=True)
dataset['EMBARKED'].value_counts()

Southampton    1058
Cherbourg       209
Queenstown       46
Name: EMBARKED, dtype: int64

In [10]:
dataset['EMBARKED'] = dataset['EMBARKED'].map({'Southampton': 0, 'Cherbourg': 1, 'Queenstown': 2})
dataset.sample(5)

Unnamed: 0,PCLASS,AGE,EMBARKED,SEX,SURVIVED
120,1st,71.0,1,male,0
5,1st,0.9167,0,male,1
145,1st,35.0,0,female,1
445,2nd,23.0,0,male,0
407,2nd,38.0,0,male,0


In [11]:
print('Shape of the dataset: {}'.format(dataset.shape))

Shape of the dataset: (1313, 5)


#### Class

In [12]:
dataset['PCLASS'] = dataset['PCLASS'].map({'1st': 1, '2nd': 2, '3rd': 3})
dataset.sample(5)

Unnamed: 0,PCLASS,AGE,EMBARKED,SEX,SURVIVED
1103,3,,0,male,0
973,3,,0,male,0
377,2,8.0,0,female,1
1265,3,,0,female,0
59,1,31.0,0,female,1


#### Age

In [13]:
dataset['AGE'].value_counts()

30        28
18        25
36        23
22        23
24        22
          ..
67         1
7          1
0.1667     1
10         1
0.9167     1
Name: AGE, Length: 73, dtype: int64

The data set cotains all kinds of variables. Strings, Integers and Floats. In order to use this data it has to be converted first.

In [14]:
dataset['AGE'] = dataset['AGE'].fillna(0)
#dataset['AGE'].fillna(dataset['AGE'].mean())
#dataset['AGE']

#dataset['AGE'] = dataset['AGE'].fillna(0)
#s = pd.Series(dataset['AGE'], dtype="Int64")
#pd.to_numeric(s, downcast='integer')
#dataset['AGE'].value_counts()
#dataset['AGE'].fillna(dataset['AGE'].mean())
pd.to_numeric(dataset['AGE'], errors='coerce', downcast='signed')
dataset.round(0)
dataset['AGE'].head(10)

1         29
2          2
3         30
4         25
5     0.9167
6         47
7         63
8         39
9         58
10        71
Name: AGE, dtype: object

In [15]:
dataset

Unnamed: 0,PCLASS,AGE,EMBARKED,SEX,SURVIVED
1,1,29,0,female,1
2,1,2,0,female,0
3,1,30,0,male,0
4,1,25,0,female,0
5,1,0.9167,0,male,1
...,...,...,...,...,...
1309,3,0,0,male,0
1310,3,0,0,male,0
1311,3,0,0,male,0
1312,3,0,0,female,0


#### Sex

In [16]:
dataset['SEX'] = dataset['SEX'].map({'male': 0, 'female': 1})
dataset.sample(5)

Unnamed: 0,PCLASS,AGE,EMBARKED,SEX,SURVIVED
652,3,17,1,1,0
65,1,26,1,1,1
475,2,0,2,0,0
1027,3,0,0,0,0
656,3,32,0,0,0


## Machine Learning

For now we have a start at data preparation. Lets try Random Forest Regression to check if we are heading in the right direction.

In [17]:
X_train = dataset.drop(["SURVIVED"], axis=1)
Y_train = dataset['SURVIVED']

X_test = dataset.iloc[:, :4]

X_train.shape, Y_train.shape, X_test.shape

((1313, 4), (1313,), (1313, 4))

In [18]:
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier 

random_forest = RandomForestClassifier(criterion='gini',
                                        n_estimators=700, 
                                       min_samples_split=10, 
                                       min_samples_leaf=1, 
                                       max_features='auto', 
                                       oob_score=True, 
                                       random_state=1,
                                       n_jobs=-1)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)

acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 3)
acc_random_forest

85.758

In [19]:
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

#print(asarray(dataset['EMBARKED']).reshape(-1, 1))

#get date from dataset
#data = dataset['EMBARKED'].values
#print(data)
#seperate data into input and output
#X = data.astype(str).reshape(-1, 1)
#Y = data.astype(str)
#onehot_encoder = OneHotEncoder(sparse=False)
#X = onehot_encoder.fit_transform(X)
#label_encoder = LabelEncoder()
#Y = label_encoder.fit_transform(Y)
#print('Input', X.shape)
#print(X[:5, :])
#print(Y[:5])

#dataset.head()
#encoder = OneHotEncoder(sparse=False)
#onehot = encoder.fit_transform(asarray(dataset['EMBARKED']).reshape(-1, 1))
#print(onehot)
#dataset['EMBARKED'] = onehot
#dataset.sample(5)