# Machine Learning for Disaster

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

# 1. Introduction

The idea of this work is to show in a simple and easy way, a solution for the classification problem in the Titanic disaster. In this notebook we use the XGBoost classifier.

# 2. Let's Start

In [1]:
"""Importing libraries and stuff"""
# Author: Fernando-Lopez-Velasco

import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
import category_encoders as ce
from sklearn import preprocessing
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

In [2]:
"""Loading files as a pandas dataframe"""

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
"""Splitting data"""

Y = train['Survived'].copy() # We extract the target vector
Xtrain = train.drop(['Survived','PassengerId', 'Name'], axis=1) # Drop some columns which are not useful
Xtest = test.drop(['PassengerId','Name'], axis=1)

In [4]:
Xtest.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,male,34.5,0,0,330911,7.8292,,Q
1,3,female,47.0,1,0,363272,7.0,,S
2,2,male,62.0,0,0,240276,9.6875,,Q
3,3,male,27.0,0,0,315154,8.6625,,S
4,3,female,22.0,1,1,3101298,12.2875,,S


# 3. Handling null values

In this section we will to solve the problem with missing or null values

In [5]:
"""First we split data in categorical and no categorical values"""

train_category = Xtrain.select_dtypes(include=['object']).copy()
test_category = Xtest.select_dtypes(include=['object']).copy()
train_float = Xtrain.select_dtypes(exclude=['object']).copy()
test_float = Xtest.select_dtypes(exclude=['object']).copy()

## 3.1 Null values in not categorical data

First we need to implement some method to adress this problem, in this case we will use the Imputer method provided by scikit-learn.

In [6]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(train_float)

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [7]:
Xtrain_float= imp.transform(train_float)
Xtest_float = imp.transform(test_float)

## 3.2 Transformation of categorical data into numerical format
Now that we have solved the problem of null values in categorical data, we need to transform continuos values into discrete format. To do this we will use the technique "Backward Difference Encoder".

In [8]:
"""Declaring the object of BackwardDifferenceEncoder and fitting"""

encoder = ce.BackwardDifferenceEncoder(cols=['Sex', 'Ticket','Cabin','Embarked'])
encoder.fit(train_category)

BackwardDifferenceEncoder(cols=['Sex', 'Ticket', 'Cabin', 'Embarked'],
             drop_invariant=False, handle_unknown='impute',
             impute_missing=True, return_df=True, verbose=0)

In [9]:
"""Transforming data"""

Xtrain_category = encoder.transform(train_category)
Xtest_category = encoder.transform(test_category)

In [10]:
"""We need to drop some columns, this is because the transformation have generated extra columns"""

train_cols = Xtrain_category.columns
test_cols = Xtest_category.columns

In [11]:
flag = 0
cols_to_drop = []
for i in train_cols:
    for j in test_cols:
        if i == j:
            flag = 1
    if flag == 0:
        cols_to_drop.append(i)
    else:
        flag = 0

In [12]:
"""Dropping columns"""

Xtrain_category = Xtrain_category.drop(cols_to_drop, axis=1)

In [13]:
print(Xtrain_category.shape)
print(Xtest_category.shape)

(891, 160)
(418, 160)


## 3.3 Null values in categorical data

To solve the problem with null values in categorical data we will implement the Imputer function provided by scikit-learn.

In [14]:
"""Intialize the object imputer"""

imp.fit(Xtrain_category)

Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)

In [15]:
"""Transforming data"""

Xtrain_category = pd.DataFrame(imp.transform(Xtrain_category), columns = Xtrain_category.columns)
Xtest_category = pd.DataFrame(imp.transform(Xtest_category), columns = Xtest_category.columns)

# 4. Scaling data

To scale data, we will use the function MinMaxScaler provided by scikit-learn.

In [16]:
"""Initializing and fiting"""

min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit(Xtrain_float)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [17]:
"""Scaling"""

Xtrain_float = pd.DataFrame(min_max_scaler.transform(Xtrain_float), columns = train_float.columns)
Xtest_float = pd.DataFrame(min_max_scaler.transform(Xtest_float), columns = test_float.columns)

In [18]:
Xtest_float.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,1.0,0.428248,0.0,0.0,0.015282
1,1.0,0.585323,0.125,0.0,0.013663
2,0.5,0.773813,0.0,0.0,0.018909
3,1.0,0.334004,0.0,0.0,0.016908
4,1.0,0.271174,0.125,0.166667,0.023984


# 5. Concatenating categorical and numerical data

In [19]:
Xtest_category.head()

Unnamed: 0,col_Sex_0,col_Sex_1,col_Ticket_0,col_Ticket_1,col_Ticket_2,col_Ticket_3,col_Ticket_4,col_Ticket_5,col_Ticket_6,col_Ticket_7,...,col_Cabin_32,col_Cabin_33,col_Cabin_34,col_Cabin_35,col_Cabin_36,col_Cabin_37,col_Cabin_38,col_Embarked_0,col_Embarked_1,col_Embarked_2
0,1.0,-0.5,1.0,-0.991379,-0.982759,-0.974138,-0.965517,-0.956897,-0.948276,-0.939655,...,-0.179487,-0.153846,-0.128205,-0.102564,-0.076923,-0.051282,-0.025641,1.0,0.333333,0.666667
1,1.0,0.5,1.0,-0.991379,-0.982759,-0.974138,-0.965517,-0.956897,-0.948276,-0.939655,...,-0.179487,-0.153846,-0.128205,-0.102564,-0.076923,-0.051282,-0.025641,1.0,-0.666667,-0.333333
2,1.0,-0.5,1.0,-0.991379,-0.982759,-0.974138,-0.965517,-0.956897,-0.948276,-0.939655,...,-0.179487,-0.153846,-0.128205,-0.102564,-0.076923,-0.051282,-0.025641,1.0,0.333333,0.666667
3,1.0,-0.5,1.0,-0.991379,-0.982759,-0.974138,-0.965517,-0.956897,-0.948276,-0.939655,...,-0.179487,-0.153846,-0.128205,-0.102564,-0.076923,-0.051282,-0.025641,1.0,-0.666667,-0.333333
4,1.0,0.5,1.0,0.008621,0.017241,0.025862,0.034483,0.043103,0.051724,0.060345,...,-0.179487,-0.153846,-0.128205,-0.102564,-0.076923,-0.051282,-0.025641,1.0,-0.666667,-0.333333


In [20]:
"""As we have two kinds of datasets which are categorical and not categorical data, we need to concatenate both"""

Xtrain = pd.concat([Xtrain_float,Xtrain_category], axis=1)
Xtest = pd.concat([Xtest_float,Xtest_category], axis=1)

# 6. XBoost Classifier

To solve this classification problem we will to apply the XBoost classifier.

In [21]:
"""Initializing the XBoost classifier"""

model = xgb.XGBClassifier(n_estimators=2000, max_depth=5, learning_rate=0.1)

In [22]:
"""Fitting"""

model.fit(Xtrain, Y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=None, n_estimators=2000,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [23]:
"""Making a prediction"""

Ypred = model.predict(Xtest)

In [24]:
"""Saving data"""
Ypred = pd.DataFrame({'Survived':Ypred})
prediction = pd.concat([test['PassengerId'], Ypred], axis=1)
prediction.to_csv('predictions_xboost.csv', sep=',', index=False)

In [25]:
prediction.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,1
4,896,0
