![alt text](Tita3.png "Title")

# Titanic: Machine Learning from Disaster#

## Rafael S. Barbosa ##

In this work, it is presented the development of the machine learning algorithm to predict who survived the Titanic disaster. The dataset and the relevant information came from a Kaggle challenge as per described in the link below. 

https://www.kaggle.com/c/titanic/overview

For this challenge, two files are given:

"train.csv" - containing the data to be use to train the model;
"test.csv" - contaning the data from which the model is applied to predict the survivors.

The list of survivors given by the prediction model need to be given in a third file called "gender_submission.csv".

## Part 1: Defining the prediction model from the training data ##

Initially, the data from the file "train.csv" is evaluated and used to develop the prediction model.

The libraries to be used are:

In [1]:
import pandas as PD
from sklearn import linear_model
import random as RD
import numpy as NP
import warnings

### 1.1 Data reading and cleaning ###

The first step is to read the "train.csv" filea nd evaluate the data. The pandas library is used for this aim.


In [2]:
file1 = 'train.csv'
train_df = PD.read_csv(file1, encoding='utf-8', delimiter=',')
train_df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


It's possible to see that there are some NaN in the data. Therefore, some cleaning is applied on the columns containing the data of interest for this analysis.

In [3]:
train_df = train_df.dropna(subset=['PassengerId','Survived','Pclass','Sex', \
'Age','SibSp','Parch','Fare','Embarked'])

Now, the selected columns are converted into arrays with the proper adjustments.

In [4]:
warnings.filterwarnings('ignore')
#Passenger identification:
pidcol = train_df.loc[:,'PassengerId']
pid = NP.array(pidcol.values,dtype=int)
#Survived:
surcol = train_df.loc[:,'Survived']
sur = NP.array(surcol.values,dtype=float)
#Class where the passenger was traveling:
clacol = train_df.loc[:,'Pclass']
cla = NP.array(clacol.values,dtype=float)
#Passenger sex:
sexcol = train_df.loc[:,'Sex']
sx1 = NP.array(sexcol.values,dtype=str)
sex = NP.zeros(len(sx1))
for i in range(len(sx1)):
	if(sx1[i] == 'male'):
		sex[i] = 1.0
	else:
		sex[i] = 0.0
#Passenger age:
agecol = train_df.loc[:,'Age']
age = NP.array(agecol.values,dtype=float)
#Fare:
farcol = train_df.loc[:,'Fare']
far = NP.array(farcol.values,dtype=float)
#
embcol = train_df.loc[:,'Embarked']
emb = NP.array(embcol.values,dtype=str)
#Port where the passenger embarked - here a small adjustment is made:
por = NP.zeros(len(emb))
for i in range(len(emb)):
	if(emb[i] == 'C'):
		por[i] = 0.0
	elif(emb[i] == 'Q'):
		por[i] = 1.0
	else:
		por[i] = 2.0
#Relatives traveling together. Here the twocolumns regarding passenger
#travelling with prelatives are joined:
train_df['SibSp'] = train_df['SibSp'].astype(int)
train_df['Parch'] = train_df['Parch'].astype(int)
cond1 = (train_df['SibSp'] != 0)
train_df['par'] = NP.where(cond1, 1, 0)
cond2 = (train_df['Parch'] != 0)
train_df['par'] = NP.where(cond2, 1,train_df['par'])
relcol = train_df.loc[:,'par']
rel = NP.array(relcol.values,dtype=int)

### 2.2 Regression model to predict who survived ###

The scikit-learn library is used to develop a regression model.

Two functions are defined:
 - The function to build the vector of features;
 - The function to calculate the mean squere erro of the predictions.

In [5]:
def feature(cla,sex,age):
	feat = [1.0,cla,sex,age]
	return feat
def MSE(model, X, y):
	predictions = model.predict(X)
	predictions = NP.around(predictions)
	predictions = predictions.astype(int)
	differences = [(a-b)**2 for (a,b) in zip(predictions, y)]
	return sum(differences) / len(differences)

Now, to properly test the model accuracy, the dataset is split into a subset to be used to train the model and other to test it (please don't make confusion with the files names given).

In [6]:
N = len(sur)
X = [feature(cla[ind],sex[ind],age[ind]) for ind in range(N)]
y = sur
X_train = X[:3*N//4]
X_test = X[3*N//4:]
y_train = y[:3*N//4]
y_test = y[3*N//4:]

A regression linear model with Ridge regularizer is applied.

In [7]:
lamb = 1.0
model = linear_model.Ridge(lamb, fit_intercept=False)
model.fit(X_train, y_train)

Ridge(alpha=1.0, copy_X=True, fit_intercept=False, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

Then, the predictions can be made.

In [8]:
predictionsTrain = model.predict(X_train)
predictionsTest = model.predict(X_test)
#Model prediction validation:
predictionsTrain = NP.around(predictionsTrain)
predictionsTrain = predictionsTrain.astype(int)
predictionsTest = NP.around(predictionsTest)
predictionsTest = predictionsTest.astype(int)
#
y_train= y_train.astype(int)
y_test = y_test.astype(int)

Checking the model accuracy:

In [9]:
correctPredictionsTrain = (predictionsTrain == y_train)
correctPredictionsTest = (predictionsTest == y_test)
#
tracc = sum(correctPredictionsTrain) / len(correctPredictionsTrain)
teacc = sum(correctPredictionsTest) / len(correctPredictionsTest) 
mseTrain = MSE(model, X_train, y_train)
mseTest = MSE(model, X_test, y_test)
#
print('Training set accuracy: ' + str(tracc))
print('Test set accuracy: ' + str(teacc))
print('Means squared error - training: '+ str(mseTrain))
print('Means squared error - test: '+ str(mseTest))

Training set accuracy: 0.7771535580524345
Test set accuracy: 0.8033707865168539
Means squared error - training: 0.22284644194756553
Means squared error - test: 0.19662921348314608


## Part 2: Using the model to predict the survivors ##

Here the full amount of data from the 'train.csv' file is used to predict the survivors whhen using the feature form the file 'test.csv'.

### 2.1 Data reading and cleaning ###

Now, the final dataset is cleaned and organized.

In [10]:
#Reading:
file2 = 'test.csv'
test_df = PD.read_csv(file2, encoding='utf-8', delimiter=',')
test_df

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0000,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
...,...,...,...,...,...,...,...,...,...,...,...
413,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S
414,1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C
415,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S
416,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S


Since there are some NaNs in the 'Age' column, and given that the final submission file must have all the passengers listed on the 'test.csv' file, its necessary to make some adjustments in the dataset.
The mean value of the 'Age' column will be used to fill the NaN positions.

In [11]:
aux_df = test_df.dropna(subset=['Age'])
medage = aux_df['Age'].mean() 
test_df['Age'] = test_df['Age'].fillna(medage)

Inserting the data into arrays:

In [12]:
pidcol = test_df.loc[:,'PassengerId']
pid2 = NP.array(pidcol.values,dtype=int)
#
clacol = test_df.loc[:,'Pclass']
cla2 = NP.array(clacol.values,dtype=float)
#
sexcol = test_df.loc[:,'Sex']
sx1 = NP.array(sexcol.values,dtype=str)
sex2 = NP.zeros(len(sx1))
for i in range(len(sx1)):
	if(sx1[i] == 'male'):
		sex2[i] = 1.0
	else:
		sex2[i] = 0.0
#
agecol = test_df.loc[:,'Age']
age2 = NP.array(agecol.values,dtype=float)

This time, the model is fitten by using the whole amount of data from 'train.csv'.

In [13]:
model = linear_model.Ridge(lamb, fit_intercept=False)
model.fit(X, y)

Ridge(alpha=1.0, copy_X=True, fit_intercept=False, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001)

Accordingly, the predictions can be made.

In [14]:
N = len(cla2)
Xfinal = [feature(cla2[ind],sex2[ind],age2[ind]) for ind in range(N)]
predictions = model.predict(Xfinal)
predictions = NP.around(predictions)
predictions = predictions.astype(int)

Finally, the pandas library is used to quick organize the results in the requested formatting.

In [15]:
aux =  list(zip(pid2, predictions))  
out_df = PD.DataFrame(aux, columns = ['PassengerId','Survived']) 
out_df.to_csv('gender_submission.csv', sep=',', index=False)