# TITANIC

A data analysis on what kind of people survived the Titanic sinking. Children, ladies and upper class people were given preference to the life boats, thereby their survival chances are higher. The machine learning tools are used to make such an analysis on the data to predict if a person survived or not. The dataset contains: Name, PassengerId, Age, Sex, Class, Ticket Number, Cabin, Number of children, parents and spouse, Port of entry, Ticket Price, Survived or not.

The data is read from file and stored in a Pandas Dataframe Object. There are 2 files: test and train in csv format. In train.csv, all the data is given. It is used to find the relationship between all the attributes and the Survived attribute. In test.csv, all the data except survived parameter is given. The relationship found from the train.csv file is applied to test.csv to predict the attribute Survived.

In [11]:
import os
import pandas as pd
from sklearn.tree import DecisionTreeClassifier

DATA_PATH = "C:/Users/Aishu/Documents/GitHub/MachineLearning/datasets"
def load_titanic_data(data_path, file_name):
    csv_path = os.path.join(data_path, file_name)
    return pd.read_csv(csv_path)

train_set = load_titanic_data(DATA_PATH, "train.csv")
test_set = load_titanic_data(DATA_PATH, "test.csv")

# Data Preprocessing

The attributes are first preprocessed to make it suitable to apply the data analysis tools. Some preprocessing techniques are:
All the missing values are filled with some meaningful value. 
Attributes that don't affect the analysis are removed. 
Some additional attributes may be added to make the data look more meaningful. 
Continous values of data can be converted to discrete values. 

In test_data, attribute Fare contains few empty values. They are replaced by the median value of all the values of Fare of test_data. Attributes Ticket and Cabin are mostly unique with only few repetitions, indicating it does not have much impact on the Survived attribute. Hence they are removed from both train and test dataset. Preproceesing needs to be applied to both test and train data in same manner. SO they are combined and stored in list variable called combine.

In [12]:
test_set['Fare'].fillna(test_set['Fare'].dropna().median(), inplace=True)

train_set = train_set.drop(["Cabin", "Ticket"], axis=1)
test_set = test_set.drop(["Cabin", "Ticket"], axis=1)
combine = [train_set, test_set]

A new attribute called Title is created by extracting the characters from Name attribute indicating their title. 

In [311]:
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train_set['Title'], train_set['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,2
Countess,1,0
Don,0,1
Dr,1,6
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,40
Miss,182,0


All the titles are grouped into 5 categories and estimated the survival rate of each category.

In [312]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
train_set[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,Rare,0.347826


The title is mapped to pre-defined integer values and stored in the dataset. If title is not found then its value is mapped to 0.

In [313]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,3
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,3
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,1


Attributes like Name and Pssenger Id are almost unique and hence don't affect the survival condition. Thereby, they are removed from the dataset. Passenger Id of test_set is retained for the final submission.

In [13]:
train_set = train_set.drop(["Name", "PassengerId"], axis=1)
test_set = test_set.drop(["Name"], axis=1)
combine = [train_set, test_set]

Attribute Sex contains either of 2 values: male or female. The machine learning tools avoid the use of strings as values of attribute. So they are mapped to integer values.

In [14]:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,7.25,S
1,1,1,1,38.0,1,0,71.2833,C
2,1,3,1,26.0,0,0,7.925,S
3,1,1,1,35.0,1,0,53.1,S
4,0,3,0,35.0,0,0,8.05,S


Attributes Sex and Pclass are combined to guess the age of the person. This guessed age is used to replace any missing value in the attribute age. The values of age are categorized into 5 divisions each of equal renage. The survival rate for each of these ranges is found, which gives us a convincing that these ranges are meaningful and makes it easier to find the survival. The ranges are replaced by integer values.  

In [5]:
import numpy as np

guess_ages = np.zeros((2,3))
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_set = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna()
            age = guess_set.median()
            guess_ages[i,j] = int(age/0.5+0.5)*0.5
    for i in range(0,2):
        for j in range(0,3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1), 'Age'] = guess_ages[i,j]
    dataset['Age'] = dataset['Age'].astype(int)
train_set["AgeRange"] = pd.cut(train_set['Age'],5)
train_set[['AgeRange', 'Survived']].groupby(['AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
 


Unnamed: 0,AgeRange,Survived
0,"(-0.08, 16.0]",0.55
1,"(16.0, 32.0]",0.337374
2,"(32.0, 48.0]",0.412037
3,"(48.0, 64.0]",0.434783
4,"(64.0, 80.0]",0.090909


In [6]:
for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
train_set = train_set.drop(['AgeRange'], axis=1)
combine = [train_set, test_set]
train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,1,1,0,7.25,S
1,1,1,1,2,1,0,71.2833,C
2,1,3,1,1,0,0,7.925,S
3,1,1,1,2,1,0,53.1,S
4,0,3,0,2,0,0,8.05,S


Missing values for Port of Entry is replaced by that of its most occured value. The values are further mapped to pre-defined integer values.

In [15]:
freq_embarked = train_set.Embarked.dropna().mode()[0]
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_embarked)
    dataset['Embarked'] = dataset['Embarked'].map({'C':0, 'Q':1, 'S':2}).astype(int)
train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,7.25,2
1,1,1,1,38.0,1,0,71.2833,0
2,1,3,1,26.0,0,0,7.925,2
3,1,1,1,35.0,1,0,53.1,2
4,0,3,0,35.0,0,0,8.05,2


Fare attribute is grouped into 4 categories and checked for its survival rate. Upon seeing a convincing dependency on the survival, it is mapped to pre-defined integer values.

In [16]:
train_set['FareBand'] = pd.qcut(train_set['Fare'], 4)
train_set[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

Unnamed: 0,FareBand,Survived
0,"(-0.001, 7.91]",0.197309
1,"(7.91, 14.454]",0.303571
2,"(14.454, 31.0]",0.454955
3,"(31.0, 512.329]",0.581081


In [17]:
for dataset in combine:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

train_set = train_set.drop(['FareBand'], axis=1)
combine = [train_set, test_set]

train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,22.0,1,0,0,2
1,1,1,1,38.0,1,0,3,0
2,1,3,1,26.0,0,0,1,2
3,1,1,1,35.0,1,0,3,2
4,0,3,0,35.0,0,0,1,2


Age and Pclass are combined as a new attribute by multiplying its integer values.

In [321]:
for dataset in combine:
    dataset['Age*Class'] = dataset.Age * dataset.Pclass

train_set.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)

Unnamed: 0,Age*Class,Age,Pclass
0,3,1,3
1,2,2,1
2,3,1,3
3,2,2,1
4,6,2,3
5,3,1,3
6,3,3,1
7,0,0,3
8,3,1,3
9,0,0,2


The number of siblings, spouse, parents are added and stored in new attribute called FamilySize. Their survival rates are calculated to find the convincing relationship between the survived attribute and FamilySize.

In [322]:
for dataset in combine:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

train_set[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,FamilySize,Survived
3,4,0.724138
2,3,0.578431
1,2,0.552795
6,7,0.333333
0,1,0.303538
4,5,0.2
5,6,0.136364
7,8,0.0
8,11,0.0


The FamilySize attribute is used to create a new attribute called ISAlone which holds only 2 values: 0 if FamilySize is 0 else its value is 1. Other attributes FamilySize, SibSp, Parch can be removed.

In [323]:
for dataset in combine:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

train_set[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()

train_set = train_set.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_set = test_set.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_set, test_set]

train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,Title,Age*Class,IsAlone
0,0,3,0,1,0,2,1,3,0
1,1,1,1,2,3,0,3,2,0
2,1,3,1,1,1,2,2,3,1
3,1,1,1,2,3,2,3,2,0
4,0,3,0,2,1,2,1,6,1


# Modeling the dataset

Decision Tree Classifier is used to predict if the person survives are not. This model is usually used to predict if the target attribute has a possibility of only 2 values. It uses all the preprocessed attribute as the training set to predict the attribute survived. 

The train_set is split into 2 parts: train and test. Train contains 80% of the train_set and the remaining is test.  X contains the preprocessed attributes and Y contains the Survived attribute. Decision Tree classifier is built using the train_X and train_Y. This classifier is applied to the test_X to predict the survived attribute. This attribute is compared with actual values of Survived attribute. Accuracy of the model is calulated as the number of correct predictions divided by the total number of predictions.

In [18]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_Y, test_Y = train_test_split(train_set, train_set['Survived'], test_size=0.2, random_state=42)
train_X = train_X.drop(["Survived"],axis=1)
test_X = test_X.drop(["Survived"],axis=1)

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(train_X, train_Y) 
y_pred = tree_classifier.predict(test_X)
print ("Accuracy on Training: ",sum(y_pred==test_Y)/len(test_Y))

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

The Decision Tree classifier is built on the entire training set and applied to the test_set to get the final results of the given problem.

In [325]:
train_Y = train_set["Survived"]
train_X = train_set.drop(["Survived"],axis=1)
test_data = test_set.drop(["PassengerId"],axis=1)

tree_classifier1 = DecisionTreeClassifier()
tree_classifier1.fit(train_X, train_Y) 
test_set["Survived"] = tree_classifier1.predict(test_data)
test_set[['PassengerId', 'Survived']].to_csv('datasets/titanic_result.csv', index=False)
test_set[['PassengerId', 'Survived']]

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0
