# Titanic: Machine Learning from Disaster

## Overview

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

This challenge is from [Kaggle.com](https://www.kaggle.com/c/titanic/overview)

First of all, we look at the data that they have provided us with. The data is already divided into 2 parts - `train.csv` and `test.csv`. First, we will do all the feature engineering and model fitting on the `train` set and then apply it to the `test` set.

In [1]:
#Importing packages
import pandas as pd

#Reading csv file
data = pd.read_csv('train.csv')

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Feature Selection

As we can see, there are 12 columns in this train set. And the *"ground truth"* for this ML problem is to predict whether the person will survive or not depending upon the given set of attributes.

We have to remove certain attributes from the data to get our optimal result. When we observe the dataset, we see that the `Name`, `PassengerId`, `Ticket` and `Cabin` attributes does not contribute much to the target attribute, i.e. `Survived`. So, we remove these attributes and move forward.

In [2]:
#Defining the feature and target attributes
feature_attr = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
target_attr = ["Survived"]

#Slicing the data according to the feature attributes
feature = data[data.columns[data.columns.isin(feature_attr)]]
target = data[data.columns[data.columns.isin(target_attr)]]

In [3]:
feature.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,3,male,22.0,1,0,7.25,S
1,1,female,38.0,1,0,71.2833,C
2,3,female,26.0,0,0,7.925,S
3,1,female,35.0,1,0,53.1,S
4,3,male,35.0,0,0,8.05,S


In [4]:
target.head()

Unnamed: 0,Survived
0,0
1,1
2,1
3,1
4,0


## Removing Null Values

Before we move forward, we have to check for NaN values in the dataset that we are working on.

In [16]:
null_columns = feature.columns[feature.isna().any()]
null_columns

Index(['Age', 'Embarked'], dtype='object')

So, we see that there are null values in the feature dataset, let's see how many are there.

In [17]:
data[null_columns].isnull().sum()

Age         177
Embarked      2
dtype: int64

## Categorical Attributes

Now, before we can move forward with figuring out which attributes contributes more to the target. We will first encode the categorical attributes. Here, they are `Sex` and `Embarked`.

In [5]:
feature_cat = feature[feature.columns[feature.columns.isin(["Sex", "Embarked"])]]

In [6]:
feature_cat.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S
