# Titanic ML from Disaster

This machine learning problem is one of the most common problems to start with machine learning algorithms, it is related to the sinking of the Titanic, as one of the most infamous wrecks in history. On April 15, 1912 sank after hitting an iceberg during her maiden voyage.

The ship didn't have enough lifeboats for everyone on board, resulting in the death of 1,502 out of 2,224 passengers and crew. While there was an element of luch in survival, it appears that certain groups of people were more likely to survive than others.

The problem is defined as a supervised machine learning model, as classification, here a Logistic Regression will be performed in order to obtain a solution

## Setup

First lets load some libraries

In [1]:
import pandas as pd

## Load the Data

The data is already splitted as a training set and a test set, so let's load

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [3]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Cleaning

Let's check first is there something missing on the data

In [4]:
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [5]:
test.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

There are some null values in the age feature, this will complicate the analysis if we leave it that way, because it's considerer an important feature for the analysis. To fill this missing data, let's calculate the median of the training dataset and fill it in the null values in both tables (train and test).

In [6]:
med_age = train['Age'].median()
train['Age'] = train['Age'].fillna(med_age)
test['Age'] = test['Age'].fillna(med_age)

Let's check if the code really works

In [7]:
train.Age.isnull().sum()

0

In [8]:
test.Age.isnull().sum()

0

Great! there are no null values in age feature. Now, regarding the sex feature, we have it right now as strings values (male or female), let's chance it to integers, to do this let's create another column representing 1 if the passenger is female or 0 if it is male.

In [9]:
train['IsFemale'] = (train['Sex'] == 'female').astype(int)
test['IsFemale'] = (test['Sex'] == 'female').astype(int)

For the model we will require the Passenger Class, the IsFemale value and the Age and the label is the column where the passenger survive (1) or not (0)

In [10]:
labels = ['Pclass', 'IsFemale', 'Age']
X_train = train[labels].values
X_test = test[labels].values
y_train = train['Survived'].values
X_train[:5]

array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])

Great! Now onto the model

## Model to Predict Survival

We are going to use the LogisticRegression model as the approach.

In [11]:
from sklearn.linear_model import LogisticRegression

In [12]:
model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

In [13]:
y_predict = model.predict(X_test)

In [14]:
y_predict[:10]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])