# Titanic: Machine Learning from Disaster

## Overview

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

This challenge is from [Kaggle.com](https://www.kaggle.com/c/titanic/overview)

First of all, we look at the data that they have provided us with. The data is already divided into 2 parts - `train.csv` and `test.csv`. First, we will do all the feature engineering and model fitting on the `train` set and then apply it to the `test` set.

Now, let's load the data

In [None]:
import os

TITANIC_PATH = os.path.join('datasets', 'titanic')

In [None]:
import pandas as pd

def load_titanic_data(filename, path=TITANIC_PATH):
    csv_path = os.path.join(path, filename)
    return pd.read_csv(csv_path)

In [None]:
train_dataset = load_titanic_data("train.csv")
test_dataset = load_titanic_data("test.csv")

Let's look at the dataset.

In [None]:
train_dataset.head()

In [None]:
train_dataset.info()

As we can see, almost **20%** of the data in **Age** is null. Around **77%** in **Cabin** is null, so we will mostly ignore it. And very few precent in **Embarked** contains null values. So, at the moment we can fill the null values in **Age** with the median age value.

Let's begin with the preprocessing pipeline.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("num_attr", DataFrameSelector(['Age', 'SibSp', 'Parch', 'Fare'])),
    ("imputer", SimpleImputer(strategy='median')),
])

In [None]:
num_pipeline.fit_transform(train_dataset)

Now, let's create a **Imputer** for categorical attributes.

In [None]:
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('cat_attr', DataFrameSelector(['Pclass', 'Sex', 'Embarked'])),
    ('imputer', MostFrequentImputer()),
    ('cat_one_hot', OneHotEncoder(sparse=False)),
])

In [None]:
cat_pipeline.fit_transform(train_dataset)

In [None]:
#Full preprocessing pipeline
from sklearn.pipeline import FeatureUnion

preprocess_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline),
])

In [None]:
X_train = preprocess_pipeline.fit_transform(train_dataset)
X_train.shape

In [None]:
y_train = train_dataset['Survived']

In [None]:
from sklearn.svm import SVC

svm_clf = SVC(gamma='auto')
svm_clf.fit(X_train, y_train)

In [None]:
X_test = preprocess_pipeline.transform(test_dataset)
y_pred = svm_clf.predict(X_test)

In [None]:
from sklearn.model_selection import cross_val_score

svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)
svm_scores.mean()

This is good but let's try another classifier and see if we can get better performance.

In [None]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100 ,random_state=42)
forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)

In [None]:
forest_scores.mean()

As we can see that this model is better, we will use this.

In [None]:
forest_clf.fit(X_train, y_train)

In [None]:
forest_predictions = forest_clf.predict(X_test)

In [None]:
forest_predictions

Dumping the predictions in a file.

In [None]:
pd.DataFrame(forest_predictions, columns=['Survived'], index=list(range(892, 1310))).to_csv('predictions.csv', index_label='PassengerId')