# Predicting the survivours in Titanic shipwreck. 

This notebook builds a ML algorithm to predict which passengers are survived during Titanic shipwreck. We will use python based machine learning libraries, pandas, numpy to build the model.

## Problem
Create a Machine Learning model to predict survivors in Titanic shipwerck.

## Data
Data is taken from Kaggle competition, *Titanic: Machine Learning from Disaster*

https://www.kaggle.com/c/titanic/data

## Evaluation
Evaluation metric is accuracy. 

That is the percentage of passengers we correctly predicted.

https://www.kaggle.com/c/titanic/overview/evaluation

## Features

The data has been split into two groups:

* training set (train.csv)
* test set (test.csv)

It consists of below features:
* survival
* pclass
* sex
* Age
* sibsp
* parch
* ticket
* fare
* cabin
* embarked

## Getting Workspace ready

In [None]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
# Remaining imports will be added later whenever we require them.

## Exploratory Data Analysis (EDA)

In [None]:
# Import the data into a pandas DataFrame
df = pd.read_csv("../input/titanic/train.csv")
df

In [None]:
# Find the number of classes
df["Survived"].value_counts()

In [None]:
df["Survived"].value_counts().plot(kind="bar", color=["salmon", "lightblue"])

In [None]:
# More info about the data
df.info()

In [None]:
# Check for any missing values
df.isna().sum()

Our dataframe has missing values for Age, Cabin and Embarked columns.
In the data preprocessing stage, let's fill the missing values of age with average values, Cabin and Embarked values as 'missing'.

In [None]:
# Check all the values in cabin column
df["Cabin"].value_counts()

In [None]:
# Check all the vlaues in Embark column
df["Embarked"].value_counts()

### Survival against Sex

In [None]:
# Compare Survival column with Sex column
pd.crosstab(df.Survived, df.Sex)

In [None]:
# create a plot for crosstab
pd.crosstab(df.Survived, df.Sex).plot(kind="bar", color=["salmon", "lightblue"])
plt.title("Survival vs Sex")
plt.xlabel("0=Not Survived 1=Survived")
plt.ylabel("Count")
plt.legend(["Female", "Male"]);
plt.xticks(rotation=0);

In [None]:
df.Age.hist();

In [None]:
df.Age.mean(), df.Age.median()

## Data Preprocessing

### Make a copy of original dataframe

In [None]:
# make a copy
df_bkp = df.copy()

### Convert strings to numbers

One way to turn all our data into numbers is to convert them into pandas categories.
We can check different data types compatible with pandas here: https://pandas.pydata.org/pandas-docs/version/0.25.3/reference/general_utility_functions.html#data-types-related-functionality

In [None]:
pd.api.types.is_string_dtype(df["Name"])

In [None]:
# Find the columns which contains strings
for label, content in df.items():
    if pd.api.types.is_string_dtype(content):
        print(label)

In [None]:
# Turn all the string values into category values
for label, content in df.items():
    if pd.api.types.is_string_dtype(content):
        df[label] = content.astype("category").cat.as_ordered()

In [None]:
df.info()

In [None]:
df.Ticket.cat.categories

### Fill missing values

In [None]:
# Print all columns with are numerical
for label, content in df.items():
    if pd.api.types.is_numeric_dtype(content):
        print(label)

In [None]:
# Check for which numeric columns has null values
for label, content in df.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
# Fill numeric rows with mean(average)
for label, content in df.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            df[label] = content.fillna(content.mean())


In [None]:
# Check if there are any null numeric values once again
for label, content in df.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            print(label)

In [None]:
df.isna().sum()

### Turn categorical variables into numbers and fill missing

In [None]:
pd.Categorical(df['Cabin']).codes

In [None]:
for label, content in df.items():
    if not pd.api.types.is_numeric_dtype(content):
        df[label] = pd.Categorical(content).codes + 1

In [None]:
df.isna().sum()

### No more missing values and all the data is numeric. Ufffffff! Let's go to modelling !!

In [None]:
# Split the data into X and y
X = df.drop("Survived", axis=1)
y = df['Survived']

In [None]:
X

In [None]:
y

In [None]:
np.random.seed(42)

# Split the data into train and validation datasets

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

In [None]:
X_train.shape, X_val.shape, y_train.shape, y_val.shape

We are going to experiment with three models:
* Logistic Regression
* KNeighbours Classifier
* Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
          "KNN": KNeighborsClassifier(),
          "Random Forest": RandomForestClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_val, y_train, y_val):
    """
    Fits and evaluates given machine learning models.
    models: A dictionary of machine learning models
    X_train: training data (with no labels)
    X_val: validation data (with no labels)
    y_train: training labels
    y_vla: validation labels
    """
    # setup random seed
    np.random.seed(42)
    # create a dictionary to store model scores
    model_scores = {}
    # Loop through the models
    for name, model in models.items():
        # fit the model
        model.fit(X_train, y_train)
        # evalute the model and append the score to model_scores
        model_scores[name] = model.score(X_val, y_val)
    return model_scores

In [None]:
model_scores = fit_and_score(models, X_train, X_val, y_train, y_val)
model_scores

### Since Random Forest is giving a good score. We will consider it as our base model and try to further enhance it using hyperparameter tuning.

### Hyperparameter tuning using RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Create a hyperparameter grid for Random Forest
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

# setup random hyperparameter search for Random Forest
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv = 5,
                           n_iter = 50,
                           verbose = True,
                           random_state=42)

# fit hyperparameter search for RandomForest
rs_rf.fit(X_train, y_train)

In [None]:
rs_rf.best_params_

In [None]:
rs_rf.score(X_val, y_val)

In [None]:
# Train model with best hyperparameters
ideal_model = RandomForestClassifier(n_estimators=460,
                                     min_samples_split=2,
                                     min_samples_leaf=3,
                                     max_depth=None,
                                     random_state=42)
# Fit the ideal model
ideal_model.fit(X_train, y_train)

In [None]:
ideal_model.score(X_val, y_val)

## Make predictions on test data

In [None]:

df_test = pd.read_csv("../input/titanic/test.csv")
df_test.head()

In [None]:
# Turn all the string values into category values
for label, content in df_test.items():
    if pd.api.types.is_string_dtype(content):
        df_test[label] = content.astype("category").cat.as_ordered()

In [None]:
df_test.info()

In [None]:
# Fill numeric rows with mean(average)
for label, content in df_test.items():
    if pd.api.types.is_numeric_dtype(content):
        if pd.isnull(content).sum():
            df_test[label] = content.fillna(content.mean())

In [None]:
# Turn categorical variables into numbers and fill missing
for label, content in df_test.items():
    if not pd.api.types.is_numeric_dtype(content):
        df_test[label] = pd.Categorical(content).codes + 1

In [None]:
X_test = df_test

In [None]:
test_preds = ideal_model.predict(X_test)

In [None]:
test_preds

In [None]:
len(test_preds)

We have made some predictions, now we have to format the output as requested by Kaggle.

In [None]:
df_preds = pd.DataFrame()
df_preds['PassengerId'] = df['PassengerId']
df_preds