# Table of content

- [Titanic](#Titanic)
  - [Setup](#Setup)
  - [Data](#Data)
    - [Download](#Download)
    - [Explore with QuickDA](#Explore-with-QuickDA)
    - [Split Data](#Split-Data)
  - [Model's Common Functions](#Model's-Common-Functions)
  - [Baseline Only Females Survived : 0.76315](#Baseline-Only-Females-Survived-:-0.76315)
  - [Log Sex Pclass : 0.76555](#Log-Sex-Pclass-:-0.76555)
    - [Transformations](#Transformations)
    - [Model](#Model)
    - [Submission](#Submission)

# Titanic

This notebook has been inspired from the book [*Handson-Machine Learning with Scikit-learn, Tensorflow and Keras*](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/). 

Thanks to the author, [Aurélien Géron](https://github.com/ageron).

## Setup

What does the environment require?

In [1]:
# Python ≥3.5 is required
from pathlib import Path
import sys

import numpy as np
import pandas as pd
import sklearn

assert sklearn.__version__ >= '0.20'
assert sys.version_info >= (3, 5)

np.random.seed(42)

## Data

### Download

In [2]:
def load_titanic_dataset(filename, path='titanic_dataset'):
    csv_path = Path.joinpath(Path(path), filename)
    return pd.read_csv(csv_path)


data = load_titanic_dataset('train.csv')
submit = load_titanic_dataset('test.csv')
gender_submission = load_titanic_dataset('gender_submission.csv')

### Split Data

Before you modify your data, you need to create a test set!

The sex is an important feature. Therefore, we need to keep the same proportions of males and females in both sets with `stratify=data['Sex']`.

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data,
                               test_size=0.2,
                               random_state=42,
                               stratify=data['Sex'])

## Model's Common Functions

For [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) codes' purposes. Don't mind them, come back if you need to.

In [4]:
from sklearn.metrics import accuracy_score


def show_accuracy(predictions, ground_truth):
    accuracy = accuracy_score(predictions, ground_truth)
    print(f'Test set\'s accuracy : {accuracy:.5f}.')


def save_submission(passenger_id, pred, FILE_NAME='submission_000'):
    FILE_NAME += '.csv'
    submission = pd.DataFrame({
        'PassengerId': passenger_id,
        'Survived': pred
    })
    output_dir = Path('submissions')
    output_dir.mkdir(parents=True, exist_ok=True)

    submission.to_csv(output_dir.joinpath(FILE_NAME), index=False)

## Baseline Only Females Survived : 0.76315

Since the females' survival rate is 74.2% and the males, 18.9%,
we can do a quick & easy model in which every female survived and every male died.

In [5]:
def baseline_female(data):
    predictions = np.zeros(data.shape[0])
    predictions[data['Sex'] == 'male'] = 0
    predictions[data['Sex'] == 'female'] = 1
    return predictions

In [6]:
predictions = baseline_female(test)
show_accuracy(predictions, test['Survived'])

Test set's accuracy : 0.77654.


Not bad, let's do a submission :

In [15]:
predictions = baseline_female(submit)
predictions[1] = 0  # Otherwise, Kaggle won't compute your score...

passenger_id = submit['PassengerId']
FILE_NAME = 'baseline_female_00'

save_submission(passenger_id, predictions, FILE_NAME)

## Log Sex : 0.76555

Now, let's do a *machine learning* model.

In [8]:
train_copy = train.copy()
test_copy = test.copy()
submit_copy = submit.copy()

### Transformations

Here, you can add some attributes to `x_att`.

In [9]:
x_attributes = ['Sex', 'Pclass']
y_attribute = ['Survived']


x_train = train_copy[x_attributes]
y_train = train_copy[y_attribute]

x_test = test_copy[x_attributes]
y_test = test_copy[y_attribute]

x_submit = submit_copy[x_attributes]

**You can do here some data preprocessing.**

Machine learning algorithms don't understand strings, but do understand vectors.

Therefore you should use `OneHotEncoder()` to transform your feature this way (if `sparse=False`):
* `'male' -> [1, 0]`
* `'female' -> [0, 1]`

If we use two features, the two will be merged into one vector :
* `'male from first class' -> [1, 0, 1, 0, 0]`
* `'female from first class' -> [0, 1, 1, 0, 0]`
* `'female from second class' -> [0, 1, 0, 1, 0]`

As you can see, the first two values of the vector determine if the instance is a male or a female and the last 3 values, the passenger's class.

_Can you tell why we use `OneHotEncoder` on `Pclass`?_

In [10]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder() 

x_train_tfm = one_hot.fit_transform(x_train)
x_test_tfm = one_hot.fit_transform(x_test)
x_submit_tfm = one_hot.fit_transform(x_submit)

### Model

Once you transformed the data into something *edible* for your algorithm, you can train it.

In [11]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=42)
log_reg.fit(x_train_tfm, np.array(y_train).ravel())

test_pred = log_reg.predict(x_test_tfm)

show_accuracy(test_pred, y_test['Survived'])

Test set's accuracy : 0.77654.


This is same result that we got with the baseline. How should you interpret this accuracy?

### Submission

In [12]:
passenger_id = submit['PassengerId']
pred = log_reg.predict(x_submit_tfm)
FILE_NAME = 'log_sex_pclass'

save_submission(passenger_id, pred, FILE_NAME)

# Congrats!
**You're now done with the first part.**

**Before going further, try to get a higher score by :**

* Adding some features
* Doing some data preprocessing
* Tweaking the model's hyperparameters
* Trying another [model](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

Also, don't forget to share your score with us!