# Table of content

- [TODO](#TODO)
- [Titanic](#Titanic)
  - [Setup](#Setup)
  - [Data](#Data)
    - [Download](#Download)
    - [Explore](#Explore)
    - [Split Data](#Split-Data)
  - [Utils](#Utils)
  - [Baseline Only Females Survived : 0.76315](#Baseline-Only-Females-Survived-:-0.76315)
  - [Baseline Log Sex Pclass : 0.76555](#Baseline-Log-Sex-Pclass-:-0.76555)
    - [Transformations](#Transformations)
    - [Model](#Model)
    - [Submission](#Submission)
- [Titanic Advanced](#Titanic-Advanced)
  - [Utils](#Utils)
  - [Custom Transformers](#Custom-Transformers)
  - [Pipelines](#Pipelines)
  - [Logistic Regression : 0.76555](#Logistic-Regression-:-0.76555)

# TODO
* Français!
* add some md to explain the structure : what is quickda?
* let some white spaces
* mail

# Titanic

This notebook has been inspired from the book [*Handson-Machine Learning with Scikit-learn, Tensorflow and Keras*](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/). 

Thanks to the author, [Aurélien Géron](https://github.com/ageron).

## Setup

L'environement require [numpy, pandas, sklearn, quickda]

In [32]:
# Python ≥3.5 is required
from pathlib import Path
import sys

import numpy as np
import pandas as pd
import sklearn

assert sklearn.__version__ >= "0.20"
assert sys.version_info >= (3, 5)

np.random.seed(42)

## Data

### Download

In [33]:
print('In the same directory of this script, make sure you have a directory named "titanic_dataset" '
    'where the data files should be.\n')

print(list(Path.cwd().joinpath("titanic_dataset").glob('**/*')))

In the same directory of this script, make sure you have a directory named "titanic_dataset" where the data files should be.

[WindowsPath('c:/Sources/kaggle_titanic_template/titanic_dataset/gender_submission.csv'), WindowsPath('c:/Sources/kaggle_titanic_template/titanic_dataset/test.csv'), WindowsPath('c:/Sources/kaggle_titanic_template/titanic_dataset/train.csv')]


In [34]:
def load_titanic_dataset(filename, path='titanic_dataset'):
    csv_path = Path.joinpath(Path(path), filename)
    return pd.read_csv(csv_path)


data = load_titanic_dataset("train.csv")
submit = load_titanic_dataset("test.csv")
gender_submission = load_titanic_dataset("gender_submission.csv")

### Explore

`py -m pip install quickda`

In [35]:
try:
    import quickda.explore_data as qda

    qda.explore(data, method="profile", report_name="Design Report")

except ModuleNotFoundError:
    print("quickda is not installed correctly")

quickda is not installed correctly


In [36]:
data[["Sex", "Survived"]].groupby(["Sex"]).mean()

Unnamed: 0_level_0,Survived
Sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


In [37]:
data[["Pclass", "Survived"]].groupby(["Pclass"]).mean()

Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


### Split Data

In [38]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data,
                               test_size=0.2,
                               random_state=42,
                               stratify=data['Sex'])

## Utils

In [39]:
from sklearn.metrics import accuracy_score


def show_performance(predictions, ground_truth):
    accuracy = accuracy_score(predictions, ground_truth)
    print(f'Test set\'s accuracy : {accuracy:.5f}.')

def submit_csv(submission: pd.DataFrame, file_name: str) -> None:
    output_dir = Path('submissions')
    output_dir.mkdir(parents=True, exist_ok=True)

    submission.to_csv(output_dir.joinpath(file_name), index=False)
    submission.head()


## Baseline Only Females Survived : 0.76315

Since the females' survival rate is 74.2% and the males, 18.9%,
we can do a quick & easy model in which every female survived and every male died.

In [40]:
test_pred = np.zeros(test['PassengerId'].shape)
test_pred[test['Sex'] == 'male'] = 0
test_pred[test['Sex'] == 'female'] = 1

show_performance(test_pred, test['Survived'])

Test set's accuracy : 0.77654.


Not bad, let's do a submission :

In [41]:
predictions = np.zeros(submit['PassengerId'].shape)
predictions[submit['Sex'] == "male"] = 0
predictions[submit['Sex'] == "female"] = 1
predictions[1] = 0  # Otherwise, Kaggle won't compute your score...

submission = pd.DataFrame({
    'PassengerId': submit['PassengerId'],
    'Survived': predictions
})

FILE_NAME = 'baseline_female.csv'

submit_csv(submission, FILE_NAME)


## Baseline Log Sex Pclass : 0.76555

Now, let's do a *machine learning* model.

In [42]:
train_copy = train.copy()
test_copy = test.copy()
submit_copy = submit.copy()

### Transformations

Here you can add some features and do some data preprocessing.

In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder

x_att = ['Sex', 'Pclass']
y_att = ['Survived']

x_train = train_copy[x_att]
y_train = train_copy[y_att]
x_test = test_copy[x_att]
y_test = test_copy[y_att]
x_submit = submit_copy[x_att]

# The ml algo can't read string, but you can read vectors.
# OneHotEncoder can transform your data this way :
#                                                   male = [1, 0]
#                                                   female = [0, 1]
one_hot = OneHotEncoder()
x_train_tfm = one_hot.fit_transform(x_train)
x_test_tfm = one_hot.fit_transform(x_test)
x_submit_tfm = one_hot.fit_transform(x_submit)

### Model

In [44]:
log_reg = LogisticRegression(random_state=42)
log_reg.fit(x_train_tfm, np.array(y_train).ravel())
test_pred = log_reg.predict(x_test_tfm)

show_performance(test_pred, y_test['Survived'])

Test set's accuracy : 0.77654.


The same result that we got with the baseline. How should you interpret this result?

### Submission

In [45]:
submit_pred = log_reg.predict(x_submit_tfm)
submission = pd.DataFrame({
    'PassengerId': submit['PassengerId'],
    'Survived': submit_pred
})

FILE_NAME = 'baseline_log_sex_pclass.csv'
submit_csv(submission, FILE_NAME)

# Titanic Advanced


## Utils


In [46]:
import joblib
from IPython.display import Audio

SOUND_FILE_NAME = './no_sound.wav'
USE_SOUND_FILE = False


def show_model_stats_and_ring(cv_clf,
                              use_sound_file=USE_SOUND_FILE,
                              sound_file=SOUND_FILE_NAME):
    print('\n\n' f'{cv_clf.best_params_}\n' f'{cv_clf.best_score_}')
    if use_sound_file:
        return Audio(sound_file, rate=1, autoplay=True)


def submit_and_save_model(cv_clf, model_name, x_test_tfm):
    predictions = cv_clf.predict(x_test_tfm)

    submission = pd.DataFrame({
        'PassengerId': submit['PassengerId'],
        'Survived': predictions
    })

    file_name = f'{model_name}.csv'

    output_dir = Path('submissions')
    output_dir.mkdir(parents=True, exist_ok=True)

    submission.to_csv(output_dir.joinpath(file_name), index=False)
    submission.head()

    joblib.dump(cv_clf, output_dir.joinpath(f'{model_name}.pkl'))
    return joblib.load(output_dir.joinpath(f'{model_name}.pkl'))

## Custom Transformers

In [47]:
from sklearn.base import TransformerMixin, BaseEstimator


class tfm_example(TransformerMixin, BaseEstimator):
    def __init__(self, do_tfm=False):
        self.do_tfm = do_tfm

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        if self.do_tfm:
            return X
        return X

## Pipelines

In [48]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

x_att = ['Sex']
y_att = ['Survived']

x_train = train[x_att]
y_train = train[y_att]
x_test = test[x_att]
y_test = test[y_att]
x_submit = submit[x_att]

categorical_tfm = Pipeline([('one_hot', OneHotEncoder())])

survived_idx = 0
pipeline = ColumnTransformer([('cat', categorical_tfm, [survived_idx])],
                             remainder='drop')

## Logistic Regression : 0.76555


In [49]:
from sklearn.linear_model import LogisticRegression

log_pipeline = Pipeline([('pipe', pipeline),
                         ('log', LogisticRegression(random_state=42))])

print(log_pipeline.get_params().keys())

dict_keys(['memory', 'steps', 'verbose', 'pipe', 'log', 'pipe__n_jobs', 'pipe__remainder', 'pipe__sparse_threshold', 'pipe__transformer_weights', 'pipe__transformers', 'pipe__verbose', 'pipe__cat', 'pipe__cat__memory', 'pipe__cat__steps', 'pipe__cat__verbose', 'pipe__cat__one_hot', 'pipe__cat__one_hot__categories', 'pipe__cat__one_hot__drop', 'pipe__cat__one_hot__dtype', 'pipe__cat__one_hot__handle_unknown', 'pipe__cat__one_hot__sparse', 'log__C', 'log__class_weight', 'log__dual', 'log__fit_intercept', 'log__intercept_scaling', 'log__l1_ratio', 'log__max_iter', 'log__multi_class', 'log__n_jobs', 'log__penalty', 'log__random_state', 'log__solver', 'log__tol', 'log__verbose', 'log__warm_start'])


In [50]:
from sklearn.model_selection import GridSearchCV

params = {
    'log__C': [1, 10],
    'log__dual': [True, False],
    'log__fit_intercept': [True, False],
    'log__max_iter': [10**4],
    'log__penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'log__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'log__tol': [10**-4]
}
cv_log = GridSearchCV(log_pipeline, params, verbose=0, scoring="accuracy")
cv_log.fit(x_train, y_train.values.ravel())
show_model_stats_and_ring(cv_log)

Traceback (most recent call last):
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\linear_model\_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\linear_model\_logistic.py", line 443, in _check_solver
    raise ValueError("Solver %s supports only 'l2' or 'none' penalties, "
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

Traceback (most recent call last):
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\model_selection\_validation.py", lin



{'log__C': 1, 'log__dual': True, 'log__fit_intercept': True, 'log__max_iter': 10000, 'log__penalty': 'l2', 'log__solver': 'liblinear', 'log__tol': 0.0001}
0.7892839554811386


Traceback (most recent call last):
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\pipeline.py", line 346, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\linear_model\_logistic.py", line 1306, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\linear_model\_logistic.py", line 454, in _check_solver
    raise ValueError(
ValueError: penalty='none' is not supported for the liblinear solver

Traceback (most recent call last):
  File "c:\Sources\kaggle_titanic_template\env\lib\site-packages\sklearn\model_selection\_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_p

In [51]:
model_name = 'log_000'
joblib_log = submit_and_save_model(cv_log, model_name, x_submit)
score = round(joblib_log.score(x_test, y_test), 5)
print(f'Test set\'s score : {score:.5f}.')

Test set's score : 0.77654.
