<a href="https://colab.research.google.com/github/claudiohfg/jornada_colaborativa/blob/master/Summit_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem definition

### Description

<p><b>RMS Titanic</b> was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of <b>15 April 1912</b>, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated <b>2,224</b> passengers and crew aboard, more than <b>1,500</b> died, making the sinking one of modern history's deadliest peacetime commercial marine disasters.</p>

<p>After leaving <b>Southampton</b> on 10 April 1912, Titanic called at <b>Cherbourg</b> in France and <b>Queenstown</b> (now Cobh) in Ireland, before heading west to <b>New York</b>. On 14 April, four days into the crossing and about 375 miles (600 km) south of Newfoundland, she <b>hit an iceberg at 11:40 p.m. ship's time</b>. The collision caused the hull plates to buckle inwards along her starboard (right) side and opened five of her sixteen watertight compartments to the sea.</p>

<p>Meanwhile, passengers and some crew members were evacuated in lifeboats, many of which were launched only partially loaded. A disproportionate number of men were left aboard because of a "women and children first" protocol for loading lifeboats. <b>At 2:20 a.m., she broke apart and foundered with well over one thousand people still aboard.</b> Just under two hours after Titanic sank, the Cunard liner RMS Carpathia arrived and brought aboard an estimated <b>705 survivors</b>.</p>


### The problem

Binary classification.

# Import libs

In [None]:
# Removes all warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

In [None]:
from collections import Counter
from io import BytesIO
import matplotlib.pyplot as plt
from PIL import Image
from PIL import ImageDraw
import requests
from scipy.special import factorial
import seaborn as sns
from termcolor import colored

In [None]:
import sklearn as sk
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder

# Constants (URLs)

In [None]:
URL_CSV = "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv"
URL_DECKS = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Titanic_side_plan_with_lifeboats.png/1280px-Titanic_side_plan_with_lifeboats.png"
URL_DISASTER = "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b6/The_destruction_of_RMS_Titanic.jpg/1280px-The_destruction_of_RMS_Titanic.jpg"
URL_HTML = "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/Ctitanic3.html"
URL_ICEBERG = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3d/Iceberg_in_the_Arctic_with_its_underside_exposed.jpg/1280px-Iceberg_in_the_Arctic_with_its_underside_exposed.jpg"
URL_LIFEBOAT = "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Titanic_lifeboat.jpg/1280px-Titanic_lifeboat.jpg"
URL_PASSENGERS = "https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/Crowd_at_Pier_54_awaiting_Carpathia_arrival_1912.jpg/1280px-Crowd_at_Pier_54_awaiting_Carpathia_arrival_1912.jpg"
URL_PYTHON = "https://upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Pit%C3%B3n_de_la_India_%28Python_molurus%29%2C_Zoo_de_Ciudad_Ho_Chi_Minh%2C_Vietnam%2C_2013-08-14%2C_DD_08.JPG/1280px-Pit%C3%B3n_de_la_India_%28Python_molurus%29%2C_Zoo_de_Ciudad_Ho_Chi_Minh%2C_Vietnam%2C_2013-08-14%2C_DD_08.JPG"
URL_SHIPYARD = "https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/RMS_Titanic_ready_for_launch%2C_1911.jpg/1280px-RMS_Titanic_ready_for_launch%2C_1911.jpg"
URL_SURVIVORS = "https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/19120417_Some_who_were_saved_when_the_Titanic_went_down_-_The_New_York_Times.png/1280px-19120417_Some_who_were_saved_when_the_Titanic_went_down_-_The_New_York_Times.png"
URL_TEST = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Adelaide_tram_number_1_on_trial_run_in_North_Tce_30_Nov_1908.jpg/1280px-Adelaide_tram_number_1_on_trial_run_in_North_Tce_30_Nov_1908.jpg"
URL_TITANIC = "https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/1280px-RMS_Titanic_3.jpg"

# Auxiliary functions

In [None]:
def label_feature(
    dataframe: pd.DataFrame,
    feature: str,
    n_parts: int,
    qcut=True) -> pd.DataFrame:
    """Generates label feature"""
    df = dataframe.copy()

    feat_sparse = f"{feature}_label"
    feat_label = f"{feature}_lbl"

    if qcut:
        df[feat_sparse] = pd.qcut(df[feature], n_parts)
    else:
        df[feat_sparse] = pd.cut(df[feature], n_parts)

    df[feat_label] = LabelEncoder().fit_transform(df[feat_sparse])
    df.drop(labels=[feat_sparse], axis=1, inplace=True)

    return df

In [None]:
media = 0
standar_deviation = 0.1
nd = np.random.normal(media, standar_deviation, 1000)  # normal distribution

df = pd.DataFrame(nd, columns=["val"])  # dataframe of the distribution

df = label_feature(
    dataframe=df,
    feature="val",
    n_parts=4,
    qcut=False
)  # transform val into sparse feature

df.hist()
plt.show()

del media
del standar_deviation
del nd
del df

In [None]:
def red_print(texto: str):
    """Print text with red color"""
    print(colored(f"\n----- {texto} -----", "red"), end="\n\n")

In [None]:
red_print("Roses are red.")

In [None]:
def simple_imputer(
    dataframe: pd.DataFrame,
    simple=True
) -> pd.DataFrame:
    """Fills null values"""
    df = dataframe.copy()

    cat_df = df[
        df.select_dtypes(include="object").columns.tolist() + ['survived']
    ]
    num_df = df.select_dtypes(exclude="object").drop(labels="survived", axis=1)

    if simple:
        imputer = IterativeImputer(random_state=42)
    else:
        imputer = sk.impute.KNNImputer(n_neighbors=7)
    imputed = imputer.fit_transform(num_df.values)

    num_df = pd.DataFrame(imputed, columns=num_df.columns, dtype=float)

    df = pd.merge(cat_df, num_df, left_index=True, right_index=True)

    return df

In [None]:
def show_img(url: str, x0: int, x1: int):
    """Show image"""
    response = requests.get(url)
    img = Image.open(BytesIO(response.content))
    fig, ax = plt.subplots(figsize=(15, 10))
    plt.imshow(np.asarray(img)[x0: x1, :], cmap='gray', vmin=0, vmax=255)
    plt.show()

In [None]:
show_img(URL_TEST, 200, 800)

In [None]:
def show_python(url):
    response = requests.get(URL_PYTHON)
    im = Image.open(BytesIO(response.content))
    W, H = im.size

    draw = ImageDraw.Draw(im)
    draw.rectangle((520, 230, 1100, 250), fill=0)
    draw.rectangle((500, 250, 540, 270), fill=0)

    draw.rectangle((600, 250, 860, 270), fill=0)
    draw.rectangle((630, 250, 650, 270), fill=(255,255,255))
    draw.rectangle((620, 270, 860, 290), fill=0)
    draw.rectangle((650, 270, 670, 290), fill=(255,255,255))
    draw.rectangle((640, 290, 840, 310), fill=0)
    draw.rectangle((670, 290, 690, 310), fill=(255,255,255))
    draw.rectangle((660, 310, 820, 330), fill=0)

    draw.rectangle((900, 250, 1080, 270), fill=0)
    draw.rectangle((930, 250, 950, 270), fill=(255,255,255))
    draw.rectangle((920, 270, 1080, 290), fill=0)
    draw.rectangle((950, 270, 970, 290), fill=(255,255,255))
    draw.rectangle((940, 290, 1060, 310), fill=0)
    draw.rectangle((970, 290, 990, 310), fill=(255,255,255))
    draw.rectangle((960, 310, 1040, 330), fill=0)

    fig, ax = plt.subplots(figsize=(15, 10))
    plt.imshow(np.asarray(im)[50: 650, :])
    plt.show()

In [None]:
def titanic_load(url=URL_CSV) -> pd.DataFrame:
    """Load Titanic dataset"""
    df = pd.read_csv(url)
    df.drop(labels=["boat", "body", "home.dest"], axis=1, inplace=True)
    df.rename(
        columns={
            "sibsp": "siblings_spouses",
            "parch": "parents_children"
        }, inplace=True)
    return df

# Data description

In [None]:
show_img(URL_TITANIC, 200, 800)

In [None]:
desc = pd.read_html(URL_HTML)
desc = pd.DataFrame(desc[0].values[1:], columns=desc[0].iloc[0])
desc

# Data loading

In [None]:
show_img(URL_PASSENGERS, 200, 800)

In [None]:
# Load Titaninc dataset
df = titanic_load()

# This backup copy will be used to show the usefulness of a Transformer
backup_1 = df.copy()

df.sample(n=10).sort_values(by=['name'])

# Train Eval Split

In [None]:
# Titanic dataset will be split in two: train and eval datasets
# Train will have 891 registers to reflect Kaggle's challenge

train = df.sample(n=891, random_state=0).index
eval = df.loc[~df.index.isin(train)].index

assert len(set(train.to_list()) & set(eval.to_list())) == 0, "Split problem"

# Exploratory Data Analysis (EDA)

In [None]:
show_img(URL_SURVIVORS, 165, 765)

## Feature engineering

### Null values

In [None]:
# Show columns and total missing values
dt = df.isna().sum().reset_index().sort_values(by=[0])
dt[dt[0] > 0][["index", 0]]

### Null values: fare

In [None]:
red_print("fare = null")
print(df[df.fare.isna()][["pclass", "ticket", "fare"]], end="\n\n")

# Since fare has only one missing value,
# we will use the most common value in the column (mode) for the same pclass.
df.loc[df.fare.isna(), "fare"] = df[
   (df.pclass == 3) & (df.fare.notna())
]["fare"].mode().values[0]

print(df.loc[[1225, 1226], ["pclass", "ticket", "fare"]])

red_print("pclass x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "pclass")
plt.show()

### New feature: fare_lbl

In [None]:
red_print("fare x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "fare")
plt.show()

# Fare is a continuous feature. 
# Let's make it discrete and split into 4 quantiles.
df = label_feature(df, "fare", 4)

red_print("fare_lbl x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "fare_lbl")
plt.show()

### Null values: embarked

In [None]:
red_print("embarked = null")
print(df[df.embarked.isna()][["ticket"]], end="\n\n")

# Since embarked has only two missing values,
# let's use the most common values on the dataset.
df.loc[df.embarked.isna(), "embarked"] = df.embarked.mode().values[0]
print(df.loc[[168, 284], ["ticket", "embarked"]])

df["embarked_lbl"] = LabelEncoder().fit_transform(df.embarked)

### Hidden feature: honorific

In [None]:
red_print("name samples")
print(df.name.sample(n=10), end="\n\n")

In [None]:
# This feature will conect age, sex and social status of the passenger.
# It is the honorific title used to refer to the passenger.
# This feature occurs in the middle of the name.
df["honorific"] = df.name.str.split(",").apply(
    lambda x: x[1]
).str.split(".").apply(lambda x: x[0]).str.strip()

red_print("honorific list")
print(df.honorific.unique(), end="\n\n")

red_print("honorific x pclass")
print(df.groupby(["honorific", "pclass"]).size(), end="\n\n")

In [None]:
# Let's select honorific titles with less than 10 occurrences and group them.
honorific_selection = (df.honorific.value_counts() < 10)
honorific_selection

In [None]:
# The group will be called Rare.
df.honorific = df.honorific.apply(
    lambda x: "Rare" if honorific_selection.loc[x] else x
)

red_print("honorific x pclass")
print(df.groupby(["honorific", "pclass"]).size(), end="\n\n")

df["honorific_lbl"] = LabelEncoder().fit_transform(df.honorific)

### New features: family size e alone

In [None]:
# Siblings and spouses = horizontal relationships.
# Parents and children = vertical relationships.
# Family size = horizontal and vertical relationships and self.
df['family_size'] = 1 + df['siblings_spouses'] + df['parents_children']

# Whether the passenger had no family on the ship.
df["alone"] = 1
df.loc[df.family_size == 1, "alone"] = 0

red_print("family_size x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "family_size")
plt.show()

red_print("alone x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "alone")
plt.show()

### Label encoder: sex

In [None]:
red_print("sex")
print(df.sex.value_counts())

# Due to women and children first, most men died.
red_print("sex (%)")
dt = df.groupby(
    ["sex"]
).size().sort_index().reset_index().rename(columns={0: "total"})
dt.total = dt.total / df.shape[0] * 1000 // 10
print(dt)

# Even though they were the majority onboard.
red_print("sex x survived (%)")
dt = df.loc[train].groupby(
    ["survived", "sex"]
).size().sort_index().reset_index().rename(columns={0: "total"})
dt.total = dt.total / df.loc[train].shape[0] * 1000 // 10
print(dt)

# We could use Pandas map, but we will use LabelEncoder instead.
df["sex_lbl"] = LabelEncoder().fit_transform(df.sex)

### Null values: age

In [None]:
# Let us visualize how age behaves.

red_print("age w/ null")
df.age.hist()
plt.show()

red_print("age w/ null x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "age")
plt.show()

In [None]:
# We will use an IterativeImputer to fill null values
# and check if the behavior of the feature doesn't change much.
df = simple_imputer(df)

red_print("age wo/ null")
df.age.hist()
plt.show()

red_print("age wo/ null x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "age")
plt.show()

### New feature: age_lbl

In [None]:
red_print("age x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "age")
plt.show()

# age is also a feature with too many different values.
# Let's group these values by stages of life using Pandas cut.
df = label_feature(df, "age", 5, False)

red_print("age_lbl x survived")
sns.FacetGrid(df.loc[train], col="survived").map(sns.distplot, "age_lbl")
plt.show()

### Hidden feature: deck

In [None]:
show_img(URL_DECKS, 0, 350)

In [None]:
# deck is a feature that can be obtained within the cabin feature.
# The problem is the majority of values is null.
# So we won't use this feature afterwards.
df['deck'] = df.cabin.apply(lambda x: x[0] if type(x) == str else "M")

red_print("deck")
print(df.deck.value_counts(dropna=False))

red_print("Cabin T was close to A")
df.loc[df.deck == "T", "deck"] = "A"

red_print("deck")
print(df.deck.value_counts(dropna=False))

# As an exercise of practice, let's label encode the deck feature.
df["deck_lbl"] = LabelEncoder().fit_transform(df.deck)

## Feature selection

### Data sample

In [None]:
# Enough engineering, let's visualize how is our dataset.
df.sample(n=5)

### Feature selection

In [None]:
# Since we have encoded most of our features into discrete values,
# let's select all numerical features and exclude continuous features.
# Let's also exclude features encoded in newer features like
# siblings_spouses, parents_children and alone.
numerical = df.select_dtypes(exclude="object").columns.tolist()
numerical.remove("age")
numerical.remove("fare")
numerical.remove("deck_lbl")
numerical.remove("siblings_spouses")
numerical.remove("parents_children")
numerical.remove("alone")

# The result is this list of features.
red_print("Numerical features")
print(numerical)

In [None]:
# Let's exercize our coding checking categorical features
# for those with more than 10 unique values.
# These are the ones that should be discarded for representing
# too much variation.
categorical = []
for column in df.select_dtypes(include="object").columns:
    if df[column].value_counts().shape[0] <= 10:
        categorical.append(column)
    else:
        print("Removed feature:", column)

red_print("Categorical features")
print(categorical)

In [None]:
# Time to assert if all the features are filled with values.
assert sum(df[numerical].isna().sum().values) == 0, "Null value detected"

## Graphical analysis

In [None]:
# In order to proceed into a graphical analysis, let's consider
# only training data. Otherwise we would be looking into the future.
sel = df.loc[train, numerical]

In [None]:
# A heatmap allows us to look for features with high correlation.
# Luckily there is none.
fig, ax = plt.subplots(figsize=(8, 8))
g = sns.heatmap(sel.corr(), annot=True, cmap='coolwarm')

In [None]:
# Pandas corr allows us to do the same without the colors. Lame.
sel.corr(method='pearson')

In [None]:
# Pandas skew is used to analyse the inclination of the normal curve.
sel.skew()

In [None]:
# Pandas hist let us analyse global behavior of features.
sel.hist(figsize=(15, 15))
plt.show()

In [None]:
# Boxplot is a useful graph to detect outliers.
# The box is the interval interquartile.
# The whiskers the rest of the distribution.
# The dots are the outliers.
# Not very useful here though.
for column in df.select_dtypes(exclude="object").columns:
    if column == "survived" or "_" in column:
        continue
    plt.boxplot(df.loc[train, column], showmeans=True, meanline=True)
    plt.ylabel(column)
    plt.show()

In [None]:
# This time we will plot some histogram comparing how two features
# relate to the survived feature.
hist = sns.FacetGrid(df.loc[train], row='sex', col='pclass', hue='survived')
hist.map(plt.hist, 'age', alpha = .75)
hist.add_legend()
plt.show()

In [None]:
# We can also compare how each feature relate to the survived feature.
# This helps us see if these features make sense to solve the problem at hand.
label = "survived"
columns = [x for x in sel.columns if x != label]

for column in columns:
    print(f"Survival rate by {column}")
    print(
        sel[[column, label]].groupby([column], as_index=False) \
        .sum().sort_values(by=["survived"], ascending=False),
        end="\n\n"
    )
    sns.FacetGrid(df, col=label).map(sns.distplot, column)
    plt.show()
    print()

In [None]:
# Barplots are used to correlate two features with survived feature.
# The lower the wick, the greater the certain.

# "A bar plot represents an estimate of central tendency for a numeric 
# variable with the height of each rectangle and provides some indication 
# of the uncertainty around that estimate using error bars."
# (https://seaborn.pydata.org/generated/seaborn.barplot.html)

fig, qaxis = plt.subplots(3, 1, figsize=(15, 15))

sns.barplot(x='sex_lbl', y='survived', hue='pclass', data=sel, ax=qaxis[0])
qaxis[0].set_title('sex_lbl x pclass x survived')

sns.barplot(x='fare_lbl', y='survived', hue='honorific_lbl', data=sel, ax=qaxis[1])
qaxis[1].set_title('fare_lbl x honorific_lbl x survived')

sns.barplot(x='embarked_lbl', y='survived', hue='age_lbl', data=sel, ax=qaxis[2])
qaxis[2].set_title('family_size x age_lbl x survived')

plt.show()

In [None]:
# Let's check on how our data looks like.
sel.sample(n=10)

# Transformer

In [None]:
# Now that you are certain about your data engineering, put
# everything in a Transformer to allow the use of pipelines
# and to guarantee that all data will be prepared in the
# same way.

class TitanicTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, key=None):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        df = X.copy()

        # FARE
        df.loc[df.fare.isna(), "fare"] = df[(df.pclass == 3) & (df.fare.notna())]["fare"].mode().values[0]
        df = label_feature(df, "fare", 4)

        # EMBARKED
        df.loc[df.embarked.isna(), "embarked"] = df.embarked.mode().values[0]
        df["embarked_lbl"] = LabelEncoder().fit_transform(df.embarked)

        # HONORIFIC
        df["honorific"] = df.name.str.split(",").apply(lambda x: x[1]).str.split(".").apply(lambda x: x[0]).str.strip()
        honorific_selection = (df.honorific.value_counts() < 10)
        df.honorific = df.honorific.apply(lambda x: "Rare" if honorific_selection.loc[x] else x)
        df["honorific_lbl"] = LabelEncoder().fit_transform(df.honorific)

        # FAMILY
        df['family_size'] = 1 + df['siblings_spouses'] + df['parents_children']
        df["alone"] = 1
        df.loc[df.family_size == 1, "alone"] = 0

        # SEX
        df["sex_lbl"] = LabelEncoder().fit_transform(df.sex)

        # AGE
        df = simple_imputer(df)
        df = label_feature(df, "age", 5, False)

        # SELECT NUMERICAL FEATURES
        numerical = df.select_dtypes(exclude="object").columns.tolist()
        numerical.remove("age")
        numerical.remove("fare")
        numerical.remove("siblings_spouses")
        numerical.remove("parents_children")
        numerical.remove("alone")

        return df[numerical]

In [None]:
# We will apply the transformation on the original data.
sel = TitanicTransformer().fit_transform(backup_1)

In [None]:
sel.sample(n=10)

# Model training

In [None]:
# First things first. Let's load the iceberg. Without it there's no accident.
show_img(URL_ICEBERG, 0, 600)

## **Without** data preprocessing

In [None]:
show_img(URL_DISASTER, 0, 600)

In [None]:
# In order to check if our preprocessing makes any difference we will run a
# model on the original data with very basic preprocessing like filling
# missing data and label encoding categorical features.
sel = backup_1.copy()

# Drop features with too much unique values
sel.drop(labels=["name", "ticket", "cabin"], axis=1, inplace=True)

# Fill null values
sel.age.fillna(0, inplace=True)
sel.embarked.fillna("", inplace=True)
sel.fare.fillna("", inplace=True)

# Encode categorical features
for column in sel.select_dtypes(include="object").columns.to_list():
    sel[column] = LabelEncoder().fit_transform(sel[column].astype(str))

# Train eval split
train_df = sel.loc[train]
eval_df = sel.loc[eval]

# Split dataset into labels and features
y = train_df.survived
X = train_df.drop(labels="survived", axis=1)
y_eval = eval_df.survived
X_eval = eval_df.drop(labels="survived", axis=1)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=42
)

# Instantiate classifier
clf = SVC()

# Model training
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Training score", score)

# Final training and evaluation
clf.fit(X, y)
score = clf.score(X_eval, y_eval)
print("   Final score", score)

# The result can be seen in the following confusion matrix.
# The confusion matrix shows true negatives, true positives, false negatives 
# and false positives.
plot_confusion_matrix(clf, X_eval, y_eval, values_format="d")
plt.show()

# Basically, the model decided to kill everyone onboard. Bad model.

## **With** data preprocessing

In [None]:
# SPOILER ALERT!
show_img(URL_LIFEBOAT, 150, 750)

### Label encoded

In [None]:
# This first model will use our transformer with label encoded data.
sel = TitanicTransformer().fit_transform(backup_1)

# Train eval split
train_df = sel.loc[train]
eval_df = sel.loc[eval]

# Split dataset into labels and features
y = train_df.survived
X = train_df.drop(labels="survived", axis=1)
y_eval = eval_df.survived
X_eval = eval_df.drop(labels="survived", axis=1)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.7, random_state=42
)

# Instantiate classifier
clf = SVC()

# Model training
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Training score", score)

# Predict
y_pred_1 = clf.predict(X_eval)

# Final training and evaluation
clf.fit(X, y)
score = clf.score(X_eval, y_eval)
print("   Final score", score)

# Confusion matrix
plot_confusion_matrix(clf, X_eval, y_eval, values_format="d")
plt.show()

# 80%! Not bad. The model decided to abandon his evil deeds.

### One hot encoded

In [None]:
# This time we will use the transformer and one hot encoder 
# to see the difference.
sel = TitanicTransformer().fit_transform(backup_1)

# Train eval split
train_df = sel.loc[train]
eval_df = sel.loc[eval]

# Split dataset into labels and features
y = train_df.survived
X = train_df.drop(labels="survived", axis=1)
y_eval = eval_df.survived
X_eval = eval_df.drop(labels="survived", axis=1)

for column in X.columns:
    X[column] = X[column].apply(lambda x: f"{column}_{x}")
    X_eval[column] = X_eval[column].apply(lambda x: f"{column}_{x}")

encoder = sk.preprocessing.OneHotEncoder(drop="first")
X_oh = encoder.fit_transform(X)
X_eval_oh = encoder.transform(X_eval)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(
    X_oh, y, train_size=0.7, random_state=42
)

# Instantiate classifier
clf = SVC()

# Model training
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Training score", score)

# Predict
y_pred_2 = clf.predict(X_eval_oh)

# Final training and evaluation
clf.fit(X_oh, y)
score = clf.score(X_eval_oh, y_eval)
print("   Final score", score)

# Confusion matrix
plot_confusion_matrix(clf, X_eval_oh, y_eval, values_format="d")
plt.show()

# The result is very similar to the one with only label encoding.

### Difference between results

In [None]:
# The confusion matrix can also be used to check if both models
# got the same rights and wrongs.

cm = confusion_matrix(y_pred_1, y_pred_2)
cmd = ConfusionMatrixDisplay(cm, display_labels=["DIED", "SURVIVED"])
cmd.plot(values_format="d")
plt.show()

# As we can see, there are some disturbances on the force. We could use this
# to our advantage, trying many models and choosing the most common results.
# But that's for another summit.

# What now?

#### Cross validation
https://scikit-learn.org/stable/modules/cross_validation.html

#### Hyper-parameters tuning
https://scikit-learn.org/stable/modules/grid_search.html

#### Feature selection
https://scikit-learn.org/stable/modules/feature_selection.html

#### Feature augmentation
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

#### Ensemble
https://scikit-learn.org/stable/modules/ensemble.html



# Any doubts? Questions?

In [None]:
show_python("")

# My social media

**Cláudio Gomes**

https://www.linkedin.com/in/claudiohfg/

https://www.kaggle.com/claudiohfg

https://www.facebook.com/claudiohfg

https://twitter.com/claudiohfg

https://www.instagram.com/claudiohfg.art/

https://www.instagram.com/claudiohfg/

# References

Datasets
http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets

Encyclopedia Titanica
https://www.encyclopedia-titanica.org/

Wikipedia RMS Titanic
https://en.wikipedia.org/wiki/RMS_Titanic

Wikipedia English Honorifics
https://en.wikipedia.org/wiki/English_honorifics

Pandas cut
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

Pandas qcut
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

Scitkit-Learn User Guide
https://scikit-learn.org/stable/user_guide.html

Scikit-Learn Cheat Sheet
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf

Matplotlib Tutorials
https://matplotlib.org/tutorials/index.html

Seaborn Tutorials
https://seaborn.pydata.org/tutorial.html

ClaudioHFG's Kaggle notebook
https://www.kaggle.com/claudiohfg/titanic-ensemble-with-sklearn-0-81339

LD Freeman's Kaggle notebook
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy