# Titanic Notebook with explanations 
---

Not being in the field, I struggled with this even though it is one of the first tutorial and very well designed.
I chose to enrich my notebook with information coming from:
 - my experience (mainly mistakes! 😅)
 - other sections of Kaggle
 - DataCamp

We are going to use a Python library called `pandas`, a Python library for analysis and data manipulation.

Here is a <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" target="_blank">cheatsheet</a> for reference.
<br>
<a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" target="_blank"><img src="https://i.imgur.com/rcSXjEZ.png" width="250px"></a>

## Setup the environment

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np   # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import re            # regular expression

# Data Visualization
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Nice banner (hey, Bruce!)
from IPython.display import clear_output
#os.system("pip install art==5.6")
!pip install art --user
clear_output()

import art
art.tprint("Titanic Notebook", font="cybermedium")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Train data
The training file `train_data.csv` contains both numeric and non-numeric columns.<br>
We will see how to leverage non-numeric values by "transposing" them into numeric ones.

In [None]:
# Train data contains more columns, including the "Survived" column
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
print("(Rows, columns) = ", train_data.shape)
print()

print(train_data.columns)
print()

train_data.head()    # <-- not visible in the output, only the last statement is displayed (see tail() below). Here, use `print(...)` instead as above
train_data.tail()

## Test data
Contains same columns except the "Survived" one

In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
print(test_data.columns)
test_data.head()

In [None]:
# Test data contains 418 lines
test_data.describe()     # only displays numerical value

## Exploring the data: "Women and children first!"

Let's first study: Women against Men

### Women 👩‍🦱

In [None]:
# Filter only female passengers using `loc`
women = train_data.loc[train_data.Sex == 'female']
women.head()

In [None]:
# Does the gender completely determine the survival? 
# (i.e. being a female ensures survival, being male means certain death)
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)
print(f"Women ♀️ who survived: {100 * rate_women:.2f} %")    # this syntax is called 'Formatted string litterals': https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings

plt.title("Women ♀️ who survived", size=16)
women.value_counts().plot.pie(colors=["mistyrose", "black"], shadow=True)   # mistyROSE?! Got this one? 😜

### Men 👨‍🦰

In [None]:

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)
print("Men ♂️ who survived:", round(100 * rate_men, 2), "%")

plt.title("Men ♂️ who survived", size=16)
men.value_counts().plot.pie(colors=["black", "dodgerblue"], shadow=True)

### What about age-wise: 👶 vs 🧑 vs 👨 vs 👴?
Now, look at how age is related to the survival rate.<br>
<i>Although I used male emoji, the analysis will be performed gender-free.</i>

In [None]:
# labels = ["0-2", "2-13", "13-18", "18-34", "35-49", "50-65", "65+"]
train_data["AgeCategory"] = train_data["Age"].map(lambda a: " 0-2"   if (0 <= a < 2)  else 
                                                            " 2-5"   if (2 <= a < 5) else 
                                                            " 6-13"  if (6 <= a < 13) else 
                                                            "13-18" if (13 <= a < 18) else 
                                                            "18-34" if (18 <= a < 34) else 
                                                            "35-49" if (35 <= a < 49) else 
                                                            "50-65" if (50 <= a < 65) else 
                                                            "65+"   if (a >= 65) else "Unknown")

age_list = train_data["AgeCategory"].drop_duplicates().sort_values().reset_index(drop=True)

age_distribution =  pd.DataFrame({
    "AgeCategory": age_list, 
    "Male": age_list.map(lambda age: train_data.loc[(train_data["AgeCategory"] == age) & (train_data["Sex"] == "male"), "AgeCategory"].count()), 
    "Female": age_list.map(lambda age: train_data.loc[(train_data["AgeCategory"] == age) & (train_data["Sex"] == "female"), "AgeCategory"].count()), 
    "Total": age_list.map(lambda age: train_data.loc[train_data["AgeCategory"] == age, "AgeCategory"].count()),
    "MaleSurvivors": age_list.map(lambda age: train_data.loc[(train_data["AgeCategory"] == age) & (train_data["Sex"] == "male")]["Survived"].sum()),
    "FemaleSurvivors": age_list.map(lambda age: train_data.loc[(train_data["AgeCategory"] == age) & (train_data["Sex"] == "female")]["Survived"].sum()),
    "Survivors": age_list.map(lambda age: train_data.loc[train_data["AgeCategory"] == age]["Survived"].sum()) })

age_distribution["MaleRate"]   = round(100 * age_distribution["MaleSurvivors"] / age_distribution["Male"], 1)
age_distribution["FemaleRate"] = round(100 * age_distribution["FemaleSurvivors"] / age_distribution["Female"], 1)
age_distribution["Rate"]       = round(100 * age_distribution["Survivors"] / age_distribution["Total"], 1)

print(age_distribution[["AgeCategory", "Male", "Female", "MaleRate", "FemaleRate", "Rate"]].sort_values(by="AgeCategory", ascending=True))
print()

# plt.title("Age distribution of Passengers", size=16)
# train_data["AgeCategory"].value_counts().plot.bar()

fig, ax = plt.subplots()
labels = age_distribution["AgeCategory"]
width = 0.65  

ax.bar(labels, age_distribution["Total"] - age_distribution["Survivors"], width, color="black", label="Dead")
ax.bar(labels, age_distribution["Survivors"], width, bottom=age_distribution["Total"] - age_distribution["Survivors"], color="tan", label="Survivors")

ax.set_title('Age distribution of Survivors')
ax.legend()

# Texts on top of bars
for i in age_distribution.index:
    plt.text(i - 0.3, age_distribution.at[i, "Total"] + 5, str(age_distribution.at[i, "Rate"]) + "%")
    
plt.show()

### Conclusion
<ul>
    <li>A large majority of babies and very young children (between 0 and 2 years old) survived the shipwreck</li>
    <li>Girls ranging from 6 to 13 years old has a unusual low rate of survival: both much lower than other females AND lower than the males of same age</li>
    <li>Mortality is (surprisingly) quite high especially for boys in the 13-18 years old range! Even lower than adult's survival rate with a dreadful 10%! 😨</li>
    <li>Another finding is that survival is in reverse order (with the exception of 65+) from adults: 50-65 survived significantly more than 18-34.<br>
    It would be interesting to know whether other factors could explain this (for instance, seniors tend to have better tickets hence better places in rowboats, ... ?)</li>
</ul>

After this quick study, we can acknowledge that the old saying "Women and children first!" is actually rather true!
    

# A bit of theory 👨‍🏫
---
## Random forest model: a machine learning (ML) model

This model is constructed of several "trees" (there are three trees in the picture below, but we'll construct 100!) that will individually consider each passenger's data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

![](https://i.imgur.com/AC9Bq63.png)

# Working on train_data 🏋
---

In [None]:
# Get statistics about train_data.csv
train_data.describe()

In [None]:
# Only 183 records contains data in each and every column
train_data.dropna(axis="index").describe()

# From Name column to a new "Title" feature

Name itself is sort of *useless* but it contains the Title.
To use it with `RandomForestClassifier` (see below), it seems necessary to somehow convert string values to numeric values.

Seems interesting enough, let's extract this as a new column using `split` then extend our DataFrame with`assign` for both train_data and test_data

*Reminder*: use the <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf" target="_blank">cheatsheet</a> to get familiar with `pandas` functions

In [None]:
#-------------------------------------------------------------------------------
# Simple string > int conversion
#-------------------------------------------------------------------------------
# train_data["Survived"] = train_data["Survived"].map(lambda s: int(s))

#-------------------------------------------------------------------------------
# Extend Train data with a 'Title' column
#-------------------------------------------------------------------------------
# Legacy way:
#   lastname_data = pd.DataFrame({"Rest": train_data["Name"].str.split(",", expand=True)[1]})
#   title_data = pd.DataFrame({"Title": lastname_data["Rest"].str.split(".", expand=True)[0]})
#   train_data = train_data.assign(Title=title_data)
#   train_data["Title"] = train_data["Name"].map(lambda t: t.split(",")[1].split(".")[0])

train_data["Title"] = train_data["Name"].map(lambda t: (re.search(",\s([A-Z][a-z]*)?", t).group(1) or "").strip())
train_data.head()

#-------------------------------------------------------------------------------
# Encode 'Title' as numeric 
#-------------------------------------------------------------------------------
# so that RandomForestClassifier can be used on it
title_map = train_data["Title"].drop_duplicates().reset_index(drop=True)
print(title_map)
print()

# Let's define an auxiliary function
def title_to_numeric(t):
    result = title_map.loc[title_map.str.contains(t)].index
    return result.values[0] # if result.any() else -1

print("'Miss' value is converted to the number: " + str(title_to_numeric("Miss")))
print("'Capt' value is converted to the number: " + str(title_to_numeric("Capt")))
print()

#-------------------------------------------------------------------------------
# Extend Train data with a 'TitleNum' column
#-------------------------------------------------------------------------------

for i in train_data.index:
    title = train_data.at[i, "Title"]
    try:
        train_data.at[i, "TitleNum"] = title_to_numeric(title)
    except:
        train_data.at[i, "TitleNum"] = -1
    
train_data.describe()


## "Title" column exploration
Now that we have all the titles, let's explore this new data!

In [None]:
title_explore = train_data[["Title", "Survived"]]

title_explore = title_explore.groupby(["Title"]).sum()
title_explore["Count"] = train_data["Title"].value_counts()
title_explore["SurvivalRate"] = round(100 * title_explore["Survived"] / title_explore["Count"], 1)

title_explore.sort_values("SurvivalRate")

Perform the same action with "Title" and "TitleNum" on `test_data`

In [None]:
#-------------------------------------------------------------------------------
# Extend Test data with a 'Title' column
#-------------------------------------------------------------------------------
test_data["Title"] = test_data["Name"].map(lambda t: (re.search(",\s([A-Z][a-z]*)?", t).group(1) or "").strip())
test_data.head()

#-------------------------------------------------------------------------------
# Extend Test data with a 'TitleNum' column
#-------------------------------------------------------------------------------

for i in test_data.index:
    title = test_data.at[i, "Title"]
    try:
        test_data.at[i, "TitleNum"] = title_to_numeric(title)
    except:
        test_data.at[i, "TitleNum"] = -1
    
test_data.describe()

## Dealing with Missing Values 
📘 <a href="https://www.kaggle.com/code/alexisbcook/missing-values?kernelSessionId=79127568" target="_blank">Missing Values notebook by Alexis Cook & DanB</a>

In [None]:
# [Train data] Missing values per column
print("Missing values in Train data\n" + str(train_data.isnull().sum()))
print()

# Fill messing age with mean value
#train_data["Age"] = train_data["Age"].fillna(29)   <-- this is one way but a better one is using `median` and `inplace = True` like below
train_data["Age"].fillna(train_data["Age"].median(), inplace = True)
train_data["Cabin"].fillna('', inplace = True)
train_data["Embarked"].fillna('', inplace = True)

print("Train data after mising-value processing\n" + str(train_data.isnull().sum()))
print()


In [None]:
# [Test data] Missing values per column
print("Missing values in Test data\n" + str(test_data.isnull().sum()))
print()

# Fill messing age with mean value
test_data["Age"].fillna(test_data["Age"].median(), inplace = True)
test_data["Cabin"].fillna('', inplace = True)
test_data["Fare"].fillna(test_data["Fare"].median(), inplace = True)
test_data["Embarked"].fillna('', inplace = True)

print("Test data after mising-value processing\n" + str(test_data.isnull().sum()))

# Let's predict! 🔮
---
[<img src="https://i.imgur.com/uhfugn1.png" width="640"/>](https://i.imgur.com/uhfugn1.png)

The code cell below (originally) looks for patterns in four different columns (**"Pclass"**, **"Sex"**, **"SibSp"**, and **"Parch"**) of the data. 

It constructs the trees in the random forest model based on patterns in the **train.csv** file, before generating predictions for the passengers in **test.csv**. 
The code also saves these new predictions in a CSV file **submission.csv**.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#-------------------
# Fitting ML
#-------------------

y = train_data["Survived"]

# 2022-06-09: Version with more features
# all_features = ["Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"]
# min_features = ["Pclass", "Sex", "SibSp", "Parch"]
# features = ["Pclass", "Age", "Sex", "SibSp", "Parch"]
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "TitleNum"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=len(features)+1, random_state=1)

model.fit(X, y)
predictions = model.predict(X_test)

#-------------------
# Fit score
#-------------------

X_train = pd.get_dummies(train_data[features])
checks = model.predict(X_train)

fit = pd.DataFrame({'PassengerId': train_data.PassengerId, 'PredictSurvived': train_data.Survived, 'HasSurvived': checks})
fit_score = fit.PredictSurvived == fit.HasSurvived
# print('Fit score on train_data:' + str(sum(fit_score) / len(fit_score)))
art.tprint('Fit score ', font="white_bubble")
print(f'{100 * sum(fit_score) / len(fit_score):.2f} %')
print()

#-------------------
# Submission!
#-------------------

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.head()

output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

# Going further 🎯
---
## Fit score VS Test score
So, we get a **fit score** (on train data) of about **91%** whereas our **test score** is barely at **78%**.
Why such a gap? Is this **overfitting**?

## How to go even further? 🚀
@mirfanazam with a *CatBoost* classifier managed to go (a bit) further
https://www.kaggle.com/code/mirfanazam/prophet-titanic/notebook

@blue7red managed to achieve *81,5%* with an *AutoGluon* approach
https://www.kaggle.com/code/rhythmcam/titanic-autogluon-label-encoding/notebook