<a href="https://colab.research.google.com/github/BlvckSanek/The_Titanic_Problem/blob/main/The_Titanic_Problem_Part_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hello World!, This is my attempt at tackling the titanic problem as an ML beginner.

### We will start off by creating an authentication to kaggle's API and download the titanic dataset.

In [None]:
!mkdir ~/.kaggle

mkdir: cannot create directory ‘/root/.kaggle’: File exists


In [None]:
!cp kaggle.json ~/.kaggle/kaggle.json

In [None]:
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle competitions download -c titanic

Downloading titanic.zip to /content
  0% 0.00/34.1k [00:00<?, ?B/s]
100% 34.1k/34.1k [00:00<00:00, 2.70MB/s]


In [None]:
!unzip titanic.zip

Archive:  titanic.zip
  inflating: gender_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


We will now import the neccesary python packages to help us in our voyage.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import rcParams
import os
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

%matplotlib inline
rcParams['figure.figsize'] = 10,8
sns.set(style='whitegrid', palette='muted',
        rc={'figure.figsize': (12,8)})

Load the data into pandas dataframe

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

We will go ahead and check the dataset for missing values and find the best way to deal with them.

In [None]:
print(train.isnull().sum())
print()
print(test.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


## Note whatever preprocessing steps we take wpuld be applied to the testing datasets too. Very Important.

We could see that Age, Cabin and Embarked have some missing values. We will deal with variables separately. For the Age we can simply impute the median for all the missing values but that would not be the best approach. Cabin we would tackle it differently, we talk more about it later. Embarked would just use pandas backfill to handle that. There might be better approach but for now these are the steps I would be taking.

First off, Age variable. The approach here is to create a new variable called Title and group the dataset by the titles and compute the median for the groups. We will then use the grouped mean for all the titles to impute the missing values. We are killing two birds with one stone.
To get the titles we can utilize regular expression to help us extract the titles from the Name column.

In [None]:
train["Title"] = train["Name"].str.extract("([A-Za-z]+)\.", expand=True)
test["Title"] = test["Name"].str.extract("([A-Za-z]+)\.", expand=True)

That is done, so we will check the unique titles in the Title variable and see what we can infer from it.

In [None]:
print(train.Title.value_counts(), end="\n\n")
print(test.Title.value_counts())

Title
Mr          517
Miss        182
Mrs         125
Master       40
Dr            7
Rev           6
Mlle          2
Major         2
Col           2
Countess      1
Capt          1
Ms            1
Sir           1
Lady          1
Mme           1
Don           1
Jonkheer      1
Name: count, dtype: int64

Title
Mr        240
Miss       78
Mrs        72
Master     21
Col         2
Rev         2
Ms          1
Dr          1
Dona        1
Name: count, dtype: int64


It turns out that most these titles are just French versions of the common English titles, e.g Mme = Madame = Lady = Mrs. We will group the titles into six most common titles by replacing other titles with the appropriate of these six.

In [None]:
# Create a mapping to replace the rare titles with their appropriate titles in English
mapping = {
    "Mlle": "Miss", "Major": "Mr", "Col": "Mr", "Sir": "Mr",
    "Don": "Mr", "Mme": "Mrs", "Jonkheer": "Mr", "Lady": "Mrs",
    "Capt": "Mr", "Countess": "Mrs", "Ms": "Miss", "Dona": "Mrs"
}
train.replace({"Title": mapping}, inplace=True)
test.replace({"Title": mapping}, inplace=True)

In [None]:
print(train.Title.value_counts(), end="\n\n")
print(test.Title.value_counts())

Title
Mr        525
Miss      185
Mrs       128
Master     40
Dr          7
Rev         6
Name: count, dtype: int64

Title
Mr        242
Miss       79
Mrs        73
Master     21
Rev         2
Dr          1
Name: count, dtype: int64


## Using the median of the title group.
We can now go ahead fill the missing ages with the medians of each title group.

In [25]:
group_title_ages = dict(train.groupby("Title")["Age"].median())
group_title_ages = dict(test.groupby("Title")["Age"].median())

# Create a new column called average age to assit us in our task
train["Med_Age"] = train["Title"].apply(lambda x: group_title_ages[x])
test["Med_Age"] = test["Title"].apply(lambda x: group_title_ages[x])

# Impute all the missing ages with the value in age column
train.Age.fillna(train["Med_Age"], inplace=True)
test.Age.fillna(test["Med_Age"], inplace=True)

# Drop the temporary created column
train.drop("Med_Age", axis=1, inplace=True)
test.drop("Med_Age", axis=1, inplace=True)

We are done with the Age variable. On to the next which the Fare variable.

## Dealing with the fare missing values

We can use the median to impute for the missing values in the same approach as we did for the Age variable.

In [31]:
fares_class = dict(train.groupby("Pclass")["Fare"].median())
fares_class = dict(test.groupby("Pclass")["Fare"].median())

# Create a new column called Med_fare to help us in our task
train["Med_Fare"] = train["Pclass"].apply(lambda x: fares_class[x])
test["Med_Fare"] = test["Pclass"].apply(lambda x: fares_class[x])

# Impute for the missing values
train.Fare.fillna(train["Med_Fare"], inplace=True,)
test.Fare.fillna(test["Med_Fare"], inplace=True,)

In [32]:
# Drop new column created
train.drop("Med_Fare", axis=1, inplace=True)
test.drop("Med_Fare", axis=1, inplace=True)

## Dealing with Embarked missing values

With this there are only 2 missing values in the training dataset so my approach is to use Pandas "backfill" method. The test dataset is okay so nothing would be done to it.

In [35]:
train["Embarked"].fillna(method="backfill", inplace=True)

## Dealing with the Cabin variable's missing values

The approach is to extract the Deck variable attached to the Cabin variables and create a new column called Deck. We will then go ahead and drop the Cabin varible. Also for now, my approach is fill missing values with "Missing" in the Deck column.

In [None]:
# Create a function that would be use to extract the variables
def extract_deck(cabin):
    if isinstance(cabin, str):
        return cabin[0]
    else:
        return "Missing"

train["Deck"] = train["Cabin"].apply(extract_deck)

In [None]:
train.drop("Cabin", axis=1, inplace=True)
test.drop("Cabin", axis=1, inplace=True)

## Let us add family size variable to the data

We can achieve this by adding the `Parch` and `SibSp` variables together to get the `Family_Size` variable.

In [None]:
# Create Family_Size variable
train["Family_Size"] = train.Parch + train.SibSp
test["Family_Size"] = test.Parch + train.SibSp

We can now check to see if all the missing values have been filled in our data.

In [None]:
print(train.isnull().sum(), end="\n\n")
print(test.isnull().sum())

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Title          0
Med_Age        0
Family_Size    0
dtype: int64

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
Title          0
Med_Age        0
Family_Size    0
dtype: int64


## Let us save our cleaned datasets for future analysis and modelling.

In [38]:
train.to_csv("train_cleaned.csv", index=False)
test.to_csv("test_cleaned.csv", index=False)