# 1 Introduction

There are three parts to our script as follows:

* Feature engineering
* Missing value imputation
* Prediction!

## 1.1 Setting up the environment

In [1]:
# Import packages

import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import pandas as pd
import seaborn as sns
from CustomDataFrame import CustomDataFrame
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from statsmodels.graphics.mosaicplot import mosaic
from tabulate import tabulate

In [2]:
# Set figsize for plots
plt.rcParams['figure.figsize'] = (8, 5)

# Set random state
random_state = 754

## 1.2 Loading data

Now that our packages are loaded, let’s read in and take a peek at the data.

In [3]:
# Read data
train = pd.read_csv('input/train.csv')
test = pd.read_csv('input/test.csv')

# Concatenate training & test data
full = pd.concat([train, test], ignore_index=True)

Authors here use df.info() method to print out the most important info about the data. That approach however does not make it easy to see clearly which columns have missing values that have to be taken care of.   
For this purpose, a class `CustomDataFrame` is created, that lists Null Count, instead of Non-Null:

In [4]:
# Custom DataFrame info
print(CustomDataFrame(full))

Rows: 1309, Columns: 12

+---------------+-------------+--------------+
| Column Name   | Data Type   | Null Count   |
|---------------+-------------+--------------|
| PassengerId   | int64       | 0 (0.0%)     |
| Survived      | float64     | 418 (31.9%)  |
| Pclass        | int64       | 0 (0.0%)     |
| Name          | object      | 0 (0.0%)     |
| Sex           | object      | 0 (0.0%)     |
| Age           | float64     | 263 (20.1%)  |
| SibSp         | int64       | 0 (0.0%)     |
| Parch         | int64       | 0 (0.0%)     |
| Ticket        | object      | 0 (0.0%)     |
| Fare          | float64     | 1 (0.1%)     |
| Cabin         | object      | 1014 (77.5%) |
| Embarked      | object      | 2 (0.2%)     |
+---------------+-------------+--------------+


As we can see, columns containing missing values are:  
* Survived (**y** variable)
* Age
* Fare
* Cabin
* Embarked

The column **Cabin** contains 77.5% missing values, so there is no smart way to impute 3/4 of missing values, column has to be dropped altogether.     
However, the **Survived** variable contains missing values as well. They cannot be imputed as well, because doing so will lead to some bias in the data, making them more noisy or causing some sort of target leakage, depending on the chosen imputation method.    
Other variables can be imputed because of the low frequency of Nulls, but first of all, all missing **Survived** observations has to be removed before any other feature engineering.

In [5]:
full = full.loc[~full['Survived'].isna()]

print(CustomDataFrame(full))

Rows: 891, Columns: 12

+---------------+-------------+--------------+
| Column Name   | Data Type   | Null Count   |
|---------------+-------------+--------------|
| PassengerId   | int64       | 0 (0.0%)     |
| Survived      | float64     | 0 (0.0%)     |
| Pclass        | int64       | 0 (0.0%)     |
| Name          | object      | 0 (0.0%)     |
| Sex           | object      | 0 (0.0%)     |
| Age           | float64     | 177 (19.9%)  |
| SibSp         | int64       | 0 (0.0%)     |
| Parch         | int64       | 0 (0.0%)     |
| Ticket        | object      | 0 (0.0%)     |
| Fare          | float64     | 0 (0.0%)     |
| Cabin         | object      | 687 (77.1%)  |
| Embarked      | object      | 2 (0.2%)     |
+---------------+-------------+--------------+


Now we have only 891 observations, but missing values are now in columns:
* Age
* Cabin
* Embarked

The one missing value from **Fare** has been dropped by removing null **Survived**.  
Contrary to the original Notebook, that created a **Deck** variable out of **Cabin**, here the **Cabin** is dropped.  
**Age** and **Embarked** columns will have their values imputed by their medians (**Age**) or most frequent value (**Embarked**).

In [6]:
# Drop the Cabin columns
full.drop('Cabin', axis = 1, inplace = True)

# Define the variables to impute
variables_to_impute = ['Age', 'Embarked']

# Perform SimpleImputation on defined variables
imputer = SimpleImputer(strategy='most_frequent')
imputed_values = imputer.fit_transform(full[variables_to_impute])

# Crate a DataFrame with new values
imputed_df = pd.DataFrame(imputed_values, columns = variables_to_impute)

# Impute the missing values
full.loc[:, variables_to_impute] = full.loc[:, variables_to_impute].fillna(imputed_df)

# Print new DataFrame info
print(CustomDataFrame(full))

Rows: 891, Columns: 11

+---------------+-------------+--------------+
| Column Name   | Data Type   | Null Count   |
|---------------+-------------+--------------|
| PassengerId   | int64       | 0 (0.0%)     |
| Survived      | float64     | 0 (0.0%)     |
| Pclass        | int64       | 0 (0.0%)     |
| Name          | object      | 0 (0.0%)     |
| Sex           | object      | 0 (0.0%)     |
| Age           | object      | 0 (0.0%)     |
| SibSp         | int64       | 0 (0.0%)     |
| Parch         | int64       | 0 (0.0%)     |
| Ticket        | object      | 0 (0.0%)     |
| Fare          | float64     | 0 (0.0%)     |
| Embarked      | object      | 0 (0.0%)     |
+---------------+-------------+--------------+
