# Phase 2 - Week 1 - Day 1 PM - Artificial Neural Network (ANN) - Binary Classification

> **NOTES**

> Before you run the notebook below, **it is recommended to run the program using the GPU** so that the training process doesn't take too long.
> If you use Google Colab, then you can set it by following the steps: `Runtime` >> `Change runtime type` >> `T4 GPU`.

# A. Binary Classification

## A.1 - Import Libraries & Data Loading

In the first tutorial, we will implement Binary Classification using Titanic dataset to build neural network models.

The purpose of this notebook is to demonstrate the creation of an ANN model in a general sense, so some details such as EDA, handling outliers, checking types of missing values, etc., are omitted.

In reality, you should consider these aspects for the project you are working on so that your ANN model is an optimal model.

In [1]:
# Import Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report

In [2]:
# Load the Titanic dataset

url = 'https://raw.githubusercontent.com/FTDS-learning-materials/phase-1/master/w1/P1W1D3AM%20-%20Feature%20Engineering%20-%20Part%201%20-%20Titanic.csv'
data = pd.read_csv(url)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## A.2 - Feature Engineering

### A.2.1 - Data Splitting

In [3]:
# Splitting between `X` and `y`

X = data.drop('Survived', axis=1)
y = data['Survived']

In [4]:
# Splitting between Train-Set, Val-Set, and Test-Set

X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.15, random_state=42)

print('Train Size : ', X_train.shape)
print('Val Size   : ', X_val.shape)
print('Test Size  : ', X_test.shape)

Train Size :  (643, 11)
Val Size   :  (114, 11)
Test Size  :  (134, 11)


### A.2.2 - Handling Missing Values

In [5]:
# Check Missing Values on X_train

X_train.isnull().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age            131
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          497
Embarked         2
dtype: int64

In [6]:
# Check Missing Values on X_val

X_val.isnull().sum()

PassengerId     0
Pclass          0
Name            0
Sex             0
Age            22
SibSp           0
Parch           0
Ticket          0
Fare            0
Cabin          94
Embarked        0
dtype: int64

In [7]:
# Check Missing Values on X_test

X_test.isnull().sum()

PassengerId     0
Pclass          0
Name            0
Sex             0
Age            24
SibSp           0
Parch           0
Ticket          0
Fare            0
Cabin          96
Embarked        0
dtype: int64

We will impute this missing values in Pipeline using median imputation.

### A.2.3 - Feature Selection

Let's assume columns `PassengerId`, `Cabin`, `Ticket`, `Embarked`, and `Name` do not have strong correlations against the target (column `Survived`).

In [8]:
# Drop Columns

X_train.drop(['PassengerId', 'Cabin', 'Ticket', 'Embarked', 'Name'], axis=1, inplace=True)
X_val.drop(['PassengerId', 'Cabin', 'Ticket', 'Embarked', 'Name'], axis=1, inplace=True)
X_test.drop(['PassengerId', 'Cabin', 'Ticket', 'Embarked', 'Name'], axis=1, inplace=True)

X_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare
868,3,male,,0,0,9.5000
223,3,male,,0,0,7.8958
846,3,male,,8,2,69.5500
171,3,male,4.0,4,1,29.1250
435,1,female,14.0,1,2,120.0000
...,...,...,...,...,...,...
533,3,female,,0,2,22.3583
302,3,male,19.0,0,0,0.0000
473,2,female,23.0,0,0,13.7917
283,3,male,19.0,0,0,8.0500


### A.2.4 - Pipeline

In [9]:
# Get Numerical Columns and Categorical Columns

num_columns = X_train.select_dtypes(include=np.number).columns.tolist()
cat_columns = X_train.select_dtypes(include=['object']).columns.tolist()

print('Numerical Columns : ', num_columns)
print('Categorical Columns : ', cat_columns)

Numerical Columns :  ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
Categorical Columns :  ['Sex']


In [10]:
# Create A Pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy='median'),
                             StandardScaler())

cat_pipeline = make_pipeline(OneHotEncoder())

final_pipeline = ColumnTransformer([
    ('pipe_num', num_pipeline, num_columns),
    ('pipe_cat', cat_pipeline, cat_columns)
])

In [11]:
# Fit and Transform

X_train = final_pipeline.fit_transform(X_train)
X_val = final_pipeline.transform(X_val)
X_test = final_pipeline.transform(X_test)
X_train.shape

(643, 7)