## ARTIFICIAL NEURAL NETWORK (ANN)


NAME: TIMILEYIN SAMUEL AKINTILO

STUDENT ID: C00302909

#### INTRODUCTION

This notebook showcases the implemention of the a fully connected neural network using Keras. This notebook was developed from the scratch to demostrate a significant practical and theoretical understanding of the underlying deep learning algorithm. 

#### LOG OF CHANGES

This log embodies all the computations carrried out for this analysis and how they affect result of the analysis. The log is structured to follows the Cross Industry Standard Process for Data Mining (CRISP-DM) model, and the changes were logged under each of the six phases as follows:

**1. Business understanding**

The analysis investigate the titanic event and aims to develop an model that helps to classify whether or not an individual on the titanic ship survives based on some certain features. The project evaluates the effectiveness of both Random Forest and fully connected neural network algorithms on the Titanic dataset to determine which performes better.

**2. Data Understanding**

The dataset used for this analysis was gotten from kaggle (https://www.kaggle.com/datasets/yasserh/titanic-dataset). The dataset contains information about the passengers that boarded the ship including age, sex, embarked state, name, cabin number, ticket number, social-class, and survival status. With this passenger information, we will build a classifier that will classify if a passenger survived or not. 

Here is the data dictionary for the dataset:

Data Dictionary

survival: Survival (0 = No, 1 = Yes)

pclass:	Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)

sex:	Sex	

Age:	Age in year
	
sibsp:	No of siblings / spouses aboard the Titanic	

parch:	No of parents / children aboard the Titanic	

ticket:	Ticket number	

fare:	Passenger fare	

cabin:	Cabin number	

embarked: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)


**3. Data Preparation**

In the data preprocessing stage, the following steps wer taken to make the data fit for modelling:

a) The missing values in each of the column were filled with the apprioprate values.

b) Categorical features were encoded into numeric featurs using Label Encoder.

c) Columns which have the tendency of not impacting the model were drop to simplify the analysis.



**4. Modelling**

The following  were implemented during the modelling phase:

**a) Standardizing the features**

**Change:** All the features were standardized to keep them within the same scale

**Result:** The accuracy of the model improved from 81% to 82%, whiile the precision and recall remained the same.

**b) Encoding the target variable**

**Change:** The categorical features were encoder into a numeric variable using a label encoder.

**Result:** This preprocessed the data, making it ready to be fit into the algorithm.

**c) Building the first architecture**

**Change:** The first architecture has three dense layers with Adam as the optimizer.

**Result:** The model did not perform as well as the Random Forest Algorithm achieving an accuracy of 0.77.

**c) Building the second architecture**

**Change:** The first architecture has three dense layers with dropouts using Adamax was used as optimizer.

**Result:** The model achieved a validation accuracy of 0.83 and a test accuracy of 0.78. 



**4. Evaluation:**

The performance of the model was evaluated using accuracy as the metric.


**5. Deployment:**

The best model was saved as an Hierarchical Data Format version 5 (HDF5) file but was not deployed due to time constraint.

#### ANALYSIS

First things first, let's import the neccessary libraries.


#### 

In [3]:
import pickle
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, StratifiedKFold
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from tensorflow.keras.layers import SimpleRNN, LSTM, Dense, Dropout, Embedding,  BatchNormalization 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.utils import pad_sequences
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV


Next, we will load the data set and take a look at it.

In [4]:
# load the drug analysis dataset
df = pd.read_csv('Titanic-Dataset.csv')

In [5]:
# check the first few rows of the dataframe
df.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
# Examine the shape of the dataset
df.shape

(891, 12)

Now, we will check the distribution of the target variable

The distribution shows that the dataset is a bit imbalanced

In [7]:
# Examine the columns in the dataframe
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [8]:
# check for missing values
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Three of the columns have missing values. We will attempt to deal with these missing values before we proceed with the analysis.

In [9]:
# fill missing values in the Age column with the mean
df['Age'].fillna(df['Age'].mean(), inplace=True)

In [10]:
# fill missing values in the Embarked column with the mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

We will drop the Cabin column as it contains a lot of missing values and there is high probability that it will not be useful in our analysis.

In [11]:
# drop the Cabin column
df.drop('Cabin', axis=1, inplace=True)

Futhermore, we will drop the Name and Ticket columns as they are not useful for our analysis as they are unique to each passenger.

In [12]:
# drop the Name and Ticket columns
df.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

Let us see the resulting dataset after the data preprocessing

In [13]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


Some of the columns are categorical, we need to convert them to numerical values. We can use the a label encoder to convert the categorical values to numerical values. 

In [14]:
# Select only categorical columns
categorical_columns = df.select_dtypes(include=['object', 'category'])
categorical_columns.head()

Unnamed: 0,Sex,Embarked
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


In [15]:
# Create a LabelEncoder instance
label_encoder = LabelEncoder()

# Iterate through each column in the DataFrame and apply label encoding
for column in categorical_columns.columns:
    df[column] = label_encoder.fit_transform(df[column])

In [16]:
df.tail()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
886,0,2,1,27.0,0,0,13.0,2
887,1,1,0,19.0,0,0,30.0,2
888,0,3,0,29.699118,1,2,23.45,2
889,1,1,1,26.0,0,0,30.0,0
890,0,3,1,32.0,0,0,7.75,1


In [17]:
# Display the class labels
class_labels = label_encoder.classes_
print(f'Class Labels: {class_labels}')

Class Labels: ['C' 'Q' 'S']


Since Label encoder encodes based on alphabetical order, we can see that C is encoded as 0, Q as 1, S as 2. Aslo for the Sex column Female as 0 and Male as 1.

Great! Now that we have encoded the categorical columns, we can proceed to train a SVM classifier using the encoded data.

In [18]:
# Split the data into features and target
X = df.drop('Survived', axis=1)
y = df['Survived']

# take a copy of the features
features = X.copy()


In [19]:
X.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')

Now, we can split the dataset into a training set and a test set. We will use 80% of the data for training and 20% for testing.

In [20]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Next, we will create our fully connected neural network model using the Sequential API.

In [21]:
# Build the artificial neural network
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(32, activation='tanh'))
model.add(Dense(1, activation='sigmoid'))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [22]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [23]:
#train the model
history = model.fit(X_train, y_train, epochs=100, batch_size=16, validation_split=0.2, verbose=1)


Epoch 1/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 10ms/step - accuracy: 0.4535 - loss: 0.8436 - val_accuracy: 0.6783 - val_loss: 0.5983
Epoch 2/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6784 - loss: 0.6132 - val_accuracy: 0.6993 - val_loss: 0.5856
Epoch 3/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6880 - loss: 0.5875 - val_accuracy: 0.6783 - val_loss: 0.5784
Epoch 4/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7115 - loss: 0.5707 - val_accuracy: 0.6923 - val_loss: 0.5630
Epoch 5/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.7106 - loss: 0.5830 - val_accuracy: 0.7133 - val_loss: 0.5784
Epoch 6/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6924 - loss: 0.5909 - val_accuracy: 0.7063 - val_loss: 0.5483
Epoch 7/100
[1m36/36[0m [32m━━

In [25]:
# evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy}')

[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7816 - loss: 0.4601 
Accuracy: 0.7709497213363647


This model achieved an accuracy of 0.77 on the validation set. 

We will add dropouts to the model and change some other parameters to see if we can improve the model's performance..

In [60]:
# Build the artificial neural network
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
#model.add(Dropout(0.5))
model.add(Dense(32, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


In [61]:
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [62]:
#train the model
history = model.fit(X_train, y_train, epochs=100, batch_size=16, validation_split=0.2, verbose=1)


Epoch 1/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 8ms/step - accuracy: 0.6368 - loss: 0.7400 - val_accuracy: 0.6993 - val_loss: 0.6019
Epoch 2/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6153 - loss: 0.7647 - val_accuracy: 0.6993 - val_loss: 0.5760
Epoch 3/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6775 - loss: 0.6508 - val_accuracy: 0.6993 - val_loss: 0.5901
Epoch 4/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6250 - loss: 0.8140 - val_accuracy: 0.7273 - val_loss: 0.5875
Epoch 5/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.6144 - loss: 0.7314 - val_accuracy: 0.6713 - val_loss: 0.5833
Epoch 6/100
[1m36/36[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.6463 - loss: 0.7036 - val_accuracy: 0.6923 - val_loss: 0.5627
Epoch 7/100
[1m36/36[0m [32m━━━

In [63]:
# evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Accuracy: {accuracy}')

[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - accuracy: 0.7954 - loss: 0.4547  
Accuracy: 0.7821229100227356


Great. The model performed a bit better than the Random Forest Classifier. The model achieved a validation  accuracy of 0.83 and a test accuracy of 0.78. 

In [None]:
# Save the model
model.save('titanic_model.h5')

### BIBLIOGRAPHY

https://www.kaggle.com/datasets/yasserh/titanic-dataset