# Artificial Neural Network

## Main Task
This data set contains details of a bank's customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.
> Understand the correlation between independent variables and dependent variable. Predicting if a new customer of the bank will stay OR leave the bank (A binary classification task).

## Data Understanding

1.0. What is the domain area of the dataset?
This dataset contains details of a bank's customers and the target variable is a binary variable reflecting the fact whether the customer left the bank (closed his account) or he continues to be a customer.

2.0. Which data format?
The dataset is in csv format!

2.1. Do the files have headers or another file describing the data?
The files does have headers that describes the data! Each column has a name that describes the data it contains!

2.2. Are the data values separated by commas, semicolon, or tabs?
The data values are separated by commas!
Example: 
* *RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited*  
* *1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1*

3.0 How many features and how many observations does the dataset have?
The dataset has:
* 14 features or columns!
* 10000 observations or rows!

4.0 Does it contain numerical features? How many?
Yes it contains 8 numerical features!

5.0. Does it contain categorical features? How many?
Yes it contains 3 or(4) categorical features! (The target class has 1 or 0 values.)

In [19]:
# Importing necessary libraries

import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split

In [3]:
dataset = pd.read_csv("../dataset/Churn_Modelling.csv")

## Basic Exploratory Data Analysis

In [4]:
dataset.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [10]:
print(f"Number of features in the dataset is {dataset.shape[1]} and the number of observations/rows in the dataset is {dataset.shape[0]}")

Number of features in the dataset is 14 and the number of observations/rows in the dataset is 10000


### Handling Missing Values

In [9]:
dataset.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [8]:
dataset.isna().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

### Data Pre-processing

In [20]:
RANDOM_STATE = 42

In [11]:
# The features or columns (RowNumber), (CustomerId) and (Surname) don't have any effect on whether 
# a customer will stay OR leave.
# So we don't include them in the independent variables.
X = dataset.iloc[:, 3:-1].values # INDEPENDENT VARIABLES
print(X) 

[[619 'France' 'Female' ... 1 1 101348.88]
 [608 'Spain' 'Female' ... 0 1 112542.58]
 [502 'France' 'Female' ... 1 0 113931.57]
 ...
 [709 'France' 'Female' ... 0 1 42085.58]
 [772 'Germany' 'Male' ... 1 0 92888.52]
 [792 'France' 'Female' ... 1 0 38190.78]]


In [13]:
y = dataset.iloc[:, -1].values # DEPENDENT VARIABLE
print(y)

[1 0 1 ... 1 1 0]


Machine learning algorithms and deep learning neural networks require that input and output variables be numbers.   
This means that categorical data must be converted to a numerical form. 

### Encoding categorical data

**Label Encoding:** It involves converting each value in a column to a number. For example, ‘male’ and ‘female’ might become 1 and 2.

In [14]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# The 'gender' column is the third column on the X variable.
# Encoding 'female' to '0' and 'male' to '1'.
X[:, 2] = le.fit_transform(X[:, 2])

In [15]:
print(X)

[[619 'France' 0 ... 1 1 101348.88]
 [608 'Spain' 0 ... 0 1 112542.58]
 [502 'France' 0 ... 1 0 113931.57]
 ...
 [709 'France' 0 ... 0 1 42085.58]
 [772 'Germany' 1 ... 1 0 92888.52]
 [792 'France' 0 ... 1 0 38190.78]]


**One-Hot Encoding:** It creates new (binary) columns, indicating the presence of each possible value from the original data.  
For example, for ‘geography’ with three countries ‘France’, ‘Spain’, and ‘Germany’, one-hot encoding will result in three columns   
(‘France’, ‘Spain’, ‘Germany’), which could have the value 0 or 1. Each row will have one ‘1’ and two '0’s.  

After we have done one-hot encoding, the dummy variables are the first columns of the matrix of features. Here we are going to get 3  
different dummy variables or 3 different features.

In [17]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [18]:
print(X) # The first 3 columns are new dummy variables for 3 different geographical locations.

[[1.0 0.0 0.0 ... 1 1 101348.88]
 [0.0 0.0 1.0 ... 0 1 112542.58]
 [1.0 0.0 0.0 ... 1 0 113931.57]
 ...
 [1.0 0.0 0.0 ... 0 1 42085.58]
 [0.0 1.0 0.0 ... 1 0 92888.52]
 [1.0 0.0 0.0 ... 1 0 38190.78]]


### Model Building

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= RANDOM_STATE)

In [22]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((8000, 12), (2000, 12), (8000,), (2000,))

### Feature Scaling
Feature scaling is a method used to standardize the range of independent variables or features of data.  
In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.

The goal of feature scaling is to make sure that all the features of your dataset have the same scale.  
This is important because the scale of the features can have a big impact on the performance of your machine learning  
algorithm or deep learning model.

Feature scaling is especially important in deep learning:
1. Speeds up learning.
2. Avoids numerical instability.
3. Ensures fairness.

In [24]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# The reason we use fit_transform() on the training data and transform() on the test data is to 
# prevent data leakage from the test set into the training set.

# fit_transform(X_train): With this, the StandardScaler calculates the mean and standard deviation 
# of each feature in the training set (X_train), and then it transforms the training data using 
# these calculated parameters to have a mean of 0 and a standard deviation of 1.

# transform(X_test): Here, the StandardScaler uses the previously computed mean and standard deviation 
# of the training set to transform the test data. It does not calculate these parameters again because 
# it’s crucial that both the training and test datasets are transformed using the same scaling parameters.

### Building the ANN model

In [26]:
ann = tf.keras.models.Sequential() # Initializing the ANN
# A Sequential model is appropriate for a plain stack of layers where each layer has exactly one 
# input tensor and one output tensor.

# Adding the input layer and the first hidden layer
# Dense layers are the regular deeply connected neural network layers. 
# units=6 means that this layer will have 6 neurons or nodes. (6 has been chosen after some trials).
# Determining the number of neurons or nodes in the hidden layers of a neural network is 
# more of an art than a science, and it often involves a bit of trial and error.
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
# Adding the second hidden layer
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
# The number of hidden layers, as well as the number of neurons in each layer, are hyperparameters of 
# the neural network, meaning they are set before the learning process begins and can be tuned to 
# optimize performance. There’s no hard and fast rule for how many layers or neurons to use, it 
# often comes down to trial and error. 

# Adding the output layer
# Since this is a binary classification task (predicting whether a customer will stay or leave), 
# you have one neuron in the output layer. The activation='sigmoid' part means that this layer 
# will use the sigmoid activation function, which is common for binary classification since it 
# squashes values between 0 and 1, effectively giving you a probability of class membership.
ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

### Training the ANN model

In [27]:
# Compiling the ANN model or setting up the learning process for the neural network.
ann.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# The optimizer, loss and metrics are hypterparameters.

# The optimizer is used to update the weights of the neural network through backpropagation 
# in order to minimize the loss function. Adam is a popular choice because it combines the advantages
#  of two other extensions of stochastic gradient descent: AdaGrad and RMSProp.

#‘binary cross-entropy’ is a common choice for binary classification problems. 
# The loss function measures how well the model did on each instance, and the goal of training is 
# to find the model parameters that minimize the loss function across all instances.

# Metrics are used to judge the performance of your model. Accuracy is a common metric for classification problems, 
# and it measures the proportion of correct predictions out of total predictions.

In [28]:
# Training the ANN on the training set.
ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

# The batch_size is a hyperparameter of gradient descent that controls the number of training samples 
# to work through before the model’s internal parameters are updated.
#batch_size = 32 means that the model will take 32 samples from your training dataset, compute the gradients, 
# update the model’s parameters, and then move on to the next 32 samples. This process continues until it 
# has worked through all samples in your training dataset. This constitutes one epoch, and you’re doing this 
# for 100 epochs according to your code.

Epoch 1/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 369us/step - accuracy: 0.7469 - loss: 0.6274
Epoch 2/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 288us/step - accuracy: 0.8022 - loss: 0.4876
Epoch 3/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 294us/step - accuracy: 0.7913 - loss: 0.4701
Epoch 4/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 292us/step - accuracy: 0.7868 - loss: 0.4557
Epoch 5/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 290us/step - accuracy: 0.7992 - loss: 0.4319
Epoch 6/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 289us/step - accuracy: 0.8007 - loss: 0.4343
Epoch 7/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 292us/step - accuracy: 0.8214 - loss: 0.4079
Epoch 8/100
[1m250/250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 333us/step - accuracy: 0.8183 - loss: 0.4230
Epoch 9/100
[1m

<keras.src.callbacks.history.History at 0x18c34d6a5a0>