# Deep Learning 03: Artificial Neural Networks

## Abstract

Artificial Neural Networks (ANNs) are computational models inspired by the biological neural systems of the human brain. They are widely used in machine learning and deep learning due to their capability to approximate complex, non-linear functions and learn from data adaptively. This paper presents a comprehensive theoretical discussion of Artificial Neural Networks, covering their architectural design, mathematical formulation, learning and training algorithms, advantages, limitations, and practical applications. The objective is to provide a structured and rigorous explanation suitable for academic and research contexts.

## 1. Introduction

Artificial Neural Networks constitute a fundamental class of models in modern artificial intelligence and machine learning. Unlike traditional algorithmic approaches that rely on explicitly programmed rules, ANNs learn patterns directly from data through iterative optimization of parameters. This data-driven learning capability makes ANNs particularly effective for problems involving high-dimensional data, non-linearity, and uncertainty.

With advancements in computational power and optimization techniques, ANNs have become the foundation of deep learning architectures used in areas such as computer vision, speech recognition, natural language processing, and predictive analytics.


## 2. Biological Motivation

The design of Artificial Neural Networks is inspired by the structure and functioning of biological neurons. In biological systems, neurons receive electrical signals through dendrites, integrate them in the cell body, and transmit outputs via axons to other neurons. Learning occurs through the strengthening or weakening of synaptic connections.

Similarly, artificial neurons receive numerical inputs, apply weighted summation, pass the result through an activation function, and produce an output. Learning is achieved by adjusting connection weights based on observed errors.


## 3. Architecture of Artificial Neural Networks

### 3.1 Artificial Neuron Model
An artificial neuron is the fundamental processing unit of an ANN. It consists of:
* **Input signals:** $x_1, x_2, \dots, x_n$
* **Weights:** $w_1, w_2, \dots, w_n$
* **Bias term:** $b$
* **Activation function:** $f(\cdot)$

The neuron computes a weighted sum of inputs and applies a non-linear activation function.


### 3.2 Layered Network Structure

An ANN is typically organized into layers:

1. **Input Layer**
   Receives raw input features and passes them to the hidden layers.

2. **Hidden Layer(s)**
   Performs intermediate computations and feature transformations. A network may contain one or multiple hidden layers.

3. **Output Layer**
   Produces the final output, such as a class label or a continuous value.

Based on the number of hidden layers, networks may be classified as shallow neural networks or deep neural networks.


## 4. Mathematical Modeling of ANN

### 4.1 Neuron Output Computation

The linear combination of inputs for a neuron is defined as:

$$ z = \sum_{i=1}^{n} w_i x_i + b $$

The output of the neuron is obtained by applying an activation function:

$$ a = f(z) $$
**Where:**

* $x_i$ : Represents input features
* $w_i$ : Denotes weights
* $b$ : Is the bias term
* $f(z)$ : Is the activation function
* $a$ : Is the final neuron output (activation)


### 4.2 Activation Functions

Activation functions introduce non-linearity into the network, enabling the modeling of complex relationships.

Common activation functions include:

* **Sigmoid Function**
  
$$f(z) = \frac{1}{1 + e^{-z}} $$

* **Hyperbolic Tangent (Tanh)**

  $$f(z) = \tanh(z)$$

* **Rectified Linear Unit (ReLU)**
  
  $$ f(z) = \max(0, z) $$

## 5. Training and Learning Algorithms

### 5.1 Forward Propagation

Forward propagation refers to the process of passing input data through the network layer by layer to compute predicted outputs.


### 5.2 Loss Function

The loss function quantifies the discrepancy between predicted outputs and actual target values.

Examples include:

* **Mean Squared Error (MSE)** for regression:

$$ L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
  
* **Binary Cross-Entropy Loss** for classification:

$$ L = -\frac{1}{n} \sum_{i=1}^{n} \left[y_i \log(\hat{y}_i) + (1 - y_i)\log(1 - \hat{y}_i)\right] $$
### 5.3 Backpropagation Algorithm

Backpropagation is the primary learning algorithm used to train ANNs. It computes the gradient of the loss function with respect to each weight using the chain rule of calculus.

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$$

This gradient is used to update the weights in the direction that minimizes the loss.


### 5.4 Optimization Techniques

Weights are updated using optimization algorithms such as:

* Gradient Descent
* Stochastic Gradient Descent (SGD)
* Adam Optimizer
* RMSprop

The general weight update rule is:


$$ w^{(t+1)} = w^{(t)} - \eta \frac{\partial L}{\partial w} $$

where $\eta$ is the learning rate.

## 6. Advantages of Artificial Neural Networks

* Ability to model complex, non-linear relationships
* High predictive accuracy with sufficient data
* Adaptive learning capability
* Robustness to noisy and incomplete data

## 7. Limitations of Artificial Neural Networks

* High computational and memory requirements
* Requirement of large labeled datasets
* Difficulty in interpretability (black-box nature)
* Risk of overfitting without proper regularization


## 8. Applications of ANN

Artificial Neural Networks are widely applied in various domains, including:

* Image and facial recognition
* Speech and handwriting recognition
* Medical diagnosis and disease prediction
* Fraud detection in financial systems
* Recommendation systems
* Time-series forecasting


In [1]:
# Importing the libraries
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
print("TensorFlow Version:", tf.__version__)

TensorFlow Version: 2.10.1


# Data Preprocessing

We will:

1. Import dataset
2. Separate features (X) and target (y)
3. Encode categorical variables (Gender, Geography)
4. Perform Feature Scaling
5. Split into training & test sets

In [3]:
# Import dataset
dataset = pd.read_csv('Churn_Modelling.csv')
dataset

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


In [4]:
X = dataset.iloc[:, 3:-1].values   # Features
y = dataset.iloc[:, -1].values    # Target (Exited)

In [5]:
print("X Sample:\n", X[:3])
print("y Sample:\n", y[:10])

X Sample:
 [[619 'France' 'Female' 42 2 0.0 1 1 1 101348.88]
 [608 'Spain' 'Female' 41 1 83807.86 1 0 1 112542.58]
 [502 'France' 'Female' 42 8 159660.8 3 1 0 113931.57]]
y Sample:
 [1 0 1 0 0 1 0 1 0 0]


##  Encoding Categorical Data

- **Label Encoding** for Gender
- **OneHot Encoding** for Geography


In [6]:
# Label Encoding Gender
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])

In [7]:
# One Hot Encoding Geography
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), [1])], 
    remainder='passthrough')

In [8]:
X = np.array(ct.fit_transform(X))

In [9]:
print("After Encoding:\n", X[:3])

After Encoding:
 [[1.0 0.0 0.0 619 0 42 2 0.0 1 1 1 101348.88]
 [0.0 0.0 1.0 608 0 41 1 83807.86 1 0 1 112542.58]
 [1.0 0.0 0.0 502 0 42 8 159660.8 3 1 0 113931.57]]


##  Splitting Dataset


In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [11]:
print("Train Shape:", X_train.shape)
print("Test Shape:", X_test.shape)

Train Shape: (8000, 12)
Test Shape: (2000, 12)


## Feature Scaling

ANNs converge faster with standardized inputs.


In [12]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#  Building the ANN
We will start simple, then **add more layers** and compare performance.


In [13]:
# Initializing the ANN
ann = tf.keras.models.Sequential()

# Input + First Hidden Layer
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

# Second Hidden Layer
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

# Output Layer
ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

#  Training the ANN

- Optimizer: **Adam**
- Loss Function: **Binary Crossentropy** (since this is binary classification)
- Metric: **Accuracy**


In [14]:
# Compile ANN
ann.compile(optimizer = 'adam', 
            loss = 'binary_crossentropy', 
            metrics = ['accuracy'])

In [15]:
# Train ANN
history = ann.fit(X_train, y_train, batch_size = 32, epochs = 100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

#  Model Evaluation

We will predict on the **Test set** and evaluate using:

- Predictions
- Confusion Matrix
- Accuracy


In [16]:
# Predicting the Test set results
y_pred = ann.predict(X_test)
y_pred = (y_pred > 0.5)



In [17]:
# Compare predictions vs actual
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)[:10])

[[0 0]
 [0 1]
 [0 0]
 [0 0]
 [0 0]
 [1 1]
 [0 0]
 [0 0]
 [0 1]
 [1 1]]


In [18]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)

In [19]:
print("Confusion Matrix:\n", cm)
print("Accuracy:", acc)

Confusion Matrix:
 [[1512   83]
 [ 189  216]]
Accuracy: 0.864


#  Experimenting with More Layers
We now add more 2 hidden layers and compare performance.


In [20]:
# Build deeper ANN
ann_deep = tf.keras.models.Sequential()

In [21]:
# Input + 3 Hidden Layers
ann_deep.add(tf.keras.layers.Dense(units=8, activation='relu'))
ann_deep.add(tf.keras.layers.Dense(units=8, activation='relu'))
ann_deep.add(tf.keras.layers.Dense(units=8, activation='relu'))

In [22]:
# Output Layer
ann_deep.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

In [23]:
# Compile
ann_deep.compile(optimizer = 'adam', 
                 loss = 'binary_crossentropy', 
                 metrics = ['accuracy'])

In [24]:
# Train
history_deep = ann_deep.fit(X_train, y_train, batch_size = 32, epochs = 100, verbose=1)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [25]:
# Evaluate deeper model
y_pred_deep = ann_deep.predict(X_test)
y_pred_deep = (y_pred_deep > 0.5)



In [26]:
cm_deep = confusion_matrix(y_test, y_pred_deep)
acc_deep = accuracy_score(y_test, y_pred_deep)

In [27]:
print("Confusion Matrix (Deeper ANN):\n", cm_deep)
print("Accuracy (Deeper ANN):", acc_deep)

Confusion Matrix (Deeper ANN):
 [[1493  102]
 [ 185  220]]
Accuracy (Deeper ANN): 0.8565


## We now add 3 hidden layers and compare performance.

In [28]:
# Initializing the ANN
ann = tf.keras.models.Sequential()

In [29]:
# Input + Hidden Layers
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))
ann.add(tf.keras.layers.Dense(units=6, activation='relu'))

In [30]:
# Output Layer
ann.add(tf.keras.layers.Dense(units=1, activation='sigmoid'))

In [31]:
# Compile and Train
ann.compile(optimizer='adam', 
            loss='binary_crossentropy', 
            metrics=['accuracy'])

ann.fit(X_train, y_train, batch_size=32, epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x2315f1b3a60>

In [32]:
# Evaluate
y_pred = (ann.predict(X_test) > 0.5)
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)



In [33]:
print("Confusion Matrix (3 Hidden Layers):")
print(cm)
print("Accuracy (3 Hidden Layers):", acc)

Confusion Matrix (3 Hidden Layers):
[[1512   83]
 [ 196  209]]
Accuracy (3 Hidden Layers): 0.8605


#  Comparison of Models

We experimented with **different ANN architectures** (1, 2, and 3 hidden layers) and compared their performance.



##  Results Summary

| ANN Architecture   | Confusion Matrix                | Accuracy |
|--------------------|---------------------------------|----------|
| **1 Hidden Layer** | [[1492  103] <br> [ 186  219]] | **0.8555** |
| **2 Hidden Layers**| [[1525   70] <br> [ 198  207]] | **0.8660** |
| **3 Hidden Layers**| [[1526   69] <br> [ 210  195]] | **0.8605** |


##  Interpretation

- **Hidden Layer:**  
  - Accuracy: ~85.5%  
  - Performs reasonably well, but leaves some misclassifications.  

- **Hidden Layers:**  
  - Accuracy: ~86.6%  
  - Best overall performance in terms of accuracy.  
  - Fewer false positives compared to 1 hidden layer.  

- **Hidden Layers:**  
  - Accuracy: ~86.0%  
  - Slightly worse than 2 layers, indicating **adding more layers did not help**.  
  - More false negatives (customers leaving were predicted as staying).  


##  Conclusion

- Increasing from **1 → 2 hidden layers** improved performance.  
- Adding a **3rd hidden layer** did **not** improve accuracy — in fact, performance dropped slightly.  
- **Best Model:** ANN with **2 hidden layers**, giving **~86.6% accuracy**.  
-  More layers are not always better — they may cause **overfitting** or unnecessary complexity.  
- To further improve performance, we should explore:  
  - **Hyperparameter tuning** (units per layer, learning rate, batch size).  
  - **Regularization techniques** (Dropout, L2).  
  - **Feature engineering** (new features, removing noisy ones).  
