 <center> <h1> <b> Pattern Recognition and Machine Learning (EE5610 - EE2802 - AI2000 - AI5000) </b> </h1> </center>

<b> Programming Assignment - 04 : Neural Networks </b>


This programming assignment gives you a chance to perform the classification task using neural networks. You will get to build a neural network from scratch and train and test it on a standard classification dataset. Further you will learn different tricks and techniques to train a neural network eficiently by observing few important issues and trying to overcome them. This includes observing the performance of the network for different activation functions and optimization algorithms. We will conclude with implementation of various regularization techniques to overcome the problems of overfitting and vanishing gradients.

<b> Instructions </b>
1. Plagiarism is strictly prohibited.
2. Delayed submissions will be penalized with a scaling factor of 0.5 per day.
3. Please DO NOT use any machine learning libraries unless and otherwise specified.




### Basically in Glance

- NN Based Classification from Scratch
- Understanding various Activation Functions
- Understanding Optimization Algorithms
- Understanding Regularization Methods
- Comparision with Linear Classifiers

In [1]:
#All imports
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import collections

#  Part 1 🧠✨ Neural Network Based Digit Classification — MNIST 🖼️🔢

This programming assignment focuses on building a **Feedforward Neural Network** from scratch to classify handwritten digits using the **MNIST dataset**. The dataset consists of grayscale images of size **28×28 pixels**, each representing a digit from **0 to 9**.

---

## 🗂️ 1: Load MNIST Data & Create Train-Test Splits

📌 **Instructions:**
- The MNIST dataset contains **70,000 images** in total.
- Split the dataset into:
  - ✅ **60,000** images for training  
  - 🧪 **10,000** images for testing  
- ✅ The code for downloading and splitting the dataset is provided.

---

## 🏗️  2: Design a Feedforward Classification Network

We will build a **3-layer Feedforward Neural Network** with the following architecture:

📐 **Network Dimensions:**
- Input: 784 (28 × 28 flattened)
- Hidden Layer 1: 512 nodes
- Hidden Layer 2: 512 nodes
- Output Layer: 10 nodes (one for each digit class)

🧮 **Mathematical Representation:**

$$
\mathbf{y} = h(\mathbf{W}_3 \cdot g(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \cdot \mathbf{x})))
$$

- $ \mathbf{W}_1 \in \mathbb{R}^{512 \times 784} $
- $ \mathbf{W}_2 \in \mathbb{R}^{512 \times 512} $
- $ \mathbf{W}_3 \in \mathbb{R}^{10 \times 512} $

✨ **Activations:**
- Hidden layers: `ReLU` ⚡  
- Output layer: `Softmax` 🎯

---

## 🏋️  3: Train the Neural Network

### 🔄 Steps:
1. 📦 **Flatten** 28×28 images into 784-dimensional vectors.  
2. 🔁 **Randomly initialize** the weights $ \mathbf{W}_1, \mathbf{W}_2, \mathbf{W}_3 $  
3. 🧠 **Feedforward pass** to compute class posteriors.  
4. 📉 **Compute loss** between predicted probabilities and true labels (cross-entropy loss suggested).  
5. 🔧 **Backpropagate** the loss to compute gradients.  
6. ⚙️ **Update parameters** using **Stochastic Gradient Descent (SGD)**.  
7. 🧪 **Tune hyperparameters** such as learning rate, batch size, and epochs.

---

## 📊 4: Evaluate the Model

### ✅ Steps:
- 🔁 Run a **feedforward pass** on the test data.  
- 🧠 **Predict the class** with the highest posterior probability.  
- 📉 **Compute loss** and **accuracy**.  
- 📋 **Report observations**, performance metrics, and any interesting behaviors.

---

### 🚀 Bonus Tips:
- Try visualizing some predictions 📸  
- Experiment with learning rates and see how they affect training!  
- Track loss/accuracy over epochs using plots 📈

---


In [2]:
##1 Load data and create train test splits
 ##################################################
#Load MNIST data.
##################################################
import torchvision.datasets as datasets
mnist_trainset = datasets.MNIST(root='./data', train=True, download=True, transform=None)
mnist_testset = datasets.MNIST(root='./data', train=False, download=True, transform=None)

#Training data
mnist_traindata = mnist_trainset.data.numpy()
mnist_trainlabel = mnist_trainset.targets.numpy()
print("Training data",mnist_traindata.shape)
print("Training labels",mnist_trainlabel.shape)

#Testing data
mnist_testdata = mnist_testset.data.numpy()
mnist_testlabel = mnist_testset.targets.numpy()
print("Testing data",mnist_testdata.shape)
print("Testing labels",mnist_testlabel.shape)


Training data (60000, 28, 28)
Training labels (60000,)
Testing data (10000, 28, 28)
Testing labels (10000,)


In [5]:


##################################################
#2 Define the architecture
##################################################

#Complete the below function to impliment ReLU activation function
def ReLu(inp):
  return np.maximum(inp,0)
  

#Complete the below function to impliment gradient of ReLU activation function
def gradReLu(inp):
    return (inp > 0).astype(float)





def LeakyReLu(inp, alpha=0.01):
  return np.maximum(inp, alpha * inp)
  
def gradLeakyReLu(inp, alpha=0.01):
  return np.where(inp > 0, 1.0, alpha)





  
#Complete the below function to impliment softmax activation function
# def softmax(X):
#     X = X - np.max(X, axis=0, keepdims=True)  # keepdims is key here
#     exp_X = np.exp(X)
#     return exp_X / np.sum(exp_X, axis=0, keepdims=True)

def softmax(Z):
    A = np.exp(Z) / sum(np.exp(Z))
    return A

#Complete the below function to impliment forward propagation of data
def fwdPropagate(inputs, weights,bias):
  #Inputs: input data, paramters of network
  W1, W2, W3 = weights
  b1,b2,b3 = bias

  A1 = W1 @ inputs + b1
  # print(A1.shape)
  Z1 = LeakyReLu(A1)
  A2 = W2 @ Z1 + b2
  Z2 = LeakyReLu(A2)
  A3 = W3 @ Z2 + b3
  outps = softmax(A3)



  #Return the requires outputs, i.e., final output and intermediate activations
  return [A1,A2,A3],[inputs,Z1,Z2,outps]


#Complete the below function to compute the gradients
# def computeGradients(inputs, targets, weights, activations):
#   #Inputs: input data, targets, parameters of netwrok, intermediate activations

#   #Compute the loss



#   #Compote the derivative of loss at parameters

#   #Return the gradients
#   return [dj_dw1, dj_dw2, dj_dw3]

#Complete the below function to update the parameters using the above computed gradients
def applyGradients(weights,biases, gradients_weights, gradient_biases, learning_rate):
  #Inputs: weights, gradients, and learning rate
  W1, W2, W3 = weights
  b1,b2,b3 = biases
  nabla_W1, nabla_W2, nabla_W3 = gradients_weights
  nabla_b1,nabla_b2,nabla_b3 = gradient_biases

  # print("W1 Shape", W1.shape)
  # print("nabla_W1 shape", nabla_W1.shape)
  # for i in range(3):
  #   print("Weights shape", weights[i].shape , "nabla_Weight shape: ",gradients_weights[i].shape )

  W1 = W1 - learning_rate * nabla_W1
  W2 = W2 - learning_rate * nabla_W2
  W3 = W3 - learning_rate * nabla_W3

  b1= b1 - learning_rate * nabla_b1
  b2 = b2 - learning_rate * nabla_b2
  b3 = b3 - learning_rate * nabla_b3

  


  #Return the updated parameters
  return [W1, W2, W3],[b1,b2,b3]

#Complete the below function to complete the backpropagation ste
def backPropagate( targets, weights,biases, activations,neuronal_outputs,learning_rate):
  #Inputs: input data, targets, parameters of network, intermediate activations, learning rate of optimization algorithm

  # nabla_b = [np.zeros(b.shape) for b in biases]
  # nabla_w = [np.zeros(w.shape) for w in weights]
  W1,W2,W3 = weights
  b1,b2,b3 = biases
  m = activations[0].shape[1]

  X,Z1,Z2,Z3 = neuronal_outputs
  A1,A2,A3 = activations

  # # print("Activation shape")
  # # for activation in activations:
  # #   print(activation.shape)

  # # print("Neuronal Outputs")
  # # for w in neuronal_outputs:
  # #   print(w.shape)
  
  # delta = neuronal_outputs[-1] - targets

  # nabla_b[-1] = np.sum(delta,axis =1 )[:,np.newaxis]
  # nabla_w[-1] = delta @ neuronal_outputs[-2].T
  # # print(f"Layer {3} Activation shape {activations[-1].shape}  delta shape {delta.shape} nabla_w shape  {nabla_w[-1].shape}")

  # layer_count = 3
  # for l in range(2,layer_count+1):
  #   a = activations[-l]
  #   h_prime = gradLeakyReLu(a)

  #   delta = weights[-l +1].T @ delta  * h_prime
  #   col_min = np.min(delta, axis=0)  # shape: (N,)
  #   col_max = np.max(delta, axis=0)  # shape: (N,)    # print(f"Layer {3-l} Activation shape {a.shape}  delta shape {delta.shape}")

  #   print("Min per sample:", col_min[:10])
  #   print("Max per sample:", col_max[:10])

  #   nabla_b[-l] = np.sum(delta,axis =1)[:,np.newaxis]

  #   nabla_w[-l] = delta @ neuronal_outputs[-l-1].T

  delta3 = Z3 - targets
  nabla_w3 = 1/m * delta3 @ Z2.T
  nabla_b3 = 1/m * np.sum(delta3)

  delta2 = W3.T @delta3  * gradReLu(A2)
  nabla_w2  = 1/m * delta2@ Z1.T
  nabla_b2 = 1/m * np.sum(delta2)


  delta1 = W2.T @delta2  * gradReLu(A1)
  nabla_w1  = 1/m * delta1@ X.T
  nabla_b1 = 1/m * np.sum(delta1)



  
  



  
  return applyGradients(weights,biases,[nabla_w1,nabla_w2,nabla_w3],[nabla_b1,nabla_b2,nabla_b3],learning_rate)


def error(predicted, true):
    eps = 1e-10
    return -np.sum(true * np.log(predicted + eps)) / predicted.shape[1]

def accuracy(t,t_hat):
    return np.sum(t==t_hat)/t.size *100
      


##################################################
#3 Train the network
##################################################

def vectorized_result(j):
    """Return a 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere.  This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    e = np.zeros((10, 1))
    e[j] = 1.0
    return e

#Complete the below function to complete the training of network
def training(inputs, targets_idx, batch_size = 128, epochs=30, train_val_split=0.8, learning_rate=0.01):

  #Set the hyperparameters
  hidden_units = [512,512]
  n_classes = 10
  n_samples = inputs.shape[1]


  sizes = [inputs.shape[0],512,512,n_classes]

  #Split the training data into two parts.
  #Use 90 percent of training data for training the network.
  #Remaining 10 percent as validation data

  train_samples = int(train_val_split * n_samples) 
  # validation_samples = n_samples - train_samples



  permutation = np.random.permutation(n_samples)

  # Apply the permutation to both inputs and labels
  shuffled_inputs = inputs[:, permutation]
  shuffled_labels = targets_idx[:, permutation]

  # Split based on shuffled data
  train_X = shuffled_inputs[:, :train_samples]
  train_idx = shuffled_labels[:, :train_samples]

  validate_X = shuffled_inputs[:, train_samples:]
  validate_idx = shuffled_labels[:, train_samples:]

  
  n_batches = int(train_samples/batch_size)

  # print("Train_X shape",train_X.shape)
  # print("Train_idx shape",train_idx.shape)




  #Randomly initialize the weights
  biases = [np.random.randn(y,1) *0.01 for y in sizes[1:]]


  weights = [np.random.randn(y, x) * np.sqrt(2. / x) for x, y in zip(sizes[:-1], sizes[1:])]
  # batch =3
  # batch_X = train_X[:,batch* batch_size : (batch+1 )* batch_size]
  # batch_idx = targets_idx[:,batch * batch_size : (batch+1) * batch_size]



  # print("Batch_X shape",batch_X.shape)
  # print("Batch_idx shape",batch_idx.shape)



  #Interate for epochs times
  for epoch in range(epochs):
    #Shuffle the training data

    #Interate through the batches of data
    for batch in range(n_batches):
      #Get the batch of data
      batch_X = train_X[:,batch* batch_size : (batch+1 )* batch_size]
      batch_idx = targets_idx[:,batch * batch_size : (batch+1) * batch_size]

      #Forward propagation
      activations,neuronal_outputs = fwdPropagate(batch_X, weights,biases)

      #Backward propagation
      weights,biases = backPropagate(batch_idx,weights,biases, activations,neuronal_outputs, learning_rate)

      # active_neurons = np.mean(activations > 0)
      # print(f"Active neurons: {active_neurons*100:.2f}%")

      # print(f"Batch {batch} Exectued successfully Updated Weights and Biases ")



    #Compute outpus on trianing data
    activations,neuronal_outputs = fwdPropagate(inputs,weights,biases)
    outputs = neuronal_outputs[-1]

    
    predictions = np.argmax(outputs,axis = 0)[:,np.newaxis]
    actual_labels = np.argmax(targets_idx,axis = 0)[:,np.newaxis]
    print("Predictions",predictions[:15])
    print("actual labels", actual_labels[:15])
    #Compute training accuracy, and training error
    train_error = error(outputs,targets_idx)
    train_accuracy = accuracy(predictions,actual_labels)



    print(f"Epoch {epoch+1} : Accuracy : {train_accuracy} Error : {train_error}")
    





    #Compute outputs on validation data
    val_activations, val_outputs = fwdPropagate(validate_X, weights, biases)
    val_preds = np.argmax(val_outputs[-1], axis=0)[:, np.newaxis]
    val_labels = np.argmax(validate_idx, axis=0)[:, np.newaxis]



    #Compute validation accuracy, and validation error
    val_acc = accuracy(val_preds, val_labels)
    val_err = error(val_outputs[-1], validate_idx)

    print(f"Epoch {epoch+1} | Val Acc: {val_acc:.2f} | Val Error: {val_err:.2f}")

    #Print the statistics of training, i.e., training error, training accuracy, validation error, and validation accuracy


    #Save the parameters of network


#Call the training function to train the network




train_inputs=np.array([i.flatten() for i in mnist_traindata]).T[:,:150]
train_labels= np.array([vectorized_result(j).flatten() for j in mnist_trainlabel]).T[:,:150]

training(train_inputs,train_labels,learning_rate =0.0001)










Predictions [[1]
 [1]
 [8]
 [0]
 [2]
 [8]
 [9]
 [8]
 [9]
 [6]
 [8]
 [8]
 [9]
 [0]
 [9]]
actual labels [[5]
 [0]
 [4]
 [1]
 [9]
 [2]
 [1]
 [3]
 [1]
 [4]
 [3]
 [5]
 [3]
 [6]
 [1]]
Epoch 1 : Accuracy : 10.0 Error : 20.232949406118053
Epoch 1 | Val Acc: 16.67 | Val Error: 18.47
Predictions [[1]
 [1]
 [8]
 [0]
 [2]
 [8]
 [9]
 [8]
 [9]
 [6]
 [8]
 [8]
 [9]
 [0]
 [9]]
actual labels [[5]
 [0]
 [4]
 [1]
 [9]
 [2]
 [1]
 [3]
 [1]
 [4]
 [3]
 [5]
 [3]
 [6]
 [1]]
Epoch 2 : Accuracy : 10.0 Error : 20.232949406118053
Epoch 2 | Val Acc: 16.67 | Val Error: 18.47
Predictions [[1]
 [1]
 [8]
 [0]
 [2]
 [8]
 [9]
 [8]
 [9]
 [6]
 [8]
 [8]
 [9]
 [0]
 [9]]
actual labels [[5]
 [0]
 [4]
 [1]
 [9]
 [2]
 [1]
 [3]
 [1]
 [4]
 [3]
 [5]
 [3]
 [6]
 [1]]
Epoch 3 : Accuracy : 10.0 Error : 20.232949406118053
Epoch 3 | Val Acc: 16.67 | Val Error: 18.47
Predictions [[1]
 [1]
 [8]
 [0]
 [2]
 [8]
 [9]
 [8]
 [9]
 [6]
 [8]
 [8]
 [9]
 [0]
 [9]]
actual labels [[5]
 [0]
 [4]
 [1]
 [9]
 [2]
 [1]
 [3]
 [1]
 [4]
 [3]
 [5]
 [3]
 [6]
 [1

In [4]:
x= np.array([1,2,3])
exp_x = np.e**x
exp_x/  (np.sum(exp_x))


array([0.09003057, 0.24472847, 0.66524096])

In [5]:


X= np.array([
  [1,4,5],
  [4,5,-1]
])

softmax(X)  

array([[0.04742587, 0.26894142, 0.99752738],
       [0.95257413, 0.73105858, 0.00247262]])

In [None]:
sizes = [784,512,512,10]
for x,y in zip(sizes[:-1],sizes[1:]):
    print(y,x)


512 784
512 512
10 512


In [None]:
np.flatten()

AttributeError: module 'numpy' has no attribute 'flatten'

In [None]:
##################################################
#4 Evaluate the performance on test data
##################################################



In [None]:
A = np.array([
    [-1,2,3],
    [4,5,6]
])

def gradReLu(inp):
  if inp >=0:
    return 1
  else:
    return 0







np.apply_along_axis(gradReLu_vec,0,A)

array([[0, 1, 1],
       [1, 1, 1]])

# 🔍🧠 Part 2: Understanding Activation Functions ⚡📉

In this part, you will explore how different **activation functions** affect the performance of a feedforward neural network trained on the **MNIST digit classification** task.

---

## 🧪 1. Experiment with Activation Functions

Train the same network architecture as in Part 1 using **different activation functions** in the hidden layers:

🔁 **Activation Functions to try:**
- 🌀 `Sigmoid`
- 🔄 `Tanh`
- ⚡ `ReLU`
- ⚡💧 `Leaky ReLU`

📌 Keep the following fixed:
- Use **Stochastic Gradient Descent (SGD)** as the optimization algorithm
- Keep the rest of the architecture, loss function, and learning setup same

---

## 📊 2. Report & Compare Performance

🧠 **Evaluation Instructions:**
- Run the trained model on the **MNIST test dataset**
- Record and report the **accuracy** for each activation function
- ✍️ Write down your **observations** in the report:
  - How does the choice of activation affect learning?
  - Which function performed best and why?
  - Are there any signs of vanishing gradients or training instability?

---

### 📝 Sample Table for Results (fill this in later):

| Activation Function | Test Accuracy (%) | Observations |
|---------------------|-------------------|--------------|
| Sigmoid             |                   |              |
| Tanh                |                   |              |
| ReLU                |                   |              |
| Leaky ReLU          |                   |              |

---

### 💡 Tips:
- You may visualize training loss curves for each activation function 📉
- Consider running multiple trials to average out randomness
- Look at confusion matrices to understand misclassification patterns 🔍

---


# ⚙️📈 Part 3: Understanding Optimization Algorithms 🧠🔧

In this part, you'll explore how different **optimization algorithms** affect the training performance of your classification network.

---

## 🧠 1. Use the Best Activation Function

- Choose the **best-performing activation function** from your experiments in **Part 2** (e.g., ReLU or Tanh).
- Use this activation function in your network’s hidden layers.

---

## 🚀 2. Train with Adam Optimizer

Train the same classification network using the **Adam optimization algorithm**, instead of SGD.

🔁 Keep the rest of the setup **unchanged**:
- Same architecture (3-layer FFNN)
- Same initialization
- Same learning rate (or tune slightly if necessary)
- Same loss function (Cross-Entropy)

---

## ⚖️ 3. Compare Performance: SGD vs Adam

Evaluate and compare the models trained using:
- 🔁 **Stochastic Gradient Descent (SGD)**
- ⚙️ **Adam Optimizer**

🧪 **Metrics to Compare:**
- Test Accuracy
- Convergence speed (training loss curves)
- Stability of training (variance in accuracy across epochs)

---

### 📝 Sample Comparison Table:

| Optimizer | Activation Function | Test Accuracy (%) | Observations |
|-----------|---------------------|-------------------|--------------|
| SGD       | ReLU (or best)      |                   |              |
| Adam      | ReLU (or best)      |                   |              |

---

## ✍️ 4. Report Your Observations

🔍 **What to observe:**
- Did Adam improve the accuracy?
- Was convergence faster or slower than SGD?
- Any trade-offs noticed (e.g., stability vs. generalization)?

🗒️ Document your insights in the report.

---

### 💡 Tips:
- Plot accuracy/loss over epochs for both optimizers
- Try using same learning rates for fairness, or optimize them slightly for best results
- Optionally try RMSprop, Adagrad, etc., for deeper exploration 🔍

---


# 🧪🧰 Part 4: Understanding Regularization Methods 🔒📉

In this part of the assignment, you'll explore **regularization techniques** to reduce **overfitting** in your neural network. You'll use the network from previous parts and incorporate different methods to improve generalization.

---

## 🧠 Goal:
Retrain the neural network using the following **regularization techniques**, and compare their performance on the **MNIST test set**.

---

## 🔗 1. Weight Regularization (L2 Penalty)

- Add an L2 regularization term to the loss function:
  
  $$
  \text{Loss}_{\text{total}} = \text{CrossEntropyLoss} + \lambda \left( \lVert \mathbf{W}_1 \rVert^2 + \lVert \mathbf{W}_2 \rVert^2 + \lVert \mathbf{W}_3 \rVert^2 \right)
  $$

- Experiment with different values of **$\lambda$** (e.g., `0.001`, `0.01`, `0.1`) and report the impact.

---

## 💧 2. Dropout Regularization

- Introduce **Dropout** in the hidden layers with a probability of `0.2`.
- Dropout randomly disables some neurons during training to prevent co-adaptation.
- 🧪 **Important:** Disable Dropout during inference/evaluation.

### 🔁 Suggestions:
- Try with different probabilities: `0.1`, `0.2`, `0.5`
- Compare their effects on validation accuracy

---

## ⏹️ 3. Early Stopping

- Monitor **validation loss** during training
- Stop training when validation loss starts increasing, even if training loss is decreasing
- Prevents the model from overfitting the training data

---

## 📊 Results Table

| Regularization Method | Accuracy on Test Set (%) | Observations |
|------------------------|--------------------------|--------------|
| L2 (λ = 0.001)         |                          |              |
| L2 (λ = 0.01)          |                          |              |
| Dropout (p = 0.2)      |                          |              |
| Dropout (p = 0.5)      |                          |              |
| Early Stopping         |                          |              |

---

## ✍️ Report Your Observations

🧠 Consider reflecting on:
- Which regularization method gave the best test accuracy?
- Did any method significantly reduce overfitting?
- How did training time, convergence, and stability change with each method?

---

### 💡 Tips:
- You can visualize **training vs. validation loss curves** to better understand overfitting and the effect of early stopping.
- For Dropout, visualize training accuracy vs test accuracy to observe generalization.

---


# ⚖️📊 Part 5: Comparison with Linear Classifiers 🤖 vs 📉

In this final part of the assignment, you'll compare the classification performance of **deep neural networks** and **linear classifiers** on both linearly and non-linearly separable datasets.

---

## 🔹 1. Linearly Separable Data Generation

📈 **Class 1**:
- Gaussian distribution
- Mean: $ \begin{bmatrix} 1 \\ 1 \end{bmatrix} $
- Covariance: $ \begin{bmatrix} 0.3 & 0.0 \\ 0.0 & 0.3 \end{bmatrix} $

📉 **Class 2**:
- Gaussian distribution
- Mean: $ \begin{bmatrix} 3 \\ 3 \end{bmatrix} $
- Covariance: $ \begin{bmatrix} 0.3 & 0.0 \\ 0.0 & 0.3 \end{bmatrix} $

📦 **Split:**
- 4500 samples per class for **training**
- 500 samples per class for **testing**

---

## 🔹 2. Non-linearly Separable Data

- Predefined code provides:
  - `class1_data` and `class2_data`
  - ~5000 points per class

📦 **Split:**
- Use **90% for training**, **10% for testing**

---

# 🧠 Programming Tasks

---

## 🔸 3. Linear Classification Models - Logistic Regression

📘 **Model:**
- $ y = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}} $

### ✅ Your tasks:

**a) Build a function `Logistic_Regression(X_train, Y_train, X_test)`**:
- Initialize $ \mathbf{w} $ randomly
- Optimize using **iterative reweighted least squares**
- Predict using learned $ \mathbf{w} $

**b) Evaluate test accuracy**  
**c) Visualize decision regions**:
- Use boundary plots or color-coded regions

---

## 🔸 4. Deep Neural Network-Based Classification

🏗 **Architecture:**

$ \mathbf{y} = h(\mathbf{W}_3 \cdot g(\mathbf{W}_2 \cdot g(\mathbf{W}_1 \cdot \mathbf{x}))) $

### 🧱 Weight Dimensions:
- $ \mathbf{W}_1 \in \mathbb{R}^{3 \times 2} $
- $ \mathbf{W}_2 \in \mathbb{R}^{3 \times 3} $
- $ \mathbf{W}_3 \in \mathbb{R}^{1 \times 3} $

🧩 **Activation Functions**:
- $g(.)$ = ReLU (hidden layers)
- $h(.)$ = Sigmoid (output layer)

🔁 **Posterior Probabilities**:
- Class 1: $ \sigma(z) $
- Class 2: $ 1 - \sigma(z) $

### ✅ Your tasks:

- Train the network on both datasets
- Plot **second layer activation potentials**:
  - Input entire data
  - Visualize 3D hidden representation
- Evaluate performance on test set
- Comment on how non-linearity transforms the space

---

## 🆚 5. Comparison and Observations

📊 Compare:
- **Logistic Regression (Linear Model)**
- **Feedforward Neural Network**

### Suggested Comparison Table:

| Dataset Type         | Classifier            | Test Accuracy (%) | Comments |
|----------------------|------------------------|-------------------|----------|
| Linearly Separable   | Logistic Regression    |                   |          |
| Linearly Separable   | Neural Network         |                   |          |
| Non-linearly Separable | Logistic Regression  |                   |          |
| Non-linearly Separable | Neural Network       |                   |          |

---

## ✍️ Final Observations

- Which model performs better on linearly separable data?
- Which model excels on non-linear data?
- How do hidden activations help in class separation?
- Is the added complexity of neural networks justified?

---

### 💡 Tips:
- Use meshgrid and contour plots for decision boundaries
- For 3D visualization of hidden layers, use `matplotlib`'s `Axes3D`
- Observe the **power of feature transformation** done by neural networks!

---
