**OBJECTIVE** :
WAP to implement a three-layer neural network using Tensor flow library (only, no keras) to classify MNIST handwritten digits dataset. Demonstrate the implementation of feed-forward and back-propagation approaches.  

**Model Description:**
*Architecture:*
  Input Layer (784 neurons) → Takes flattened 28×28 grayscale images.
  Hidden Layer 1 (300 neurons) → Sigmoid activation.
  Hidden Layer 2 (150 neurons) → Sigmoid activation.
  Output Layer (10 neurons) → Produces logits, passed to Softmax for classification.
*Hyperparameters:*
  Loss Function: Softmax Cross-Entropy
  Optimizer: Stochastic Gradient Descent (SGD) 
  Learning Rate: 0.1
  Batch Size: 32
  Epochs: 100


In [1]:
import tensorflow as tf
import tensorflow_datasets as tfds

# Load and preprocess MNIST dataset
dataset, info = tfds.load("mnist", as_supervised=True, with_info=True)
train_dataset, test_dataset = dataset["train"], dataset["test"]

# Define batch size
BATCH_SIZE = 32

def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0  # Normalize (0 to 1)
    image = tf.reshape(image, [-1])  # Flatten (28x28 → 784)
    label = tf.one_hot(label, depth=10)  # One-hot encode labels
    return image, label

train_dataset = train_dataset.map(preprocess).shuffle(10000).batch(BATCH_SIZE)
test_dataset = test_dataset.map(preprocess).batch(BATCH_SIZE)

# Define model parameters
input_dim = 784
hidden_dim1 = 300
hidden_dim2 = 150
output_dim = 10

# Initialize weights and biases
W1 = tf.Variable(tf.random.normal([input_dim, hidden_dim1], stddev=0.1))
b1 = tf.Variable(tf.zeros([hidden_dim1]))
W2 = tf.Variable(tf.random.normal([hidden_dim1, hidden_dim2], stddev=0.1))
b2 = tf.Variable(tf.zeros([hidden_dim2]))
W3 = tf.Variable(tf.random.normal([hidden_dim2, output_dim], stddev=0.1))
b3 = tf.Variable(tf.zeros([output_dim]))

# Define forward pass
def model(x):
    hidden_layer1 = tf.sigmoid(tf.matmul(x, W1) + b1)  # First Hidden Layer
    hidden_layer2 = tf.sigmoid(tf.matmul(hidden_layer1, W2) + b2)  # Second Hidden Layer
    logits = tf.matmul(hidden_layer2, W3) + b3  # Output layer (logits)
    return logits

# Loss function (Softmax Cross-Entropy)
def compute_loss(logits, labels):
    return tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

# Accuracy function
def compute_accuracy(dataset):
    correct_preds, total_samples = 0, 0
    for images, labels in dataset:
        logits = model(images)
        correct_preds += tf.reduce_sum(tf.cast(tf.equal(tf.argmax(logits, axis=1), tf.argmax(labels, axis=1)), tf.float32)).numpy()
        total_samples += images.shape[0]
    return correct_preds / total_samples
# Optimizer (sgd)
optimizer = tf.optimizers.SGD(learning_rate=0.1)


# Training step function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        logits = model(images)
        loss = compute_loss(logits, labels)
    gradients = tape.gradient(loss, [W1, b1, W2, b2, W3, b3])
    optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2, W3, b3]))
    return loss

# Training loop
epochs = 100
for epoch in range(epochs):
    total_loss = 0.0
    for images, labels in train_dataset:
        loss = train_step(images, labels)
        total_loss += loss.numpy()
    
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

# Compute final training and test accuracy
train_accuracy = compute_accuracy(train_dataset)
test_accuracy = compute_accuracy(test_dataset)

print(f"Final Training Accuracy (sgd): {train_accuracy:.4f}")
print(f"Final Test Accuracy (sgd): {test_accuracy:.4f}")


Epoch 1, Loss: 1279.7699
Epoch 2, Loss: 593.3142
Epoch 3, Loss: 508.1269
Epoch 4, Loss: 447.6935
Epoch 5, Loss: 396.6103
Epoch 6, Loss: 352.4718
Epoch 7, Loss: 315.1519
Epoch 8, Loss: 284.7723
Epoch 9, Loss: 258.7495
Epoch 10, Loss: 236.2349
Epoch 11, Loss: 217.0366
Epoch 12, Loss: 199.7343
Epoch 13, Loss: 185.6813
Epoch 14, Loss: 171.9853
Epoch 15, Loss: 161.2843
Epoch 16, Loss: 149.1017
Epoch 17, Loss: 139.6993
Epoch 18, Loss: 130.0861
Epoch 19, Loss: 122.5064
Epoch 20, Loss: 114.1880
Epoch 21, Loss: 107.6347
Epoch 22, Loss: 100.6482
Epoch 23, Loss: 95.1949
Epoch 24, Loss: 89.5251
Epoch 25, Loss: 83.8410
Epoch 26, Loss: 78.9405
Epoch 27, Loss: 74.4654
Epoch 28, Loss: 69.9460
Epoch 29, Loss: 65.4590
Epoch 30, Loss: 61.7510
Epoch 31, Loss: 58.0726
Epoch 32, Loss: 54.0214
Epoch 33, Loss: 51.3869
Epoch 34, Loss: 47.9156
Epoch 35, Loss: 45.1794
Epoch 36, Loss: 42.3969
Epoch 37, Loss: 40.1871
Epoch 38, Loss: 37.8948
Epoch 39, Loss: 35.4205
Epoch 40, Loss: 33.4166
Epoch 41, Loss: 30.9513
Ep

**CODE DESCRIPTION:**
1. *Dataset Loading & Preprocessing:*
   -The MNIST dataset is loaded using tensorflow_datasets.
   -Images are normalized (values scaled between 0 and 1) for better training stability.
   -Labels are one-hot encoded 
   -The dataset is shuffled and batched to improve training efficiency.
2. *Model Definition:*
   -Weights (W) and biases (b) are initialized randomly for each layer.
   -First hidden layer: Uses sigmoid activation on W1 * input + b1.
   -Second hidden layer: Uses sigmoid activation on W2 * hidden_layer1 + b2.
   -The final output is logits (raw scores), which will later be converted into probabilities using softmax.
4. *Loss Function:*
   -The loss function is Softmax Cross-Entropy, which measures how far the predicted probability distribution is from the actual label distribution.
5. *Accuracy Computation:*
   -Converts logits into class predictions using tf.argmax.
   -Compares them with actual labels and calculates the percentage of correct predictions.
6. *Optimization (SGD ):*
   -Uses Stochastic Gradient Descent (SGD) with a learning rate of 0.1.
   -This optimizer updates weights and biases based on computed gradients from the loss function.
7. *Training Loop:*
   -Runs for 100 epochs, where each epoch loops through the entire training dataset.
   -Computes total loss for each epoch and prints the loss value.
   -At the end of training, accuracy is computed for both the training and test sets.

**Training Accuracy:** 1.0000
**Test Accuracy:** 0.9811

**MY COMMENTS:**
-SGD with momentum provides better accuracy when compared to SGD without momentum
-ADAM is faster compared to SGD 
-ReLu activation function is better option compared to Sigmoid.
-SGD takes more time compared to both sgd with momentum and adam