**OBJECTIVE** :
WAP to implement a three-layer neural network using Tensor flow library (only, no keras) to classify MNIST handwritten digits dataset. Demonstrate the implementation of feed-forward and back-propagation approaches.  

**Model Description:**
*Architecture:*
  Input Layer (784 neurons) → Takes flattened 28×28 grayscale images.
  Hidden Layer 1 (128 neurons) → Sigmoid activation.
  Hidden Layer 2 (64 neurons) → Sigmoid activation.
  Output Layer (10 neurons) → Produces logits, passed to Softmax for classification.
*Hyperparameters:*
  Loss Function: Softmax Cross-Entropy
  Optimizer: Stochastic Gradient Descent (SGD) with Momentum (0.9)
  Learning Rate: 0.1
  Batch Size: 32
  Epochs: 10


In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds

# Load and preprocess MNIST dataset
dataset, info = tfds.load("mnist", as_supervised=True, with_info=True)
train_dataset, test_dataset = dataset["train"], dataset["test"]

# Define batch size
BATCH_SIZE = 32

def preprocess(image, label):
    image = tf.cast(image, tf.float32) / 255.0  # Normalize (0 to 1)
    image = tf.reshape(image, [-1])  # Flatten (28x28 → 784)
    label = tf.one_hot(label, depth=10)  # One-hot encode labels
    return image, label

train_dataset = train_dataset.map(preprocess).shuffle(10000).batch(BATCH_SIZE)
test_dataset = test_dataset.map(preprocess).batch(BATCH_SIZE)

# Define model parameters
input_dim = 784
hidden_dim1 = 128
hidden_dim2 = 64
output_dim = 10

# Initialize weights and biases
W1 = tf.Variable(tf.random.normal([input_dim, hidden_dim1], stddev=0.1))
b1 = tf.Variable(tf.zeros([hidden_dim1]))
W2 = tf.Variable(tf.random.normal([hidden_dim1, hidden_dim2], stddev=0.1))
b2 = tf.Variable(tf.zeros([hidden_dim2]))
W3 = tf.Variable(tf.random.normal([hidden_dim2, output_dim], stddev=0.1))
b3 = tf.Variable(tf.zeros([output_dim]))

# Define forward pass
def model(x):
    hidden_layer1 = tf.sigmoid(tf.matmul(x, W1) + b1)  # First Hidden Layer
    hidden_layer2 = tf.sigmoid(tf.matmul(hidden_layer1, W2) + b2)  # Second Hidden Layer
    logits = tf.matmul(hidden_layer2, W3) + b3  # Output layer (logits)
    return logits

# Loss function (Softmax Cross-Entropy)
def compute_loss(logits, labels):
    return tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels))

# Accuracy function
def compute_accuracy(dataset):
    correct_preds, total_samples = 0, 0
    for images, labels in dataset:
        logits = model(images)
        correct_preds += tf.reduce_sum(tf.cast(tf.equal(tf.argmax(logits, axis=1), tf.argmax(labels, axis=1)), tf.float32)).numpy()
        total_samples += images.shape[0]
    return correct_preds / total_samples

# Optimizer (SGD with Momentum)
optimizer = tf.optimizers.SGD(learning_rate=0.1, momentum=0.9)

# Training step function
def train_step(images, labels):
    with tf.GradientTape() as tape:
        logits = model(images)
        loss = compute_loss(logits, labels)
    gradients = tape.gradient(loss, [W1, b1, W2, b2, W3, b3])
    optimizer.apply_gradients(zip(gradients, [W1, b1, W2, b2, W3, b3]))
    return loss

# Training loop
epochs = 10
for epoch in range(epochs):
    total_loss = 0.0
    for images, labels in train_dataset:
        loss = train_step(images, labels)
        total_loss += loss.numpy()
    
    print(f"Epoch {epoch+1}, Loss: {total_loss:.4f}")

# Compute final training and test accuracy
train_accuracy = compute_accuracy(train_dataset)
test_accuracy = compute_accuracy(test_dataset)

print(f"Final Training Accuracy (SGD): {train_accuracy:.4f}")
print(f"Final Test Accuracy (SGD): {test_accuracy:.4f}")


Epoch 1, Loss: 671.7911
Epoch 2, Loss: 248.4321
Epoch 3, Loss: 169.9272
Epoch 4, Loss: 133.2994
Epoch 5, Loss: 99.0014
Epoch 6, Loss: 80.7056
Epoch 7, Loss: 63.2901
Epoch 8, Loss: 50.5881
Epoch 9, Loss: 39.6281
Epoch 10, Loss: 35.4824
Final Training Accuracy (SGD): 0.9970
Final Test Accuracy (SGD): 0.9801


**CODE DESCRIPTION:**
1. *Dataset Loading & Preprocessing:*
   -The MNIST dataset is loaded using tensorflow_datasets.
   -Images are normalized (values scaled between 0 and 1) for better training stability.
   -Labels are one-hot encoded 
   -The dataset is shuffled and batched to improve training efficiency.
2. *Model Definition:*
   -Weights (W) and biases (b) are initialized randomly for each layer.
   -The model performs matrix multiplications followed by the sigmoid activation function in each hidden layer.
   -The final output is logits (raw scores), which will later be converted into probabilities using softmax.
3. *Loss Function:*
   -The loss function is Softmax Cross-Entropy, which measures how far the predicted probability distribution is from the actual label distribution.
4. *Accuracy Computation:*
   -Accuracy is computed by checking how many predicted labels match the actual labels.
5. *Optimization (SGD with Momentum):*
   -SGD (Stochastic Gradient Descent) with Momentum is used to update weights.
   -Momentum helps the optimizer move past local minima and speeds up convergence.
6. *Training Loop:*
   -The model is trained for 10 epochs.
   -In each epoch, all training samples are passed through the model, and the weights are updated based on the computed loss.
   -At the end of training, accuracy is computed for both the training and test sets.

**Training Accuracy:** 0.9970
**Test Accuracy:** 0.9801

**MY COMMENTS:**
-SGD with momentum provides better accuracy when compared to SGD without momentum
-ADAM is faster compared to SGD 
-ReLu activation function is better option compared to Sigmoid.
