### WAP to implement a three-layer neural network using Tensor flow library (only, no keras) to classify MNIST handwritten digits dataset. Demonstrate the implementation of feed-forward and back-propagation approaches.

### Description of the code

* Tensorflow library provides interface to artificial neural network.
* The MNIST dataset is loaded.
* Feature engineering is done as normalization.
* An input layer with 784 neurons (flattened 28x28 images)
* Two hidden layers with 128 and 64 neurons, using Sigmoid activation function.
* An output layer with 10 neurons (corresponding to digit classes).
* Epoch: 20 is used.
* Epoch: If the model keeps improving, it is advisable to try a higher number of epochs. If the model stopped improving way before the final epoch, it is advisable to try a lower number of epochs.
* Batch size - 100 
* Batch size refers to the number of samples used in one iteration.
* Optimization via Adam optimizer to minimize loss.
* Loss function: Softmax cross entropy is used 

In [2]:
import tensorflow as tf
import numpy as np
from tensorflow.python.framework.ops import disable_eager_execution

disable_eager_execution()  # Disable eager execution to use TensorFlow's graph execution

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train.reshape(-1, 784) / 255.0, x_test.reshape(-1, 784) / 255.0

# Convert labels to one-hot encoding
y_train = np.eye(10)[y_train]
y_test = np.eye(10)[y_test]

# Define model hyperparameters
input_size = 784
hidden1_size = 128
hidden2_size = 64
output_size = 10
learning_rate = 0.01
batch_size = 100
epochs = 20

# Define placeholders for input and output
X = tf.compat.v1.placeholder(tf.float32, [None, input_size])
y = tf.compat.v1.placeholder(tf.float32, [None, output_size])

# Initialize weights and biases
weights = {
    'w1': tf.Variable(tf.random.truncated_normal([input_size, hidden1_size], stddev=0.1)),
    'w2': tf.Variable(tf.random.truncated_normal([hidden1_size, hidden2_size], stddev=0.1)),
    'w3': tf.Variable(tf.random.truncated_normal([hidden2_size, output_size], stddev=0.1))
}

biases = {
    'b1': tf.Variable(tf.zeros([hidden1_size])),
    'b2': tf.Variable(tf.zeros([hidden2_size])),
    'b3': tf.Variable(tf.zeros([output_size]))
}

# Define feed-forward neural network
def neural_network(X):
    layer1 = tf.nn.sigmoid(tf.matmul(X, weights['w1']) + biases['b1'])
    layer2 = tf.nn.sigmoid(tf.matmul(layer1, weights['w2']) + biases['b2'])
    output_layer = tf.matmul(layer2, weights['w3']) + biases['b3']
    return output_layer

# Compute logits
logits = neural_network(X)

# Define loss function (cross-entropy)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=logits))

# Define optimizer
optimizer = tf.compat.v1.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)

# Define accuracy metric
correct_pred = tf.equal(tf.argmax(logits, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

# Run session
tf.compat.v1.disable_eager_execution()
with tf.compat.v1.Session() as sess:
    sess.run(tf.compat.v1.global_variables_initializer())
    
    # Training loop
    for epoch in range(epochs):
        for i in range(0, len(x_train), batch_size):
            batch_x, batch_y = x_train[i:i+batch_size], y_train[i:i+batch_size]
            sess.run(optimizer, feed_dict={X: batch_x, y: batch_y})
        
        # Calculate and display loss and accuracy at each epoch
        train_loss, train_acc = sess.run([loss, accuracy], feed_dict={X: x_train, y: y_train})
        test_acc = sess.run(accuracy, feed_dict={X: x_test, y: y_test})
        print(f"Epoch {epoch+1}, Loss: {train_loss:.4f}, Train Accuracy: {train_acc*100:.2f}, Test Accuracy: {test_acc*100:.2f}")
    
    # Compute final train and test accuracy
    final_train_acc = sess.run(accuracy, feed_dict={X: x_train, y: y_train})
    final_test_acc = sess.run(accuracy, feed_dict={X: x_test, y: y_test})
    print(f"Final Train Accuracy: {final_train_acc*100:.2f}")
    print(f"Final Test Accuracy: {final_test_acc*100:.2f}")
    
    print("Training Complete!")



Epoch 1, Loss: 0.1419, Train Accuracy: 95.72, Test Accuracy: 95.16
Epoch 2, Loss: 0.0987, Train Accuracy: 96.92, Test Accuracy: 96.10
Epoch 3, Loss: 0.0815, Train Accuracy: 97.38, Test Accuracy: 96.27
Epoch 4, Loss: 0.0734, Train Accuracy: 97.62, Test Accuracy: 96.44
Epoch 5, Loss: 0.0547, Train Accuracy: 98.25, Test Accuracy: 96.91
Epoch 6, Loss: 0.0694, Train Accuracy: 97.72, Test Accuracy: 96.59
Epoch 7, Loss: 0.0646, Train Accuracy: 97.91, Test Accuracy: 96.49
Epoch 8, Loss: 0.0411, Train Accuracy: 98.66, Test Accuracy: 97.03
Epoch 9, Loss: 0.0504, Train Accuracy: 98.32, Test Accuracy: 96.84
Epoch 10, Loss: 0.0402, Train Accuracy: 98.68, Test Accuracy: 97.08
Epoch 11, Loss: 0.0355, Train Accuracy: 98.78, Test Accuracy: 97.21
Epoch 12, Loss: 0.0326, Train Accuracy: 98.93, Test Accuracy: 97.09
Epoch 13, Loss: 0.0320, Train Accuracy: 98.92, Test Accuracy: 97.07
Epoch 14, Loss: 0.0304, Train Accuracy: 98.99, Test Accuracy: 97.24
Epoch 15, Loss: 0.0207, Train Accuracy: 99.34, Test Accur

### Description of the code

* Imported numpy and tensorflow.
* Disabled the tensorflow 2.x by disable eager execution as we are using tensorflow 1.x. 

 **Data Preprocessing**:
   - Loads the MNIST dataset.
   - Normalizes pixel values range 0 to 1 so that input values are small, model can perform better.
   - Reshape images into vectors and convert labels into one-hot encoding.

 **Network Initialization**:
   - Defines the structure with two hidden layers.
   - Initializes weights and biases.
      * W1, W2, b1 and b2 are weights and biases of the hidden layer.
      * W3 and b3 are weights and biases of the output layer.
      * Weights are initialized randomly between -1 and 1. 
      * Biases are initialized to zeroes.
   - Placeholder defines the input and output values storage as where it is stored like it can be stored in RAM or in GPU or in Cache
  
  **Feed-Forward Neural Network**
   - First and second hidden layer uses sigmoid activation function.
   - Output layer uses softmax during loss calculation and does not use any activation function.
   - Matmul function is for matrix multiplication.

  **Compute Logits**
   - Passes X through the network to compute predictions (logits).
   - Computes logits (raw scores before softmax).
   - Final logits are passed to the softmax cross-entropy loss function.

  **Backpropagation & Optimization**
   - Uses Adam optimizer to minimize the cross-entropy loss.
   - Gradients are computed and used to adjust weights and biases.

  **Accuracy Metric**
   - argmax(logits, 1): Gets the predicted class.
   - argmax(y, 1): Gets the actual class.
   - equal(): Compares predicted vs actual class.
   - reduce_mean(): Computes the accuracy.

  **Session**
   - It runs session on the global variable initializer.

  **Training of the model**
   * In session, training of the model is done by iterating the training data through batches.
   * For n iterations, n epochs is done and for batch size, there is 100 batch size that means as for every 100 feedforward there is one backpropagation.
   * Prints loss and accuracy after each epoch.
   * Evaluates final accuracy on both training and test data.



### Performance Evaluation 
* The model achieves high accuracy (~99% on training and ~97% on test data).
* The performance is satisfactory for MNIST classification.

### My Comments(Limitations and Improvements)
 * Sigmoid activation is not ideal for deep networks due to the vanishing gradient problem, which slows down learning.
 * This code uses TensorFlow 1.x style graph execution, which requires disabling eager execution.

#### Improvements
 * Use TensorFlow 2.x with Keras Functional API for modern implementation.
 * Use of ReLu activation function for speeding up training and gradient flow.