**OBJECTIVE**:
WAP to evaluate the performance of implemented three-layer neural network with variations in activation functions, size of hidden layer, learnin  rate, batch size and 
number of epochs 

**Description of the Model:**
This is a basic neural network with one hidden layer, meaning it has:
 1. Input Layer: Takes the flattened image (28×28 pixels = 784 values).
 2. Hidden Layer: Contains 256 neurons that learn patterns from the images.
      -Uses ReLU activation (which helps the model learn complex patterns).
 3. Output Layer: Contains 10 neurons, one for each digit (0-9).
 4. Uses softmax logits, meaning it outputs probabilities for each digit.


*The training process involves*:
1. Calculating loss (cross-entropy loss) to measure how well the model is performing.
2. Using the Adam optimizer to update weights and improve predictions.
3. Running training over multiple epochs (iterations over the dataset).

In [None]:
import tensorflow as tf
import numpy as np
import time
import pandas as pd
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

# Check if GPU is available
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
    print('GPU not found. Using CPU instead.')

# Load MNIST dataset (converted to tensors)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = tf.convert_to_tensor(x_train.reshape(-1, 784).astype(np.float32) / 255.0)
x_test = tf.convert_to_tensor(x_test.reshape(-1, 784).astype(np.float32) / 255.0)

y_train = tf.one_hot(y_train, depth=10, dtype=tf.float32)
y_test_labels = tf.convert_to_tensor(y_test)  # Keep original labels for confusion matrix
y_test = tf.one_hot(y_test, depth=10, dtype=tf.float32)

# Define model parameters using random initialization
n_hidden = 256
tf.random.set_seed(42)

W1 = tf.Variable(tf.random.normal([784, n_hidden], mean=0.0, stddev=0.1, dtype=tf.float32))
b1 = tf.Variable(tf.random.normal([n_hidden], mean=0.0, stddev=0.1, dtype=tf.float32))
W2 = tf.Variable(tf.random.normal([n_hidden, 10], mean=0.0, stddev=0.1, dtype=tf.float32))
b2 = tf.Variable(tf.random.normal([10], mean=0.0, stddev=0.1, dtype=tf.float32))

# Define model using pure tensors
@tf.function
def model(X):
    Z1 = tf.add(tf.matmul(X, W1), b1)
    A1 = tf.nn.relu(Z1)
    Z2 = tf.add(tf.matmul(A1, W2), b2)
    return Z2

# Loss function using tensors
def loss_fn(y_true, y_pred):
    return tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_true, logits=y_pred))

# Optimizer
optimizer = tf.optimizers.Adam(learning_rate=0.001)

# Training step using tensors
@tf.function
def train_step(batch_x, batch_y):
    with tf.GradientTape() as tape:
        logits = model(batch_x)
        loss = loss_fn(batch_y, logits)
    grads = tape.gradient(loss, [W1, b1, W2, b2])
    optimizer.apply_gradients(zip(grads, [W1, b1, W2, b2]))
    return loss

# Hyperparameter tuning configurations
configs = [(10, 100), (10, 50), (10, 10), (100, 100), (100, 50), (100, 10), (1, 100), (1, 50), (1, 10)]

results = []
output_file = "mnist_nn_results.csv"

# Run on GPU
with tf.device('/GPU:0'):
    for batch_size, epochs in configs:
        start_time = time.time()
        loss_curve, acc_curve = [], []

        dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size).shuffle(10000).prefetch(tf.data.AUTOTUNE)

        for epoch in range(epochs):
            for batch_x, batch_y in dataset:
                train_loss = train_step(batch_x, batch_y)

            # Compute train accuracy using pure tensors
            train_logits = model(x_train)
            train_acc = tf.reduce_mean(tf.cast(tf.argmax(train_logits, axis=1) == tf.argmax(y_train, axis=1), tf.float32))
            loss_curve.append(train_loss.numpy())
            acc_curve.append(train_acc.numpy())

        # Compute test accuracy using pure tensors
        test_logits = model(x_test)
        test_acc = tf.reduce_mean(tf.cast(tf.argmax(test_logits, axis=1) == tf.argmax(y_test, axis=1), tf.float32))
        y_pred = tf.argmax(test_logits, axis=1)
        conf_matrix = confusion_matrix(y_test_labels.numpy(), y_pred.numpy())
        exec_time = time.time() - start_time

        # Store results
        results.append([batch_size, epochs, train_loss.numpy(), train_acc.numpy(), test_acc.numpy(), exec_time])

        # Print confusion matrix
        print(f"Confusion Matrix for Batch Size {batch_size}, Epochs {epochs}:\n", conf_matrix)

        # Plot loss and accuracy curves
        plt.figure(figsize=(12, 5))
        plt.subplot(1, 2, 1)
        plt.plot(range(epochs), loss_curve, label='Loss', color='red')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.title(f'Loss Curve (Batch Size {batch_size}, Epochs {epochs})')
        plt.legend()

        plt.subplot(1, 2, 2)
        plt.plot(range(epochs), acc_curve, label='Accuracy', color='blue')
        plt.xlabel('Epochs')
        plt.ylabel('Accuracy')
        plt.title(f'Accuracy Curve (Batch Size {batch_size}, Epochs {epochs})')
        plt.legend()

        plt.show()

        # Print results
        print(f"Batch Size: {batch_size}, Epochs: {epochs}")
        print(f"Test Accuracy: {test_acc.numpy():.4f}")
        print(f"Execution Time: {exec_time:.2f} seconds\n")

    # Save results to file
    df = pd.DataFrame(results, columns=["Batch Size", "Epochs", "Train Loss", "Train Accuracy", "Test Accuracy", "Execution Time"])
    df.to_csv(output_file, index=False)
    print(f"Results saved to {output_file}")


 **Description of code:**
1. Loading the MNIST Dataset
   11. The MNIST dataset contains 60,000 training images and 10,000 test images of handwritten digits.
   12. Images are grayscale (0-255 pixel values), so they are normalized (converted to values between 0 and 1).
   13. Labels (digits 0-9) are converted into one-hot encoded format (e.g., digit 3 → [0,0,0,1,0,0,0,0,0,0]).

2. Defining the Neural Network
   21. Weights and biases are created using tf.random.normal(), which gives random initial values.
   22. The model is implemented manually (without using Keras layers).
   23. Forward propagation is done using matrix multiplication (tf.matmul()) and activation functions (tf.nn.relu()).

3. Training Process
   31. The model is trained using a batch size (small sets of images are processed at a time).
   32. For each batch:
       -->Predictions are made using forward propagation.
       -->Loss is calculated (how far the predictions are from actual labels).
       -->Gradients are computed using tf.GradientTape() to update the weights.
       -->Weights and biases are updated using the Adam optimizer.
       -->Training repeats for multiple epochs (full passes through the dataset).

4. Performance Evaluation
   41. After training, the model is tested on unseen images.
   42. Accuracy is calculated by checking how many predictions match the correct labels.
   43. A confusion matrix is created to see which digits are misclassified.

5. Saving Results
   51. The results for different batch sizes and epochs are saved in a CSV file (mnist_nn_results.csv) for later analysis.


**MY COMMENTS:**
*HIDDEN LAYER:*
-->The model has one hidden layer with 256 neurons, which is a good choice but adding one more layer might improve performance a bit.

*BATCH SIZE:*
-->Batch size 1 is too slow, the model learns very slowly.
-->Batch size 10 is also slow, but better than Batch size 1.
-->Batch size 100 is the best choice, it makes training fast while keeping accuracy good.

*EPOCHS:*
-->Accuracy is almost the same for 10,50, and 100 epochs. It slightly increases with number of epochs.
-->Training for 50,100 epochs wastes time without major improvement.

*COMBINATION REVIEW:*
-->Batch Size:100, Epochs:10 is the best with almost the same accuarcy and faster execution.
-->Combinations with Batch Size:1 are the worst just increases the training too high.
-->Batch Size:10 combinations are better than 1 but still takes more time than Batch size 100.

*GPU Vs CPU:*
-->CPU takes too much training time (Not advisable even for single combination with Batch size 1 and 10)
-->GPU is faster than cpu
-->Gpu takes less time for batch size 10,100 but takes more training time for batch size 1.

**The results are saved to another file and provided in the same folder**