

### Training Cost Plot
![Training Cost vs Epochs](training_cost.png)


### Training Results

Training FFNN using batch gradient descent...

Initial accuracy (before training): 18.70%

Epoch 0, Cost: 15.0183<br>
Epoch 100, Cost: 5.6735<br>
Epoch 200, Cost: 3.8015<br>
Epoch 300, Cost: 2.9149<br>
Epoch 400, Cost: 2.3716<br>
Epoch 500, Cost: 1.9932<br>
Epoch 600, Cost: 1.7098<br>
Epoch 700, Cost: 1.4900<br>
Epoch 800, Cost: 1.3178<br>
Epoch 900, Cost: 1.1808<br>
Final accuracy (after training): 75.50%


Results Comparison:<br>
Before training accuracy: 18.70%<br>
After training accuracy: 75.50%

The training dynamics reveal the fundamental nature of gradient-based optimization in neural networks. When training begins, the network's weights are randomly initialized, leading to essentially random predictions with an accuracy of 18.70%. During the initial training epochs, the network undergoes rapid improvement because the random weights are far from optimal, creating large gradients that drive significant weight updates. This explains the sharp drop in cost from 15.02 to 5.67 in just the first 100 epochs.

As training progresses, the pace of improvement naturally slows down. This occurs because the network has already captured the more obvious patterns in the data, and the remaining improvements require more subtle weight adjustments. The gradients become smaller as the network approaches a local minimum in the loss landscape, resulting in smaller weight updates. This phenomenon explains the diminishing returns we observe after epoch 500, where the cost decrease becomes more gradual.

The final accuracy of 75.50% indicates substantial learning has occurred, yet the network hasn't achieved perfect classification. This suggests several possibilities: the model architecture might be too simple to capture all the complexities in the data, the dataset might contain inherent noise making perfect classification impossible, or we might be experiencing underfitting. The smooth, monotonic decrease in the cost function throughout training indicates our learning rate of 0.005 is well-chosen - large enough to enable efficient learning but small enough to avoid oscillatory behavior that would result from overshooting the minimum.

This decreasing cost function demonstrates that the model is effectively learning and improving its ability to classify the input data. The steep initial drop suggests the model is making significant progress in the early stages of training, while the gradual decline towards the end indicates the model is converging towards a local minimum.

Comparing the initial and final accuracy, the plot shows the model's performance improving from an initial accuracy of 18.70% to a final accuracy of 75.50% after the 1000 training iterations. This substantial increase in accuracy aligns with the observed reduction in the cost function, indicating the model is successfully learning to better classify the input data.

Overall, the plot provides a clear visualization of the model's training progress, demonstrating its ability to learn and improve its performance over the course of the training process. The decreasing cost function and increasing accuracy are positive signs that the model is effectively learning the underlying patterns in the data.

### Mini-batch and Stochastic Gradient Descent
![Training Cost vs Epochs](minibatch_comparison.png)

The batch gradient descent shows a relatively slow but stable convergence pattern, starting with a high cost of 15.0183 and gradually decreasing to 1.1808 over 1000 epochs. This behavior is typical of batch gradient descent because it uses the entire dataset to compute gradients, resulting in more stable but conservative updates to the model parameters.

In contrast, mini-batch training (batch_size=64) demonstrates faster initial convergence, with the cost dropping sharply from 13.3386 to 0.6720 in just 100 epochs. This faster learning occurs because mini-batch updates allow the model to make more frequent parameter updates while still maintaining some stability from averaging over 64 examples. The final cost of 0.0428 is significantly lower than batch gradient descent, suggesting better optimization.

Stochastic Gradient Descent (batch_size=1) shows the most aggressive convergence, reaching a remarkably low cost of 0.0052 by epoch 100 and stabilizing around 0.0022. This rapid descent happens because SGD updates parameters after each individual example, allowing for very quick adaptation. However, the noisy nature of single-example updates typically leads to more variance in the learning trajectory, though this isn't immediately apparent in the epochal averages shown here.

The stark difference in final cost values (1.1808 for batch, 0.0428 for mini-batch, and 0.0022 for SGD) demonstrates how smaller batch sizes can find better local minima, likely due to their ability to escape poor optimization regions through more frequent, slightly noisy updates. This noise can actually be beneficial, acting as a form of regularization and helping the model find better solutions in the loss landscape.

However, it's important to note that while SGD achieves better convergence, it comes with a significant computational overhead. With batch_size=1, the model performs parameter updates after every single example, resulting in M updates per epoch (where M is the number of training examples) compared to only M/64 updates for mini-batch training. Each update requires a complete forward pass, backward pass, and weight update operations, making SGD computationally intensive. Additionally, processing single examples fails to take advantage of modern hardware optimizations for vectorized operations, further contributing to longer training times.

Among these three approaches, mini-batch training (batch_size=64) emerges as the optimal choice for several reasons. It achieves significantly better optimization than batch GD (0.0428 vs 1.1808). While not reaching SGD's low cost (0.0022), it offers much faster training time than SGD, good balance between stability and stochastic noise, reasonable computational overhead, and sufficient noise to escape poor local minima. This makes mini-batch training the most practical and efficient approach, providing the best trade-off between computational efficiency and optimization quality for this neural network training task.