# MNIST Data Preprocessing
1. **Loads MNIST dataset:**
    - `(x_train, y_train), (x_test, y_test) = mnist.load_data()`
2. **Prepares training data:**
    - Reshapes first 60000 images: `images = x_train[0:60000].reshape(60000, 28*28) / 255`
    - One-hot encodes labels: `labels = np.zeros((len(labels), 10)); ...`
3. **Prepares test data:**
    - Reshapes and normalizes all test images: `test_images_all = x_test.reshape(len(x_test), 28*28) / 255`
    - One-hot encodes test labels: `test_labels_all = np.zeros((len(y_test), 10)); ...`
    - Selects first 6000 test images and labels: `test_images, test_labels = test_images_all[0:6000], test_labels_all[0:6000]`



In [1]:
import sys, numpy as np
from keras.datasets import mnist

# Preparing the data
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:10000].reshape(10000,28*28) / 255, y_train[0:10000])
one_hot_labels = np.zeros((len(labels), 10))
for i, l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels
test_images = x_test.reshape(len(x_test), 28*28) / 255
test_labels = np.zeros((len(y_test), 10))
for i, l in enumerate(y_test):
    test_labels[i][l] = 1



## Building a Simple Neural Network for Image Classification

This code demonstrates building and training a simple neural network for image classification using the MNIST dataset:

**Initialization:**

- `np.random.seed(1)`: Sets a random seed for reproducibility.
- `relu`: Defines the ReLU activation function.
- `relu2deriv`: Defines the derivative of ReLU for backpropagation.
- Hyperparameters:
    - `alpha`: Learning rate.
    - `iterations`: Number of training epochs.
    - `hidden_size`: Number of neurons in the hidden layer.
    - `pixels_per_image`: Number of pixels in each image (784 for MNIST).
    - `num_labels`: Number of output classes (10 for MNIST digits).
- `weights_0_1`: Randomly initialized weights between input layer and hidden layer.
- `weights_1_2`: Randomly initialized weights between hidden layer and output layer.

**Training Loop:**

- **Each iteration:**
    - Initialize error and correct count for the iteration.
    - **Loop through each image and label:**
        - `layer_0`: Select the current image data (1x784).
        - `layer_1`: Apply ReLU activation to the weighted sum of input and weights_0_1 (1x40).
        - `layer_2`: Calculate the weighted sum of hidden layer and weights_1_2 (1x10).
        - Update error based on squared difference between prediction (layer_2) and true label.
        - Count correct predictions based on highest predicted and actual class.
        - Calculate deltas for both layers using error and activation derivatives.
        - Update weights_1_2 and weights_0_1 using learning rate, deltas, and dot products.
    - Print training progress: iteration, average training error, and accuracy (correct/total).
    - **Every 10 iterations or at the end:**
        - Evaluate on test set:
            - Calculate test error and accuracy similar to training.
            - Print test error and accuracy.

In [10]:
# building the model
np.random.seed(1)
relu = lambda x: (x > 0) * x
relu2deriv = lambda x: x >= 0
alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 120, 40, 784, 10)

weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1 # 784x40
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1 # 40x10

for j in range(iterations):
    error, correct_cnt = (0.0, 0)
    for i in range(len(images)):
        layer_0 = images[i:i+1] # 1x784
        layer_1 = relu(np.dot(layer_0, weights_0_1)) # 1x784 * 784x40 = 1x40
        layer_2 = np.dot(layer_1,weights_1_2) # 1x40 * 40x10 = 1x10
        error += np.sum((layer_2 - labels[i:i+1]) ** 2)
        correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))
        layer_2_delta = layer_2 - labels[i:i+1] # 1x10 
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1) # 1x10 * 10x40
        weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta) # 40x1 * 1x10 = 40x10
        weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta) # 784x1 * 1x40
        
    if(j % 10 == 0 or j == iterations - 1):
        error, correct_cnt = (0.0, 0)
        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(np.dot(layer_0, weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)
            error += np.sum((layer_2 - test_labels[i:i+1]) ** 2)
            correct_cnt += int(np.argmax(layer_2) == \
                               np.argmax(test_labels[i:i+1]))
        
        print(f'Epoch {j+1} | Error: {error/len(images)} | Correct: {correct_cnt/len(images)} | Test-Err: {error/len(test_images)} | Test-Acc: {correct_cnt/len(test_images)}')

Epoch 1 | Error: 0.36543277291589404 | Correct: 0.8421 | Test-Err: 0.36543277291589404 | Test-Acc: 0.8421
Epoch 11 | Error: 0.27270289186306745 | Correct: 0.8757 | Test-Err: 0.27270289186306745 | Test-Acc: 0.8757
Epoch 21 | Error: 0.2723293025897816 | Correct: 0.8769 | Test-Err: 0.2723293025897816 | Test-Acc: 0.8769
Epoch 31 | Error: 0.2796880624653478 | Correct: 0.8683 | Test-Err: 0.2796880624653478 | Test-Acc: 0.8683
Epoch 41 | Error: 0.2820254375195656 | Correct: 0.8722 | Test-Err: 0.2820254375195656 | Test-Acc: 0.8722
Epoch 51 | Error: 0.2931874475899022 | Correct: 0.8732 | Test-Err: 0.2931874475899022 | Test-Acc: 0.8732
Epoch 61 | Error: 0.2904323556559943 | Correct: 0.8792 | Test-Err: 0.2904323556559943 | Test-Acc: 0.8792
Epoch 71 | Error: 0.29646876681543705 | Correct: 0.8808 | Test-Err: 0.29646876681543705 | Test-Acc: 0.8808
Epoch 81 | Error: 0.2837620871913358 | Correct: 0.8799 | Test-Err: 0.2837620871913358 | Test-Acc: 0.8799
Epoch 91 | Error: 0.2823585056392112 | Correct: 0.

## Adding a dropout layer to the model

Dropout is a powerful technique commonly used in Deep Learning to **prevent overfitting** and improve **generalization**. It works by randomly deactivating a certain proportion of neurons in each hidden layer during the training process. This forces the remaining neurons to learn independently, preventing them from relying too heavily on each other and becoming overly specialized to the training data.

**Here's how it works:**

1. During training, at each iteration, each neuron has a probability (dropout rate) of being temporarily "dropped out" or deactivated.
2. The remaining active neurons are forced to compensate for the missing ones, learning to extract features that are more robust and independent.
3. This process helps to break down complex co-adaptations between neurons and prevents overfitting the model to the specific training data.

**Benefits of Dropout:**

* **Reduced overfitting:** By forcing neurons to learn independently, dropout reduces the model's dependence on specific patterns in the training data, leading to better generalization performance on unseen data.
* **Improved generalization:** The diverse learning encouraged by dropout leads to models that are less likely to overfit and can perform better on new data not encountered during training.
* **More robust models:** By preventing neurons from co-adapting too much, dropout helps to create models that are less susceptible to noise and variations in the input data.

**Dropout is particularly useful for:**

* Deep neural networks with many layers and parameters, which are more prone to overfitting.
* Complex tasks where the model needs to learn intricate relationships between features.

**Important points to remember:**

* The optimal dropout rate can vary depending on the specific dataset and model architecture. Finding the right value often requires experimentation.
* Dropout can be applied to different layers of the network, each with potentially different dropout rates.
* While dropout is a powerful technique, it can also negatively impact performance if not used correctly.


In [12]:
# building the model with dropout
np.random.seed(1)
relu = lambda x: (x > 0) * x
relu2deriv = lambda x: x >= 0
alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 120, 100, 784, 10)

weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1 # 784x40
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1 # 40x10

for j in range(iterations):
    error, correct_cnt = (0.0, 0)
    for i in range(len(images)):
        layer_0 = images[i:i+1] # 1x784
        layer_1 = relu(np.dot(layer_0, weights_0_1)) # 1x784 * 784x40 = 1x40
        dropout_mask = np.random.randint(2, size=layer_1.shape) # set of 1s and 0s with layer_1 shape 
        layer_1 *= dropout_mask * 2 # multiplying the result by 2 due to that wew cut the ouput of layer_1 by half (because dropout) so we are trying to rebalance the network
        layer_2 = np.dot(layer_1,weights_1_2) # 1x40 * 40x10 = 1x10
        error += np.sum((layer_2 - labels[i:i+1]) ** 2) # sum of the result to remove the [] from it
        correct_cnt += int(np.argmax(layer_2) == np.argmax(labels[i:i+1]))
        layer_2_delta = layer_2 - labels[i:i+1] # 1x10
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)  # 1x10 * 10x40 = 1x40
        layer_1_delta *= dropout_mask
        weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta) # 40x1 * 1x10 = 
        weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta) # 784x1 * 1x40 = 784x40

    if(j%10 == 0):
        test_error = 0.0
        test_correct_cnt = 0

        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(np.dot(layer_0,weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)

            test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        print(f'Epoch {j+1} | Error: {error/len(images)} | Correct: {correct_cnt/len(images)} | Test-Err: {error/len(test_images)} | Test-Acc: {correct_cnt/len(test_images)}')

Epoch 1 | Error: 0.5949953995101731 | Correct: 0.6583 | Test-Err: 0.5949953995101731 | Test-Acc: 0.6583
Epoch 11 | Error: 0.4312827242059426 | Correct: 0.8017 | Test-Err: 0.4312827242059426 | Test-Acc: 0.8017
Epoch 21 | Error: 0.42152608446708434 | Correct: 0.8126 | Test-Err: 0.42152608446708434 | Test-Acc: 0.8126
Epoch 31 | Error: 0.4357730935201847 | Correct: 0.8077 | Test-Err: 0.4357730935201847 | Test-Acc: 0.8077
Epoch 41 | Error: 0.43866602129806 | Correct: 0.8119 | Test-Err: 0.43866602129806 | Test-Acc: 0.8119
Epoch 51 | Error: 0.4440785659798922 | Correct: 0.8056 | Test-Err: 0.4440785659798922 | Test-Acc: 0.8056
Epoch 61 | Error: 0.4578244726373815 | Correct: 0.7966 | Test-Err: 0.4578244726373815 | Test-Acc: 0.7966
Epoch 71 | Error: 0.4764422043105311 | Correct: 0.7739 | Test-Err: 0.4764422043105311 | Test-Acc: 0.7739
Epoch 81 | Error: 0.4809077892497466 | Correct: 0.7794 | Test-Err: 0.4809077892497466 | Test-Acc: 0.7794
Epoch 91 | Error: 0.47223267047633055 | Correct: 0.7756 | 

# Feeding the data in a set of batches to the model

Batching is a technique commonly used in machine learning and deep learning to improve efficiency and performance. It involves grouping individual data points into smaller sets called **batches** before feeding them to the model for training or inference. Here's an explanation of the technique:

**Concept:**

Imagine you have a large dataset of images you want to train a model to recognize cats and dogs. Feeding each image individually to the model would be slow and inefficient, especially for complex models requiring extensive calculations. Batching allows you to:

* **Process multiple data points simultaneously:** Instead of feeding each image one by one, you group them into batches (e.g., 32 or 64 images). The model then processes these batches together, utilizing resources more efficiently.
* **Reduce memory overhead:** Loading the entire dataset at once can overwhelm your system's memory. Batching keeps only the current batch in memory, optimizing memory usage and enabling training on larger datasets.
* **Parallelize computations:** Many modern hardware architectures support parallel processing. Batching allows you to leverage this capability by processing multiple data points in parallel on GPUs or other multi-core processors, significantly speeding up training.

**Benefits:**

* **Faster training:** By utilizing resources more efficiently and parallelizing computations, batching can significantly reduce training time.
* **Improved memory management:** Smaller batches reduce memory footprint, allowing you to train on larger datasets or complex models with limited memory.
* **Efficient hardware utilization:** Batching takes advantage of parallel processing capabilities in modern hardware, leading to faster training and inference.

**Considerations:**

* **Choosing the batch size:** There's no one-size-fits-all batch size. Different model architectures and datasets perform best with specific batch sizes. Experimenting to find the optimal size is often necessary.
* **Impact on learning rate:** Smaller batches can lead to more frequent updates and potentially faster convergence. However, adjusting the learning rate might be necessary to avoid instability.
* **Hardware limitations:** Choose batch sizes compatible with your hardware capabilities to avoid memory limitations or performance bottlenecks.

In [16]:
# building the model with dropout and batch
np.random.seed(1)
relu = lambda x: (x > 0) * x
relu2deriv = lambda x: x >= 0
batch_size = 100
alpha, iterations = (0.001, 120)
hidden_size, pixels_per_image, num_labels = (100, 784, 10)

weights_0_1 = 0.2 * np.random.random((pixels_per_image, hidden_size)) - 0.1 # 784x40
weights_1_2 = 0.2 * np.random.random((hidden_size, num_labels)) - 0.1 # 40x10

for j in range(iterations):
    error, correct_cnt = (0.0, 0)
    for i in range(int(len(images) / batch_size)): # 60000 / 100 = 600
        batch_start, batch_end = ((i* batch_size), ((i+1) * batch_size))
        layer_0 = images[batch_start:batch_end] # 100x784
        layer_1 = relu(np.dot(layer_0, weights_0_1)) # 100x784 * 784x40 = 100x40
        dropout_mask = np.random.randint(2, size=layer_1.shape) # set of 1s and 0s with layer_1 shape 
        layer_1 *= dropout_mask * 2 # multiplying the result by 2 due to that wew cut the ouput of layer_1 by half (because dropout) so we are trying to rebalance the network
        layer_2 = np.dot(layer_1,weights_1_2) # 100x40 * 40x10 = 100x10
        error += np.sum((layer_2 - labels[batch_start:batch_end]) ** 2) # sum of the result to remove the [] from it, 100x10 - 100x10
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))
            layer_2_delta = (layer_2 - labels[batch_start:batch_end]) / batch_size # 100x10
            layer_1_delta = layer_2_delta.dot(weights_1_2.T) * relu2deriv(layer_1)  # 1x10 * 10x40 = 1x40
            layer_1_delta *= dropout_mask
            weights_1_2 -= alpha * layer_1.T.dot(layer_2_delta) # 40x1 * 1x10 = 
            weights_0_1 -= alpha * layer_0.T.dot(layer_1_delta) # 784x1 * 1x40 = 784x40

    if(j%10 == 0 or j == iterations-1):
        test_error = 0.0
        test_correct_cnt = 0

        for i in range(len(test_images)):
            layer_0 = test_images[i:i+1]
            layer_1 = relu(np.dot(layer_0,weights_0_1))
            layer_2 = np.dot(layer_1, weights_1_2)

            test_error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
            test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
        
        print(f'Epoch {j+1} | Error: {error/len(images)} | Correct: {correct_cnt/len(images)} | Test-Err: {error/len(test_images)} | Test-Acc: {test_correct_cnt/len(test_images)}')

Epoch 1 | Error: 0.7752947227502046 | Correct: 0.5039 | Test-Err: 0.7752947227502046 | Test-Acc: 0.7451
Epoch 11 | Error: 0.4842224741600875 | Correct: 0.7576 | Test-Err: 0.4842224741600875 | Test-Acc: 0.8222
Epoch 21 | Error: 0.45936947308853976 | Correct: 0.777 | Test-Err: 0.45936947308853976 | Test-Acc: 0.8092
Epoch 31 | Error: 0.43699727223051243 | Correct: 0.7967 | Test-Err: 0.43699727223051243 | Test-Acc: 0.8299
Epoch 41 | Error: 0.42742085666875484 | Correct: 0.8084 | Test-Err: 0.42742085666875484 | Test-Acc: 0.8375
Epoch 51 | Error: 0.4155963094341605 | Correct: 0.8129 | Test-Err: 0.4155963094341605 | Test-Acc: 0.8465
Epoch 61 | Error: 0.4129700470032456 | Correct: 0.8169 | Test-Err: 0.4129700470032456 | Test-Acc: 0.8439
Epoch 71 | Error: 0.4087772568840016 | Correct: 0.824 | Test-Err: 0.4087772568840016 | Test-Acc: 0.8532
Epoch 81 | Error: 0.40514287201002447 | Correct: 0.8223 | Test-Err: 0.40514287201002447 | Test-Acc: 0.8521
Epoch 91 | Error: 0.4014275547902344 | Correct: 0.

# Draft Cells

In [13]:
print(x_train[0:1000].shape)
print(x_train[0].shape)
print(x_test.shape)
print(images[0:1].shape)
print(images[0].shape)
print()
print(y_train[0:1000].shape)
print(y_train[0])
print(labels.shape)
print(labels[9])
print(one_hot_labels.shape)

(1000, 28, 28)
(28, 28)
(10000, 28, 28)
(1, 784)
(784,)

(1000,)
5
(10000, 10)
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
(10000, 10)


In [148]:
# understanding the droput
import numpy as np
np.random.seed(1)
dropout = np.random.randint(2, size=40) # 1x40
turned_of_nodes = [i for i in dropout if i==0]
print(f"turned off nodes is:  {len(turned_of_nodes)} of {len(dropout)} \n")
layer = 2 * np.random.random((1, 40)) - 1 # 1x40
print(f"layer values before droput:\n {layer}", end='\n\n')
layer = layer * dropout * 2
print(f"layer values after droput:\n {layer}")

turned off nodes is:  21 of 40 

layer values before droput:
 [[ 0.60148914  0.93652315 -0.37315164  0.38464523  0.7527783   0.78921333
  -0.82991158 -0.92189043 -0.66033916  0.75628501 -0.80330633 -0.15778475
   0.91577906  0.06633057  0.38375423 -0.36896874  0.37300186  0.66925134
  -0.96342345  0.50028863  0.97772218  0.49633131 -0.43911202  0.57855866
  -0.79354799 -0.10421295  0.81719101 -0.4127717  -0.42444932 -0.73994286
  -0.96126608  0.35767107 -0.57674377 -0.46890668 -0.01685368 -0.89327491
   0.14823521 -0.70654285  0.17861107  0.39951672]]

layer values after droput:
 [[ 1.20297827  1.8730463  -0.          0.          1.50555661  1.57842665
  -1.65982315 -1.84378087 -1.32067832  0.         -0.         -0.3155695
   0.          0.13266114  0.76750846 -0.          0.          1.33850269
  -0.          0.          0.          0.99266262 -0.          0.
  -1.58709597 -0.          0.         -0.         -0.84889865 -0.
  -0.          0.         -1.15348754 -0.93781336 -0.0337073