## MNIST Data Preparation

**1. Get the Data:**

- We load a small portion (1000 images) of the MNIST dataset, which contains handwritten digits.

**2. Make the Data Easier to Use:**

- We flatten the images from 2D (28x28 pixels) to 1D arrays.
- We adjust the pixel values from 0-255 to 0-1 for better calculations.

**3. Convert Labels to a Special Code:**

- Instead of just having a number for each digit (e.g., 3 for '3'), we use a code with 10 numbers.
- For the correct digit, the code has a 1 in that position, and 0s everywhere else (like a one-hot code). This helps the model understand the categories better.

**4. Do the Same for the Test Data:**

- We repeat steps 2 and 3 for the test images and labels, preparing them for later testing.

**Now the MNIST data is ready for a machine learning model to learn to recognize handwritten digits!**

I hope this explanation is even simpler and easier to understand.

In [1]:
import numpy as np, sys
np.random.seed(1)

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels),10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels

test_images = x_test.reshape(len(x_test),28*28) / 255
test_labels = np.zeros((len(y_test),10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1

Explanation of `tanh` and `softmax` functions:

**1. tanh:**

- **Behavior:**
    - S-shaped function, mapping real numbers to the range **[-1, 1]**.
    - Output increases gradually as input increases, but plateaus near -1 and 1.
- **Properties:**
    - Zero-centered: Output is 0 when input is 0.
    - Smooth gradient: Enables easier learning compared to step-like functions.
    - Non-linearity: Introduces non-linearity, crucial for learning complex patterns.
- **Applications:**
    - Often used in hidden layers of neural networks.
    - Can also be used as an output layer activation for tasks with a bipolar scale (e.g., sentiment analysis).

**2. softmax:**

- **Behavior:**
    - Takes a vector of real numbers and normalizes it into a probability distribution.
    - Each element in the output vector represents the probability of belonging to a specific class.
    - Outputs always sum to 1.
- **Properties:**
    - Ensures outputs are non-negative and interpretable as probabilities.
    - Forces competition between classes, encouraging the network to focus on the most likely class.
- **Applications:**
    - Primarily used in the output layer of multi-class classification tasks (e.g., image recognition, handwritten digit recognition).
    - Useful when you want the network to predict the class an input belongs to with a confidence score.

**Key Comparisons:**

| Feature        | tanh                 | softmax               |
|----------------|----------------------|-----------------------|
| Output range   | [-1, 1]              | [0, 1] (sum to 1)      |
| Zero-centered  | Yes                  | No                     |
| Gradient        | Smooth               | Depends on input values |
| Non-linearity  | Yes                  | No                     |
| Applications   | Hidden layers, bipolar | Output layer, multi-class |

**1. `tanh(x)`:** This function implements the hyperbolic tangent activation function. For each input value `x`, it squishes it between -1 and 1 using the formula `tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))`. This helps to limit the output of the network and introduce non-linearity, improving its learning capabilities.

**2. `tanh2deriv(output)`:** This function calculates the derivative of the `tanh` function at a given output value `output`. The derivative tells us how much the output of the `tanh` function changes in response to a small change in the input. In this case, the derivative is calculated as `1 - (output ** 2)`.

**3. `softmax(x)`:** This function implements the softmax activation function. It takes a vector of values `x` (e.g., outputs from a network layer) and normalizes them into a probability distribution. So, after applying softmax, each element in the vector represents the probability of belonging to a certain class. This is useful for multi-class classification tasks where you want the network to predict the most likely class for an input.

table summarizing the functions:

| Function | Purpose |
|---|---|
| `tanh(x)` | Applies the hyperbolic tangent activation |
| `tanh2deriv(output)` | Calculates the derivative of `tanh` at `output` |
| `softmax(x)` | Converts a vector into a probability distribution |

In [2]:
def tanh(x):
    return np.tanh(x)

def tanh2deriv(output):
    return 1 - (output ** 2)

def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

## Detailed Explanation of the MNIST Classification Code

**Initialization:**

- `alpha, iterations, hidden_size`: Learning rate, training iterations, and hidden layer size.
- `pixels_per_image, num_labels`: Number of pixels per image (784) and number of output classes (10 digits).
- `batch_size`: Number of images used in each training step (100).
- `weights_0_1, weights_1_2`: Randomly initialized weights between layers.

**Training Loop:**

- **Outer loop:** Iterates for the specified number of training epochs (`iterations`).
- **Inner loop:** Iterates over batches of images (`batch_size`).
    - **Forward pass:**
        - Extract a batch of training images `layer_0`.
        - Apply tanh activation to the first hidden layer `layer_1`.
        - Implement dropout with a random mask `dropout_mask`.
        - Apply softmax activation to the output layer `layer_2`.
        - Count correctly classified examples within the batch `correct_cnt`.
    - **Backward pass:**
        - Calculate output layer error `layer_2_delta`.
        - Backpropagate error to hidden layer using tanh derivative `layer_1_delta`.
        - Apply dropout mask again to `layer_1_delta`.
        - Update weights using Mini-batch Gradient Descent and learning rate `alpha`.
- **Test accuracy:**
    - Loop through all test images and calculate accuracy `test_correct_cnt`.
    - Print training and test accuracy every 10 epochs.

In [3]:


alpha, iterations, hidden_size = (2, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100

weights_0_1 = 0.02*np.random.random((pixels_per_image,hidden_size))-0.01 # 784x100
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1 # 100x10

for j in range(iterations):
    correct_cnt = 0
    for i in range(int(len(images) / batch_size)):
        batch_start, batch_end=((i * batch_size),((i+1)*batch_size))
        layer_0 = images[batch_start:batch_end] # 100x784
        layer_1 = tanh(np.dot(layer_0,weights_0_1)) # 100x784 * 784100 = 100x100
        dropout_mask = np.random.randint(2,size=layer_1.shape) # 100x100
        layer_1 *= dropout_mask * 2
        layer_2 = softmax(np.dot(layer_1,weights_1_2)) # 100x100 * 100x10 = 100x10

        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))
            

        layer_2_delta = (labels[batch_start:batch_end]-layer_2) / (batch_size * layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) * tanh2deriv(layer_1)
        layer_1_delta *= dropout_mask

        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    test_correct_cnt = 0

    for i in range(len(test_images)):

        layer_0 = test_images[i:i+1]
        layer_1 = tanh(np.dot(layer_0,weights_0_1))
        layer_2 = np.dot(layer_1,weights_1_2)

        test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
    if(j % 10 == 0):
        sys.stdout.write("\n"+ \
         "I:" + str(j) + \
         " Test-Acc:"+str(test_correct_cnt/float(len(test_images)))+\
         " Train-Acc:" + str(correct_cnt/float(len(images))))


I:0 Test-Acc:0.394 Train-Acc:0.156
I:10 Test-Acc:0.6867 Train-Acc:0.723
I:20 Test-Acc:0.7025 Train-Acc:0.732
I:30 Test-Acc:0.734 Train-Acc:0.763
I:40 Test-Acc:0.7663 Train-Acc:0.794
I:50 Test-Acc:0.7913 Train-Acc:0.819
I:60 Test-Acc:0.8102 Train-Acc:0.849
I:70 Test-Acc:0.8228 Train-Acc:0.864
I:80 Test-Acc:0.831 Train-Acc:0.867
I:90 Test-Acc:0.8364 Train-Acc:0.885
I:100 Test-Acc:0.8407 Train-Acc:0.883
I:110 Test-Acc:0.845 Train-Acc:0.891
I:120 Test-Acc:0.8481 Train-Acc:0.901
I:130 Test-Acc:0.8505 Train-Acc:0.901
I:140 Test-Acc:0.8526 Train-Acc:0.905
I:150 Test-Acc:0.8555 Train-Acc:0.914
I:160 Test-Acc:0.8577 Train-Acc:0.925
I:170 Test-Acc:0.8596 Train-Acc:0.918
I:180 Test-Acc:0.8619 Train-Acc:0.933
I:190 Test-Acc:0.863 Train-Acc:0.933
I:200 Test-Acc:0.8642 Train-Acc:0.926
I:210 Test-Acc:0.8653 Train-Acc:0.931
I:220 Test-Acc:0.8668 Train-Acc:0.93
I:230 Test-Acc:0.8672 Train-Acc:0.937
I:240 Test-Acc:0.8681 Train-Acc:0.938
I:250 Test-Acc:0.8687 Train-Acc:0.937
I:260 Test-Acc:0.8684 Train-