# Introduction

The MNIST dataset is a widely-used benchmark dataset in the field of machine learning and computer vision. It consists of a collection of 28x28 pixel grayscale images of handwritten digits (0 through 9) along with their corresponding labels. This dataset is often used for tasks such as image classification and digit recognition.

# Package Imports

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Loading the dataset

In [2]:
# Load MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


Let us check the dataset once.

In [4]:
x_train

array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 

In [5]:
y_train

array([5, 0, 4, ..., 5, 6, 8], dtype=uint8)

In [6]:
x_test

array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       ...,

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]],

       [[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 

In [7]:
y_test

array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)

In [8]:
# Print shapes of the arrays
print("Training data shape:", x_train.shape)  # Shape of training images
print("Training labels shape:", y_train.shape)  # Shape of training labels
print("Test data shape:", x_test.shape)  # Shape of test images
print("Test labels shape:", y_test.shape)  # Shape of test labels

Training data shape: (60000, 28, 28)
Training labels shape: (60000,)
Test data shape: (10000, 28, 28)
Test labels shape: (10000,)


# Normalizing the data

Min Max scaler has been used to normalize the data.

In [9]:
# Normalize pixel values to between 0 and 1
x_train, x_test = x_train / 255.0, x_test / 255.0


# Flattening the images for processing

In [10]:
# Flatten images
x_train_flat = x_train.reshape(x_train.shape[0], -1)
x_test_flat = x_test.reshape(x_test.shape[0], -1)

In [11]:
# Print shapes of the flattened arrays
print("Flattened training data shape:", x_train_flat.shape)  # Shape of flattened training images
print("Flattened test data shape:", x_test_flat.shape)  # Shape of flattened test images

Flattened training data shape: (60000, 784)
Flattened test data shape: (10000, 784)


So, we see now that we will have 784 neurons in the input layers. Let us process this ahead.

# One hot encoding of the labels

In the case of the MNIST dataset, where each image represents a handwritten digit from 0 to 9, one-hot encoding converts the single scalar label for each image (which represents the digit) into a binary vector of length 10. Each element in this vector corresponds to a possible digit label, with a value of 1 indicating the presence of that digit and 0 otherwise.

For example, if we have an input image of the digit 3, its corresponding one-hot encoded label would be [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].

The length of this one-hot encoded vector matches the number of neurons in the input layer of the neural network. In this case, there would be 10 neurons in the input layer, each representing one possible digit label.

During the training process, the neural network receives these one-hot encoded vectors as input, allowing it to learn the relationships between the features of the input images and their corresponding digit labels.

## Why sparse in One Hot Encoding?
If sparse=True, it returns a sparse binary matrix representation of the input labels. This means that instead of returning a dense array with one-hot encoding, it returns a sparse matrix where only the non-zero elements are stored. This is useful when dealing with large datasets with many classes, as it saves memory by not storing all the zeros in the dense representation. However, for most use cases, the default value of False suffices.

In [15]:
# One-hot encode labels
onehot_encoder = OneHotEncoder(sparse_output=False)
y_train_onehot = onehot_encoder.fit_transform(y_train.reshape(-1, 1))
y_test_onehot = onehot_encoder.transform(y_test.reshape(-1, 1))

In [16]:
print("Shape of one-hot encoded training labels:", y_train_onehot.shape)
print("Shape of one-hot encoded testing labels:", y_test_onehot.shape)

Shape of one-hot encoded training labels: (60000, 10)
Shape of one-hot encoded testing labels: (10000, 10)


# Defining the ANN Model

`layers.Dense` refers to a fully connected layer. A fully connected layer is a type of neural network layer where each neuron in the layer is connected to every neuron in the preceding layer.

`layers.Dropout` is a regularization technique commonly used in neural networks to prevent overfitting. Overfitting occurs when a model learns to memorize the training data rather than generalize well to new, unseen data.

The Dropout layer works by randomly setting a fraction of input units to zero during training, which effectively "drops out" those units temporarily. This means that the units (neurons) in the Dropout layer do not contribute to the forward pass or backpropagation during training with a certain probability. By randomly dropping out units, Dropout introduces noise to the network, which helps prevent the network from relying too heavily on any particular set of features.

In [17]:
# Define ANN model
model = models.Sequential([
    layers.Dense(128, activation='relu', input_shape=(784,)),  # Input layer with 784 neurons and ReLU activation
    layers.Dropout(0.2),  # Dropout layer to prevent overfitting
    layers.Dense(10, activation='softmax')  # Output layer with 10 neurons (for 10 classes) and softmax activation
])


# Compiling the model

Compiling the model in the context of neural networks, specifically in frameworks like TensorFlow or Keras, involves configuring the model for training. When you compile a model, you specify several parameters that are necessary for training, including:

**Optimizer:** This is the algorithm used to update the weights of the neural network based on the training data. Common choices include Stochastic Gradient Descent (SGD), Adam, RMSprop, etc. Each optimizer has its own set of hyperparameters that can be tuned.

**Loss Function:** The loss function measures how well the model performs on the training data by comparing the predicted output with the actual target output. It represents the objective that the model aims to minimize during training. The choice of loss function depends on the type of problem you're solving (e.g., binary classification, multi-class classification, regression) and the nature of your data.

**Metrics:** Metrics are used to monitor the performance of the model during training and evaluation. Common metrics for classification tasks include accuracy, precision, recall, F1-score, etc. For regression tasks, metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE) are often used.

When you compile the model, you specify these components using appropriate arguments. Once the model is compiled, it's ready to be trained on the training data using the specified optimizer, loss function, and metrics.

**Backpropagation** is used implicitly during the training phase in the code provided. When you train a neural network model using frameworks like TensorFlow, backpropagation is automatically handled for you by the optimization algorithm (e.g., stochastic gradient descent, Adam, etc.) that you choose when compiling the model.

In [18]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

The Adam optimizer is a popular optimization algorithm used for training deep learning models. It is an extension of the stochastic gradient descent (SGD) algorithm, which computes adaptive learning rates for each parameter by maintaining two moving averages of the gradients: the first moment (mean) and the second moment (uncentered variance).

Here are some key characteristics of the Adam optimizer:

**Adaptive Learning Rate:** Adam adjusts the learning rate for each parameter based on the magnitude of its gradients and the history of past gradients. This adaptivity helps in effectively navigating the loss landscape and converging faster, especially in cases where the gradients have varying magnitudes.

**Bias Correction:** Adam performs bias correction to counteract the initialization bias and ensure that the estimated moments are unbiased, particularly at the beginning of training when the moving averages are initialized to zero.

**Parameter Updates:** It computes individual adaptive learning rates for each parameter, which are used to update the parameters in the direction that minimizes the loss function.

**Momentum:** Adam incorporates momentum similar to SGD with momentum, which helps accelerate convergence by accumulating gradients from previous time steps.

**Regularization:** Adam includes built-in L2 weight decay regularization, which penalizes large weights and helps prevent overfitting.

The Adam optimizer is widely used in various deep learning tasks due to its effectiveness and ease of use. It typically offers good performance across different types of neural network architectures and datasets, making it a popular choice for many researchers and practitioners.

**Categorical cross-entropy** is a loss function used in multi-class classification tasks, where the target variable (i.e., the true labels) is categorical and can take on more than two possible classes. It measures the dissimilarity between the true probability distribution of the classes and the predicted probability distribution output by the model.

# Train the Model

In [19]:
# Split training data into training and validation sets
x_train_split, x_val, y_train_split, y_val = train_test_split(x_train_flat, y_train_onehot, test_size=0.2, random_state=5)

**Epoch:**
An epoch refers to one complete pass through the entire training dataset. During one epoch, the model sees each training example once and updates its parameters (weights and biases) accordingly to minimize the chosen loss function. Training for multiple epochs allows the model to learn complex patterns from the data by adjusting its parameters iteratively. Typically, the number of epochs is a hyperparameter that needs to be specified by the user based on experimentation and observing the model's performance on a validation set.

**Batch Size:** Batch size refers to the number of training examples utilized in one iteration. Instead of updating the model's parameters based on the gradients computed from the entire dataset (as in batch gradient descent), batch size determines how many examples are processed at once before updating the parameters. A smaller batch size results in more frequent updates to the model's parameters but can increase training time, while a larger batch size can speed up training but may lead to less accurate updates. Choosing an appropriate batch size involves balancing computational efficiency and model performance. Common batch sizes range from 8 to 256, depending on the dataset size and computational resources.

In summary, during neural network training:

* Epochs control the number of times the entire dataset is passed forward and backward through the neural network.
* Batch size determines how many training examples are processed simultaneously before updating the model's parameters.

In [20]:
#Training the model
img_model = model.fit(x_train_split, y_train_split, epochs=10, batch_size=128, validation_data=(x_val, y_val))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# Evaluating the model on test set

`model.evaluate(x_test_flat, y_test_onehot)`: This function evaluates the model on the test data (`x_test_flat`) and corresponding true labels (`y_test_onehot`). It computes the loss and any other metrics specified during the compilation of the model. In this case, it calculates the loss and accuracy.

In [22]:
# Evaluate model on test data
test_loss, test_accuracy = model.evaluate(x_test_flat, y_test_onehot)
print(f'Test Accuracy: {test_accuracy}')


Test Accuracy: 0.9757999777793884


# Predicting on the test set

In [23]:
predictions = model.predict(x_test_flat)



In [24]:
predictions

array([[1.0757532e-07, 4.0014578e-09, 1.1442638e-06, ..., 9.9962610e-01,
        3.2582109e-05, 1.1780765e-04],
       [1.9341680e-08, 1.4482893e-04, 9.9979210e-01, ..., 2.2339750e-13,
        2.0373584e-05, 2.3300431e-10],
       [1.3315276e-06, 9.9848735e-01, 2.5100767e-04, ..., 6.2476232e-04,
        4.4810868e-04, 6.8912045e-06],
       ...,
       [9.8724355e-11, 9.8095609e-10, 3.5479525e-10, ..., 1.2193240e-06,
        9.5068835e-06, 2.6077894e-04],
       [5.4147414e-08, 3.3857582e-08, 3.2052776e-12, ..., 1.9250338e-08,
        8.8209449e-04, 2.5835329e-08],
       [9.8873011e-07, 1.5431493e-11, 2.5710563e-07, ..., 2.0438418e-11,
        6.3097270e-09, 4.7623028e-10]], dtype=float32)