# MNIST Digit Classification with our own Framework

Lab Assignment from [AI for Beginners Curriculum](https://github.com/microsoft/ai-for-beginners).

### Reading the Dataset

This code download the dataset from the repository on the internet. You can also manually copy the dataset from `/data` directory of AI Curriculum repo.

In [4]:
!rm *.pkl
!wget https://raw.githubusercontent.com/microsoft/AI-For-Beginners/main/data/mnist.pkl.gz
!gzip -d mnist.pkl.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  9.9M  100  9.9M    0     0   9.9M      0  0:00:01 --:--:--  0:00:01 15.8M


In [3]:
import pickle
with open('mnist.pkl','rb') as f:
    MNIST = pickle.load(f)

In [4]:
labels = MNIST['Train']['Labels']
data = MNIST['Train']['Features']

Let's see what is the shape of data that we have:

In [5]:
data.shape

(42000, 784)

### Splitting the Data

We will use Scikit Learn to split the data between training and test dataset:

In [6]:
from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(data,labels,test_size=0.2)

print(f"Train samples: {len(features_train)}, test samples: {len(features_test)}")

Train samples: 33600, test samples: 8400


### Instructions

1. Take the framework code from the lesson and paste it into this notebook, or (even better) into a separate Python module
1. Define and train one-layered perceptron, observing training and validation accuracy during training
1. Try to understand if overfitting took place, and adjust layer parameters to improve accuracy
1. Repeat previous steps for 2- and 3-layered perceptrons. Try to experiment with different activation functions between layers.
1. Try to answer the following questions:
    - Does the inter-layer activation function affect network performance?
    - Do we need 2- or 3-layered network for this task?
    - Did you experience any problems training the network? Especially as the number of layers increased.
    - How do weights of the network behave during training? You may plot max abs value of weights vs. epoch to understand the relation.

### Answer
1. Inter-layer activation function can both increase or decrease the network's performance. It may allow the network to break through soft peaks of the algorithm's accuracy, but it may also lead it into worse depths (without implementing backtracking)
2. By simply brute-forcing a few hyperparameters within the network's gradient descent, the accuracy of the network can increase drastically already. Adding more layers did not increase the accuracy of the network's predictions. Thus, adding aditional perceptron layers are not always the solution into learning a system.
3. As the number increase, the algorithm often converges to a set of weights as it deems to be the global max of the system when its only the local max. Further algorithms which allow the network to surpass local max can be implemented for possible higher accuracy.
4. 

In [3]:
import matplotlib.pyplot as plt
import numpy as np
import pickle
import os
import random
import gzip

# Load the MNIST data with the correct encoding
with gzip.open('mnist.pkl.gz', 'rb') as mnist_pickle:
    MNIST = pickle.load(mnist_pickle, encoding='latin1')  # Specify encoding explicitly

training_data, validation_data, test_data = MNIST  # Unpack the tuple
# Access training features and labels
train_features, train_labels = training_data
val_features, val_labels = validation_data
test_features, test_labels = test_data
# Normalize features to the range [0, 1] to improve training stability
# Note: This is a common practice for many machine learning algorithm. MNIST dataset contains grayscale images, where each pixel value ranges from 0 to 255
train_features = train_features.astype(np.float32) / 255.0
val_features = val_features.astype(np.float32) / 255.0
test_features = test_features.astype(np.float32) / 255.0

# Filter positive and negative examples for training
def set_mnist_pos_neg(target_label, x_labels, x_features):
    positive_indices = [i for i, label in enumerate(x_labels) if label == target_label]
    negative_indices = [i for i, label in enumerate(x_labels) if label != target_label]

    positive_images = x_features[positive_indices]
    negative_images = x_features[negative_indices]
    
    return positive_images, negative_images

# Train function for a single binary classifier (one-vs-all)
def train(positive_examples, negative_examples, num_iterations, lambda_reg, weights, activation_func):
    num_dims = positive_examples.shape[1]  # Number of features
    if weights is None:  # Initialize weights if not provided
        weights = np.zeros((num_dims, 1))*0.01 # Shape: (num_features, 1), initialized with small values to prevent convergence issues

    for i in range(num_iterations):  # Optimize weights through gradient descent
        pos = random.choice(positive_examples).reshape(-1, 1)  # Shape: (num_features, 1)
        neg = random.choice(negative_examples).reshape(-1, 1)  # Shape: (num_features, 1)

        # Apply the activation function to the weighted sum (dot product)
        pos_output = activation_func(np.dot(weights.T, pos))
        neg_output = activation_func(np.dot(weights.T, neg))

        # Update weights based on activation outputs
        if pos_output < 0:
            weights += pos  # Update weights with positive example
        if neg_output >= 0:
            weights -= neg  # Update weights with negative example

        # Apply L2 regularization
        weights -= lambda_reg * weights  # Regularization step

    return weights

# Train one-vs-all classifiers for one layer with weight updates
def train_all_classes(train_features, train_labels, num_iterations, lambda_reg, weights_list, activation_func):
    if weights_list is None:
        # Initialize weights for 10 classes if not provided
        weights_list = [np.zeros((train_features.shape[1], 1)) for _ in range(10)]  

    updated_weights_list = []
    for digit in range(10):  # Train one classifier per digit (0-9)
        pos_examples, neg_examples = set_mnist_pos_neg(digit, train_labels, train_features)
        # Pass existing weights to the train function to update them
        weights = train(pos_examples, neg_examples, num_iterations, lambda_reg, weights_list[digit], activation_func)
        updated_weights_list.append(weights)

    return updated_weights_list


# Multi-layer perceptron with weight updating across layers
def multi_layer_training(train_features, train_labels, num_iterations, lambda_reg, num_layers, activation_func):
    weights_list = None  # Initialize weights list for the first layer
    accuracies = []  # Track accuracies for each layer
    for layer in range(num_layers):
        print(f"Training Layer {layer + 1}")
        weights_list = train_all_classes(train_features, train_labels, num_iterations, lambda_reg, weights_list, activation_func)
        # Evaluate the current weights on the test set
        predicted_labels = classify_multi_class(weights_list, test_features, test_labels)
        layer_accuracy = accuracy(predicted_labels, test_labels)
        accuracies.append(layer_accuracy)
        print(f"Accuracy after Layer {layer + 1}: {layer_accuracy:.4f}")
    return weights_list, accuracies


# Classification Function with Matrix Multiplication
def classify_multi_class(weights_list, test_features, test_labels):
    # Convert weights list to a (num_features, 10) matrix
    weights_matrix = np.column_stack(weights_list)  # Shape: (num_features, 10)
    # Compute scores: test_features (num_samples, num_features) * weights_matrix (num_features, 10)
    scores = np.dot(test_features, weights_matrix)  # Shape: (num_samples, 10)
    # For each test sample, pick the class with the highest score
    predicted_labels = np.argmax(scores, axis=1)  # Shape: (num_samples, ) - max score index for each sample
    return predicted_labels

def accuracy(predicted_labels, test_labels):
    return float(np.sum(predicted_labels == test_labels) / len(test_labels))

ModuleNotFoundError: No module named 'matplotlib'

In [24]:
# View activation functions here: https://www.geeksforgeeks.org/activation-functions-neural-networks/
# No activation function
def no_activation(x):
    return x
    
# Sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# ReLU activation function
def relu(x):
    return np.maximum(0, x)

# Softmax activation function (used for output layer in multi-class classification)
def softmax(x):
    exp_values = np.exp(x - np.max(x, axis=1, keepdims=True))  # Stability improvement
    return exp_values / np.sum(exp_values, axis=1, keepdims=True)

# Train and evaluate the multi-layer perceptron
num_layers = 10  # Define the number of layers
weights_list, accuracies = multi_layer_training(train_features, train_labels, 200, 0.-0.015, num_layers, no_activation)
# Final accuracy after all layers
print("Final multi-layer perceptron accuracy:", accuracies[-1])

Training Layer 1
Accuracy after Layer 1: 0.8149
Training Layer 2
Accuracy after Layer 2: 0.8248
Training Layer 3
Accuracy after Layer 3: 0.8256
Training Layer 4
Accuracy after Layer 4: 0.8258
Training Layer 5
Accuracy after Layer 5: 0.8258
Training Layer 6
Accuracy after Layer 6: 0.8258
Training Layer 7
Accuracy after Layer 7: 0.8258
Training Layer 8
Accuracy after Layer 8: 0.8258
Training Layer 9
Accuracy after Layer 9: 0.8258
Training Layer 10
Accuracy after Layer 10: 0.8258
Final multi-layer perceptron accuracy: 0.8258
