[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/HURU-School/HURUAI/blob/main/Lesson%202/07-First%20Model%20Breakdown.ipynb)

# Breaking Down Our First Model

## Setting Up Our Development Enironment

### Mounting Colab to Gdrive

In [None]:
#  Mounts Google Colab on Gdrive.
from google.colab import drive
drive.mount('/content/gdrive')

### Move to Drive, Create a Working Directory and Move into it.

In [None]:
# Selects our Gdrive we just mounted above
%cd /content/gdrive/My Drive

# Create our working directory
%mkdir HuruAI

# Move into the working directory
%cd /HuruAI

### Notebook Setup

In [None]:
# The code below sets us up with some nice formatting for our plots.

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# Import the required packages

import numpy as np

import torch
from torch import nn, optim
import torch.nn.functional as F
from torchvision import datasets, transforms

import matplotlib.pyplot as plt

### Define a plot funtion that takes an image and returns it's predicted Class.
This part is usually not included here. It can be written on a separate page and imported. But I will leave it here so that we do not have to deal with the complexities.

In [None]:
def image_preds(image, probs):
    ''' This function is for viewing an image and its predicted class.
    '''
    probs = probs.data.numpy().squeeze()

    fig, (ax1, ax2) = plt.subplots(figsize=(6,9), ncols=2)
    ax1.imshow(image.resize_(1, 28, 28).numpy().squeeze())
    ax1.axis('off')
    ax2.barh(np.arange(10), probs)
    ax2.set_aspect(0.1)
    ax2.set_yticks(np.arange(10))
    ax2.set_yticklabels(['T-shirt/top',
                            'Trouser',
                            'Pullover',
                            'Dress',
                            'Coat',
                            'Sandal',
                            'Shirt',
                            'Sneaker',
                            'Bag',
                            'Ankle Boot'], size='small');
    ax2.set_title('Returned Class Probabilities')
    ax2.set_xlim(0, 1.1)

    plt.tight_layout()

### Preparing the Dataset

#### Defining Our Transforms
Transforms are a way to add variety to our data. Common transforms include:
  * Centre Crop - Crops a given image at the centre.
  * Color Jitter - randomly changing the brightness, hue, saturation or contrast in an image.
  * Grayscale - Converts a color image to grascale
  * Random Horizontal Flip - Horizontally flips random images in  a dataset  
  
Feel free to check out the [transforms documentation](https://pytorch.org/vision/stable/transforms.html) here.  
There are more transforms than these. You can read more about these transforms and what they do from the torchvision transforms documentation. In addition, we convert the images in our dataset into tensor datatypes and normalize the data as well. Color in a computer is represented as an integer with values between 0 - 255. Normalizing the tensors scales the values thus the model train faster. In the Normalize function, we pass the mean and standard deviation the function will use to normalize the data. In this case, we pass (0.5, 0.5)
$$
image = (image-mean) / std
$$
This normalizes the image to a range of [-1, 1].

In [None]:
# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

#### Downloading the Dataset
We then download the dataset and apply the transfomations defined above. Data loaders provide a convenient way to load our datasets. They match each image to its label,  batch the data into the defined batch sizes and shuffle the data everytime the we are going through the dataloader, to reduce bias. Samplers can also be defined here if the dataset is imbalanced.

In [None]:
# Download and load the training data
trainset = datasets.FashionMNIST('./Data', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

# Download and load the test data
testset = datasets.FashionMNIST('./Data', download=True, train=False, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True)

#### Prepare an iterator
We prepare an iterator that will allow us to loop through our data loader, each time picking an image with its corresponding label as shown below
```python
for image, label in trainloader:
    ## What to do with the image and label.
```

In [None]:
train_iterator = iter(trainloader)
images, labels = train_iterator.next()
print(type(images))
print(images.shape)
print(labels.shape)

We can then print out an image from the data loader as below

In [None]:
plt.imshow(images[9].numpy().squeeze(), cmap='Greys_r');

### Building Our Network
The network we implemented in our first model is called a _fully connected_ or _dense network_. In this network, each unit in a layer is connected to every other unit in the next layer. The input to this network MUST be a 1-D tensor. Since our images are 2-D tensors i.e (28 * 28 pixels), we need to convert them to 1-D tensors. This process is called [_flattening._ ](https://pytorch.org/docs/stable/generated/torch.flatten.html)We convert the shape from (64, 1, 28, 28) to (64, 784).
Similar to the network before, we need 10 output units, one for each class of clothing item we would like to predict. We calculate the probability that the image provided is of any one class or clothing defined in our labels. This is known as a _discrete probability distribution calculated over the classes(clothing) telling us the most likely class.

#### The Network Architecture

Our network consists of an input layer, two hidden layers and an output layer. Typically, the network will need to be _deeper_ than this, but we are keeping things simple. Our network will look as below.
![Network architecture](../images/Lesson_2/nn.png)  

#### Building Our Network Purely from Tensors
In this section we will explore building the network purely from weight matrices. Next, we will explore using torch's nn module to build the network. 

##### Define our activation function
In the first network, we used a ReLU activation function. For this we will switch things up an explore a new activation function, _the sigmoid activation function_. Mathematically, it is expressed as below:
$$
\sigma(x) = \frac{1}{1 + e^{-x}}
$$
Graphically, the function is represented as below.
![Sigmoid Function](../images/Lesson_2/sigmoid.PNG)  

In [None]:
## Initialize the sigmoid activation function.

def activation(x):
    return 1/(1+torch.exp(-x))


##### Flatten the Input Images

In [None]:
# Flatten the input images
inputs = images.view(images.shape[0], -1)

##### Randomly Initialize Our Weights and Bias

In [None]:
# Initializing Weights and Bias
w1 = torch.randn(784, 256)
b1 = torch.randn(256)

w2 = torch.randn(256, 64)
b2 = torch.randn(64)

w3 = torch.randn(64, 10)
b3 = torch.randn(10)

h1 = activation(torch.mm(inputs, w1) + b1)

h2 = activation(torch.mm(h1, w2) + b2)

out = torch.mm(h2, w3) + b3

##### Calculate Probability Distribution
The probability distribution is calculated by applying a _softmax function_ across the 10 classes. Mathematically, this function is represented as below:
$$
\Large \sigma(x_i) = \cfrac{e^{x_i}}{\sum_k^K{e^{x_k}}}
$$
It mashes each input x into a range between 0 and 1, then normalizes the values resulting in a proper distribution with the values all adding up to one.

In [None]:
## Define the softmax function

def softmax(x):
    return torch.exp(x)/torch.sum(torch.exp(x), dim=1).view(-1, 1)

probabilities = softmax(out)

# Confirm that indeed the shape is (64, 10)
print(probabilities.shape)
# Confirm that the probabilirs all add up to 1
print(probabilities.sum(dim=1))

#### Building the Network with Pytorch's _nn_ Module
Torch provides a handy module called _nn_, that makes building neural networks from scratch pretty easy.
First we inherit from the _nn.Module_ class, combine this with *super().__init__* function, will create a python class object with some useful methods and attributes. 

```python
self.hidden = nn.Linear(784, 256)
```
The line above will create a module for a linear transformation $x\mathbf{W} + b$, with 784 units as input and 256 units as output. Hidden layer 1 to hidden layer 2's transformation follows a similar approach with 256 units as input and 64 unts as output, as does the output layer, with 64 inputs and 10 outputs. 
Next we define the sigmoid activation function and the softmax output function. Setting (dim=1) calculates the softmax output across the columns only.
The nn module requires a forward function. The function takes an input tensor and passes it through the transformations defines in the *__init__ function*.  
**Note:** Order is not particularly important in the *__init__* definition but it is crucial in the forward method.

In [None]:
# Instantiate the Network

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Inputs to hidden layer 1 linear transformation
        self.hidden1 = nn.Linear(784, 256)
        # Hidden Layer 1 to hidden layer 2 linear transformation
        self.hidden2 = nn.Linear(256, 64)
        # Output layer, 10 units - one for each item of clothing
        self.output = nn.Linear(64, 10)
        
        # Define sigmoid activation and softmax output 
        self.sigmoid = nn.Sigmoid()
        self.softmax = nn.Softmax(dim=1)
        
    def forward(self, x):
        # Pass the input tensor through each of our operations
        x = self.hidden1(x)
        x = self.sigmoid(x)
        x = self.hidden2(x)
        x = self.sigmoid(x)
        x = self.output(x)
        x = self.softmax(x)
        
        return x

In [None]:
# Initialize the network

model = Net()
model

#### Building the network using Pytorch's *nn.functional* Module
This module provides a more concise way to build the network architecture. This is by far the most common way to build network architectures in pytorch.

In [None]:
# Instantiate the network
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        # Inputs to hidden layer 1 linear transformation
        self.hidden1 = nn.Linear(784, 256)
        # Hidden Layer 1 to hidden layer 2 linear transformation
        self.hidden2 = nn.Linear(256, 64)
        # Output layer, 10 units - one for each item of clothing
        self.output = nn.Linear(64, 10)
        
    def forward(self, x):
        # Hidden layer 1 with sigmoid activation
        x = F.sigmoid(self.hidden1(x))
        # Hidden layer 2 with sigmoid activation
        x = F.sigmoid(self.hidden2(x))
        # Output layer with softmax activation
        x = F.softmax(self.output(x), dim=1)
        
        return x

##### Weights and Bias Initialization
The weights and biases are initialized, from a random distribution function, for you automatically, unlike when we were building purely from tensors. 

In [None]:
# Check the weights and biases initialized

print(model.hidden1.weight)
print(model.hidden1.bias)

We could also customize how our weights and biases are initialized as shown below

###### Initializing using a constant value

In [None]:
# Fill all the bias values with zero
model.hidden1.bias.data.fill_(0)

###### Initializing by sampling from a distibution function

In [None]:
# sample from random normal with standard dev = 0.03
model.hidden1.weight.data.normal_(std=0.03)

##### Making our forward pass through the Network.

In [None]:
# Grab some data 
train_iterator = iter(trainloader)
images, labels = train_iterator.next()

# Resize images into a 1D vector, new shape is (batch size, color channels, image pixels) 
images.resize_(64, 1, 784)
# or images.resize_(images.shape[0], 1, 784) to automatically get batch size

# Forward pass through the network
image_index = 0
probs = model.forward(images[image_index,:])

image = images[image_index]
image_preds(image.view(1, 28, 28), probs)

Our Network is not yet trained. As you can see in the plot above, it is just making random guesses. This is because we initialized the weights and biases from a random distribution, hence the random predictions.

#### Building the Network using the nn.Sequential Module

The nn.Sequential module is unique in that the input tensor is passed sequentially through the transformations. 

In [None]:
# Hyperparameters for our network
input_size = 784
hidden_sizes = [256, 64]
output_size = 10

# Build a feed-forward network
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[1], output_size),
                      nn.Softmax(dim=1))
print(model)

# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
probs = model.forward(images[0,:])
image_preds(images[0].view(1, 28, 28), probs)

### Training the Network.
Training a neural network, is the process of finding a set of weights that can cause the network architecture to give us the best performance when solving a problem at hand. The network we saw above is not that smart. We train the network by showing it examples of real world data, and adjusting the weights applied to it, such that, it is able to approximate this function. The power of neural networks is that, given enough data and compute capacity, we can train this network, to be able to approximate this function, and any other function for that matter.

#### Define the loss function
To train our network, we first need to have some measure of how well the network is performing. We usually calculate a **loss function**, which measures our prediction error. A common loss function used in regression and binary classification problems is the **mean squared loss**, expressed as below:
$$
\large \ell = \frac{1}{2n}\sum_i^n{\left(y_i - \hat{y}_i\right)^2}
$$

where $n$ is the number of training examples, $y_i$ are the true labels, and $\hat{y}_i$ are the predicted labels. We will be diving deeper into this loss function, but in the meantime, you can find the [documentation for loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions) here.  
By continously minimizing this loss, with respect to the network parameters, we will come up with a set of parameters that will give us a minimal loss. The process of finding this minimum loss is called **gradient descent.**  

In [None]:
# Define Our loss function.
criterion = nn.NLLLoss()

#### Calculate Gradients
Torch provides a module called **Autograd** to automatically calculate gradients of tensors. These gradients are used to update the parameters for our archtecture. Pytorch will automatically initialize all parameters with *require_grad=True*. After we calculate the loss, we can call `loss.backward()` and the gradients are calculated automatically.  

In [None]:
# Flatten Our images so that we can pass them through a fully connected network
images = images.view(images.shape[0], -1)

# Pass the images through our model to get the probabilities
logps = model(images)

# Calculate the loss
loss = criterion(logps, labels)

# Let us view the gradients before and after the backward pass.
print('Before backward pass: \n', model[0].weight.grad)

loss.backward()

print('After backward pass: \n', model[0].weight.grad)

#### Define Our Optimizer Function
Torch also provides an _optim_ package to update the weights with the calculated gradients. Optimizers usually require that we pass the parameters we want to optimize, and a learning rate. More on the learning rate soon.  
When we do multiple backward passes, the gradients are accumulated. Hence we need to do a `optimizer.zero_grad()` to zero our gradients after every training pass to remove the gradients from the previous training passes.

In [None]:
# Define the optimizer passing in the parameters and a learning rate
optimizer = optim.SGD(model.parameters(), lr=0.01)

#### Training Process In Full
The process of training a neural network is as below:
  * We make a forward pass through the network.
  * We take the output of the forward pass and use it to calculate our loss
  * We perform a backward pass to calculate our gradients
  * We take an optimizer step to update the parameters of our architecture

In [None]:
# Instantiate the network Once again to camcel out everything we have just done
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[1], output_size),
                      nn.Softmax(dim=1))

# Define Our Loss
criterion = nn.NLLLoss()

# Define our optimization function
optimizer = optim.SGD(model.parameters(), lr=0.001)

# Define the number of iterations on the full dataset to make during the training process
epochs = 5

# Train the Network
for e in range(epochs):
    running_loss = 0
    for images, labels in trainloader:
        # Flatten our images into a 784 long vector
        images = images.view(images.shape[0], -1)
    
        # Zero Out the gradients on every training pass
        optimizer.zero_grad()
        
        # Calculate the logits(log_probabilies / predictions) generated by the model
        output = model(images)
        # Calculate our loss
        loss = criterion(output, labels)
        # Calculate Our gradients to update the model
        loss.backward()
        # Perform an optimization step to update the parameters(Weights)
        optimizer.step()
        
        # Track Our loss. It should be decreasing on every iteration
        running_loss += loss.item()
    else:
        print(f"Training loss: {running_loss/len(trainloader)}")

### Testing the Network

In [None]:
test_loss = 0
accuracy = 0
test_losses = []

# Turn off gradients for validation, saves memory and computations
with torch.no_grad():
    model.eval()
    for images, labels in testloader:
        # Flatten our images into a 784 long vector
        images = images.view(images.shape[0], -1)
        # Pass the image through the model and get the logits
        logits = model(images)
        # Calculate the test loss
        test_loss += criterion(logits, labels)
        
        # Get the classes for each logit
        logit_class = torch.exp(logits)
        top_p, top_class = logit_class.topk(1, dim=1)
        equals = top_class == labels.view(*top_class.shape)
        accuracy += torch.mean(equals.type(torch.FloatTensor))

test_losses.append(test_loss/len(testloader))

print("Test Loss: {:.3f}.. ".format(test_losses[-1]),
      "Test Accuracy: {:.3f}%".format((accuracy * 100)/len(testloader)))

### Deployment.