# Logistic Regression Model:

In [2]:
# Imports
import numpy as np
import torch 
import torch.nn.functional as F
from torchvision import datasets, transforms
from tqdm.notebook import tqdm

PyTorch AutoGrad Mechanics

The optimizer in PyTorch, such as "torch.optim.SGD", knows how to compute the gradients of the loss with respect to the model parameters (e.g., weights and biases) through the use of PyTorch's automatic differentiation system, known as Autograd. Here's how it works:
    1. Autograd Mechanics: PyTorch's Autograd system automatically computes the gradients of tensors. When you perform operations on tensors that have "requires_grad=True", PyTorch keeps track of these operations. When you call ".backward()" on a tensor (usually the loss tensor), Autograd computes the gradient of that tensor with respect to all tensors that have "requires_grad=True". These gradients are stored in teh ".grad" attribute of each tensor.
    2. Gradients Computation: In the provided code snippet, after the forward pass and the computation of the loss ("cross_entropy"), calling "cross_entropy.backward()" triggers the computation of gradients for all tensors involved in the computation of teh loss that have "requires_grad=True". This includes the model parameters "w" and "b"
    3. Optimizer Updates Parameters: Once the gradients are computed, the optimizer ("torch.optim.SGD" in this case) uses these gradients to update teh model parameters. The optimizer adjusts the parameters in teh direction that minimizes the loss. This is done by subtracting the gradient times the learning rate from teh current parameter values. The optimizer does not need to know the value of the loss itself; it only needs the gradients to perform the update
Summary: The optimizer in PyTorch knows how to compute the gradients because it relies on PyTorch's Autograd system, which automatically computes gradients for tensors involved in operations that have "requires_grad=True". The optimizer then uses these gradients to update the model parameters, enabling the learning process.

In [3]:
# Load the data
mnist_train = datasets.MNIST(root="./datasets", train=True, transform=transforms.ToTensor(), download=True)
mnist_test = datasets.MNIST(root="./datasets", train=False, transform=transforms.ToTensor(), download=True)
train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=100, shuffle=True)
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size=100, shuffle=False)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./datasets\MNIST\raw\train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 20369357.00it/s]


Extracting ./datasets\MNIST\raw\train-images-idx3-ubyte.gz to ./datasets\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./datasets\MNIST\raw\train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 28821245.26it/s]

Extracting ./datasets\MNIST\raw\train-labels-idx1-ubyte.gz to ./datasets\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./datasets\MNIST\raw\t10k-images-idx3-ubyte.gz



100%|██████████| 1648877/1648877 [00:00<00:00, 13241781.96it/s]


Extracting ./datasets\MNIST\raw\t10k-images-idx3-ubyte.gz to ./datasets\MNIST\raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./datasets\MNIST\raw\t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<?, ?it/s]

Extracting ./datasets\MNIST\raw\t10k-labels-idx1-ubyte.gz to ./datasets\MNIST\raw





In [4]:
# Training
# Initialize parameters
# Xavier Initialization, assist in not making gradient vanish/explode 
# and other benefits compared to making weights initially zeros which is why W = equation below

W = torch.randn(784, 10)/np.sqrt(784)
W.requires_grad_() # Tells PyTorch autograd to track the gradients 
b = torch.zeros(10, requires_grad=True)

# Optimizer
optimizer = torch.optim.SGD([W,b], lr=0.1)

In [5]:
# Iterate through train set minibatch
for images, labels in tqdm(train_loader):
    # Zero out the gradients
    optimizer.zero_grad()
    
    # Forward pass
    x = images.view(-1, 28*28) # 1D array [1, 784] | the "-1" auto calculate the size of the first dimension
    y = torch.matmul(x, W) + b # Matrix Multiplication
    cross_entropy = F.cross_entropy(y, labels) # F.cross_entropy applies SoftMax as well as the cross-entropy loss
        # SoftMax: Normalize probabilities (output of matrixMultiply [y])
            # Normalizing here makes probs positive and add up to 1, softmax eq. makes inputs into outputs between 0 and 1 such that the sum of the outputs add up to 1
        # Cross-Entropy Loss:
    
    # Backward Pass
    cross_entropy.backward() # calculate gradient (gradients = slope)
    optimizer.step() # updates values based on gradient (new_values = old_value - (learningRate * gradient)
        # LearningRate (LR): dictates the step size of update
        # Gradient: indicates the direction of which the loss decreases
        # Keep in mind that LR and gradient are multiplied and with SGD gradient might be big causing a large influence in step or update

  0%|          | 0/600 [00:00<?, ?it/s]

In [6]:
 # Testing
correct = 0
total = len(mnist_test)

# Keep in mind testing occurs after the model is learned and the values of W and b are the latest or perhaps the most optimal

with torch.no_grad():
    # Iterate through test set minibatchs
    for images, labels in tqdm(test_loader):
        # Forward pass
        x = images.view(-1, 28*28)
        y = torch.matmul(x, W) + b
        
        # Keep in mind that "y" is now a tensor/array/1D matrix --> [prob0, prob1, prob2, prob3, prob4, prob5, prob6, prob7, prob8, prob9] and torch.argmax returns the largest prob, which is the prediction
        
        predictions = torch.argmax(y, dim=1)
        correct += torch.sum((predictions == labels).float())
        
print('Test accuracy: {}'.format(correct/total))

  0%|          | 0/100 [00:00<?, ?it/s]

Test accuracy: 0.9028000235557556


Extra Notes on Non-Linearity Activation Function:

# Non-linearity in the context of neural networks refers to the ability of the network to model complex, non-linear relationships between inputs and outputs. This is crucial for tasks where the underlying data or the relationships between features are not linear. Non-linearity is introduced into neural networks through the use of non-linear activation functions and the architecture of the network itself, particularly through the use of multiple layers. Example: ReLu function