# Pytorch

- open-source (by Facebook - Saumith Chintala)
- for building and training NN
- flexible
- efficient
- GPU support
- fast
- support for other libraries(huggingFace, langchain, etc)
- Wide uses

### Key Features
**Dynamic Computational Graphs**: Data flow through the graph is flexible and depends on the input structure. Different graphs for different batches (exp NLP where we need padding, now with DCGs, we can do padding at the batch levels rather than the global level.)
**Graphs is built on the fly as the operations are executed** i.e. structure of the graphs are determined during runtime based on the data flowing through it.

**GPU Acceleration**: built in CUDA features

**Autograd functionality**: enables automatic computations of gradients, simplifying implementation of backpropagation

## PyTorch API

- Working with data
- Creating Models
- Optimizing the Model Parameters
- Saving Models
- Loading Models

#### Working with data

- _**torch.utils.data.DataLoader**_: wraps an iterable around the Dataset
- _**torch.utils.data.Dataset**_: stores samples and their corresponding labels

In [1]:
import torch
from torch.utils.data import DataLoader, Dataset

from torchvision import datasets
from torchvision.transforms import ToTensor

In [2]:
training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

test_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data\FashionMNIST\raw\train-images-idx3-ubyte.gz


100%|██████████| 26421880/26421880 [00:22<00:00, 1151834.41it/s]


Extracting data\FashionMNIST\raw\train-images-idx3-ubyte.gz to data\FashionMNIST\raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data\FashionMNIST\raw\train-labels-idx1-ubyte.gz


100%|██████████| 29515/29515 [00:00<00:00, 360389.52it/s]


Extracting data\FashionMNIST\raw\train-labels-idx1-ubyte.gz to data\FashionMNIST\raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data\FashionMNIST\raw\t10k-images-idx3-ubyte.gz


100%|██████████| 4422102/4422102 [00:01<00:00, 2461669.40it/s]


Extracting data\FashionMNIST\raw\t10k-images-idx3-ubyte.gz to data\FashionMNIST\raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data\FashionMNIST\raw\t10k-labels-idx1-ubyte.gz


100%|██████████| 5148/5148 [00:00<?, ?it/s]


Extracting data\FashionMNIST\raw\t10k-labels-idx1-ubyte.gz to data\FashionMNIST\raw



In [4]:
batch_size = 64

train_dataloader = DataLoader(
    dataset=training_data,
    batch_size=batch_size,
)

test_dataloader = DataLoader(
    dataset=test_data,
    batch_size=batch_size
)

for X, y in train_dataloader:
    print(f"Shape of train X: {X.shape}")
    print(f"Shape of y: {y.shape}")
    break

for X, y in test_dataloader:
    print(f"Shape of test X: {X.shape}")
    print(f"Shape of y: {y.shape}")
    break

Shape of train X: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64])
Shape of test X: torch.Size([64, 1, 28, 28])
Shape of y: torch.Size([64])


Means each data point (image) is 28x28 with 64 images in each batch.

In [6]:
training_data, test_data

(Dataset FashionMNIST
     Number of datapoints: 60000
     Root location: data
     Split: Train
     StandardTransform
 Transform: ToTensor(),
 Dataset FashionMNIST
     Number of datapoints: 10000
     Root location: data
     Split: Test
     StandardTransform
 Transform: ToTensor())

Total 60000 images in Train and 10000 in test set

## Creating Models

- _**nn.Module**_: Creating a class that inherits from nn.Module
- _**__init__**_: define layers of the network in the __init__ function
- _**forward**_: define how data will pass through the network in forward function
- _**GPU instance**_: to accelerate operations in NN, we move class instance to GPU

In [7]:
from torch import nn

In [8]:
device = "cuda" if torch.cuda.is_available() else "cpu"

device

'cpu'

In [16]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNetwork().to(device=device)
model

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

## Optimizing the Model Parameters

We need a loss function and an optimizer.

In [17]:
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(params=model.parameters(), lr=1e-3)

model.parameters(), loss_function, optimizer

(<generator object Module.parameters at 0x00000223D940F220>,
 CrossEntropyLoss(),
 SGD (
 Parameter Group 0
     dampening: 0
     differentiable: False
     foreach: None
     lr: 0.001
     maximize: False
     momentum: 0
     nesterov: False
     weight_decay: 0
 ))

In single loop, model makes predictions on training set (few to it in batches), and backpropagates the error to adjust the model's parameters.

In [18]:
len(train_dataloader.dataset)

60000

In [19]:
def train(dataloader, model, loss_function, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to(device), y.to(device)

        # Compute predictions
        pred = model(X)
        loss = loss_function(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch%100==0:
            loss, current = loss.item(), (batch+1)*len(X)
            print(f"Loss: {loss}, current: [{current}/{size}]")

**loss.backward()**: calculated the gradients of the loss w.r.t each parameter

**optimizer.step()**: once we have our gradients, optimizer.step() adjusts the parameters by the gradients collected in loss.backward()

**optimizer.zero_grad()**: reset the gradients of model parameters. Gradients by defaults add up, to prevent double counting, we set them to zeros at each iteration.

In [22]:
def test(dataloader, model, loss_function):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for X, y in dataloader:
            X, y = X.to(device), y.to(device)
            pred = model(X)
            test_loss += loss_function(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= size
    print(f"Test error: \n Accuracy: {correct*100}, Avg. Loss: {test_loss}")


**model.train()**: tells model that you are training the model. This helps inform layers like Dropout and BatchNorm, which are designed to behave differently during training and evaluation. 

**model.eval()**: tells the model that you are evaluating.

**torch.no_grad()**: ensures no grads are calculated during testing, prevents unnecessary calculations, saves memory and time

Training over several epochs

In [23]:
%%time
epochs = 5

for i in range(epochs):
    print(f"Epoch: {i+1}")
    train(train_dataloader, model, loss_function, optimizer)
    test(test_dataloader, model, loss_function)

print("Done")

Epoch: 1
Loss: 2.1852340698242188, current: [64/60000]
Loss: 2.1673810482025146, current: [6464/60000]
Loss: 2.1269333362579346, current: [12864/60000]
Loss: 2.14335560798645, current: [19264/60000]
Loss: 2.091064214706421, current: [25664/60000]
Loss: 2.0306830406188965, current: [32064/60000]
Loss: 2.0656158924102783, current: [38464/60000]
Loss: 1.9896384477615356, current: [44864/60000]
Loss: 2.0044243335723877, current: [51264/60000]
Loss: 1.9353084564208984, current: [57664/60000]
Test error: 
 Accuracy: 53.290000000000006, Avg. Loss: 1.9290706891163139
Epoch: 2
Loss: 1.9601246118545532, current: [64/60000]
Loss: 1.9209266901016235, current: [6464/60000]
Loss: 1.8250185251235962, current: [12864/60000]
Loss: 1.869011402130127, current: [19264/60000]
Loss: 1.7492424249649048, current: [25664/60000]
Loss: 1.7017980813980103, current: [32064/60000]
Loss: 1.7345941066741943, current: [38464/60000]
Loss: 1.6324387788772583, current: [44864/60000]
Loss: 1.6658046245574951, current: [51

Single prediction

In [66]:
for X, y in test_dataloader:
    print(X, y)
    print(X.shape, y.shape)
    break
pred = model(X[0])
pred, pred.argmax(1)

tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        ...,


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0.

(tensor([[-3.0277, -3.3167, -0.8365, -2.5314, -0.8681,  3.1824, -1.0871,  3.2341,
           2.2635,  3.8444]], grad_fn=<AddmmBackward0>),
 tensor([9]))

In [40]:
with torch.no_grad():
    for X, y in test_dataloader:
        X, y = X.to(device), y.to(device)
        print(X, y.shape)
        break

len(X), X.shape, y.shape

tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]],


        ...,


        [[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          ...,
          [0., 0., 0.,  ..., 0.

(64, torch.Size([64, 1, 28, 28]), torch.Size([64]))

In [42]:
X[0].shape

torch.Size([1, 28, 28])

In [43]:
pred = model(X[0])
pred

tensor([[-3.0277, -3.3167, -0.8365, -2.5314, -0.8681,  3.1824, -1.0871,  3.2341,
          2.2635,  3.8444]], grad_fn=<AddmmBackward0>)

In [49]:
(torch.argmax(pred).item() == y[0]).type(torch.float).item()

1.0

In [55]:
pred.argmax(1) == y[0], (pred.argmax(1) == y[0]).item(), (pred.argmax(1) == y[0]).type(torch.float), (pred.argmax(1) == y[0]).type(torch.float).sum(), torch.argmax(pred, 1) == y[0]

(tensor([True]), True, tensor([1.]), tensor(1.), tensor([True]))

Multiple predictions

In [56]:
preds = model(X[0:5])
preds

tensor([[-3.0277, -3.3167, -0.8365, -2.5314, -0.8681,  3.1824, -1.0871,  3.2341,
          2.2635,  3.8444],
        [ 1.2637, -3.4860,  4.3094, -1.0402,  3.7479, -2.1592,  3.2621, -3.9865,
          2.0015, -2.5964],
        [ 1.8172,  5.7545, -0.1377,  4.0086,  0.9133, -3.1278,  0.7877, -3.9263,
         -2.1961, -4.0858],
        [ 1.1486,  4.4776, -0.2569,  3.1102,  0.4883, -2.2147,  0.4146, -2.8601,
         -1.8106, -2.8417],
        [ 1.1762, -1.6343,  1.8896, -0.1472,  1.8013, -1.2123,  1.8479, -2.3607,
          1.0001, -1.5686]], grad_fn=<AddmmBackward0>)

In [62]:
torch.argmax(preds, 1), y[:5]

(tensor([9, 2, 1, 1, 2]), tensor([9, 2, 1, 1, 6]))

In [61]:
(torch.argmax(preds, 1) == y[:5]).sum().item()

4

## Saving Models

Common way to save a model is to serialize the internal state dictionary, containing the model parameters

In [24]:
torch.save(model.state_dict(), "model.pth")

## Loading Models

Recreating the model structure and loading the state dictionary into it

In [25]:
model = NeuralNetwork().to(device=device)
model.load_state_dict(torch.load("model.pth", weights_only=True))

<All keys matched successfully>