# Pytorch Reference Notebook

pyenv name: ner-chatbot-env

### Squeeze vs Unsqueeze

Unsqueeze: adds one dimensiion to embedding

Squeeze: removes one dimension to embedding

In [1]:
import torch
import numpy as np

In [28]:
x = torch.tensor([[1,2,3,4,5,6,7,8]])
x.shape

torch.Size([1, 8])

In [29]:
y = torch.unsqueeze(x, 0)
y

tensor([[[1, 2, 3, 4, 5, 6, 7, 8]]])

In [30]:
y.shape

torch.Size([1, 1, 8])

In [31]:
z = torch.squeeze(x)
z

tensor([1, 2, 3, 4, 5, 6, 7, 8])

In [32]:
z.shape

torch.Size([8])

### Quickstart 

Link (https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)

Link (https://pytorch.org/text/stable/index.html)

#### Working with Data

Pytorch has two approaches to handling datasets:

- DataLoader from torch.utils.data
    - Wraps an iterable around the Dataset
- Dataset from torch.utils.data
    - Stores all samples and labels

In [95]:
import torch 
from torch import nn 
from torch.utils.data import DataLoader
from torchvision import datasets 
from torchvision.transforms import ToTensor 

In [96]:
# datasets offer a wide range of datasets pertaining to text, audio, and vision 

training_data = datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor()
)

testing_data = datasets.FashionMNIST(
    root="data",
    train=False,
    download=True,
    transform=ToTensor()
)

In [97]:
# Instatiate a DataLoader Class by passing a Dataset Class to it.
# This will wrap an iterable over the dataset to handle 
# ...batching of data, shuffling, multiprocess data loading, and sampling


train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(testing_data, batch_size=64)


In [98]:
train_dataloader.dataset, test_dataloader.dataset

(Dataset FashionMNIST
     Number of datapoints: 60000
     Root location: data
     Split: Train
     StandardTransform
 Transform: ToTensor(),
 Dataset FashionMNIST
     Number of datapoints: 10000
     Root location: data
     Split: Test
     StandardTransform
 Transform: ToTensor())

In [99]:
for x, y in train_dataloader:
    print(x.shape, y.shape)

torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size([64, 1, 28, 28]) torch.Size([64])
torch.Size

#### Creating Models

To create a model in Pytorch, it must be defined by a class that inherits from the torch.nn.Module class. Our model class should include an __init__ and forward method. The __init__ function will define each layer of the model and the forward function will connect the layers together of which breaks down the runthrough of the input and hidden states passed from layer to layer. Additionally, it helps to utlize a GPU or MPS rather than a CPU

In [100]:
device = (
    "cuda"
    if torch.cuda.is_available()
    else 'mps'
    if torch.backends.mps.is_available()
    else "cpu"
)

print(device)

mps


In [148]:
class FashionMNISTClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer_1_flatten = nn.Flatten()
        self.layer_2_linear_relu = nn.Sequential(
            nn.Linear(in_features=(28*28), out_features=512),
            nn.ReLU(),
            nn.Linear(in_features=512, out_features=512),
            nn.ReLU(),
            # RuntimeError: linear(): 
            # input and weight.T shapes cannot be 
            # multiplied (64x64 and 512x10)

            # nn.Linear(in_features=512, out_features=64),
            # nn.ReLU(),
            nn.Linear(in_features=512, out_features=10)
        )

    def forward(self, x):
        hidden_states_1 = self.layer_1_flatten(x)
        logits_2 = self.layer_2_linear_relu(hidden_states_1)
        return logits_2
    
model = FashionMNISTClassifier().to(device)
model

FashionMNISTClassifier(
  (layer_1_flatten): Flatten(start_dim=1, end_dim=-1)
  (layer_2_linear_relu): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

#### Optimizing Model Hyperparameters

In [149]:
# Establish the loss function and optimizer to use

loss = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(),lr=1e-3)
loss, optimizer

(CrossEntropyLoss(),
 SGD (
 Parameter Group 0
     dampening: 0
     differentiable: False
     foreach: None
     lr: 0.001
     maximize: False
     momentum: 0
     nesterov: False
     weight_decay: 0
 ))

Batches of data are fed into the model to make predictions on training set, then backpropagates the prediction error to adjust the parameters of the model. I.e. the model computes the loss and then adjusting the parameters through back propagation to minimize that said loss. Furthermore, be sure to set training/testing data to the same .to(device) as the model above.

In [150]:
def ner_train(dataloader, model, loss_fn, optimizer):
    # size of dataset
    len_train = train_dataloader.dataset.data.shape[0]
    # Compute loss
    temp_loss = 0
    # train model
    model.train()
    # run a training loop that computes loss, backpropogates error, then adjusts the parameters
    for batch, (x, y) in enumerate(train_dataloader):
        x, y = x.to(device), y.to(device)
        # Compute the model predictions over the training features
        pred = model(x)
        # Compute the loss based on the predicted labels and the true labels
        loss = loss_fn(pred,y)
        temp_loss += loss

        # Backpropagation
        loss.backward()
        # Take a step in some direction on the loss surface that helps lower the prediction loss
        optimizer.step
        # Zero out the gradients since the model is not influenced by past results. 
        # Note there is no need to keep the past gradients when that info was used
        # ...to update the location of the optimizer on the loss surface. 
        optimizer.zero_grad()

        # print(f"Batch: {batch}, Loss: {temp_loss}", "\n")

        # if batch % 100 == 0:
        #     loss, current = loss.item(), (batch+1)*len(x)
        #     print(f"Loss: {loss}, Samples {current/len_train}")

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(x)
            print(f"loss: {loss:>7f}  [{current:>5d}/{len_train:>5d}]")




In [151]:
def ner_test(dataloader, model, loss_fn):
    len_test = test_dataloader.dataset.data.shape[0]
    num_batches = len(dataloader)
    model.eval()
    test_loss, correct = 0, 0
    with torch.no_grad():
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            pred_y = model(x)
            test_loss += loss_fn(pred_y, y).item()
            correct += (pred_y.argmax(1) == y).type(torch.float).sum().item()
    test_loss /= num_batches
    correct /= len_test
    # print(f"Testing: \n Accuracy: {correct} \n Loss: {test_loss}")
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Training the model means to learn the patterns of data and the parameters to make optimal predictions. Training occurs over several epochs, where each epoch trains on a batch of data. Notice in the code, the training function prints out the loss and the number of samples used. The test set prints out the accuracy and loss for each training batch. 

In [152]:
model, loss, optimizer, train_dataloader

(FashionMNISTClassifier(
   (layer_1_flatten): Flatten(start_dim=1, end_dim=-1)
   (layer_2_linear_relu): Sequential(
     (0): Linear(in_features=784, out_features=512, bias=True)
     (1): ReLU()
     (2): Linear(in_features=512, out_features=512, bias=True)
     (3): ReLU()
     (4): Linear(in_features=512, out_features=10, bias=True)
   )
 ),
 CrossEntropyLoss(),
 SGD (
 Parameter Group 0
     dampening: 0
     differentiable: False
     foreach: None
     lr: 0.001
     maximize: False
     momentum: 0
     nesterov: False
     weight_decay: 0
 ),
 <torch.utils.data.dataloader.DataLoader at 0x2a1460670>)

In [153]:
epochs = 5

for epoch in range(5):
    print(f"Epoch: {epoch+1}")
    ner_train(train_dataloader, model, loss, optimizer)
    ner_test(test_dataloader, model, loss)
print('Fin')


Epoch: 1
loss: 2.306123  [   64/60000]
loss: 2.309023  [ 6464/60000]
loss: 2.309962  [12864/60000]
loss: 2.310167  [19264/60000]
loss: 2.301754  [25664/60000]
loss: 2.300050  [32064/60000]
loss: 2.305031  [38464/60000]
loss: 2.308173  [44864/60000]
loss: 2.311109  [51264/60000]
loss: 2.285941  [57664/60000]
Test Error: 
 Accuracy: 9.3%, Avg loss: 2.303915 

Epoch: 2
loss: 2.306123  [   64/60000]
loss: 2.309023  [ 6464/60000]
loss: 2.309962  [12864/60000]
loss: 2.310167  [19264/60000]
loss: 2.301754  [25664/60000]
loss: 2.300050  [32064/60000]
loss: 2.305031  [38464/60000]
loss: 2.308173  [44864/60000]
loss: 2.311109  [51264/60000]
loss: 2.285941  [57664/60000]
Test Error: 
 Accuracy: 9.3%, Avg loss: 2.303915 

Epoch: 3
loss: 2.306123  [   64/60000]
loss: 2.309023  [ 6464/60000]
loss: 2.309962  [12864/60000]
loss: 2.310167  [19264/60000]
loss: 2.301754  [25664/60000]
loss: 2.300050  [32064/60000]
loss: 2.305031  [38464/60000]
loss: 2.308173  [44864/60000]
loss: 2.311109  [51264/60000]
l

#### Saving Model

Save the model and its parameters in an internal state dictionary

In [154]:
model.state_dict()

  nonzero_finite_vals = torch.masked_select(


OrderedDict([('layer_2_linear_relu.0.weight',
              tensor([[-0.0196, -0.0325, -0.0176,  ...,  0.0033, -0.0076, -0.0189],
                      [-0.0199,  0.0266,  0.0286,  ..., -0.0157,  0.0154, -0.0274],
                      [ 0.0039, -0.0239,  0.0177,  ..., -0.0039, -0.0232, -0.0278],
                      ...,
                      [-0.0288,  0.0055,  0.0166,  ...,  0.0208, -0.0169,  0.0231],
                      [-0.0301,  0.0011,  0.0143,  ...,  0.0187, -0.0289,  0.0303],
                      [ 0.0227, -0.0251,  0.0306,  ...,  0.0211, -0.0037,  0.0220]],
                     device='mps:0')),
             ('layer_2_linear_relu.0.bias',
              tensor([-1.0901e-03, -1.8960e-02,  2.2796e-03, -3.5242e-02, -2.6809e-02,
                      -3.4505e-02,  1.2617e-02, -2.5151e-02, -2.8374e-02,  9.3289e-03,
                       4.0030e-03, -8.9758e-03, -2.2560e-02,  6.6295e-03,  1.3166e-02,
                       2.5571e-02, -3.4357e-02,  3.0701e-03, -3.3995e-02,  6.8

In [156]:
torch.save(model.state_dict(), "my_ner_model.pth")

#### Loading Model

In [161]:
new_model = FashionMNISTClassifier().to(device)
new_model.load_state_dict(torch.load("my_ner_model.pth"))

<All keys matched successfully>

In [164]:
new_model.eval()

FashionMNISTClassifier(
  (layer_1_flatten): Flatten(start_dim=1, end_dim=-1)
  (layer_2_linear_relu): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

In [172]:
classes = [
    "T-shirt/top",
    "Trouser",
    "Pullover",
    "Dress",
    "Coat",
    "Sandal",
    "Shirt",
    "Sneaker",
    "Bag",
    "Ankle boot",
]

new_model.eval()

test_x_example = testing_data[0][0]
test_y_example = testing_data[0][1]

with torch.no_grad():
    test_x_example = test_x_example.to(device)

    pred_logits = model(test_x_example)

    prediction, actual = classes[pred_logits[0].argmax(axis=0)],  classes[test_y_example]
    print(f"Predicted: {prediction} \n Actual: {actual}") 

Predicted: T-shirt/top 
 Actual: Ankle boot


### Builidng with Models and layer types

#### Example of a Pytorch Model


To create a Pytorch model, it must inherit from the torch.nn.Module class to make sure it's compatible with Pytorch. The huge benefit is being able to track the weights (parameters) of each hidden layer. When assigned, the parameters become instances of the torch.nn.Parameter class which is a subclass of the torch.nn.Tensor class. As the weights (tensors) are updated the class will add them to a list of parameters based on the torch.nn.Module class. The weights can be accessed through the parameters() method from the torch.nn.Module class

In [173]:
import torch
from torch import nn

In [176]:
class LinearClassifierModel(nn.Module):
    def __init__(self):
        super().__init__()

        self.linear_layer = nn.Linear(in_features=5, out_features=10)
        self.activation_fn = nn.ReLU()
        self.linear_layer_2 = nn.Linear(in_features=10, out_features=2)
        self.softmax_layer = nn.Softmax()
    
    def forward(self, x):
        layer_1 = self.linear_layer(x)
        layer_2 = self.activation_fn()(layer_1)
        layer_3 = self.linear_layer_2(layer_2)
        layer_4 = self.softmax_layer()(layer_3)
        return layer_4

In [177]:
# The displayed output inherits from nn.Module and also shows the model layers 
model = LinearClassifierModel()
model

LinearClassifierModel(
  (linear_layer): Linear(in_features=5, out_features=10, bias=True)
  (activation_fn): ReLU()
  (linear_layer_2): Linear(in_features=10, out_features=2, bias=True)
  (softmax_layer): Softmax(dim=None)
)

In [186]:
# The .parameters() method can be used since it's from the torch.nn.Module class
for i in model.parameters():
    print(i)

Parameter containing:
tensor([[-0.0393,  0.2507, -0.4054, -0.0228,  0.4069],
        [-0.4357,  0.4470, -0.0297,  0.2698, -0.2882],
        [ 0.1768, -0.1114, -0.2653,  0.2500,  0.2786],
        [ 0.2372,  0.2003, -0.1650, -0.3076,  0.2704],
        [ 0.4116,  0.1751, -0.2421,  0.2375, -0.4243],
        [-0.1958, -0.1220,  0.1047, -0.1787,  0.1606],
        [ 0.1934,  0.0592,  0.1174, -0.1835,  0.0038],
        [ 0.3793, -0.2489, -0.1857,  0.1972,  0.4322],
        [ 0.4318,  0.3358,  0.1332, -0.2016, -0.3435],
        [-0.2290, -0.3268, -0.4252,  0.3387,  0.3440]], requires_grad=True)
Parameter containing:
tensor([ 0.2368, -0.3312,  0.1434, -0.0631,  0.4451,  0.2488, -0.1648,  0.0921,
        -0.2638, -0.3270], requires_grad=True)
Parameter containing:
tensor([[-0.2418, -0.2944,  0.1352, -0.2306,  0.1630,  0.0178, -0.1322, -0.1990,
          0.2017, -0.2476],
        [ 0.1565,  0.2878, -0.1379,  0.1578,  0.2889,  0.2515, -0.2311,  0.2666,
         -0.2227, -0.2847]], requires_grad=Tru

In [179]:
model.linear_layer

Linear(in_features=5, out_features=10, bias=True)

In [185]:
for i in model.linear_layer.parameters():
    print(i)

Parameter containing:
tensor([[-0.0393,  0.2507, -0.4054, -0.0228,  0.4069],
        [-0.4357,  0.4470, -0.0297,  0.2698, -0.2882],
        [ 0.1768, -0.1114, -0.2653,  0.2500,  0.2786],
        [ 0.2372,  0.2003, -0.1650, -0.3076,  0.2704],
        [ 0.4116,  0.1751, -0.2421,  0.2375, -0.4243],
        [-0.1958, -0.1220,  0.1047, -0.1787,  0.1606],
        [ 0.1934,  0.0592,  0.1174, -0.1835,  0.0038],
        [ 0.3793, -0.2489, -0.1857,  0.1972,  0.4322],
        [ 0.4318,  0.3358,  0.1332, -0.2016, -0.3435],
        [-0.2290, -0.3268, -0.4252,  0.3387,  0.3440]], requires_grad=True)
Parameter containing:
tensor([ 0.2368, -0.3312,  0.1434, -0.0631,  0.4451,  0.2488, -0.1648,  0.0921,
        -0.2638, -0.3270], requires_grad=True)


#### Linear Layers

Every node is computed via y = wx+ b where w are the weights, x is the input, b is the bias, and y is the outpu value for that node in the neural network. Every node is connected to each other meaning the weights have an equal impact on those nodes. With an input of size m and an output of size n, the weights matrix would be size mxn

In [188]:
import torch
from torch.nn import Linear

In [191]:
x = torch.rand(4,4)
x.shape

torch.Size([4, 4])

In [216]:
# Input 4x4 matrix -> 2x4 matrix
linear_layer = Linear(in_features=4, out_features=2)
linear_layer

Linear(in_features=4, out_features=2, bias=True)

In [217]:
for i in linear_layer.state_dict():
    print(i)

weight
bias


In [218]:
# 4x4 -> 2x4 Tensor (Weights) and 1x2 Tensor (Bias)
# I.e. one bias per row of data
for i in linear_layer.parameters():
    print(i)

Parameter containing:
tensor([[ 0.0393, -0.2453, -0.3070,  0.3471],
        [ 0.0014,  0.0895, -0.3446,  0.2589]], requires_grad=True)
Parameter containing:
tensor([0.3911, 0.4799], requires_grad=True)


In [219]:
# 4x4 -> 4 samples (rows), 4 columns (features)
x

tensor([[0.6062, 0.1219, 0.5724, 0.0812],
        [0.4876, 0.9861, 0.6739, 0.3195],
        [0.8459, 0.2756, 0.7410, 0.1012],
        [0.2101, 0.9687, 0.9727, 0.3913]])

In [220]:
# 2x4 -> 4 samples (rows), 
# 2 columns (outputs: think of a binary classifier 
# where each of the 4 samples has a two 
# probablities for each class. 
# Therefore, it would need only a softmax 
# function to extract the highest logit 
# from each sample.
y = linear_layer(x)
y

tensor([[ 0.2375,  0.3155],
        [ 0.0724,  0.4193],
        [ 0.1644,  0.2766],
        [-0.0011,  0.3330]], grad_fn=<AddmmBackward0>)

Manual matrix multiplication of x times the weights plus the bias will output the same y value above. 

In [224]:
weights, bias = linear_layer.parameters()

In [229]:
weights.shape, bias.shape, x.shape

(torch.Size([2, 4]), torch.Size([2]), torch.Size([4, 4]))

In [226]:
weights, bias

(Parameter containing:
 tensor([[ 0.0393, -0.2453, -0.3070,  0.3471],
         [ 0.0014,  0.0895, -0.3446,  0.2589]], requires_grad=True),
 Parameter containing:
 tensor([0.3911, 0.4799], requires_grad=True))

In [251]:
# Matrix Multiplication with torch.linalg.multi_dot [(rows x columns)
# This example shows matrix multiplication of a 2x3 and a 3x1 tensor to output a 2x1 matrix
torch.linalg.multi_dot([torch.Tensor([[1,2,3], [4,5,6]]), torch.Tensor([[7],[8],[9]])])

tensor([[ 50.],
        [122.]])

In [252]:
# Not correct
# 2x4 = 2x4 x 4x4
torch.linalg.multi_dot([weights, x]).T + bias



tensor([[0.1086, 0.2873],
        [0.4056, 0.7241],
        [0.3584, 0.5375],
        [0.4207, 0.5751]], grad_fn=<AddBackward0>)

In [254]:
# Correct: Transpose the wieghts so it can be multipliced with input x
# 4x2 = 4x4 x 4x2
torch.linalg.multi_dot([x, weights.T]) + bias

tensor([[ 0.2375,  0.3155],
        [ 0.0724,  0.4193],
        [ 0.1644,  0.2766],
        [-0.0011,  0.3330]], grad_fn=<AddBackward0>)

In [255]:
y

tensor([[ 0.2375,  0.3155],
        [ 0.0724,  0.4193],
        [ 0.1644,  0.2766],
        [-0.0011,  0.3330]], grad_fn=<AddmmBackward0>)

#### Convolutional Layers

Convolutions layers are used to help extract high dimensional features like an image or text. Each layer has a kernal with a sliding window of size mxn or mxm. The sliding window uses a function to determine the output for a reduced window size based on the values in each sliding window. E.g. 

In [198]:
# 2 pm cst 

weight
bias


#### Recurrent Layers

#### Transformer Layers

#### Data Manipulation Layers

##### Max Pooling Layers

##### Normalization Layers

##### Dropout Layers

#### Activation Functions

#### Loss Functions

In [3]:
nums = [1,1,2,2,2,3,3,3,4,5,5]
nums.remove(nums[8])
nums

[1, 1, 2, 2, 2, 3, 3, 3, 5, 5]