# PyTorch Basics

*by Arif Ozan Kızıldağ*

This tutorial will introduce you to the basics of PyTorch while we tackle a news classification task using the AG NEWS dataset. While creating this tutorial, we utilized the [``Text classification with the torchtext library``](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html) tutorial from the official PyTorch tutorials as a reference.

PyTorch is a Python deep learning library from Facebook mainly written with CUDA, which is a general-purpose GPU computing framework of NVIDIA. Its low-level design is intended to be similar to Numpy and to follow Python principles as much as possible. It facilitates deep learning applications with its CUDA backend, automatic differentiation mechanism on computational graphs, and its utilities for machine learning and deep learning.

In [1]:
from time import time
import torch
import torchtext
import torchdata
%matplotlib inline
print('version of the torch:' + torch.__version__)
print('version of the torchtext:' + torchtext.__version__)
print('version of the torchdata:' + torchdata.__version__)

version of the torch:2.0.1
version of the torchtext:0.15.2
version of the torchdata:0.6.1


PyTorch supports the CUDA backend for accelerating parallelizable operations such as matrix multiplication and convolution in GPU. You can check if CUDA is available in the current session by calling `torch.cuda.is_avaliable()` method. 

In [2]:
print(torch.cuda.is_available())

True


## Tensors and Gradient

With PyTorch, you can create tensors in the way that you create arrays in Numpy. `torch.Tensor` is the main data structure of PyTorch. It holds the necessary information for the computation of created computational graphs in both forward and backward modes. Each tensor is an edge in computation.

In [3]:
# Basic data operations
x1 = torch.zeros(5, 3) # You can initialize tensors with zeros, ones, or rand methods
x2 = torch.ones(5, 3) * 2.5  #example of ones method
x3 = torch.rand(5, 3)  # example of rand method
print('\033[1m'+'Tensor generated with zeros method:'+'\033[0m')
print(x1)
print('\033[1m'+'\nTensor generated with ones method:'+'\033[0m')
print(x2)
print('\033[1m'+'\nRandomly generated tensor:'+'\033[0m')
print(x3)

[1mTensor generated with zeros method:[0m
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
[1m
Tensor generated with ones method:[0m
tensor([[2.5000, 2.5000, 2.5000],
        [2.5000, 2.5000, 2.5000],
        [2.5000, 2.5000, 2.5000],
        [2.5000, 2.5000, 2.5000],
        [2.5000, 2.5000, 2.5000]])
[1m
Randomly generated tensor:[0m
tensor([[0.2983, 0.4563, 0.9009],
        [0.3079, 0.5502, 0.3269],
        [0.6809, 0.4641, 0.1907],
        [0.9260, 0.7665, 0.1745],
        [0.3402, 0.9769, 0.3311]])


Also, you can apply mathematical operations on tensors just as in other scientific computing libraries.

In [4]:
print('\033[1m'+'Sum of x2 and x3:'+'\033[0m')
print(x2 + x3)
print('\033[1m'+'\nTranspose of x3:'+'\033[0m')
print(x3.T)
print('\033[1m'+'\nShape of x2'+'\033[0m')
print(x2.shape)    #shape method of the Tensor return dimentions of the tensor
print('\033[1m'+'\nShape of x3.T '+'\033[0m')
print(x3.T.shape)
print('\033[1m'+'\nMatrix multiplication of x2 and x3.T'+'\033[0m')
print(x2 @ x3.T) # @ is matrix multiplication operator. You can use .matmul() method as well.

[1mSum of x2 and x3:[0m
tensor([[2.7983, 2.9563, 3.4009],
        [2.8079, 3.0502, 2.8269],
        [3.1809, 2.9641, 2.6907],
        [3.4260, 3.2665, 2.6745],
        [2.8402, 3.4769, 2.8311]])
[1m
Transpose of x3:[0m
tensor([[0.2983, 0.3079, 0.6809, 0.9260, 0.3402],
        [0.4563, 0.5502, 0.4641, 0.7665, 0.9769],
        [0.9009, 0.3269, 0.1907, 0.1745, 0.3311]])
[1m
Shape of x2[0m
torch.Size([5, 3])
[1m
Shape of x3.T [0m
torch.Size([3, 5])
[1m
Matrix multiplication of x2 and x3.T[0m
tensor([[4.1387, 2.9624, 3.3394, 4.6675, 4.1206],
        [4.1387, 2.9624, 3.3394, 4.6675, 4.1206],
        [4.1387, 2.9624, 3.3394, 4.6675, 4.1206],
        [4.1387, 2.9624, 3.3394, 4.6675, 4.1206],
        [4.1387, 2.9624, 3.3394, 4.6675, 4.1206]])


PyTorch autograd module lets you take derivatives of the leaf nodes in computational graphs by calling the `.backward()` method of the scalar tensor to be differentiated. In this example, the derivative of `x` with respect to `out = mean(x + y)` is computed by autograd. Note that differentiation by a matrix (or tensor) is nothing but differentiating by all elements in the matrix.

By default, PyTorch does not keep gradients of newly created tensos, unless you specify it when creating a tensor.

In [5]:
x = torch.ones(2, 3, requires_grad=True)
y = torch.ones(2, 3, requires_grad=True) * 0.5
print(x)

tensor([[1., 1., 1.],
        [1., 1., 1.]], requires_grad=True)


PyTorch also keeps track of the operations that create tensors to apply chain rule at backward propagation. For example, y is created by a multiplication while x is not.

In [6]:
print(y)

tensor([[0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000]], grad_fn=<MulBackward0>)


In [7]:
z = x + y
print(z)

tensor([[1.5000, 1.5000, 1.5000],
        [1.5000, 1.5000, 1.5000]], grad_fn=<AddBackward0>)


In [8]:
out_scalar = z.mean()
print(out_scalar)

tensor(1.5000, grad_fn=<MeanBackward0>)


When you call `.backward()` of a 'scalar valued' tensor, autograd module automatically populates the `.grad` fields of the other tensors in computational graph.

In [9]:
out_scalar.backward()
print(x.grad)

tensor([[0.1667, 0.1667, 0.1667],
        [0.1667, 0.1667, 0.1667]])


## GPU vs CPU

You can send tensors in GPU memory by calling `.cuda()` methods from tensors. If you run an operations on tensors in the GPU memory, it will automatically be calculated in GPU so that you don't need to use CUDA for most operations.

Here is a benchmark for multiplication of large matrices in CPU and GPU.

In [10]:
x = torch.rand(500, 700)
y = torch.rand(700, 900)

start = time()

for n in range(5000):# depending on the system CPU, this process may take some time
    x.matmul(y)

end = time()

cpu_time = end - start
print(f'{cpu_time:.2f} seconds')

2.61 seconds


In [11]:
x = torch.rand(500, 700).cuda()
y = torch.rand(700, 900).cuda()

start = time()

for n in range(5000):
    x.matmul(y)

end = time()

gpu_time = end - start
print(f'{gpu_time:.2f} seconds')

0.72 seconds


GPU had a significant advantage here. Actually, this is not even a very significant difference compared to how it speeds up deep neural networks.

## Data Utilities

In addition to the scientific computing features of PyTorch, it has utilities for common operations in machine learning. These include dataset operations like sampling, shuffling, batching of data, or loading of commonly used datasets. Some of these features are provided with the `torchtext` library, which is an NLP-specialized library that is officially supported by the PyTorch team.  In this example, the AG_NEWS dataset is loaded using `torchtext.datasets`

In [12]:
from torchtext.datasets import AG_NEWS

train_iter = iter(AG_NEWS(split='train'))

This iterator facilitates the efficient processing of raw data and provides easy access to it

In [13]:
next(train_iter)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

"In the standard NLP process, we convert text into tokens, which in our context are essentially words. These tokens are then mapped to numbers. " This can be observed in the following code block:

In [14]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split="train")


def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])  # Creates a dictionary for tokens
vocab.set_default_index(vocab["<unk>"])

In [15]:
vocab(['here', 'is', 'an', 'example'])

[475, 21, 30, 5297]

In [16]:
text_pipeline = lambda x: vocab(tokenizer(x))   # Pipelines for conversion
label_pipeline = lambda x: int(x) - 1

The Data Loader in PyTorch efficiently batches and loads datasets for neural network training and evaluation. It supports data shuffling, automatic batching, and parallel data loading with multiple worker processes.

Before sending to the model, the ``collate_fn`` function works on a batch of samples generated from ``DataLoader``. The input to ``collate_fn`` is a batch of data with the batch size in ``DataLoader``, and ``collate_fn`` processes them according to the data processing pipelines declared previously. Pay attention here and make sure that ``collate_fn`` is declared as a top-level def. This ensures that the function is available to each worker.

In this example, the text entries in the original data batch input are packed into a list and concatenated as a single tensor for the input of ``nn.EmbeddingBag``. The offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor. The label is a tensor saving the labels of individual text entries.

In [17]:
from torch.utils.data import DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for _label, _text in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)


train_iter = AG_NEWS(split="train")
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

## The model

To create neural networks with PyTorch, you need to create a Class that inherits `torch.nn.Module`. Then, you need to define the layers in the `__init__` method and define the forward propagation logic in the `forward` method. This class can be considered as a blueprint of the network. It specifies the shapes of its parameters but does not initialize the parameters. The example below shows a 3-layer network with input size 30 and layer sizes [20, 15, 1], respectively. The most common layers are provided by the `torch.nn` module.

In [18]:
class DumbNetwork(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(30, 20)
        self.layer2 = torch.nn.Linear(20, 15)
        self.layer3 = torch.nn.Linear(15, 1)

    def forward(self, x):
        x = self.layer1(x)
        x = torch.nn.functional.relu(x)
        x = self.layer2(x)
        x = torch.nn.functional.relu(x)
        x = self.layer3(x)
        return x

In [19]:
input_data = torch.randn(16, 30) #dummy input to match the network's input shape

In [20]:
print(input_data.shape)

torch.Size([16, 30])


To initialize and instantiate a network from the blueprint, you can use the class initialization syntax of Python, which is shown below.
Note that if the `__init__` method takes arguments other than `self`, then you need to initialize the network with these arguments.


In [21]:
network = DumbNetwork()

In [22]:
print(network)

DumbNetwork(
  (layer1): Linear(in_features=30, out_features=20, bias=True)
  (layer2): Linear(in_features=20, out_features=15, bias=True)
  (layer3): Linear(in_features=15, out_features=1, bias=True)
)


When the network is called via function call operator `()`, the `forward` method of the network is called. Note that the network handles inputs as batches and the batch dimension can be arbitrary.

In [23]:
output = network(input_data)

In [24]:
print(output.shape)

torch.Size([16, 1])


### Lets Create Model for Processing AG_NEWS

Now that we know how to create models, let's create models for our task. As mentioned, we are going to utilize an "embedding bag," which is essentially a summation of embeddings. If you're unfamiliar with what embeddings are, you can think of them as representing objects in a multi-dimensional space rather than listing them in a linear array. For example, you could represent the computer you're using in terms of its coordinates (x, y, z) relative to the center of the room.

Text embeddings work similarly; instead of representing each word with a simple number, we represent them in different, abstract dimensions that capture various semantic attributes or ideas.

You have **five minutes** to fill in the empty spaces and make the following class functional. You can verify the accuracy of your work using the next code block

In [None]:
from torch import nn
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc =         ## fill here
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded =  ## fill here
        out =   ## fill here
        return out

The ``AG_NEWS`` dataset has four labels and therefore the number of classes is four.

1.  World
2.  Sports
3.  Business
4.  Sci/Tec

Let us select the embedding size as 64.

In [26]:
train_iter = AG_NEWS(split="train")
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)
print(model)

TextClassificationModel(
  (embedding): EmbeddingBag(95811, 64, mode='mean')
  (fc): Linear(in_features=64, out_features=4, bias=True)
)


### Spoilers

In [25]:
from torch import nn
class TextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)  ## remove here
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets) ## remvoe here
        out = self.fc(embedded)  # remove here
        return out

## Model Training

### Example

Since most deep neural networks are trained with a Stochastic Gradient Descent (SGD) variant, PyTorch also has a module that provides most deep learning optimizers, and utilities like learning rate schedulers.


In [27]:
x = torch.tensor(100.0, requires_grad=True)

In [28]:
y = torch.tensor(100.0, requires_grad=True)

In [29]:
optimizer = torch.optim.SGD(params=[x, y],
                            lr=0.01)

Before proceeding, I suggest you to find values of `x` and `y` to minimize `my_function(x, y)`

In [30]:
def my_function(x, y):
    return torch.square(x) + torch.square(y)

You can create optimizers by initializing the classes in `torch.optim` module with the parameters (needs to be iterable like a list) to be optimized such as learning rate and momentum.
Actually, optimizers does not know anything about the optimized quantity. It can only access (and reset) the gradient fields (`.grad`) of the tensors in its `params=` argument.

The `grad` field of tensors is populated by the `.backward()` method of to-be-minimized tensor.

In this example, `x` and `y` are optimized to minimize `z` tensor. So `.backward()` needs to be called from`z`.


In [31]:
#This optimizer is going to optimize x and y
optimizer = torch.optim.SGD(params=[x, y],
                            lr=0.01)


Now, try to run the cell below multiple times to see how the SGD optimizer minimizes z by subtracting the gradients of x and y from themselves after multiplying gradients by the learning rate (lr).

You can run cells before proceeding to the next cell by Ctrl+Enter.

In [32]:
#calculate x^2 + y^2
z = my_function(x, y)

print(f'x before step is: {x.item()}')
print(f'y before step is: {y.item()}')
#calculate gradients of x and y wrt. z
optimizer.zero_grad()
z.backward()

print(f'gradient of x is: {x.grad.item()}')
print(f'gradient of y is: {y.grad.item()}')

#take a gradient descent step
optimizer.step()

print(f'x^2 + y^2 = {z}')

x before step is: 100.0
y before step is: 100.0
gradient of x is: 200.0
gradient of y is: 200.0
x^2 + y^2 = 20000.0


### Spliting the data and preparing for training

Since the original ``AG_NEWS`` has no valid dataset, we split the training
dataset into train/valid sets with a split ratio of 0.95 (train) and
0.05 (valid). Here we use
[torch.utils.data.dataset.random_split](https://pytorch.org/docs/stable/data.html?highlight=random_split#torch.utils.data.random_split)_
function in PyTorch core library.

[CrossEntropyLoss](https://pytorch.org/docs/stable/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss)_
criterion combines ``nn.LogSoftmax()`` and ``nn.NLLLoss()`` in a single class.
It is useful when training a classification problem with C classes.
[SGD](https://pytorch.org/docs/stable/_modules/torch/optim/sgd.html)_
implements the stochastic gradient descent method as the optimizer. The initial
learning rate is set to 5.0.
[StepLR](https://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#StepLR)_
is used here to adjust the learning rate though epochs.


In [33]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 64  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch
)

### Lets create optimization loop

Let us fill in the missing parts of the training loop. Then run the next code block to test your code and train your model.

In [34]:
import time


def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        ##########################
        #call zero_grad, the model (with text and offset) and the calculate the gradients
        
        predicted_label =
        loss = 
        
        ############
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1) ## To prevent expoding gradient
        ############
        ## call optimizer step function
        ############
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

In [35]:
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

| epoch   1 |   500/ 1782 batches | accuracy    0.685
| epoch   1 |  1000/ 1782 batches | accuracy    0.849
| epoch   1 |  1500/ 1782 batches | accuracy    0.879
-----------------------------------------------------------
| end of epoch   1 | time:  6.84s | valid accuracy    0.879 
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.898
| epoch   2 |  1000/ 1782 batches | accuracy    0.902
| epoch   2 |  1500/ 1782 batches | accuracy    0.901
-----------------------------------------------------------
| end of epoch   2 | time:  6.82s | valid accuracy    0.903 
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.915
| epoch   3 |  1000/ 1782 batches | accuracy    0.912
| epoch   3 |  1500/ 1782 batches | accuracy    0.915
-----------------------------------------------------------
| end of epoch   3 | time:  6.99s | valid accuracy    0.905 
-------------------------------

### Spoilers

In [None]:
import time


def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        ##########################
        #call zero_grad, the model (with text and offset) and the calculate the gradients
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        ############
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1) ## To prevent expoding gradient
        ############
        ## call optimizer step function
        optimizer.step()
        ############
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

### Cont

Checking the results of the test dataset…



In [36]:
print("Checking the results of test dataset.")
accu_test = evaluate(test_dataloader)
print("test accuracy {:8.3f}".format(accu_test))

Checking the results of test dataset.
test accuracy    0.909


Use the best model so far and test a golf news.




In [37]:
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}


def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1


ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu")

print("This is a %s news" % ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news


## Vision

In our previous lessons, we explored NLP using transformer models. Now, let's pivot to the realm of computer vision. We'll work with the classic MNIST dataset, which consists of handwritten digit images. Instead of dealing with the hassle of downloading and preprocessing the dataset manually, we'll leverage the convenience of the torchvision library.

In [38]:
import torchvision
import torchvision.transforms as transforms

In [39]:
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

train_dataset = torchvision.datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)

Let's construct a simple neural network module with two layers. This network will have a hidden dimension of 500. Remember to incorporate the ReLU activation function between the two layers. 

In [51]:
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNN, self).__init__()


    def forward(self, x):

        return x

model = SimpleNN(28*28, 500, 10).to(device)
print(model)
model(torch.rand(10,784).cuda()).size()

SimpleNN(
  (fc1): Linear(in_features=784, out_features=500, bias=True)
  (fc2): Linear(in_features=500, out_features=10, bias=True)
  (relu): ReLU()
)


torch.Size([10, 10])

### @Spoilers

In [40]:
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = SimpleNN(28*28, 500, 10).to(device)

### Cont

In [41]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

In [42]:
num_epochs = 5
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)
        optimizer.zero_grad()
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        
        loss.backward()
        optimizer.step()

        if (i+1) % 300 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

Epoch [1/5], Step [300/938], Loss: 0.5572
Epoch [1/5], Step [600/938], Loss: 0.2961
Epoch [1/5], Step [900/938], Loss: 0.2194
Epoch [2/5], Step [300/938], Loss: 0.0729
Epoch [2/5], Step [600/938], Loss: 0.1057
Epoch [2/5], Step [900/938], Loss: 0.0985
Epoch [3/5], Step [300/938], Loss: 0.4766
Epoch [3/5], Step [600/938], Loss: 0.1331
Epoch [3/5], Step [900/938], Loss: 0.0943
Epoch [4/5], Step [300/938], Loss: 0.0733
Epoch [4/5], Step [600/938], Loss: 0.0594
Epoch [4/5], Step [900/938], Loss: 0.0346
Epoch [5/5], Step [300/938], Loss: 0.1152
Epoch [5/5], Step [600/938], Loss: 0.0254
Epoch [5/5], Step [900/938], Loss: 0.1785


In [43]:
test_dataset = torchvision.datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, shuffle=False)

In [44]:
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        predicted = outputs.argmax(1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
print(f'Accuracy of the model on the test images: {100 * correct / total:.2f}%')

Accuracy of the model on the test images: 97.42%
