<a href="https://colab.research.google.com/github/Alex112525/Neural-Networks-with-PyTorch-course/blob/main/PyTorch_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*TorchText* is a powerful library that can be used for many natural language processing (NLP) tasks. Some of the use cases for TorchText include text classification, sequence tagging, machine translation, and sentiment analysis

In [None]:
!pip install portalocker>=2.0.0
!pip install torchtext --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split

import torchtext
from torchtext.datasets import DBpedia
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset

torchtext.__version__

'0.15.2+cpu'

##Dataset processing and vocabulary building

In [None]:
test_DB = iter(DBpedia(split="train"))

In [None]:
next(test_DB)

(1,
 'E. D. Abbott Ltd  Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972.')

Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – *word*, *character*, and *subword* (n-gram characters)

In [None]:
tokenizer = get_tokenizer("basic_english")
train_iter = DBpedia(split="train")

def yield_tokens(data_iter: iter) -> list:
  for _, text in data_iter:
    yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unknown>"])
vocab.set_default_index(vocab["<unknown>"])

In [None]:
test = tokenizer("Hello I am Alex, I'm studying in Platzi")
test

['hello', 'i', 'am', 'alex', ',', 'i', "'", 'm', 'studying', 'in', 'platzi']

In [None]:
vocab(test)

[7296, 187, 2409, 2215, 90515, 187, 17, 104, 4782, 3, 0]

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [None]:
text_pipeline("Hello I'm Alex")

[7296, 187, 17, 104, 2215]

##Creating a Dataloader

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch: list) -> tuple:
  label_list = []
  text_list = []
  offsets = [0]

  for (_label, _text) in batch:
    label_list.append(label_pipeline(_label))
    processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
    text_list.append(processed_text)
    offsets.append(processed_text.size(0))

  label_list = torch.tensor(label_list, dtype=torch.int64)
  offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
  text_list = torch.cat(text_list)

  return label_list.to(device), text_list.to(device), offsets.to(device)

A *dataloader* is a PyTorch utility that helps in loading and batching data. It provides an iterable over a dataset that yields batches of data. The *dataloader* can also handle batching, shuffling, and multiprocess data loading.

Dataloaders provide a number of benefits such as:

- **Efficient data loading**: Dataloaders can load data in parallel using multiprocessing workers. 
- **Batching**: Dataloaders can batch data together, which is useful for training deep learning models. 
- **Shuffling**: Dataloaders can shuffle the data before each epoch. This is useful for training deep learning models because it helps prevent overfitting.
- **Customization**: Dataloaders are highly customizable. You can define your own collate function to control how samples are batched together, and you can define your own sampler to control how samples are selected from the dataset.

In [None]:
train_iter = DBpedia(split="train")
dataloader = DataLoader(train_iter, batch_size=8, shuffle=True, collate_fn=collate_batch)

In [None]:
for i, (label, text, offset) in enumerate(dataloader):
    print(f"label: {label}, text: {text}, offset: {offset}")
    break

label: tensor([0, 0, 0, 0, 0, 0, 0, 0], device='cuda:0'), text: tensor([184535, 184535,    453,      2,      6,      5,     50,  17634,   1804,
             4,   1545,   4034,   3905,    684,      3,    287,    906,    153,
             2,      1,     54,    845,      3,   2752,     55,   1004,    153,
           910,   2071,    390,    886,     42,    705,     42,    120,    383,
           713,      7,   1159,      2, 184535,     10,   5820,      3,    121,
           193,     11,   9501, 344439,   2617,  78806,      7,   3268, 220285,
             2,  78806,      7, 220285,   2576,    445,     22,   5605,   1129,
            21,      4,    334,      7,    161,   1720,     14,      1,    826,
            17,     18, 132749,     17,     18,      2,  51445,   4674,   8618,
         51445,      6,     19,   4674,   3038,   1112,      7,    305,     54,
            40,     10,    212,      3,   2379,    390,      2,     98,      3,
           647,     11,   3299,   4727,      5,     42, 

## Creating architecture for the classification model

In [None]:
class ClassificationModel(nn.Module):
  def __init__(self, vocab_size: int, embed_dim: int, num_class: int) -> None:
    super(ClassificationModel, self).__init__()

    self.embedding = nn.EmbeddingBag(vocab_size, embed_dim)
    self.bn1 = nn.BatchNorm1d(embed_dim)
    self.fc = nn.Linear(embed_dim, num_class)

  def forward(self, text: torch.Tensor, offsets: torch.Tensor) -> torch.Tensor:
    embedded = self.embedding(text, offsets)
    embedded_norm = self.bn1(embedded)
    embedded_activated = F.relu(embedded_norm)

    return self.fc(embedded_activated)

In [None]:
train_iter = DBpedia(split="train")
num_class = len(set([label for (label, _) in train_iter]))
vocab_size = len(vocab)
embedding = 128

model_cm = ClassificationModel(vocab_size=vocab_size, embed_dim=embedding, num_class=num_class).to(device)

In [None]:
model_cm

ClassificationModel(
  (embedding): EmbeddingBag(802998, 128, mode='mean')
  (bn1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (fc): Linear(in_features=128, out_features=14, bias=True)
)

In [None]:
model_cm(text, offset)

tensor([[-3.2848e-01,  3.2775e-01,  2.3615e-02, -4.9947e-01,  2.7661e-01,
         -1.7258e-01, -4.8254e-01,  2.8958e-01,  4.3761e-02, -3.3177e-01,
          2.8847e-01, -2.8055e-01,  1.4088e-01,  1.3044e-01],
        [-6.9019e-01, -3.8628e-01, -2.7909e-01, -6.5109e-01, -2.3605e-03,
         -5.6130e-03,  5.3470e-01, -8.6261e-02, -2.0238e-01, -5.9136e-01,
          4.6033e-01,  3.1968e-02, -5.2840e-01, -5.3199e-01],
        [ 1.1558e-01, -5.7493e-02,  1.1665e-03, -6.1120e-01,  2.3265e-01,
          5.3019e-01,  4.2965e-01, -2.2074e-01,  7.4407e-02,  1.0521e-01,
          1.5437e-01, -5.0819e-01, -3.1330e-01, -4.4089e-01],
        [ 5.2036e-01,  3.2213e-01,  1.5288e-01, -4.9221e-01, -3.0451e-03,
         -4.8278e-02,  1.1372e-01, -5.6361e-01, -2.5408e-01, -7.9610e-01,
         -1.2549e-01, -3.0932e-01, -8.1546e-01,  4.6232e-01],
        [ 5.9216e-01, -3.4557e-01,  3.1020e-01, -9.5563e-01,  3.1538e-01,
         -1.2268e-01,  5.7765e-02,  4.0617e-02,  3.1634e-01, -1.4622e+00,
         -2.

*numel()* is a PyTorch function that returns the total number of elements in the input tensor

In [None]:
def count_parameters(model: ClassificationModel) -> int:
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

In [None]:
print(f"The model have {count_parameters(model_cm):,} trainable parameters")

The model have 102,785,806 trainable parameters


## Training Function

The *utils.clip_grad_norm_* function is used to avoid the "exploding gradient" problem in neural network training. This function is responsible for normalizing the gradients of the model parameters so that they are not too large. If the gradients are too large, they can cause the model to not converge or even diverge. 

The function *clip_grad_norm_* takes as input the model parameters and a maximum value for the gradient norm. If the gradient norm is greater than the maximum value, then the gradients are normalized to have a norm equal to the maximum value.

In [None]:
def train(model: ClassificationModel, dataloader: DataLoader) -> tuple:
  model.train()

  epoch_acc = 0
  epoch_loss = 0
  total_count = 0

  for i, (label, text, offset) in enumerate(dataloader):
    optimizer.zero_grad()

    predict = model(text, offset)

    loss = loss_fn(predict, label)

    loss.backward()

    acc = (predict.argmax(1) == label).sum()

    nn.utils.clip_grad_norm_(model.parameters(), 0.1)

    optimizer.step()

    epoch_acc += acc.item()
    epoch_loss += loss.item()
    total_count += label.size(0)

    if i%500 == 0:
      print(f"epoch:{epoch} | {i}/{len(dataloader)} batches| loss:{epoch_loss/total_count} | acc:{epoch_acc/total_count}")
    
  return epoch_acc/total_count, epoch_loss/total_count


In [None]:
def eval(model: ClassificationModel, dataloader: DataLoader) -> tuple:
  model.eval()
  epoch_acc = 0
  total_count = 0
  epoch_loss = 0

  with torch.no_grad():
    for i, (label, text, offset) in enumerate(dataloader):
      predict = model(text, offset)

      loss = loss_fn(predict, label)
      acc = (predict.argmax(1) == label).sum()

      epoch_loss += loss.item()
      epoch_acc += acc.item()
      total_count += label.size(0)

  return epoch_acc/total_count, epoch_loss/total_count

## Preparing training: data splitting, loss and optimization.

In machine learning, *hyperparameters* are parameters that are not learned from data, but are set prior to training. They are used to control the learning process and can have a significant impact on the performance of the model.

In [None]:
#Hyperparameters
EPOCH = 4
LEARNING_RATE = 0.22
BATCH_SIZE = 64

In [None]:
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model_cm.parameters(), lr=LEARNING_RATE)

The *to_map_style_dataset* function is used to convert a dataset into a map-style dataset. It takes an iterable dataset and returns a map-style dataset. 

The map-style dataset is a PyTorch Dataset that returns a dictionary of samples instead of a tuple of samples. Each sample in the dictionary is identified by a key. The keys are specified using the field_names argument. The *to_map_style_dataset* function is useful when you want to use a dataset with a DataLoader that requires a map-style dataset.

In [None]:
train_iter, test_iter = DBpedia()
train_data = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)

split_train = int(len(train_data)* 0.95)

train_dataset, validation_dataset = random_split(train_data, [split_train, len(train_data)-split_train]) 

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
validation_dataloader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)

##Training and Evaluation.

In [None]:
major_loss_validation = float("inf")

for epoch in range(1, EPOCH+1):
  train_acc, train_loss = train(model_cm, train_dataloader)

  eval_acc, eval_loss = eval(model_cm, validation_dataloader)

  if eval_loss < major_loss_validation:
    best_valid_loss = eval_loss
    torch.save(model_cm.state_dict(), "best_model.pt")


epoch:1 | 0/8313 batches| loss:0.021196749061346054 | acc:0.5625
epoch:1 | 500/8313 batches| loss:0.018923175156845187 | acc:0.6487337824351297
epoch:1 | 1000/8313 batches| loss:0.018162808962516135 | acc:0.6593874875124875
epoch:1 | 1500/8313 batches| loss:0.017615180856596024 | acc:0.6672426715522984
epoch:1 | 2000/8313 batches| loss:0.017196083099286684 | acc:0.6730775237381309
epoch:1 | 2500/8313 batches| loss:0.0168769058746482 | acc:0.6771603858456617
epoch:1 | 3000/8313 batches| loss:0.01656188335771195 | acc:0.6816061312895701
epoch:1 | 3500/8313 batches| loss:0.016331112985354224 | acc:0.6852372536418166
epoch:1 | 4000/8313 batches| loss:0.01610981972417446 | acc:0.6883396338415396
epoch:1 | 4500/8313 batches| loss:0.01590883541749606 | acc:0.6913463674738947
epoch:1 | 5000/8313 batches| loss:0.015709996309952273 | acc:0.6944267396520696
epoch:1 | 5500/8313 batches| loss:0.015560685703776394 | acc:0.696575054535539
epoch:1 | 6000/8313 batches| loss:0.01540673974446097 | acc:0.

## Inference

*torch.compile(model, mode="reduce_overhead")* allows our code to run more efficiently. However, this optimization may come at the cost of a small amount of additional memory. This is the recommended mode for small models like ours for sorting.

In [None]:
DBpedia_label = {1: "Company",
                 2: "Educational Institution",
                 3: "Artist",
                 4: "Athlete",
                 5: "OfficeHolder",
                 6: "Mean of transportation",
                 7: "Building",
                 8: "Natural place",
                 9: "Village",
                 10: "Animal",
                 11: "Plant",
                 12: "Album",
                 13: "Film", 
                 14: "Written Work"}

def predict(model: ClassificationModel, text: str) -> int:
  with torch.no_grad():
    text = torch.tensor(text_pipeline(text))
    opt_mod = torch.compile(model, mode="reduce-overhead")
    output = opt_mod(text, torch.tensor([0]))
    return output.argmax(1).item() + 1

In [None]:
Example_1 ="Nithari is a village in the western part of the state of Uttar Pradesh \
            India bordering on New Delhi. Nithari forms part of the New Okhla Industrial \
            Development Authority's planned industrial city Noida falling in Sector 31. \
            Nithari made international news headlines in December 2006 when the skeletons\
             of a number of apparently murdered women and children were unearthed in the village."

model_test = model_cm.to("cpu")
print(f"Output example 1: {DBpedia_label[predict(model_test, Example_1)]}")

Output example 1: Village
