<a href="https://colab.research.google.com/github/LennyRBriones/pytorch/blob/main/torchtext_data_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Library & dataset


In [1]:
%%capture
!pip install portalocker>=2.0.0
!pip install torchtext --updgrade

In [3]:
import torch
import torchtext
from torchtext.datasets import DBpedia

# version
torchtext.__version__

'0.15.2+cpu'

## Processing the dataset and starting the vocabulary

In [4]:
train_iter = iter(DBpedia(split="train"))

In [5]:
next(train_iter)

(1,
 'E. D. Abbott Ltd  Abbott of Farnham E D Abbott Limited was a British coachbuilding business based in Farnham Surrey trading under that name from 1929. A major part of their output was under sub-contract to motor vehicle manufacturers. Their business closed in 1972.')

In [7]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
train_iter = DBpedia(split="train")

def yield_tokens(data_iter):
  for _, text in data_iter:
    yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])


Our vocabulary transforms the list of tokens in int numbers

In [8]:
tokenizer("Hi there!, here Lenny making tests")

['hi', 'there', '!', ',', 'here', 'lenny', 'making', 'tests']

In [10]:
vocab(tokenizer("Hi there!, here Lenny making tests, nihao!"))

[10371, 313, 403, 90515, 1538, 13823, 1031, 5247, 90515, 0, 403]

The tokenizer classify every word that is register, in this case tnakns to `<unk> ` the words that are not register as "nihao" is store as 0

In [11]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) -1
#to save starting in 0

In [12]:
text_pipeline("Hi, I'am Lenny")

[10371, 90515, 187, 17, 2409, 13823]

In [13]:
label_pipeline("10")

9

The `Dataloader` allows to load big data in a small batchers

In [15]:
#Using cuda to big data process
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
  label_list = []
  text_list = []
  offsets = [0]

  for (_label, _text) in batch:
    label_list.append(label_pipeline(_label))
    processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
    text_list.append(processed_text)
    offsets.append(processed_text.size(0))

  label_list = torch.tensor(label_list, dtype=torch.int64)
  offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
  text_list = torch.cat(text_list)
  return label_list.to(device), text_list.to(device), offsets.to(device)

In [16]:
from torch.utils.data import DataLoader

train_iter = DBpedia(split="train")
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch )
                          # 8 in this case using a colab CPU

In [17]:
dataloader

<torch.utils.data.dataloader.DataLoader at 0x7f461610c8b0>