# Pytorch Deeplearning Tutorial

## 1. Introduction

This turotial introduces a general approach to training deep learning models with PyTorch. It is divided into the following parts:
1. Introduction (the current section)
2. Introduction to PyTorch and Tensor
3. Piepline of Training a Deep Learning Model
4. Dataset
5. Tokenizer
6. Vocabulary
7. Label Mapping
8. Model
9. Loss Function
10. Optimizer
11. Data Loader
12. Training, Validation, Testing and Prediction
13. Complete Implementation

## 2. Introduction to PyTorch and Tensor

### 2.1 PyTorch

Pytorch is an open source machine learning framework for training deep learning models. Other similar frameworks include Tensorflow, Keras and Jax.

The download command of the latest PyTorch version can be found on its [homepage](https://pytorch.org/). Several info should be considered.
+ PyTorch Build: Some latest PyTorch versions.
+ Your OS: Your operating system.
+ Package: Download method. Conda and Pip are generally used.
+ Language: Python in most cases.
+ Compute Platform: GPU (CUDA), CPU or others.

After selecting the above info, run the command provided in "Run this Command" to download pytorch.  
E.g. Stable (1.12.0), Linux, Pip, Python, CPU
```bash
pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cpu
```

Note that `torchvision` and `torchaudio` are task-specific package for vision (e.g. computer vision) and audio (e.g. audio recognition). For NLP, they are not needed and can be removed, resulting in the following command:
```bash
pip3 install torch --extra-index-url https://download.pytorch.org/whl/cpu
```

After downloading `torch`, check if it has been downloaded correctly with the following **Python code**:

In [1]:
import torch

  from .autonotebook import tqdm as notebook_tqdm


If nothing is printed after executing the above line of code, the `torch` package is downloaded successfully.

A bunch of tutorials for training deep learning models with PyTorch can be found [here](https://pytorch.org/tutorials/). They include many topics including NLP, CV and so on.
And the PyTorch documentation is [here](https://pytorch.org/docs/stable/index.html). Check it out if you are not familiar with certain classes, methods, etc..

### 2.2 `torchtext`

`torchtext`, similar to `torchvision` and `torchaudio`, is a task-specific package for texts. It is useful for NLP tasks. `torchtext` can be downloaded with the following command:
```bash
pip install torchtext
```

<!-- Note that the version of `torchtext` should be compatible with `torch`. But don't worry, when downloading `torchtext`, the correct version of `torch` will also be downloaded. -->

After downloading `torchtext`, check if it has been downloaded correctly with the following **Python code**:

In [2]:
import torchtext

Again, if nothing is printed after executing the above line of code, the `torchtext` package is downloaded successfully.

### 2.3 Tensors

Tensors are basic units in PyTorch. A tensor is like a Python **list** or an Numpy **ndarray**. **A tensor is analogous to a matrix in any dimension that supports many deep learning operations (e.g. matrix operations, running on gpu, automatic differentiation, etc.).**

### 2.3.1 Tensor Creation

There are a lot of ways to create a tensor. See the examples below. More examples can be found [here](https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html).

#### 2.3.1.1 Creating a Tensor from a Python List

In [3]:
data = [
    [1, 2],
    [3, 4]
]
tensor = torch.tensor(data)

print(f"data:\n{data}\n")
print(f"tensor:\n{tensor}\n")

data:
[[1, 2], [3, 4]]

tensor:
tensor([[1, 2],
        [3, 4]])



#### 2.3.1.2 Creating a Tensor from a Numpy ndarray

In [4]:
# Creating a numpy.ndarray.
import numpy as np
data = [
    [1, 2],
    [3, 4]
]
np_array = np.array(data)

# numpy.ndarray -> torch.tensor.
tensor = torch.from_numpy(np_array)

print(f"data:\n{data}\n")
print(f"numpy array:\n{np_array}\n")
print(f"tensor:\n{tensor}")

data:
[[1, 2], [3, 4]]

numpy array:
[[1 2]
 [3 4]]

tensor:
tensor([[1, 2],
        [3, 4]], dtype=torch.int32)


## 2.3.2 Tensor Attributes

Tensor attributes describe their shape, datatype, and the device on which they are stored.

In [5]:
# Create a tensor from a python list.
data = [
    [1, 2],
    [3, 4],
    [5, 6]
]
tensor = torch.tensor(data)

print(f"Shape of the tensor is:\n{tensor.shape}\n")
print(f"Datatype of the tensor is:\n{tensor.dtype}\n")
print(f"Device that the tensor is stored on is:\n{tensor.device}\n")

Shape of the tensor is:
torch.Size([3, 2])

Datatype of the tensor is:
torch.int64

Device that the tensor is stored on is:
cpu



### 2.3.3 Tensor Operations

Over 100 tensor operations are supported by PyTorch, including arithmetic, linear algebra, matrix manipulation (transposing, indexing, slicing), sampling. Some tensor operations are presented below. More detailed descriptions can be found [here](https://pytorch.org/docs/stable/torch.html). **Note: you DON'T have to memorize all operations. Search what you need in the documentation or with Google / Baidu**.

#### 2.3.3.1 Moving a Tensor from CPU to GPU

By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using `.to()` method (after checking for GPU availability with `torch.cuda.is_available`).

In [6]:
print(torch.cuda.is_available())
if torch.cuda.is_available():
    tensor = tensor.to("cuda")  # Move the tensor to gpu.
print(tensor.device)

False
cpu


#### 2.3.3.2 Indexing and Slicing

In [7]:
data = [
    [1, 2],
    [3, 4],
    [5, 6]
]
tensor = torch.tensor(data)
print(f"tensor:\n{tensor}\n")

print(f"First row:\n{tensor[0]}\n")
print(f"First column:\n{tensor[:, 0]}\n")
print(f"Last column:\n{tensor[:, -1]}\n")

tensor[:,1] = 0
print(f"After changing value:\n{tensor}")

tensor:
tensor([[1, 2],
        [3, 4],
        [5, 6]])

First row:
tensor([1, 2])

First column:
tensor([1, 3, 5])

Last column:
tensor([2, 4, 6])

After changing value:
tensor([[1, 0],
        [3, 0],
        [5, 0]])


#### 2.3.3.3 Converting Single-Element Tensors to Python Numerical Values

If you have a one-element tensor (e.g. `tensor([233])`, you can convert it to a Python numerical value using the `item` method:

In [8]:
single_elem_tensor = torch.tensor([233])
print(f"single element tensor:\n{single_elem_tensor}\n")
numerical_val = single_elem_tensor.item()
print(f"numerical value:\n{numerical_val}\n")

single_elem_tensor = torch.tensor([[[233]]])
print(f"single element tensor:\n{single_elem_tensor}\n")
numerical_val = single_elem_tensor.item()
print(f"numerical value:\n{numerical_val}\n")

single element tensor:
tensor([233])

numerical value:
233

single element tensor:
tensor([[[233]]])

numerical value:
233



#### 2.3.3.4 Arithmetic Operations

**Matrix multiplication**

In [9]:
data = [
    [1, 2],
    [3, 4],
    [5, 6]
]
tensor = torch.tensor(data)
print(f"tensor:\n{tensor}\n")

y1 = tensor @ tensor.T
print(f"tensor @ tensor:\n{y1}\n")
y2 = tensor.matmul(tensor.T)
print(f"tensor @ tensor:\n{y2}\n")

tensor:
tensor([[1, 2],
        [3, 4],
        [5, 6]])

tensor @ tensor:
tensor([[ 5, 11, 17],
        [11, 25, 39],
        [17, 39, 61]])

tensor @ tensor:
tensor([[ 5, 11, 17],
        [11, 25, 39],
        [17, 39, 61]])



**Element-wise multiplication**

In [10]:
data = [
    [1, 2],
    [3, 4],
    [5, 6]
]
tensor = torch.tensor(data)
print(f"tensor:\n{tensor}\n")

z1 = tensor * tensor
print(f"tensor * tensor:\n{z1}\n")
z2 = tensor.mul(tensor)
print(f"tensor * tensor:\n{z2}\n")

# Element-wise multiplication + broadcasting.
z3 = tensor * 233
print(f"tensor * 233:\n{z3}\n")

tensor:
tensor([[1, 2],
        [3, 4],
        [5, 6]])

tensor * tensor:
tensor([[ 1,  4],
        [ 9, 16],
        [25, 36]])

tensor * tensor:
tensor([[ 1,  4],
        [ 9, 16],
        [25, 36]])

tensor * 233:
tensor([[ 233,  466],
        [ 699,  932],
        [1165, 1398]])



## 3. Piepline of Training a Deep Learning Model

In this turotial, we will train a deep learning model for **text classification**. It is modified from the AG News classification task.
+ [Input] A piece of news.
+ [Output] Category of the news (one of Business, Sci/Tech, Sports, World).

Data sample:
+ [Input] Editors' Picks: What do you think of the iMac's newest design? From the introduction of the first Macintosh computer through to the release of the iPod, Apple Computer has earned a reputation for cutting-edge industrial design. Does the newest iMac live up to that reputation?
+ [Output] Sci/Tech

### 3.1 Pipeline

![pipeline](./pipeline.png)

**Pseudo code** for the pipeline is as follows. Details are omitted for brief demonstration.
```python
train_dataset = Dataset(path_of_training_data)
dev_dataset = Dataset(path_of_dev_data)
test_dataset = Dataset(path_of_test_data)

tokenizer = Tokenizer()
vocab = Vocab()
label_mapping = make_label_mapping()

model = Model()

loss_function = LossFunction()
optimizer = Optimizer()

train_loader = DataLoader(train_dataset)
dev_loader = DataLoader(dev_dataset)
test_loader = DataLoader(test_dataset)

best_valid_loss = float("inf")
best_model = None
for i in range(num_epochs):
    train(train_loader, tokenizer, vocab, model, loss_function, optimizer, label_mapping)

    valid_loss = valid(dev_loader, tokenizer, vocab, model, loss_function, label_mapping)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        best_model = copy.deepcopy(model)

test(test_loader, tokenizer, vocab, best_model, label_mapping)
```

### 3.2 Required Components

According to the pipeline above, seven components are required to training a deep learning model for text classification:
1. **Dataset**: producing input texts and output labels.
2. **Tokenizer**: tokenizing texts.
3. **Vocabulary**: converting tokens to integral ids.
4. **Label mapping**: converting classes to ids.
5. **Model**
6. **Loss function**
7. **Optimizer**

Note that when training, the model often receives **a batch** of training examples instead of one training example. Thus, a data loader is needed for sampling training examples from the dataset, constructing batches, and so on.

8. **Data loader**: Sampling training examples and constructing batches.

The following sections will detail how to construct each component.

## 4. Dataset

### 4.1 Goal of this Section

First, we will create datasets for the training, development and test data. Datasets are created for
+ reading data from data files
+ storing data
+ producing data for training

What we will accomplish in this section can be explained by the following pseudo code:
```python
train_dataset = Dataset(path_of_training_data)
dev_dataset = Dataset(path_of_dev_data)
test_dataset = Dataset(path_of_test_data)
```

### 4.2 Required Methods for Implementing a PyTorch Custom Dataset

In PyTorch, a custom dataset corresponds to a Python **class**. The dataset class must implement three methods: 
+ `__init__`: reading data from data files.
+ `__len__`: returning the length, i.e. number of examples, of the dataset.
+ `__getitem__`: producing a (input, output) pair, given an index.

### 4.3 Implementation of the Three Methods

Now let's create a dataset class for the text classification task. As mentioned above, it should contain three methods `__init__`, `__len__` and `__getitem__`.

In [11]:
from torch.utils.data import Dataset

class AGNewsDataset(Dataset):
    """Dataset for AG News.

    :param fp: File path of the data.
    """

    def __init__(self, fp: str):
        ...

    def __len__(self):
        ...

    def __getitem__(self, idx: int):
        ...

#### 4.3.1 Implementing `__init__`

First, we'll write a method `read` for reading data from the data file. The method receives the path of the data file, from which it reads data from. And it returns a list of texts (model inputs) and a list of labels (model outputs).

Specifically, the data files (`train.csv`, `dev.csv` and `test.csv`) used in this turotial are in csv (comma separated values) format. Each line contains news category, news title and news body, separated by commas. Some examples are as follows:
+ World,"Bush, Kerry Tentatively OK Three Debates (AP)","AP - The campaigns of President Bush and Sen. John Kerry tentatively have agreed to a series of three debates that both sides hope will give them momentum in the closing weeks of the presidential election campaign, a person familiar with the debate negotiations said Sunday night."
+ Sports,Roddick to Lead U.S. Against Belarus (AP),AP - Andy Roddick and the rest of the U.S. Davis Cup team figure it's about time the country reclaimed the championship.
+ Sci/Tech,IMPlanet Weekly News Break,"The free, hosted MSN Spaces service is Microsoft #39;s first consumer foray into providing a blogging platform. Spaces will enable MSN members to maintain a blog and control it with some sophisticated features that are not typical of all blogging services."

We concatenate the news title and news body as text (model input), and use the news category as label (model output).

In [12]:
import csv
from typing import Tuple, List

class AGNewsDataset(Dataset):
    """Dataset for AG News.

    :param fp: File path of the data.
    """

    def __init__(self, fp: str):
        self.texts, self.labels = self.read(fp)

    def __len__(self):
        ...

    def __getitem__(self, idx: int):
        ...
    
    @classmethod
    def read(cls, fp: str) -> Tuple[List[str], List[str]]:
        """Obtain texts and labels from the data file.

        :param fp: File path of the data.
        :return: texts and labels.
        """

        texts = []
        labels = []
        with open(fp, encoding="utf-8") as f:
            csv_reader = csv.reader(f)
            for line in csv_reader:  # Read each line in the csv file.
                label, title, body = line  # Each field in the line is split by the csv reader automatically.

                texts.append(f"{title} {body}")  # Concatenate the title and the body as text.
                labels.append(label)

        return texts, labels

#### 4.3.2 Implementing `__len__`

Next, let's implement the `__len__` method. It's generally implemented as the length of the labels.

In [13]:
def __len__(self) -> int:
    return len(self.labels)

#### 4.3.3 Implementing `__getitem__`

Now there is only one method left: `__getitem__`. This method receives an index and returns a pair (input, output).

In [14]:
def __getitem__(self, idx: int) -> Tuple[str, str]:
    text = self.texts[idx]
    label = self.labels[idx]

    return text, label

### 4.4 Complete Implementation for the Dataset

Done! The complete implementation for the dataset is as follows:

In [15]:
class AGNewsDataset(Dataset):
    """Dataset for AG News.

    :param fp: File path of the data.
    """

    def __init__(self, fp: str):
        self.texts, self.labels = self.read(fp)

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx: int) -> Tuple[str, str]:
        text = self.texts[idx]
        label = self.labels[idx]

        return text, label

    @classmethod
    def read(cls, fp: str) -> Tuple[List[str], List[str]]:
        """Obtain texts and labels from the data file.

        :param fp: File path of the data.
        :return: texts and labels.
        """

        texts = []
        labels = []
        with open(fp, encoding="utf-8") as f:
            csv_reader = csv.reader(f)
            for line in csv_reader:
                label, title, body = line

                # Concatenate the title and the body as text.
                texts.append(f"{title} {body}")
                labels.append(label)

                # TODO: Some operations in collate_func can be placed here.

        return texts, labels

Let's check if we've implement correctly.

In [16]:
train_data = AGNewsDataset(fp="../train.csv")

# Check the read method in __init__.
print(f"Model input of the first example in the training data:\n{train_data.texts[0]}\n")
print(f"Model output of the first example in the training data:\n{train_data.labels[0]}\n")

# Check the __len__ method.
print(f"Length of the training data:\n{len(train_data)}\n")

# Check the __getitem__ method.
print(f"Model input of the first example in the training data:\n{train_data[0][0]}\n")
print(f"Model output of the first example in the training data:\n{train_data[0][1]}\n")

Model input of the first example in the training data:
Bush, Kerry Tentatively OK Three Debates (AP) AP - The campaigns of President Bush and Sen. John Kerry tentatively have agreed to a series of three debates that both sides hope will give them momentum in the closing weeks of the presidential election campaign, a person familiar with the debate negotiations said Sunday night.

Model output of the first example in the training data:
World

Length of the training data:
114000

Model input of the first example in the training data:
Bush, Kerry Tentatively OK Three Debates (AP) AP - The campaigns of President Bush and Sen. John Kerry tentatively have agreed to a series of three debates that both sides hope will give them momentum in the closing weeks of the presidential election campaign, a person familiar with the debate negotiations said Sunday night.

Model output of the first example in the training data:
World



### 4.5 Creating Datasets for Training/Development/Test Data

After the check, let's create datasets for training data, development data and test data, respectively.

In [17]:
train_dataset = AGNewsDataset(fp="../train.csv")
dev_dataset = AGNewsDataset(fp="../dev.csv")
test_dataset = AGNewsDataset(fp="../test.csv")

## 5. Tokenizer

### 5.1 Goal of this Section

A tokenizer is for splitting text into tokens. Many tokenizers are available, including `word_tokenize` in `nltk`, `spacy` and so on. Note that some tokenizers will **lowercase** all the tokens when tokenizing, while some will not. Lowercasing tokens can reduce the vocabulary size and the number of embedding parameters, which is beneficial for some tasks.

In this section, we create a tokenizer.

### 5.2 Multiple Ways for Tokenizer Creation

Tokenizers can be created in many ways. Following is two ways to construct a tokenizer.

In [18]:
text = "I'am learning to train a deeplearning model."

# Creating a tokenizer with torchtext.
from torchtext.data import get_tokenizer
tokenizer = get_tokenizer('basic_english')
tokens = tokenizer(text)
print(f"Tokenized using torchtext.data.get_tokenizer(\'basic_english\'): {tokens}")

# Creating a tokenizer with nltk.
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(f"Tokenized using nltk.tokenize.word_tokenize: {tokens}")

Tokenized using torchtext.data.get_tokenizer('basic_english'): ['i', "'", 'am', 'learning', 'to', 'train', 'a', 'deeplearning', 'model', '.']
Tokenized using nltk.tokenize.word_tokenize: ["I'am", 'learning', 'to', 'train', 'a', 'deeplearning', 'model', '.']


### 5.3 Creating a Tokenizer

In this turotial we construct a tokenizer with `torchtext`. This tokenizer will lowercase all tokens. It's ok for text classification in the turotial as cases won't affect the category of the texts.

In [19]:
tokenizer = get_tokenizer('basic_english')

## 6. Vocabulary

### 6.1 Goal of this Section

A vocabulary stores all the **tokens that the model understands**. It is also responsible for **converting tokens into token ids**. A token id is for obtaining the embedding of the corresponding token. Sometimes we should create a vocabulary by ourselves, while sometimes need not (e.g. when we use pre-trained embeddings, the vocab is determined by the vocab of the pre-training).

**Note that in this turotial we don't distinguish words and tokens for simplicity. They are interchangable in this turotial**

In this turotial, we create a vocab manually, as the pseudo code below:
```python
def make_vocab():
    ...

vocab = make_vocab()
```

### 6.2 Three Steps for Creating a Vocab

A vocab can be created in three steps:

1. Obtaining all the tokens from the data. In common practice, the vocab is made from the **training data**, as it is for training the model. The development data and the test data are for evaluating the performance of the model, and are not appropriate for vocab construction.
2. Making a token:count mapping.
3. Making the vocab with (1) the token:count mapping and (2) a `min_freq` threshold. All the tokens with a frequency greater than the threshold will be considered, and all those with a frequency less than the threshold will be discarded. The `min_freq` threshold is useful for reducing vocab size. A smaller vocab size results in a smaller embedding lookup table, requiring less parameters.

### 6.3 Implementation of Each Step

#### 6.3.1 Obtaining all Tokens

Obtaining all the tokens from the training data is quite simple. Previously we have created the training dataset. We can access all texts in the training data with its `texts` attribute. Here's the first ten lines:

In [20]:
train_dataset.texts[:10]

['Bush, Kerry Tentatively OK Three Debates (AP) AP - The campaigns of President Bush and Sen. John Kerry tentatively have agreed to a series of three debates that both sides hope will give them momentum in the closing weeks of the presidential election campaign, a person familiar with the debate negotiations said Sunday night.',
 "Roddick to Lead U.S. Against Belarus (AP) AP - Andy Roddick and the rest of the U.S. Davis Cup team figure it's about time the country reclaimed the championship.",
 'IMPlanet Weekly News Break The free, hosted MSN Spaces service is Microsoft #39;s first consumer foray into providing a blogging platform. Spaces will enable MSN members to maintain a blog and control it with some sophisticated features that are not typical of all blogging services.',
 'U.S., Militants Battle in Central Baghdad (AP) AP - A gunbattle between U.S. forces and militants erupted in central Baghdad on Friday, witnesses said.',
 'In the frame OPINIONS were split in the Australian camp 

Therefore, it is not hard to obtain all tokens from it:

In [21]:
all_tokens = [
    token
    for text in train_dataset.texts
    for token in tokenizer(text)
]

# Check it by printing the first 100 tokens:
print(all_tokens[:100])

['bush', ',', 'kerry', 'tentatively', 'ok', 'three', 'debates', '(', 'ap', ')', 'ap', '-', 'the', 'campaigns', 'of', 'president', 'bush', 'and', 'sen', '.', 'john', 'kerry', 'tentatively', 'have', 'agreed', 'to', 'a', 'series', 'of', 'three', 'debates', 'that', 'both', 'sides', 'hope', 'will', 'give', 'them', 'momentum', 'in', 'the', 'closing', 'weeks', 'of', 'the', 'presidential', 'election', 'campaign', ',', 'a', 'person', 'familiar', 'with', 'the', 'debate', 'negotiations', 'said', 'sunday', 'night', '.', 'roddick', 'to', 'lead', 'u', '.', 's', '.', 'against', 'belarus', '(', 'ap', ')', 'ap', '-', 'andy', 'roddick', 'and', 'the', 'rest', 'of', 'the', 'u', '.', 's', '.', 'davis', 'cup', 'team', 'figure', 'it', "'", 's', 'about', 'time', 'the', 'country', 'reclaimed', 'the', 'championship', '.']


#### 6.3.2 Making a Token:Count Mapping

Now let's make the token:count mapping.

In [22]:
counter = {}
for token in all_tokens:
    try:
        counter[token] += 1
    except KeyError:
        counter[token] = 1

An equivalent and more efficient way is as follows.

In [23]:
from collections import Counter
counter = Counter(all_tokens)

# Have a look at the first 10 items:
for token, count in list(counter.items())[:10]:
    print(f"{token}: {count}")

bush: 3403
,: 157171
kerry: 1335
tentatively: 40
ok: 246
three: 4537
debates: 84
(: 39039
ap: 15264
): 38741


Let's sort them by frequency. It's not necessay though.

In [24]:
counter = dict(sorted(
    counter.items(),
    key=lambda x: x[1],
    reverse=True
))

# Have a look at the first 10 items:
for token, count in list(counter.items())[:10]:
    print(f"{token}: {count}")

.: 214545
the: 193701
,: 157171
to: 113276
a: 104552
of: 92908
in: 90710
and: 65455
s: 58627
on: 53678


#### 6.3.3 Creating a Vocab

After making the token:count mapping, we can construct a vocab. The `torchtext` package provides us a convenient function `torchtext.vocab.vocab` for making vocab. Let's first of all learn its usage from [here](https://pytorch.org/text/stable/vocab.html?highlight=vocab#torchtext.vocab.vocab).

`torchtext.vocab.vocab` requires four params. Let's focus on the first two:
+ ordered_dict: Ordered Dictionary mapping tokens to their corresponding occurance frequencies.
+ min_freq: The minimum frequency needed to include a token in the vocabulary.

The first param `ordered_dict` is the token:count mapping we just created. And `min_freq` is the threshold mentioned earlier, let's set it to 5.

In [25]:
from torchtext.vocab import vocab as vc
min_freq = 5
vocab = vc(
    ordered_dict=counter,
    min_freq=min_freq,
)

Done! Now let's try to use the vocab to convert tokens to ids.

In [26]:
text = "I'am learning to train a deeplearning model."

# Tokenize the text.
tokens = tokenizer(text)
print(tokens)

# Tokens -> token ids
token_ids = vocab(tokens)
print(token_ids)

['i', "'", 'am', 'learning', 'to', 'train', 'a', 'deeplearning', 'model', '.']


RuntimeError: Token deeplearning not found and default index is not set

Oops! A runtime error is raised. It says the token "deeplearning" is not found and the default index is not set. The reason is that "deeplearning" is not in vocab.

In [27]:
vocab.get_stoi()["deeplearning"]

KeyError: 'deeplearning'

A **out-of-vocabulary (OOV)** token should be represented by a certain symbol, such as `<unk>`. We should also specify an index for the `<unk>` symbol. First, insert the unknown token to the vocab. Second, tell the vocab that `<unk>` is the default symbol, so all OOV token will be replaced by the `<unk>` symbol.

In [28]:
unk_token = "<unk>"
unk_idx = 0

vocab.insert_token(
    token=unk_token,
    index=unk_idx
)
vocab.set_default_index(index=unk_idx)

Now let's run the previous code again.

In [29]:
text = "I'am learning to train a deeplearning model."

# Tokenize the text.
tokens = tokenizer(text)
print(tokens)

# Tokens -> token ids
token_ids = vocab(tokens)
print(token_ids)

['i', "'", 'am', 'learning', 'to', 'train', 'a', 'deeplearning', 'model', '.']
[275, 16, 1915, 4754, 4, 1933, 5, 0, 2083, 1]


It works!  The text is first tokenized into tokens by the tokenizer. Then the tokens are converted into token ids by the vocab. The OOV token "deeplearning" is treated as `<unk>`, whose token id is 0.

### 6.4 Complete Implementation for Vocab Creation

Let's wrap the code into a function `make_vocab`.

In [30]:
from typing import Dict, Callable

def make_vocab(
        texts: List[str], tokenizer: Callable,
        min_freq: int = 1, unk_token: str = "<unk>", unk_idx: int = 0,
) -> torchtext.vocab.Vocab:
    """Make a vocabulary from the specified texts.

    :param texts: Texts for making vocab.
    :param tokenizer: Tokenizer for tokenizing texts.
    :param min_freq: Min frequency of the words to be added to the vocab.
    :param unk_token: Unknown token.
    :param unk_idx: Unknown token index.
    :return: Constructed vocab.
    """

    def make_token_count_mapping(texts) -> Dict[str, int]:
        """Make a mapping {token: count}"""

        all_tokens = [
            token
            for text in texts
            for token in tokenizer(text)
        ]
        counter = Counter(all_tokens)
        counter = dict(sorted(
            counter.items(),
            key=lambda x: x[1],
            reverse=True
        ))

        return counter

    vocab = vc(
        ordered_dict=make_token_count_mapping(texts=texts),
        min_freq=min_freq,
    )
    vocab.insert_token(
        token=unk_token,
        index=unk_idx
    )
    vocab.set_default_index(index=unk_idx)

    return vocab

### 6.5 Creating a Vocab

Now, we can construct a vocab with the function above.

In [31]:
min_freq = 5
vocab = make_vocab(
    texts=train_dataset.texts,
    tokenizer=tokenizer,
    min_freq=min_freq,
)

## 7. Label Mapping

### 7.1 Goal of this Section

Textual labels classes (Business, Sci/Tech, Sports, World) should be mapped to integral ids for model training. In this section, we create a mapping that maps each texutal class to an integral id.

### 7.2 Two Steps for Making the Mapping

Only two steps are needed to create the mapping:
1. Obtaining all textual label classes from the **training data** and removing duplicates.
2. Making the mapping with the deduplicated label classes.

### 7.3 Implementation of the Two Steps

First, let's obtain all textual labels with the `labels` attribute of the training dataset, followed by deduplication.

In [32]:
classes = list(set(train_dataset.labels))

print(classes)

['Sports', 'Business', 'Sci/Tech', 'World']


Then, we can make the mapping, which is represented by a Python dict.

In [33]:
class_index_mapping = {
    class_: idx
    for idx, class_ in enumerate(classes)
}

print(class_index_mapping)

{'Sports': 0, 'Business': 1, 'Sci/Tech': 2, 'World': 3}


With the constructed mapping `class_index_mapping`, we can map texutal labels into label ids.

In [34]:
class_index_mapping["World"]

3

### 7.4 Complete Implementation for Making the Label Mapping

Let's wrap the code into a function `make_class_index_mapping()`.

In [35]:
def make_class_index_mapping(labels: List[str]) -> Dict[str, int]:
    """Make a mapping that maps the classes to integral indices.

    :param labels: Label strings.
    :return: Label indices.
    """

    classes = list(set(labels))
    class_index_mapping = {
        class_: idx
        for idx, class_ in enumerate(classes)
    }

    return class_index_mapping

### 7.5 Creating a Label Mapping

Now, we can make the label mapping with the function above.

In [36]:
class_index_mapping = make_class_index_mapping(labels=train_dataset.labels)

## 8. Model

### 8.1 Goal of this Section

This section introduces how to construct a deep learning model.

### 8.2 Required Methods for Implementing a PyTorch Deeplearning Model

We define our neural network by subclassing `nn.Module` of PyTorch, and initialize the neural network layers in __init__. Every `nn.Module` subclass implements the operations on input data in the `forward` method. In other words, two methods are required:
+ `__init__`: defining components of the network.
+ `forward`: dpecifying how the inputs are processed by each component.

### 8.3 Implementation of the Two Methods

Now let's create a model class for the text classification task. In this turotial, we use a single layer RNN as model structure. It contains three layers:
1. **Embedding layer** for converting token ids to embeddings. The size of the embedding lookup table is (`vocab_len`, `embed_dim`), where `vocab_len` is the length of the vocab and `embed_dim` is the dimension of the embeddings. In this tutorial, we set `embed_dim` to 50.
2. **RNN layer** for extracting textual info. In this tutorial, we set the hidden dimension `hidden_dim` of the RNN layer to 50.
3. **Linear layer** for decoding. The input dimension of the linear layer is `hidden_dim` of the RNN layer, and the output dimension equals to the number of news categories `class_num`, which is 4 in this tutorial.

Receiving the input (a batch of token ids), the network processes it as follows:
1. The embedding layer converts the token ids to embeddings.
2. The RNN layer encodes the embeddings sequentially, and outputs the last hidden state.
3. The linear layer transforms the hidden state into a four-element vector. Each elem in the vector represents the prob of the correcponding class predicted by the model.

As mentioned above, the model class should contain two methods `__init__` and `forward`.

In [37]:
from torch import nn

class RNNTextClassifier(nn.Module):
    """A text classifier based on RNN.

    :param emb_len: Embedding dimension.
    :param hid_dim: Dimension of the RNN hidden layers.
    """

    def __init__(self, vocab_len: int, class_num: int, embed_dim: int, hidden_dim: int):
        super(RNNTextClassifier, self).__init__()

        ...

    def forward(self, x: torch.tensor):
        ...


#### 8.3.1 Implementing `__init__`

First, let's implement the `__init__` method. The arguments it receives include `vocab_len`, `embed_dim`, `hidden_dim` and `class_num`, as described above. We use [nn.Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html?highlight=embedding#torch.nn.Embedding) for creating the lookup table, [nn.RNN](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html?highlight=rnn#torch.nn.RNN) for creating the RNN layer, and [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html?highlight=linear#torch.nn.Linear) for the linear layer. 

Note that the construction method of **each neural layer** requires **different sets of arguments**. **Read the documentation** to learn more! For example, for constructing an embedding layer, we use `nn.Embedding`. We should read its [doc](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html?highlight=embedding#torch.nn.Embedding) to figure out which arg represents the length of the vocab, and which arg represents the dimension of the embeddings. From the doc, we see that `num_embeddings` denotes *size of the dictionary of embeddings*, i.e., vocab length, and that `embedding_dim` denotes *the size of each embedding vector*. Therefore, we pass `vocab_len` to the arg `num_embeddings` and pass `embed_dim` to the arg `embedding_dim`. Similarly, for the RNN and the linear layer, we first read their docs, and then pass values to the corresponding args.

In [38]:
def __init__(self, vocab_len: int, class_num: int, embed_dim: int, hidden_dim: int):
    super(RNNTextClassifier, self).__init__()

    self.embed_dim = embed_dim
    self.hidden_dim = hidden_dim

    self.embedding = nn.Embedding(
        num_embeddings=vocab_len,
        embedding_dim=self.embed_dim
    )
    self.rnn = nn.RNN(
        input_size=self.embed_dim,
        hidden_size=self.hidden_dim,
        batch_first=True
    )
    self.linear = nn.Linear(
        in_features=self.hidden_dim,
        out_features=class_num
    )

#### 8.3.2 Implementing `forward`

Now let's implement the `forward` method, which determines how the input `x` should be processed. When implementing the `forward` method, two things should be clear:
1. The input and output of each layer.
2. The dimension of the inputs and outputs.

Again, **search the documentation** to know all we need.

Specifically, let's start from the input `x`. Generally it's a tensor of size (batch_size, text_len). (As a reminder, we can use `x.size()` to obtain the size of `x`.)

First, `x` is fed into the embedding layer. From the [doc](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html?highlight=embedding#torch.nn.Embedding) of `nn.Embedding`, we can learn about the input and output description of the layer from the **Shape** section:
+ Input: (∗), IntTensor or LongTensor of arbitrary shape containing the indices to extract
+ Output: (∗,H), where * is the input shape and H=`embedding_dim`

According to the description, the input `x` with size (batch_size, text_len) will be transformed to (batch_size, text_len, embed_dim). Therefore, the size of `embeddings` is (batch_size, text_len, embed_dim).

Then we should focus on the RNN layer. It receives the `embeddings` as input. Read the **Inputs** and **Outputs** sections of the [doc](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html?highlight=rnn#torch.nn.RNN) for details. For the model used in this tutorial, only the final hidden state is required. The final hidden state corresponds to the second elem of the outputs (`last_hidden` in the code), whose size is (1, batch_size, hidden_dim) according to the doc. The 1 here is the default value for the layer number in the construction method of `nn.RNN`. The first dimension (with the size of 1) in `last_hidden` is redundant, so we remove this dimension with the `squeeze` method, resulting in the dimension of (batch_size, hidden_dim).

Similarly, read the [doc](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html?highlight=linear#torch.nn.Linear) of `nn.Linear` to learn its usage.

*More for `last_hidden = last_hidden.squeeze(dim=0)`: the original `last_hidden` is (1, batch_size, hidden_dim). If we pass it directly to the linear layer, we will obtain `y` which is the output of the whole model with the size (1, batch_size, class_num). However, the model output (at least for the task here) requires the size to be (batch_size, class_num). Therefore, we squeeze the dimension of `last_hidden` in advance. In addition, we can squeeze `y` instead of squeezing `last_hidden`, which will produce the same result.*

In [39]:
def forward(self, x: torch.tensor) -> torch.tensor:
    # x: (batch_size, text_len)
    
    embeddings = self.embedding(x)
    # embeddings: (batch_size, text_len, embed_dim)
    
    _, last_hidden = self.rnn(embeddings)
    # last_hidden: (1, batch_size, hidden_dim)
    last_hidden = last_hidden.squeeze(dim=0)
    # last_hidden: (batch_size, hidden_dim)
    
    y = self.linear(last_hidden)
    # y: (batch_size, class_num)

    return y

### 8.4 Complete Implementation for the Model

Let's wrap the code above.

In [40]:
class RNNTextClassifier(nn.Module):
    """A text classifier based on RNN.

    :param emb_len: Embedding dimension.
    :param hid_dim: Dimension of the RNN hidden layers.
    """

    def __init__(self, vocab_len: int, class_num: int, embed_dim: int, hidden_dim: int):
        super(RNNTextClassifier, self).__init__()

        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(
            num_embeddings=vocab_len,
            embedding_dim=self.embed_dim
        )
        self.rnn = nn.RNN(
            input_size=self.embed_dim,
            hidden_size=self.hidden_dim,
            batch_first=True
        )
        self.linear = nn.Linear(
            in_features=self.hidden_dim,
            out_features=class_num
        )

    def forward(self, x: torch.tensor) -> torch.tensor:
        embeddings = self.embedding(x)
        
        _, last_hidden = self.rnn(embeddings)
        last_hidden = last_hidden.squeeze(dim=0)
        
        y = self.linear(last_hidden)

        return y

### 8.5 Creating a Model

Now that the model class has been implemented, let's create a model.

In [41]:
embed_dim = 50
hidden_dim = 50
model = RNNTextClassifier(
    vocab_len=len(vocab),
    class_num=len(class_index_mapping),
    embed_dim=embed_dim,
    hidden_dim=hidden_dim,
)

## 9. Loss Function

### 9.1 Goal of this Section

In this section, we specify a loss function for loss calculation.

### 9.2 Loss Functions Provided by PyTorch

A bunch of loss functions are implemented by PyTorch, including [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html#torch.nn.MSELoss) (usually for regression), [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss) (usually for classification) and so on. See all the loss functions available [here](https://pytorch.org/docs/stable/nn.html#loss-functions).

### 9.3 Creating a Loss Function (Criterion)

Creating a loss function (criterion) is straight-forward. Nothing more is needed to explain.

In [42]:
criterion = nn.CrossEntropyLoss()

## 10. Optimizer

### 10.1 Goal of this Section

In this section, we specify a optimizer for optimizing model parameters.

### 10.2 Optimizers Provided by PyTorch

Similarly, there are many optimizers provided by PyTorch, such as [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html#torch.optim.SGD), [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html#torch.optim.Adam), [RMSProp](https://pytorch.org/docs/stable/generated/torch.optim.RMSprop.html#torch.optim.RMSprop) and so on. See all [here](https://pytorch.org/docs/stable/optim.html).

### 10.3 Creating an Optimizer

The simplest way to create an optim is by specifying the params of the model `model.parameters()` to optimize. Here we use the `Adam` optim, with a learning rate of 1e-3. 

In [43]:
from torch.optim import Adam

lr = 1e-3
optimizer = Adam(
    params=model.parameters(),
    lr=lr,
)

## 11. Data Loader

### 11.1 Goal of this Section

Til now, we've prepared many modules for training a deep learning model, including: a dataset class for storing and obtaining data, a tokenizer and a vocab for processing texts, a label mapping for processing labels, a deep learning model, a loss func and an optimizer. We are almost ready for training the model. However, there's one more thing to consider: loading data.

When training, the model receives a **batch** of training data instead of a single one. Therefore, there are several things we should take into account:
+ How many data are there in a batch? In other word, what is the batch size?
+ How to sample data from the dataset to form a batch? Should we randomly choose some or sequentially choose some?
+ How do we collate the sampled data (a list of training examples) to form a batch?

In this section, we introduce how to construct batches, i.e., how to load data.

### 11.2 Loading Data with `DataLoader`

We can easily specify how the data are to be loaded with the `DataLoader` class of PyTorch. Again, first read the [doc](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) to learn its usage. From the doc, we see that there are a bunch of params that we can configure. For now, let's only focus on four of them: `dataset`, `batch_size`, `shuffle` and `collate_fn`.

The `dataset` arg, needless to say, specifies the dataset from which data are to be loaded. When training the model, we should load data from the training set. However, when we validate the model performance during training, we should load data from the dev set. And when we test the trained model after training, we should load data from the test set. Therefore, we must explicitly tell the loader from which dataset it should load data from.

The `batch_size` arg also does not require any explanation.

The `shuffle` arg specifies whether the data in the dataset should be shuffled before sampling. If it is set to `True`, the data in the sataset will be shuffled before sampling. Besides, the loader will by default use a [RandomSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.RandomSampler) for loading data. It it is set to `False`, the data in the dataset will not be shuffled before sampling. And the loader will by default use a [SequentialSampler](https://pytorch.org/docs/stable/data.html#torch.utils.data.SequentialSampler) for loading data.

### 11.3 the `collate_fn` Argument

Now, there is only one arg left: `collate_fn`. According to the [doc](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), it is a `Callable` (function) that *merges a list of samples to form a mini-batch of Tensor(s).* In short, two sets of operations are to be done in the `collate_fn` function:
1. Processing the data. For the texts, it includes: tokenization, converting tokens to ids, **padding**. For the labels, it includes: textual classes to ids.
2. Converting the texts into a tensor, and converting the labels into another tensor. Wrap the two tensors into a batch.

It should look like:

In [44]:
def collate_func(samples, **kwargs):
    """Collate function for training / validation / testing.

    :param samples: A list of (text, label).
    :param kwargs: Other needed args.
    :return: Collated batch.
    """
    
    ...

#### 11.3.1 Input of `collate_fn`

To make everything intuitive, let's first see what the input of the `collate_fn` looks like (assume that the batch size is 8):

In [45]:
[  
    ('PeopleSoft accepts latest Oracle offer Oracle Corp. #39;s 18-month struggle to gain control of PeopleSoft Inc. drew to a close this morning when PeopleSoft #39;s board agreed to a new \\$10.3 billion takeover offer.', 'Business'),  
    ('The wife, a stranger, the bully and a bullet Marlene Brookes is a shirt-presser, a quiet, conscientious woman who has looked after Bay Street #39;s demanding clients for more than eight years.', 'World'),  
    ('44 believed North Koreans clamber over fence into Canadian Embassy BEIJING (CP) - China said Thursday it wants the Canadian Embassy to hand over 44 people thought to be North Korean asylum-seekers who climbed over a spiked fence onto embassy grounds.', 'World'),  
    ("Carter to Miss Two Games to Fight Lawsuit (AP) AP - Vince Carter will miss the Toronto Raptors' next two preseason games while he fights a lawsuit from a former agent.", 'Sports'),   
    ('More attackers targeting e-commerce and Web apps, says Symantec The total number of virus attacks are down, but malicious codemeisters are getting faster, more sophisticated, and they #39;re beginning to target e-commerce concerns and small businesses.', 'Sci/Tech'),   
    ("UK's EMI Says to Face Music Industry Probe in U.S.  LONDON/NEW YORK (Reuters) - EMI Group PLC, the world's  third-largest music company, on Friday said it and other music  companies faced a New York probe into how music companies  influence what songs are played on the radio.", 'Business'),   
    ('D-Backs get Glaus PHOENIX The Arizona Diamondbacks have worked out a four-year deal with third baseman Troy Glaus. The 2002 World Series MVP hit just .251 with 18 homers and 42 RBI #39;s in 58 games this year, missing much of the season due to shoulder surgery.', 'Sports'),   
    ('Karzai declared winner of Afghan election KABUL, Afghanistan -- Hamid Karzai was officially declared Afghanistan #39;s first-ever popularly elected president today after a weeks-long fraud probe found no reason to overturn his landslide victory.', 'World')  
]

[('PeopleSoft accepts latest Oracle offer Oracle Corp. #39;s 18-month struggle to gain control of PeopleSoft Inc. drew to a close this morning when PeopleSoft #39;s board agreed to a new \\$10.3 billion takeover offer.',
  'Business'),
 ('The wife, a stranger, the bully and a bullet Marlene Brookes is a shirt-presser, a quiet, conscientious woman who has looked after Bay Street #39;s demanding clients for more than eight years.',
  'World'),
 ('44 believed North Koreans clamber over fence into Canadian Embassy BEIJING (CP) - China said Thursday it wants the Canadian Embassy to hand over 44 people thought to be North Korean asylum-seekers who climbed over a spiked fence onto embassy grounds.',
  'World'),
 ("Carter to Miss Two Games to Fight Lawsuit (AP) AP - Vince Carter will miss the Toronto Raptors' next two preseason games while he fights a lawsuit from a former agent.",
  'Sports'),
 ('More attackers targeting e-commerce and Web apps, says Symantec The total number of virus attacks

We see that the input is a list, each element of the list is a tuple of (text, label).

Here's how the input is generated: First the data loader samples `batch_size` samples from the `dataset`. if `shuffle` is specified to `True`, the data are sampled randomly with the `RandomSampler`, and if `shuffle` is specified to `False`, the data are sampled sequentially with the `SequentialSampler`.

When sampling data from the dataset, the `__getitem__` method of the `dataset` will ba called. As a reminder, the following code is the `__getitem__` method we implemented earlier. This method returns a tuple of (text, label), which is exactly the form of elements in the list shown.

In [46]:
def __getitem__(self, idx: int) -> Tuple[str, str]:
    text = self.texts[idx]
    label = self.labels[idx]

    return text, label

Our task now is to make a batch. It contains several steps, as described in the following sections. To make everything clear, let assume that the samples are:

In [47]:
samples = [  
    ('PeopleSoft accepts latest Oracle offer Oracle Corp. #39;s 18-month struggle to gain control of PeopleSoft Inc. drew to a close this morning when PeopleSoft #39;s board agreed to a new \\$10.3 billion takeover offer.', 'Business'),  
    ('The wife, a stranger, the bully and a bullet Marlene Brookes is a shirt-presser, a quiet, conscientious woman who has looked after Bay Street #39;s demanding clients for more than eight years.', 'World'),  
    ('44 believed North Koreans clamber over fence into Canadian Embassy BEIJING (CP) - China said Thursday it wants the Canadian Embassy to hand over 44 people thought to be North Korean asylum-seekers who climbed over a spiked fence onto embassy grounds.', 'World'),  
    ("Carter to Miss Two Games to Fight Lawsuit (AP) AP - Vince Carter will miss the Toronto Raptors' next two preseason games while he fights a lawsuit from a former agent.", 'Sports'),   
    ('More attackers targeting e-commerce and Web apps, says Symantec The total number of virus attacks are down, but malicious codemeisters are getting faster, more sophisticated, and they #39;re beginning to target e-commerce concerns and small businesses.', 'Sci/Tech'),   
    ("UK's EMI Says to Face Music Industry Probe in U.S.  LONDON/NEW YORK (Reuters) - EMI Group PLC, the world's  third-largest music company, on Friday said it and other music  companies faced a New York probe into how music companies  influence what songs are played on the radio.", 'Business'),   
    ('D-Backs get Glaus PHOENIX The Arizona Diamondbacks have worked out a four-year deal with third baseman Troy Glaus. The 2002 World Series MVP hit just .251 with 18 homers and 42 RBI #39;s in 58 games this year, missing much of the season due to shoulder surgery.', 'Sports'),   
    ('Karzai declared winner of Afghan election KABUL, Afghanistan -- Hamid Karzai was officially declared Afghanistan #39;s first-ever popularly elected president today after a weeks-long fraud probe found no reason to overturn his landslide victory.', 'World')  
]

#### 11.3.2 Obtain Texts and Labels

First, we should put all texts into a list, and all labels into another. Code:

In [48]:
texts, labels = list(zip(*samples))

# Check.
print(f"texts:\n{texts}\n")
print(f"labels:\n{labels}\n")

texts:
('PeopleSoft accepts latest Oracle offer Oracle Corp. #39;s 18-month struggle to gain control of PeopleSoft Inc. drew to a close this morning when PeopleSoft #39;s board agreed to a new \\$10.3 billion takeover offer.', 'The wife, a stranger, the bully and a bullet Marlene Brookes is a shirt-presser, a quiet, conscientious woman who has looked after Bay Street #39;s demanding clients for more than eight years.', '44 believed North Koreans clamber over fence into Canadian Embassy BEIJING (CP) - China said Thursday it wants the Canadian Embassy to hand over 44 people thought to be North Korean asylum-seekers who climbed over a spiked fence onto embassy grounds.', "Carter to Miss Two Games to Fight Lawsuit (AP) AP - Vince Carter will miss the Toronto Raptors' next two preseason games while he fights a lawsuit from a former agent.", 'More attackers targeting e-commerce and Web apps, says Symantec The total number of virus attacks are down, but malicious codemeisters are getting fast

#### 11.3.3 Process the Texts

Then, we should convert each text into token ids with the tokenizer, vocab we constructed earlier.

In [49]:
texts = list(map(
    lambda text: vocab(tokenizer(text)),
    texts
))

# Check.
print(texts)

[[367, 5233, 315, 311, 374, 311, 87, 1, 12, 9, 8489, 2581, 4, 1167, 649, 6, 367, 64, 1, 2844, 4, 5, 419, 52, 665, 90, 367, 12, 9, 515, 300, 4, 5, 23, 1729, 1, 220, 139, 696, 374, 1], [2, 1995, 3, 5, 12148, 3, 2, 15809, 8, 5, 9416, 28566, 0, 21, 5, 0, 3, 5, 3682, 3, 0, 1126, 75, 28, 2029, 34, 876, 375, 12, 9, 3258, 3590, 11, 47, 72, 631, 97, 1], [4219, 2212, 267, 6284, 0, 38, 7727, 66, 364, 1422, 848, 13, 1007, 14, 15, 118, 26, 60, 25, 716, 2, 364, 1422, 4, 1196, 38, 4219, 102, 1400, 4, 37, 267, 1259, 25237, 75, 2893, 38, 5, 14638, 7727, 2568, 1422, 6458, 1], [2432, 4, 1015, 49, 217, 4, 557, 974, 13, 31, 14, 31, 15, 4913, 2432, 33, 1015, 2, 733, 3031, 16, 109, 49, 2749, 217, 224, 48, 4352, 5, 974, 29, 5, 140, 2049, 1], [47, 5308, 2973, 5393, 8, 226, 5853, 3, 84, 2552, 2, 1352, 436, 6, 1421, 400, 42, 134, 3, 45, 3889, 0, 42, 883, 1971, 3, 47, 6616, 3, 8, 67, 12, 999, 1623, 4, 776, 5393, 770, 8, 653, 971, 1], [390, 16, 9, 7995, 84, 4, 449, 211, 241, 765, 7, 51, 1, 9, 1, 22892, 73, 13, 27,

Done! Now let's convert them into a tensor.

In [50]:
texts = torch.tensor(texts)

ValueError: expected sequence of length 41 at dim 1 (got 38)

Oops! Something went wrong.

The reason is that the elements in `texts` have inconsistent lengths. Let's see.

In [51]:
for token_ids in texts:
    print(len(token_ids))

41
38
44
35
42
59
52
37


However, to convert `texts` into a tensor, each text should has the same length. Therefore, we should handle the length problem by **padding**. A natural way for padding is to pad each text into the max text length in the samples.

In [52]:
# Find the max text length.
max_len = 0
for token_ids in texts:
    max_len = max(len(token_ids), max_len)
print(f"max len: {max_len}")

# Padding:
default_pad_val = 0
texts = list(map(
    lambda token_ids: token_ids + ([default_pad_val] * (max_len - len(token_ids))),
    texts
))
# Check.
print(texts)
for token_ids in texts:
    print(len(token_ids))

max len: 59
[[367, 5233, 315, 311, 374, 311, 87, 1, 12, 9, 8489, 2581, 4, 1167, 649, 6, 367, 64, 1, 2844, 4, 5, 419, 52, 665, 90, 367, 12, 9, 515, 300, 4, 5, 23, 1729, 1, 220, 139, 696, 374, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2, 1995, 3, 5, 12148, 3, 2, 15809, 8, 5, 9416, 28566, 0, 21, 5, 0, 3, 5, 3682, 3, 0, 1126, 75, 28, 2029, 34, 876, 375, 12, 9, 3258, 3590, 11, 47, 72, 631, 97, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [4219, 2212, 267, 6284, 0, 38, 7727, 66, 364, 1422, 848, 13, 1007, 14, 15, 118, 26, 60, 25, 716, 2, 364, 1422, 4, 1196, 38, 4219, 102, 1400, 4, 37, 267, 1259, 25237, 75, 2893, 38, 5, 14638, 7727, 2568, 1422, 6458, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [2432, 4, 1015, 49, 217, 4, 557, 974, 13, 31, 14, 31, 15, 4913, 2432, 33, 1015, 2, 733, 3031, 16, 109, 49, 2749, 217, 224, 48, 4352, 5, 974, 29, 5, 140, 2049, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [47, 5308, 2973, 5393, 8, 

A better way for padding is to pad each text into a `max_len`, instead of the max text length in the samples. The reason is that sometimes the max text length in the samples is very large, so other texts in the samples will be padded to this very large max text length. However, if the text length is too long, the RNN may fail to capture long distance textual info. Let's modify the code above by padding to a `max_len` instead of the max text length in the samples.

Note that in this case, some texts in the samples will have lengths exceeding the specified `max_len`, which should be **truncated** instead of padded.

In [53]:
max_len = 25

# Padding:
default_pad_val = 0
texts = list(map(
    lambda token_ids: (
        token_ids + ([default_pad_val] * (max_len - len(token_ids)))  # Less than max_len: padding.
        if len(token_ids) < max_len  
        else token_ids[:max_len]  # Greater than max_len: truncation.
    ),
    texts
))
# Check.
print(texts)
for token_ids in texts:
    print(len(token_ids))

[[367, 5233, 315, 311, 374, 311, 87, 1, 12, 9, 8489, 2581, 4, 1167, 649, 6, 367, 64, 1, 2844, 4, 5, 419, 52, 665], [2, 1995, 3, 5, 12148, 3, 2, 15809, 8, 5, 9416, 28566, 0, 21, 5, 0, 3, 5, 3682, 3, 0, 1126, 75, 28, 2029], [4219, 2212, 267, 6284, 0, 38, 7727, 66, 364, 1422, 848, 13, 1007, 14, 15, 118, 26, 60, 25, 716, 2, 364, 1422, 4, 1196], [2432, 4, 1015, 49, 217, 4, 557, 974, 13, 31, 14, 31, 15, 4913, 2432, 33, 1015, 2, 733, 3031, 16, 109, 49, 2749, 217], [47, 5308, 2973, 5393, 8, 226, 5853, 3, 84, 2552, 2, 1352, 436, 6, 1421, 400, 42, 134, 3, 45, 3889, 0, 42, 883, 1971], [390, 16, 9, 7995, 84, 4, 449, 211, 241, 765, 7, 51, 1, 9, 1, 22892, 73, 13, 27, 14, 15, 7995, 96, 1632, 3], [14439, 223, 8425, 3033, 2, 1575, 3622, 39, 3828, 59, 5, 3004, 120, 18, 190, 4092, 4527, 8425, 1, 2, 1921, 50, 233, 4208, 230], [1288, 2001, 1292, 6, 718, 244, 1367, 3, 669, 53, 2160, 1288, 35, 2429, 2001, 669, 12, 9, 5806, 12472, 3339, 77, 100, 34, 5]]
25
25
25
25
25
25
25
25


After padding, let's convert `texts` into a tensor.

In [54]:
texts = torch.tensor(texts)

# Check.
texts

tensor([[  367,  5233,   315,   311,   374,   311,    87,     1,    12,     9,
          8489,  2581,     4,  1167,   649,     6,   367,    64,     1,  2844,
             4,     5,   419,    52,   665],
        [    2,  1995,     3,     5, 12148,     3,     2, 15809,     8,     5,
          9416, 28566,     0,    21,     5,     0,     3,     5,  3682,     3,
             0,  1126,    75,    28,  2029],
        [ 4219,  2212,   267,  6284,     0,    38,  7727,    66,   364,  1422,
           848,    13,  1007,    14,    15,   118,    26,    60,    25,   716,
             2,   364,  1422,     4,  1196],
        [ 2432,     4,  1015,    49,   217,     4,   557,   974,    13,    31,
            14,    31,    15,  4913,  2432,    33,  1015,     2,   733,  3031,
            16,   109,    49,  2749,   217],
        [   47,  5308,  2973,  5393,     8,   226,  5853,     3,    84,  2552,
             2,  1352,   436,     6,  1421,   400,    42,   134,     3,    45,
          3889,     0,    42, 

Since all texts in the samples have lengths greater than the `max_len` (25), they are truncated instead of padded.

#### 11.3.4 Process the Labels

The labels should be (1) mapped from textual classes to ids and (2) converted into a tensor.

In [55]:
labels = torch.tensor(list(map(
    lambda label: class_index_mapping[label],
    labels
)))

# Check.
print(labels)

tensor([1, 3, 3, 0, 2, 1, 0, 3])


#### 11.3.5 Make a Batch

Very simple. Just wrap the `texts` and `labels` into a Python dict.

In [56]:
batch = {
    "texts": texts,
    "labels": labels,
}

### 11.4 Complete Implementation for `collate_fn`

In [57]:
def pad(token_indices: List[int], max_len: int, default_pad_val: int = 0) -> List[int]:
    """Pad the given token indices.

    :param token_indices: Indices of the tokens.
    :param max_len: Max sentence length.
    :param default_pad_val: Default padding value.
    :return: Padded indices.
    """

    if len(token_indices) < max_len:
        return token_indices + ([default_pad_val] * (max_len - len(token_indices)))
    else:
        return token_indices[:max_len]
    
def collate_func(
        samples: List[Tuple[str, str]],
        vocab: torchtext.vocab.Vocab, tokenizer: Callable, max_len: int,
        class_index_mapping: Dict[str, int]
) -> Dict[str, torch.tensor]:
    """Collate function for training / validation / testing.

    :param samples: A list of (text, label).
    :param vocab: Vocabulary.
    :param tokenizer: Tokenizer.
    :param max_len: Max len of the sentences.
    :param class_index_mapping: Mapping from label string to label index.
    :return: Collated batch.
    """

    texts, labels = list(zip(*samples))

    texts = torch.tensor(list(map(
        lambda text: pad(
            vocab(tokenizer(text)),
            max_len=max_len
        ),
        texts
    )))

    labels = torch.tensor(list(map(
        lambda label: class_index_mapping[label],
        labels
    )))

    return {
        "texts": texts,
        "labels": labels,
    }

Note that in the implementation above, the batch is in the form of a `dict`. But actually, **you can return the `texts` and `labels` directly in `collate_fn` with `return texts, labels`**. The return value format of `collate_fn` is not restricted!

However, sometimes you have to return many things from `collate_fn`, so returning a `dict` is more preferable.

### 11.5 Creating Data Loaders for the Training/Development/Test Datasets.

Now we've gone through the four args of `DataLoader`. Let's create a data loader for each dataset.

See how the **arguments** are passed to `collate_fn`.

In [58]:
from torch.utils.data import DataLoader

max_len = 25
train_bz = 64
eval_bz = 64

collate_fn = lambda samples: collate_func(
    samples=samples,
    tokenizer=tokenizer,
    vocab=vocab,
    max_len=max_len,
    class_index_mapping=class_index_mapping,
)

train_loader = DataLoader(
    train_dataset,
    batch_size=train_bz,
    collate_fn=collate_fn,
    shuffle=True
)
dev_loader = DataLoader(
    dev_dataset,
    batch_size=eval_bz,
    collate_fn=collate_fn
)
test_loader = DataLoader(
    test_dataset,
    batch_size=eval_bz,
    collate_fn=collate_fn
)

## 12. Training, Validation, Testing and Prediction

### 12.1 Goal of this Section

Now everything is ready for training. In this section, we implement the code for training, validation, testing and prediction. Pseudo code:

```python
best_valid_loss = float("inf")
best_model = None
for i in range(num_epochs):
    train(
        model=model,
        criterion=criterion,
        optimizer=optimizer,
        train_loader=train_loader
    )

    valid_loss = valid(
        model=model,
        criterion=criterion,
        dev_loader=dev_loader
    )
    # Update the best model.
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        best_model = copy.deepcopy(model)

test(
    model=best_model,
    test_loader=test_loader,
    class_index_mapping=class_index_mapping
)

predict(
    model=best_model,
    text="Apple Tops in Customer Satisfaction \"Dell comes in a close second, while Gateway shows improvement, study says.\"",
    tokenizer=tokenizer,
    vocab=vocab,
    class_index_mapping=class_index_mapping,
)
```

### 12.2 Training

See the code below.

+ Notes of **`model.train()`**: According to the [documentation](https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=train#torch.nn.Module.train), it *sets the module in **training mode**. This has any effect **only on certain modules**. See **documentations** of particular modules for details of their behaviors in training/evaluation mode, if they are affected, e.g. Dropout, BatchNorm, etc.* In other words, it **may not** by necessary. However, adding this line of code is a good practice, as it will have effect in certain situations.

+ `batch` is what is returned by `collate_fn`.

+ `model(texts)` calls the `forward` method of the model. In other words, it is equivalent to calling `model.forward(texts)`.

+ The return value of `model(texts)`, i.e., `predictions`, is a list of logits with size (`batch_size`, `class_num`). When you print it, you will find it looks like something above (when the batch size is 8). Each line corresponds to the prediction probabilities of a training example. For example, the first line \[2.2923e+00, -4.2605e+00,  1.4412e+00, -9.4200e-01\] is the prediction probabilities of a certain training example. It means: the model thinks there is a 2.2923 prob that the text of the training example belongs to the first class (i.e., `class_index_mapping[0]`) , a -4.2605 prob that the text of the training example belongs to the second class (i.e., `class_index_mapping[1]`), ... . Since there are four classes for text classification in this tutorial, each line has four values, correspinding to the prediction prob of the four classes.

```python
tensor([[ 2.2923e+00, -4.2605e+00,  1.4412e+00, -9.4200e-01],
        [ 1.9423e+00, -3.3877e+00,  2.1381e+00, -2.2040e+00],
        [ 5.5345e-01, -2.1561e+00,  3.7267e-02,  9.9266e-01],
        [-2.1681e+00,  4.1377e+00, -1.4757e+00, -3.7116e-01],
        [ 2.2583e+00, -3.7365e+00,  5.5374e-01,  3.8267e-02],
        [ 7.2302e-01, -2.5620e+00,  1.5886e+00, -1.2569e+00],
        [-2.3920e+00,  3.1881e+00, -1.3345e+00, -1.8628e-01],
        [-9.9738e-01, -1.2172e-01, -1.2999e+00,  2.8284e+00], grad_fn=<AddmmBackward0>)
```

+ Note that the sizes of `predictions` and `labels` in `loss = criterion(predictions, labels)` are different. `predictions`: (`batch_size`, `class_num`), `labels`: (`batch_size`).



In [59]:
# If gpu (cuda) is available, use gpu for training. Otherwise use cpu.
device = torch.device(
    "cuda"
    if torch.cuda.is_available()
    else "cpu"
)

def train(model, criterion, optimizer, train_loader):
    """Train the model."""

    model.train()

    losses = []
    for batch in tqdm(train_loader):
        texts = batch["texts"].to(device)
        labels = batch["labels"].to(device)

        predictions = model(texts)

        loss = criterion(predictions, labels)
        losses.append(loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    train_loss = torch.tensor(losses).mean()
    print(f"Train Loss : {train_loss:.3f}")

### 12.3 Validation

The code for validation is similar to that for training.

+ `torch.no_grad()`: For stopping tracking computations. When `model(texts)` is executed without `torch.no_grad()`, a lot of things are stored to do backpropogation, which requires large cpu/gpu storage. When validating, we don't need them. See more details [here](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html).

+ `predictions.argmax(dim=-1)` is for getting the most likely label id (with the greated prob) for each text in the batch.

In [60]:
def valid(model, criterion, dev_loader):
    """Validate the model performance."""

    model.eval()

    all_labels = []  # Store all labels in the dev data.
    all_predictions = []  # Store all predictions.
    losses = []
    with torch.no_grad():
        for batch in dev_loader:
            texts = batch["texts"].to(device)
            labels = batch["labels"].to(device)  # labels here: labels in a batch.

            predictions = model(texts)  # predictions here: predictions in a batch.

            loss = criterion(predictions, labels)
            losses.append(loss.item())

            all_labels.append(labels)
            all_predictions.append(predictions.argmax(dim=-1))

    all_labels = torch.cat(all_labels)
    all_predictions = torch.cat(all_predictions)

    valid_loss = torch.tensor(losses).mean()
    valid_acc = accuracy_score(
        y_true=all_labels.detach().cpu().numpy(),
        y_pred=all_predictions.detach().cpu().numpy()
    )
    print(f"Valid Loss : {valid_loss:.3f}")
    print(f"Valid Acc  : {valid_acc:.3f}")

    return valid_loss.item()

### 12.4 Testing

In [61]:
def test(model, test_loader, class_index_mapping):
    """Test the model"""

    all_labels = []
    all_predictions = []
    with torch.no_grad():
        for batch in test_loader:
            texts = batch["texts"].to(device)
            labels = batch["labels"].to(device)

            predictions = model(texts)

            all_labels.append(labels)
            all_predictions.append(predictions)

        all_predictions = torch.cat(all_predictions)
        all_labels = torch.cat(all_labels)

    all_labels = all_labels.detach().cpu().numpy()
    all_predictions = F.softmax(all_predictions, dim=-1).argmax(dim=-1).detach().cpu().numpy()

    test_acc = accuracy_score(
        y_true=all_labels,
        y_pred=all_predictions
    )
    print(f"Test Acc   : {test_acc:.3f}")

    print("\nClassification Report : ")
    print(classification_report(all_labels, all_predictions, target_names=class_index_mapping.keys()))

    print("\nConfusion Matrix : ")
    print(confusion_matrix(all_labels, all_predictions))

### 12.5 Prediction

After training/validation/testing, we can use it to predict the news category of the given news.

Given a piece of news (text), we first need to tokenize it and convert the tokens to token ids. Then, we should wrap it into a tensor with `torch.tensor`. Note that we put it into a list because the input dimension required by the model is (`batch_size`, text_len). If we put it into a list, the resulting tensor will be of (1, text_len), where the 1 here is the batch size. But if not, the resulting tensor will have a dimension of (text_len).

After processing the given text, we feed it to the model and get the `predictions`. Note that `predictions` is of (`batch_size`, `class_num`), as mentioned earlier. Here the batch size is 1, so it is of (1, `class_num`).

We then obtain the most likely label id `prediction_index` with `argmax`. After that, we convert the label id to the original textual class `prediction_class` with the reversed version of `class_index_mapping`.

In [62]:
def predict(model, text, tokenizer, vocab, class_index_mapping):
    """Predict the label of the given text."""

    tokens = tokenizer(text)
    token_ids = vocab(tokens)
    texts = torch.tensor([token_ids])
    with torch.no_grad():
        predictions = model(texts)

    prediction_index = predictions[0].argmax(dim=0).item()
    prediction_class = {
        index: class_
        for class_, index in class_index_mapping.items()
    }[prediction_index]

    print(f"\ntext: {text}")
    print(f"prediction: {prediction_class}")

## 13. Complete Implementation

Now let's put all code together and run it!

In [63]:
import copy
import csv
import torch
import torchtext.vocab
from collections import OrderedDict, Counter
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from torch import nn
from torch.optim import Adam
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader
from typing import Dict, Callable, Tuple, List
from torchtext.data import get_tokenizer
from torchtext.vocab import vocab as vc
from tqdm import tqdm


def collate_func(
        samples: List[Tuple[str, str]],
        vocab: torchtext.vocab.Vocab, tokenizer: Callable, max_len: int,
        class_index_mapping: Dict[str, int]
) -> Dict[str, torch.tensor]:
    """Collate function for training / validation / testing.

    :param samples: A list of (text, label).
    :param vocab: Vocabulary.
    :param tokenizer: Tokenizer.
    :param max_len: Max len of the sentences.
    :param class_index_mapping: Mapping from label string to label index.
    :return: Collated batch.
    """

    texts, labels = list(zip(*samples))

    texts = torch.tensor(list(map(
        lambda text: pad(
            vocab(tokenizer(text)),
            max_len=max_len
        ),
        texts
    )))

    labels = torch.tensor(list(map(
        lambda label: class_index_mapping[label],
        labels
    )))

    return {
        "texts": texts,
        "labels": labels,
    }


class AGNewsDataset(Dataset):
    """Dataset for AG News.

    :param fp: File path of the data.
    """

    def __init__(self, fp: str):
        self.texts, self.labels = self.read(fp)

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx: int) -> Tuple[str, str]:
        text = self.texts[idx]
        label = self.labels[idx]

        return text, label

    @classmethod
    def read(cls, fp: str) -> Tuple[List[str], List[str]]:
        """Obtain texts and labels from the data file.

        :param fp: File path of the data.
        :return: texts and labels.
        """

        texts = []
        labels = []
        with open(fp, encoding="utf-8") as f:
            csv_reader = csv.reader(f)
            for line in csv_reader:
                label, title, body = line

                # Concatenate the title and the body as text.
                texts.append(f"{title} {body}")
                labels.append(label)

                # TODO: Some operations in collate_func can be placed here.

        return texts, labels

    import torch


class RNNTextClassifier(nn.Module):
    """A text classifier based on RNN.

    :param emb_len: Embedding dimension.
    :param hid_dim: Dimension of the RNN hidden layers.
    """

    def __init__(self, vocab_len: int, class_num: int, embed_dim: int, hidden_dim: int):
        super(RNNTextClassifier, self).__init__()

        self.embed_dim = embed_dim
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(
            num_embeddings=vocab_len,
            embedding_dim=self.embed_dim
        )
        self.rnn = nn.RNN(
            input_size=self.embed_dim,
            hidden_size=self.hidden_dim,
            batch_first=True
        )
        self.linear = nn.Linear(
            in_features=self.hidden_dim,
            out_features=class_num
        )

    def forward(self, x: torch.tensor) -> torch.tensor:
        embeddings = self.embedding(x)

        _, last_hidden = self.rnn(embeddings)
        last_hidden = last_hidden.squeeze(dim=0)

        y = self.linear(last_hidden)

        return y


def make_class_index_mapping(labels: List[str]) -> Dict[str, int]:
    """Make a mapping that maps the classes to integral indices.

    :param labels: Label strings.
    :return: Label indices.
    """

    classes = list(set(labels))
    class_index_mapping = {
        class_: idx
        for idx, class_ in enumerate(classes)
    }

    return class_index_mapping


def make_vocab(
        texts: List[str], tokenizer: Callable,
        min_freq: int = 1, unk_token: str = "<unk>", unk_idx: int = 0,
) -> torchtext.vocab.Vocab:
    """Make a vocabulary from the specified texts.

    :param texts: Texts for making vocab.
    :param tokenizer: Tokenizer for tokenizing texts.
    :param min_freq: Min frequency of the words to be added to the vocab.
    :param unk_token: Unknown token.
    :param unk_idx: Unknown token index.
    :return: Constructed vocab.
    """

    def make_token_count_mapping() -> Dict[str, int]:
        """Make a mapping {token: count}"""

        all_tokens = [
            token
            for text in texts
            for token in tokenizer(text)
        ]
        counter = Counter(all_tokens)
        counter = sorted(
            counter.items(),
            key=lambda x: x[1],
            reverse=True
        )
        counter = OrderedDict(counter)  # Not necessary for python >= 3.6

        return counter

    vocab = vc(
        ordered_dict=make_token_count_mapping(),
        min_freq=min_freq,
    )
    vocab.insert_token(
        token=unk_token,
        index=unk_idx
    )
    vocab.set_default_index(index=unk_idx)

    return vocab


def pad(token_indices: List[int], max_len: int, default_pad_val: int = 0) -> List[int]:
    """Pad the given token indices.

    :param token_indices: Indices of the tokens.
    :param max_len: Max sentence length.
    :param default_pad_val: Default padding value.
    :return: Padded indices.
    """

    if len(token_indices) < max_len:
        return token_indices + ([default_pad_val] * (max_len - len(token_indices)))
    else:
        return token_indices[:max_len]


def train(model, criterion, optimizer, train_loader):
    """Train the model."""

    model.train()

    losses = []
    for batch in tqdm(train_loader):
        texts = batch["texts"].to(device)
        labels = batch["labels"].to(device)

        predictions = model(texts)

        loss = criterion(predictions, labels)
        losses.append(loss.item())

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    train_loss = torch.tensor(losses).mean()
    print(f"Train Loss : {train_loss:.3f}")


def valid(model, criterion, dev_loader):
    """Validate the model performance."""

    model.eval()

    all_labels = []  # Store all labels in the dev data.
    all_predictions = []  # Store all predictions.
    losses = []
    with torch.no_grad():
        for batch in dev_loader:
            texts = batch["texts"].to(device)
            labels = batch["labels"].to(device)  # labels here: labels in a batch.

            predictions = model(texts)  # predictions here: predictions in a batch.

            loss = criterion(predictions, labels)
            losses.append(loss.item())

            all_labels.append(labels)
            all_predictions.append(predictions.argmax(dim=-1))

    all_labels = torch.cat(all_labels)
    all_predictions = torch.cat(all_predictions)

    valid_loss = torch.tensor(losses).mean()
    valid_acc = accuracy_score(
        y_true=all_labels.detach().cpu().numpy(),
        y_pred=all_predictions.detach().cpu().numpy()
    )
    print(f"Valid Loss : {valid_loss:.3f}")
    print(f"Valid Acc  : {valid_acc:.3f}")

    return valid_loss.item()


def test(model, test_loader, class_index_mapping):
    """Test the model"""

    all_labels = []
    all_predictions = []
    with torch.no_grad():
        for batch in test_loader:
            texts = batch["texts"].to(device)
            labels = batch["labels"].to(device)

            predictions = model(texts)

            all_labels.append(labels)
            all_predictions.append(predictions)

        all_predictions = torch.cat(all_predictions)
        all_labels = torch.cat(all_labels)

    all_labels = all_labels.detach().cpu().numpy()
    all_predictions = F.softmax(all_predictions, dim=-1).argmax(dim=-1).detach().cpu().numpy()

    test_acc = accuracy_score(
        y_true=all_labels,
        y_pred=all_predictions
    )
    print(f"Test Acc   : {test_acc:.3f}")

    print("\nClassification Report : ")
    print(classification_report(all_labels, all_predictions, target_names=class_index_mapping.keys()))

    print("\nConfusion Matrix : ")
    print(confusion_matrix(all_labels, all_predictions))


def predict(model, text, tokenizer, vocab, class_index_mapping):
    """Predict the label of the given text."""

    tokens = tokenizer(text)
    token_ids = vocab(tokens)
    texts = torch.tensor([token_ids])
    with torch.no_grad():
        predictions = model(texts)

    prediction_index = predictions[0].argmax(dim=0).item()
    prediction_class = {
        index: class_
        for class_, index in class_index_mapping.items()
    }[prediction_index]

    print(f"\ntext: {text}")
    print(f"prediction: {prediction_class}")


def main():
    train_dataset = AGNewsDataset(fp="../train.csv")
    dev_dataset = AGNewsDataset(fp="../dev.csv")
    test_dataset = AGNewsDataset(fp="../test.csv")

    tokenizer = get_tokenizer('basic_english')
    # tokenizer = get_tokenizer(word_tokenize)
    vocab = make_vocab(
        texts=train_dataset.texts,
        tokenizer=tokenizer,
        min_freq=min_freq,
    )
    class_index_mapping = make_class_index_mapping(labels=train_dataset.labels)

    model = RNNTextClassifier(
        vocab_len=len(vocab),
        class_num=len(class_index_mapping),
        embed_dim=embed_dim,
        hidden_dim=hidden_dim,
    )
    model.to(device=device)

    criterion = nn.CrossEntropyLoss()
    optimizer = Adam(
        params=model.parameters(),
        lr=lr,
    )

    collate_fn = lambda samples: collate_func(
        samples=samples,
        tokenizer=tokenizer,
        vocab=vocab,
        max_len=max_len,
        class_index_mapping=class_index_mapping,
    )
    train_loader = DataLoader(
        train_dataset,
        batch_size=train_bz,
        collate_fn=collate_fn,
        shuffle=True
    )
    dev_loader = DataLoader(
        dev_dataset,
        batch_size=eval_bz,
        collate_fn=collate_fn
    )
    test_loader = DataLoader(
        test_dataset,
        batch_size=eval_bz,
        collate_fn=collate_fn
    )

    best_valid_loss = float("inf")
    best_model = None
    for i in range(num_epochs):
        print(f"Epoch {i + 1}")

        train(
            model=model,
            criterion=criterion,
            optimizer=optimizer,
            train_loader=train_loader
        )

        valid_loss = valid(
            model=model,
            criterion=criterion,
            dev_loader=dev_loader
        )
        # Update the best model.
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            best_model = copy.deepcopy(model)

    test(
        model=best_model,
        test_loader=test_loader,
        class_index_mapping=class_index_mapping
    )

    predict(
        model=best_model,
        text="Apple Tops in Customer Satisfaction \"Dell comes in a close second, while Gateway shows improvement, study says.\"",
        tokenizer=tokenizer,
        vocab=vocab,
        class_index_mapping=class_index_mapping,
    )


if __name__ == '__main__':
    # Data args.
    min_freq = 10  # Min frequency of the word added to the vocab.
    max_len = 25  # Max sentence len.
    # Model args.
    embed_dim = 50  # Embedding dimension.
    hidden_dim = 50  # Hidden dimension of the RNN.
    # Training args.
    num_epochs = 10  # Number of training epochs.
    lr = 1e-3  # Learning rate.
    train_bz = 64  # Training batch size.
    eval_bz = 64  # Validation/testing batch size.
    
    device = torch.device(
        "cuda"
        if torch.cuda.is_available()
        else "cpu"
    )

    main()

Epoch 1


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:41<00:00, 43.09it/s]


Train Loss : 0.862
Valid Loss : 0.586
Valid Acc  : 0.796
Epoch 2


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:29<00:00, 60.84it/s]


Train Loss : 0.479
Valid Loss : 0.461
Valid Acc  : 0.850
Epoch 3


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:39<00:00, 45.00it/s]


Train Loss : 0.383
Valid Loss : 0.427
Valid Acc  : 0.866
Epoch 4


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:54<00:00, 32.52it/s]


Train Loss : 0.329
Valid Loss : 0.410
Valid Acc  : 0.866
Epoch 5


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:54<00:00, 32.76it/s]


Train Loss : 0.290
Valid Loss : 0.413
Valid Acc  : 0.864
Epoch 6


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:54<00:00, 32.62it/s]


Train Loss : 0.261
Valid Loss : 0.410
Valid Acc  : 0.875
Epoch 7


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:55<00:00, 32.29it/s]


Train Loss : 0.238
Valid Loss : 0.411
Valid Acc  : 0.878
Epoch 8


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:53<00:00, 33.31it/s]


Train Loss : 0.215
Valid Loss : 0.418
Valid Acc  : 0.878
Epoch 9


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:52<00:00, 33.96it/s]


Train Loss : 0.198
Valid Loss : 0.414
Valid Acc  : 0.878
Epoch 10


100%|██████████████████████████████████████████████████████████████████████████████| 1782/1782 [00:54<00:00, 32.76it/s]


Train Loss : 0.178
Valid Loss : 0.414
Valid Acc  : 0.878
Test Acc   : 0.880

Classification Report : 
              precision    recall  f1-score   support

      Sports       0.94      0.96      0.95      1900
    Business       0.84      0.84      0.84      1900
    Sci/Tech       0.81      0.89      0.85      1900
       World       0.94      0.83      0.88      1900

    accuracy                           0.88      7600
   macro avg       0.88      0.88      0.88      7600
weighted avg       0.88      0.88      0.88      7600


Confusion Matrix : 
[[1816   11   47   26]
 [  14 1600  242   44]
 [  36  141 1686   37]
 [  61  153  101 1585]]

text: Apple Tops in Customer Satisfaction "Dell comes in a close second, while Gateway shows improvement, study says."
prediction: Sci/Tech
