### Multi-class Sentiment Analyisis using `FastText`.

Up to now we have been classifying text with two outcomes either positive or otherwise. In this Notebook we are going to take it futher and be able to classify multiclass text.
When we have more than 2 examples, our output must be a $C$ dimensional vector, where $C$ is the number of classes.

We are going to use the dataset with **6** (TREC dataset) classes it's a dataset of questions and the task is to classify what category the question belongs to. We do not need to set the dtype in the ``LABEL`` field. When doing a ``mutli-class`` problem, PyTorch expects the labels to be numericalized **LongTensors**.
The fine_grained argument allows us to use the fine-grained labels (of which there are 50 classes) or not (in which case they'll be 6 classes). You can change this how you please.

In [1]:
import torch
from torchtext.legacy import data, datasets
import numpy as np
import random

In [2]:
SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


### FastTex.
One of the key concepts in the FastText paper is that they calculate the ``n-grams`` of an input sentence and append them to the end of a sentence. Here, we'll use ``tri-grams``.

We are going to create a ``generate_ngrams`` function takes a sentence that has already been tokenized, calculates the bi-grams and appends them to the end of the tokenized list.

We are going to modify our `generate_bigrams` function from [this](https://github.com/CrispenGari/PyTorch-Python/blob/main/09_TorchText/02_Sentiment_Analyisis_Series/03_Faster_Sentiment_Analyisis.ipynb) notebook so that it will be a generic function that will generate, `n-grams` pairs.

In [7]:
def generate_ngrams(x, n=2):
  n_grams = set(zip(*[x[i:] for i in range(n)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_ngrams(['This', 'film', 'is', 'terrible'], 3)

['This', 'film', 'is', 'terrible', 'This film is', 'film is terrible']

In [8]:
TEXT = data.Field(tokenize="spacy", 
                  tokenizer_language="en_core_web_sm",
                  preprocessing = generate_ngrams
                  )
LABEL = data.LabelField()

In [9]:
train_data, test_data = datasets.TREC.splits(TEXT, LABEL, fine_grained=False)

downloading train_5500.label


train_5500.label: 100%|██████████| 336k/336k [00:00<00:00, 1.10MB/s]


downloading TREC_10.label


TREC_10.label: 100%|██████████| 23.4k/23.4k [00:00<00:00, 311kB/s]


In [10]:
validation_data, test_data = test_data.split(random_state=random.seed(SEED))

In [11]:
print(f"TRAINING: \t {len(train_data)}")
print(f"TESTING: \t {len(test_data)}")
print(f"VALIDATION: \t {len(validation_data)}")

TRAINING: 	 5452
TESTING: 	 150
VALIDATION: 	 350


### Let's look at some example of the training set.

In [12]:
print(vars(train_data[1]))

{'text': ['What', 'films', 'featured', 'the', 'character', 'Popeye', 'Doyle', '?', 'Doyle ?', 'the character', 'Popeye Doyle', 'character Popeye', 'What films', 'films featured', 'featured the'], 'label': 'ENTY'}


### Building a vocabulary.


In [13]:
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(
    train_data,
    max_size = MAX_VOCAB_SIZE,
    vectors= "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:42, 5.30MB/s]                           
100%|█████████▉| 399755/400000 [00:21<00:00, 19080.31it/s]

### Checking Labels.


The 6 labels (for the non-fine-grained case) correspond to the 6 types of questions in the dataset:

* HUM for questions about humans
* ENTY for questions about entities
* DESC for questions asking you for a description
* NUM for questions where the answer is numerical
* LOC for questions where the answer is a location
* ABBR for questions asking about abbreviations

In [14]:
LABEL.vocab.stoi

defaultdict(None,
            {'ABBR': 5, 'DESC': 2, 'ENTY': 0, 'HUM': 1, 'LOC': 4, 'NUM': 3})

### Creating iterators.

As usual we want to use the `BucketIterator` to create iterators for all sets.

In [15]:
BATCH_SIZE = 64

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data),
    batch_size = BATCH_SIZE,
    device = device
)

### Creating A `CNN` model to classify Text.


In [16]:
import torch.nn as nn
from torch.nn import  functional as F

In [29]:
class FastText(nn.Module):
  def __init__(self, vocab_size, embedding_dim, output_dim, pad_idx):
    super(FastText, self).__init__()

    self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
    self.fc = nn.Linear(embedding_dim, output_dim )

  def forward(self, text):
    embedded = self.embedding(text).permute(1, 0, 2)
    pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1) 
    return self.fc(pooled)

### Hyper parameters

In [30]:
INPUT_DIM = len(TEXT.vocab) # # 25002
EMBEDDING_DIM = 100
OUTPUT_DIM = len(LABEL.vocab.stoi)
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # 0

model = FastText(INPUT_DIM, 
            EMBEDDING_DIM, 
            OUTPUT_DIM,  
            PAD_IDX)
model

FastText(
  (embedding): Embedding(25002, 100, padding_idx=1)
  (fc): Linear(in_features=100, out_features=6, bias=True)
)

### Trainable parameters

In [31]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has  {count_trainable_params(model):,} trainable parameters')

The model has  2,500,806 trainable parameters


### Loading the pretrainned embeddings

In [32]:
pretrained_embeddings = TEXT.vocab.vectors

In [33]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [-1.0721,  2.5816, -0.4311,  ..., -0.0361,  0.7994,  1.2356],
        [ 0.0943,  0.4924,  1.0734,  ...,  0.0651,  0.5112,  0.5391],
        [-0.0063,  1.0709, -0.4854,  ..., -1.4409, -1.1118, -0.8337]])

### Make the `<pad>` and `<unk>` layers zeros.
The first two layers should have zeros.

In [34]:
for i in range(2):
  model.embedding.weight.data[i] = torch.zeros(EMBEDDING_DIM)

model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [-1.0721,  2.5816, -0.4311,  ..., -0.0361,  0.7994,  1.2356],
        [ 0.0943,  0.4924,  1.0734,  ...,  0.0651,  0.5112,  0.5391],
        [-0.0063,  1.0709, -0.4854,  ..., -1.4409, -1.1118, -0.8337]])

### Trainning the Model.

Generally:

``CrossEntropyLoss`` is used when our examples exclusively belong to one of $C$ classes

``BCEWithLogitsLoss`` is used when our examples exclusively belong to only 2 classes (0 and 1) and is also used in the case where our examples belong to between 0 and $C$ classes (aka multilabel classification).

In this Notebook we are going to use ``CrossEntropyLoss``.

In [35]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

### Pushing the model and Criterion to the device.

In [36]:
model = model.to(device)
criterion = criterion.to(device)

### Accuracy On Multi-Class Classification

Before, we had a function that calculated accuracy in the binary label case, where we said if the value was over ``0.5`` then we would assume it is positive. In the case where we have more than 2 classes, our model outputs a $C$ dimensional vector, where the value of each element is the beleief that the example belongs to that class.

For example, in our labels we have:

 ``**'HUM' = 0, 'ENTY' = 1, 'DESC' = 2, 'NUM' = 3, 'LOC' = 4 and 'ABBR' = 5**.``

If the output of our model was something like: 

``[5.1, 0.3, 0.1, 2.1, 0.2, 0.6]`` 

this means that the model strongly believes the example belongs to class ``0``, a question about a human, and slightly believes the example belongs to class 3, a numerical question.

We calculate the accuracy by performing an argmax to get the index of the maximum value in the prediction for each element in the batch, and then counting how many times this equals the actual label. We then average this across the batch.

In [37]:
def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

The training loop and evaluating is similar to before, without the need to squeeze the model predictions as CrossEntropyLoss expects the input to be ``[batch size, n classes]`` and the label to be ``[batch size]``.`


In [38]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = categorical_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, batch.label)
            acc = categorical_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [39]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [42]:
N_EPOCHS = 20
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 0s
	Train Loss: 0.226 | Train Acc: 97.04%
	 Val. Loss: 0.435 |  Val. Acc: 84.32%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 0.198 | Train Acc: 97.27%
	 Val. Loss: 0.430 |  Val. Acc: 84.32%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 0.178 | Train Acc: 97.60%
	 Val. Loss: 0.425 |  Val. Acc: 84.84%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 0.153 | Train Acc: 97.89%
	 Val. Loss: 0.416 |  Val. Acc: 85.10%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 0.143 | Train Acc: 98.07%
	 Val. Loss: 0.409 |  Val. Acc: 85.89%
Epoch: 06 | Epoch Time: 0m 0s
	Train Loss: 0.123 | Train Acc: 98.33%
	 Val. Loss: 0.403 |  Val. Acc: 85.63%
Epoch: 07 | Epoch Time: 0m 0s
	Train Loss: 0.120 | Train Acc: 98.58%
	 Val. Loss: 0.404 |  Val. Acc: 86.18%
Epoch: 08 | Epoch Time: 0m 0s
	Train Loss: 0.102 | Train Acc: 98.78%
	 Val. Loss: 0.402 |  Val. Acc: 86.44%
Epoch: 09 | Epoch Time: 0m 0s
	Train Loss: 0.095 | Train Acc: 98.82%
	 Val. Loss: 0.396 |  Val. Acc: 87.52%
Epoch: 10 | Epoch Time: 0m 0

### Evaluating the `model`

In [43]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.448 | Test Acc: 87.17%


### User Input.

In [53]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

def predict_class(model, sentence):
  model.eval()
  tokenized = generate_ngrams([tok.text for tok in nlp.tokenizer(sentence)])
  indexed = [TEXT.vocab.stoi[t] for t in tokenized]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(1)
  prediction = torch.argmax(model(tensor))
  return prediction.item()

In [54]:
labels = dict(LABEL.vocab.stoi)
labels = dict([(v, k) for (k, v) in labels.items()])

### Getting the user input.

In [55]:
while True:
  qn = input("Enter a question\nOR 'exit' to close:\n")

  if qn.lower() == "exit":
    break
  pred_class = predict_class(model, qn)
  print(f'Predicted class is: {pred_class} = {labels[pred_class]}')

Enter a question
OR 'exit' to close:
What is your name?
Predicted class is: 2 = DESC
Enter a question
OR 'exit' to close:
Where do you live?
Predicted class is: 4 = LOC
Enter a question
OR 'exit' to close:
Who is your father?
Predicted class is: 1 = HUM
Enter a question
OR 'exit' to close:
How are you?
Predicted class is: 2 = DESC
Enter a question
OR 'exit' to close:
exit
