### Multi-class Sentiment Analyisis.

Up to now we have been classifying text with two outcomes either positive or otherwise. In this Notebook we are going to take it futher and be able to classify multiclass text.
When we have more than 2 examples, our output must be a $C$ dimensional vector, where $C$ is the number of classes.

We are going to use the dataset with **6** (TREC dataset) classes it's a dataset of questions and the task is to classify what category the question belongs to. We do not need to set the dtype in the ``LABEL`` field. When doing a ``mutli-class`` problem, PyTorch expects the labels to be numericalized **LongTensors**.
The fine_grained argument allows us to use the fine-grained labels (of which there are 50 classes) or not (in which case they'll be 6 classes). You can change this how you please.

In [1]:
import torch
from torchtext.legacy import data, datasets
import numpy as np
import random

In [2]:
SEED = 42
random.seed(SEED)
torch.manual_seed(SEED)
np.random.seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
TEXT = data.Field(tokenize="spacy", tokenizer_language="en_core_web_sm")
LABEL = data.LabelField()

In [11]:
train_data, test_data = datasets.TREC.splits(TEXT, LABEL, fine_grained=False)

In [12]:
validation_data, test_data = test_data.split(random_state=random.seed(SEED))

In [13]:
print(f"TRAINING: \t {len(train_data)}")
print(f"TESTING: \t {len(test_data)}")
print(f"VALIDATION: \t {len(validation_data)}")

TRAINING: 	 5452
TESTING: 	 150
VALIDATION: 	 350


### Let's look at some example of the training set.

In [14]:
print(vars(train_data[1]))

{'text': ['What', 'films', 'featured', 'the', 'character', 'Popeye', 'Doyle', '?'], 'label': 'ENTY'}


### Building a vocabulary.


In [15]:
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(
    train_data,
    max_size = MAX_VOCAB_SIZE,
    vectors= "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)

### Checking Labels.


The 6 labels (for the non-fine-grained case) correspond to the 6 types of questions in the dataset:

* HUM for questions about humans
* ENTY for questions about entities
* DESC for questions asking you for a description
* NUM for questions where the answer is numerical
* LOC for questions where the answer is a location
* ABBR for questions asking about abbreviations

In [16]:
LABEL.vocab.stoi

defaultdict(None,
            {'ABBR': 5, 'DESC': 2, 'ENTY': 0, 'HUM': 1, 'LOC': 4, 'NUM': 3})

### Creating iterators.

As usual we want to use the `BucketIterator` to create iterators for all sets.

In [18]:
BATCH_SIZE = 64

train_iterator, validation_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, validation_data, test_data),
    batch_size = BATCH_SIZE,
    device = device
)

### Creating A `CNN` model to classify Text.


In [19]:
import torch.nn as nn
from torch.nn import  functional as F

In [51]:
class CNN(nn.Module):
  def __init__(self, vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx):
    super(CNN, self).__init__()
    self.embedding = nn.Embedding(vocab_size, embedding_dim = embedding_dim, padding_idx = pad_idx)
    self.convs = nn.ModuleList([
                                nn.Conv2d(in_channels = 1, 
                                          out_channels = n_filters, 
                                          kernel_size = (fs, embedding_dim)) 
                                for fs in filter_sizes
                                ])
    self.fc = nn.Linear(len(filter_sizes) * n_filters, output_dim)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text):  
    #text = [batch size, sent len]
    text = text.permute(1, 0)
    
    embedded = self.embedding(text)    
    #embedded = [batch size, sent len, emb dim]
    embedded = embedded.unsqueeze(1)
    #embedded = [batch size, 1, sent len, emb dim]

    conved = [F.relu(conv(embedded)).squeeze(3) for conv in self.convs]
    #conved_n = [batch size, n_filters, sent len - filter_sizes[n] + 1]

    pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
    #pooled_n = [batch size, n_filters]
    cat = self.dropout(torch.cat(pooled, dim = 1))
    #cat = [batch size, n_filters * len(filter_sizes)]  
    return self.fc(cat)



### Hyper parameters

In [52]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
N_FILTERS = 100
FILTER_SIZES = [2,3,4]
OUTPUT_DIM = len(LABEL.vocab)
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

In [53]:

model = CNN(INPUT_DIM, EMBEDDING_DIM, N_FILTERS, FILTER_SIZES, OUTPUT_DIM, DROPOUT, PAD_IDX)
model

CNN(
  (embedding): Embedding(9343, 100, padding_idx=1)
  (convs): ModuleList(
    (0): Conv2d(1, 100, kernel_size=(2, 100), stride=(1, 1))
    (1): Conv2d(1, 100, kernel_size=(3, 100), stride=(1, 1))
    (2): Conv2d(1, 100, kernel_size=(4, 100), stride=(1, 1))
  )
  (fc): Linear(in_features=300, out_features=6, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Trainable parameters

In [54]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has  {count_trainable_params(model):,} trainable parameters')

The model has  1,026,406 trainable parameters


### Loading the pretrainned embeddings

In [55]:
pretrained_embeddings = TEXT.vocab.vectors

In [56]:
model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[-0.3118, -1.5756,  0.4242,  ...,  0.1282, -0.0354, -0.6897],
        [-0.3598, -0.3909,  0.4201,  ...,  0.4844,  0.1023,  1.0094],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [ 0.4245, -1.1554,  0.0043,  ..., -0.8200, -2.1399,  0.5137]])

### Make the `<pad>` and `<unk>` layers zeros.
The first two layers should have zeros.

In [57]:
for i in range(2):
  model.embedding.weight.data[i] = torch.zeros(EMBEDDING_DIM)

model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.1638,  0.6046,  1.0789,  ..., -0.3140,  0.1844,  0.3624],
        ...,
        [ 0.0091,  0.2810,  0.7356,  ..., -0.7508,  0.8967, -0.7631],
        [ 0.2906,  0.3217,  0.2419,  ..., -0.9444, -0.3790,  0.6196],
        [ 0.4245, -1.1554,  0.0043,  ..., -0.8200, -2.1399,  0.5137]])

### Trainning the Model.

Generally:

``CrossEntropyLoss`` is used when our examples exclusively belong to one of $C$ classes

``BCEWithLogitsLoss`` is used when our examples exclusively belong to only 2 classes (0 and 1) and is also used in the case where our examples belong to between 0 and $C$ classes (aka multilabel classification).

In this Notebook we are going to use ``CrossEntropyLoss``.

In [58]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss()

### Pushing the model and Criterion to the device.

In [59]:
model = model.to(device)
criterion = criterion.to(device)

### Accuracy On Multi-Class Classification

Before, we had a function that calculated accuracy in the binary label case, where we said if the value was over ``0.5`` then we would assume it is positive. In the case where we have more than 2 classes, our model outputs a $C$ dimensional vector, where the value of each element is the beleief that the example belongs to that class.

For example, in our labels we have:

 ``**'HUM' = 0, 'ENTY' = 1, 'DESC' = 2, 'NUM' = 3, 'LOC' = 4 and 'ABBR' = 5**.``

If the output of our model was something like: 

``[5.1, 0.3, 0.1, 2.1, 0.2, 0.6]`` 

this means that the model strongly believes the example belongs to class ``0``, a question about a human, and slightly believes the example belongs to class 3, a numerical question.

We calculate the accuracy by performing an argmax to get the index of the maximum value in the prediction for each element in the batch, and then counting how many times this equals the actual label. We then average this across the batch.

In [60]:
def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

The training loop and evaluating is similar to before, without the need to squeeze the model predictions as CrossEntropyLoss expects the input to be ``[batch size, n classes]`` and the label to be ``[batch size]``.`


In [63]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text)
        loss = criterion(predictions, batch.label)
        acc = categorical_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text)
            loss = criterion(predictions, batch.label)
            acc = categorical_accuracy(predictions, batch.label)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

We'll also create a function to tell us how long an epoch takes to compare training times between models.

In [64]:
import time
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [65]:
N_EPOCHS = 5
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, validation_iterator, criterion)
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 1s
	Train Loss: 1.235 | Train Acc: 52.69%
	 Val. Loss: 0.791 |  Val. Acc: 74.32%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 0.783 | Train Acc: 72.23%
	 Val. Loss: 0.583 |  Val. Acc: 77.74%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 0.568 | Train Acc: 80.70%
	 Val. Loss: 0.421 |  Val. Acc: 84.06%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 0.416 | Train Acc: 85.93%
	 Val. Loss: 0.384 |  Val. Acc: 86.18%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 0.309 | Train Acc: 90.35%
	 Val. Loss: 0.333 |  Val. Acc: 87.29%


### Evaluating the `model`

In [68]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.392 | Test Acc: 86.65%


### User Input.

In [83]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

def predict_class(model, sentence, min_len = 5):
    model.eval()
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    if len(tokenized) < min_len:
        tokenized += ['<pad>'] * (min_len - len(tokenized))
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction = torch.argmax(model(tensor), dim=1)
    return prediction.item()

In [87]:
labels = dict(LABEL.vocab.stoi)
labels = dict([(v, k) for (k, v) in labels.items()])

### Getting the user input.

In [88]:
while True:
  qn = input("Enter a question\nOR 'exit' to close:\n")

  if qn.lower() == "exit":
    break
  pred_class = predict_class(model, qn)
  print(f'Predicted class is: {pred_class} = {labels[pred_class]}')

Enter a question
OR 'exit' to close:
what is your name?
Predicted class is: 2 = DESC
Enter a question
OR 'exit' to close:
Who is Keyser Söze?
Predicted class is: 1 = HUM
Enter a question
OR 'exit' to close:
How many minutes are in six hundred and eighteen hours?
Predicted class is: 3 = NUM
Enter a question
OR 'exit' to close:
What continent is Bulgaria in?
Predicted class is: 4 = LOC
Enter a question
OR 'exit' to close:
What does WYSIWYG stand for?
Predicted class is: 5 = ABBR
Enter a question
OR 'exit' to close:
exit


### Credits

* [bentrevett](https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/5%20-%20Multi-class%20Sentiment%20Analysis.ipynb)