### Emotions.

In this notebook we are going to create a pytorch model using torchtext and our custom dataset that identifies emotions of a given sentence.

### Emotions:
````
😞 -> sadness
😨 -> fear
😄 -> joy
😮 -> surprise
😍 -> love
😠 -> anger
````

We are going to use our custom dataset that we will load from my google drive.

### Structure of the data.

We have three files which are:
* test.txt
* train.txt
* val.txt
And each of these file contains lines with a respective lable. The text in these files looks as follows:

```txt
im feeling quite sad and sorry for myself but ill snap out of it soon;sadness
i feel like i am still looking at a blank canvas blank pieces of paper;sadness
i feel like a faithful servant;love
```

We will process these text file to come up with json files which is easy to work with when creating our own dataset using torchtext. The following files will be created

* train.json
* test.json
* validation.json

### Imports

In [1]:
import json
import time
from prettytable import PrettyTable
import numpy as np
from matplotlib import pyplot as plt
import torch, os, random
from torch import nn
import torch.nn.functional as F
import pandas as pd

### Setting seeds

In [2]:
SEED = 42
np.random.seed(SEED)
random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deteministic = True

### Mounting my Google Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
data_path = '/content/drive/MyDrive/NLP Data/emotions-nlp'
os.path.exists(data_path)

True

### Loading files lines

In [5]:
with open(os.path.join(data_path, 'test.txt'), 'r') as reader:
  test_data = reader.read().splitlines()
with open(os.path.join(data_path, 'val.txt'), 'r') as reader:
  valid_data = reader.read().splitlines()
with open(os.path.join(data_path, 'train.txt'), 'r') as reader:
  train_data = reader.read().splitlines()

### Creating `.json` file from these loaded list.

In [6]:
train_data_dicts = []
test_data_dicts = []
valid_data_dicts = []

emotions = ['anger', 'fear', 'joy', 'love', 'sadness', 'surprise' ]
emotions_dict = dict([(v, i) for (i, v) in enumerate(emotions)])

for line in test_data:
  text, emotion = line.split(';')
  test_data_dicts.append({
      'text': text,
      "emotion_text": emotion,
      "emotion": emotions_dict.get(emotion)
  })

for line in train_data:
  text, emotion = line.split(';')
  train_data_dicts.append({
      'text': text,
      "emotion_text": emotion,
      "emotion": emotions_dict.get(emotion)
  })

for line in valid_data:
  text, emotion = line.split(';')
  valid_data_dicts.append({
      'text': text,
      "emotion_text": emotion,
      "emotion": emotions_dict.get(emotion)
  })

In [7]:
test_path = 'test.json'
train_path = 'train.json'
valid_path = 'valid.json'

base_path = '/content/drive/MyDrive/NLP Data/emotions-nlp/json'
if not os.path.exists(base_path):
  os.makedirs(base_path)
  
file_object = open(os.path.join(base_path, train_path), 'w')
for line in train_data_dicts:
  file_object.write(json.dumps(line))
  file_object.write('\n')
file_object.close()
print("train.json created")

file_object = open(os.path.join(base_path, test_path), 'w')
for line in test_data_dicts:
  file_object.write(json.dumps(line))
  file_object.write('\n')
file_object.close()
print("test.json created")

file_object = open(os.path.join(base_path, valid_path), 'w')
for line in valid_data_dicts:
  file_object.write(json.dumps(line))
  file_object.write('\n')
file_object.close()
print("valid.json created")

train.json created
test.json created
valid.json created


### Checking how many example do we have for each set.

In [8]:
def tabulate(column_names, data, title):
  table = PrettyTable(column_names)
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)

data_rows =["training", len(train_data_dicts) ], ["testing", len(test_data_dicts) ], ["validation", len(valid_data_dicts) ]
title = "EXAMPLES IN EACH SET"
column_data = "SET", "EXAMPLES"
tabulate(column_data,data_rows, title )

+-----------------------+
|  EXAMPLES IN EACH SET |
+------------+----------+
|    SET     | EXAMPLES |
+------------+----------+
|  training  |  16000   |
|  testing   |   2000   |
| validation |   2000   |
+------------+----------+


### Preparing the fields
Now that our `.json` files of all the sets looks as follows:

```json
{"text": "i feel a little mellow today", "emotion_text": "joy", "emotion": 2}
```
We are now ready to create the fields .

In [9]:
from torchtext.legacy import data

We are going to pass `include_lengths=True` to the text Field because we are using padding padded sequences in this notebook. In the label Field we have to specify the datatype as a LongTensor. This is because when doing multiclass classfication pytorch expects the datatype to be a long tensor.

In [10]:
TEXT = data.Field(
    tokenize="spacy",
    include_lengths = True,
    tokenizer_language = 'en_core_web_sm'
)
LABEL = data.LabelField()

### Creating Field

In [11]:
fields ={
    "emotion_text": ("emotion", LABEL),
    "text": ("text", TEXT)
}

### Now we have to create our datasets.

We are going to use the `TabularDataset` to create sets of data for validation, training and testing sets.

In [12]:
train_data, test_data, valid_data = data.TabularDataset.splits(
    path=base_path,
    train=train_path,
    test=test_path,
    validation = valid_path,
    format=train_path.split('.')[-1],
    fields=fields
)

In [13]:
print(vars(train_data.examples[2]))

{'emotion': 'anger', 'text': ['i', 'm', 'grabbing', 'a', 'minute', 'to', 'post', 'i', 'feel', 'greedy', 'wrong']}


### Loading the pretrained word embeddings.

In [14]:
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(
    train_data,
    max_size = MAX_VOCAB_SIZE,
    vectors = "glove.6B.100d",
    unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)

### Device

In [15]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Now lets create iterators.

For this we are going to use my fav ``BucketIterator`` to create iterators for each set.

**Note:** - we have to pass a `sort_key` and `sort_within_batch=True` since we are using packed padded sequences otherwise it wont work.

In [16]:
sort_key = lambda x: len(x.text)
BATCH_SIZE = 64
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
    sort_within_batch=True
)

### Creating a model.

In [17]:
class EmotionsLSTMRNN(nn.Module):
  def __init__(self, vocab_size, embedding_size,
               hidden_size, output_size, num_layers,
               bidirectional, dropout, pad_index
               ):
    super(EmotionsLSTMRNN, self).__init__()

    self.embedding = nn.Embedding(vocab_size,embedding_size,
                                  padding_idx=pad_index)
    self.lstm = nn.LSTM(embedding_size, hidden_size = hidden_size,
                        bidirectional=bidirectional, num_layers=num_layers,
                        dropout = dropout
                        )
    self.hidden_1 = nn.Linear(hidden_size * 2, out_features=512)
    self.hidden_2 = nn.Linear(512, out_features=256)
    self.output_layer = nn.Linear(256, out_features=output_size)
    self.dropout = nn.Dropout(dropout)

  def forward(self, text, text_lengths):
    embedded = self.dropout(self.embedding(text))
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths.to('cpu'), enforce_sorted=False)
    packed_output, (h_0, c_0) = self.lstm(packed_embedded)
    output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
    h_0 = self.dropout(torch.cat((h_0[-2,:,:], h_0[-1,:,:]), dim = 1))

    out = self.dropout(self.hidden_1(h_0))
    out = self.hidden_2(out)
    return self.output_layer(out)


### Creating the Model instance

In [18]:
INPUT_DIM = len(TEXT.vocab) # # 25002
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 6
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] # 0
emotions_model = EmotionsLSTMRNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX).to(device)
emotions_model

EmotionsLSTMRNN(
  (embedding): Embedding(15167, 100, padding_idx=1)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.5, bidirectional=True)
  (hidden_1): Linear(in_features=512, out_features=512, bias=True)
  (hidden_2): Linear(in_features=512, out_features=256, bias=True)
  (output_layer): Linear(in_features=256, out_features=6, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Counting parameters of the model.

In [19]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(emotions_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 4,222,370
Total tainable parameters: 4,222,370


### Loading pretrained embeddings

In [20]:
pretrained_embeddings = TEXT.vocab.vectors

In [21]:
emotions_model.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 1.9269,  1.4873,  0.9007,  ...,  0.1233,  0.3499,  0.6173],
        [ 0.7262,  0.0912, -0.3891,  ...,  0.0821,  0.4440, -0.7240],
        [-0.0465,  0.6197,  0.5665,  ..., -0.3762, -0.0325,  0.8062],
        ...,
        [-0.1438,  0.8681, -0.7219,  ...,  0.0553, -0.4339,  0.3486],
        [-0.0422, -0.7724, -0.9311,  ..., -0.6228,  0.7262,  0.0521],
        [-0.6644, -0.3045,  0.6151,  ...,  0.1404,  0.5788, -0.0333]],
       device='cuda:0')

### Zeroiing the `pad` and `unk` indices

In [22]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token] or TEXT.vocab.stoi["<unk>"]
emotions_model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
emotions_model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
emotions_model.embedding.weight.data

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0465,  0.6197,  0.5665,  ..., -0.3762, -0.0325,  0.8062],
        ...,
        [-0.1438,  0.8681, -0.7219,  ...,  0.0553, -0.4339,  0.3486],
        [-0.0422, -0.7724, -0.9311,  ..., -0.6228,  0.7262,  0.0521],
        [-0.6644, -0.3045,  0.6151,  ...,  0.1404,  0.5788, -0.0333]],
       device='cuda:0')

### Loss and optimizer.

In [23]:
optimizer = torch.optim.Adam(emotions_model.parameters())
criterion = nn.CrossEntropyLoss().to(device)

### Accuracy function (`categorical_accuracy`).

In [24]:
def categorical_accuracy(preds, y):
    top_pred = preds.argmax(1, keepdim = True)
    correct = top_pred.eq(y.view_as(top_pred)).sum()
    acc = correct.float() / y.shape[0]
    return acc

### Training and evaluation functions.

In [25]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text, text_lengths = batch.text
        predictions = model(text, text_lengths).squeeze(1)
        loss = criterion(predictions, batch.emotion)
        acc = categorical_accuracy(predictions, batch.emotion)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text, text_lengths = batch.text
            predictions = model(text, text_lengths)
            loss = criterion(predictions, batch.emotion)
            acc = categorical_accuracy(predictions, batch.emotion)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Training Loop.

We will create a function that will visualize our training loop `ETA` for each and every epoch.

In [26]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)

In [35]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(emotions_model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(emotions_model, valid_iterator, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(emotions_model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.174 |    0.933 | 0:00:07.95 |
| Validation | 0.154 |    0.934 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.162 |    0.938 | 0:00:07.82 |
| Validation | 0.139 |    0.934 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|         EPOCH: 03/10 not saving...         |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Model Evaluation.

In [36]:
emotions_model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(emotions_model, test_iterator, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.123 | Test Acc: 93.99%


### Model Inference
Making predictions.

In [None]:
!pip install emoji
import emoji
emotions_emojis = {
   'anger' : ":angry:", 
   'fear': ":fearful:", 
   'joy' : ":smile:", 
   'love' : ":heart_eyes:", 
   'sadness' : ":disappointed:", 
   'surprise': ":open_mouth:"
}

In [39]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()


def tabulate(column_names, data, title="EMOTION PREDICTIONS TABLE"):
  table = PrettyTable(column_names)
  table.align[column_names[0]] = "l"
  table.align[column_names[1]] = "l"
  for row in data:
    table.add_row(row)
  print(table)

classes = LABEL.vocab.itos 
def predict_emotion(model, sentence, min_len = 5):
    model.eval()
    with torch.no_grad():
      tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
      if len(tokenized) < min_len:
          tokenized += ['<pad>'] * (min_len - len(tokenized))
      indexed = [TEXT.vocab.stoi[t] for t in tokenized]
      length =  [len(indexed)]
      tensor = torch.LongTensor(indexed).to(device)
      tensor = tensor.unsqueeze(1)
      length_tensor = torch.LongTensor(length)
      probabilities = model(tensor, length_tensor)
      prediction = torch.argmax(probabilities, dim=1)

      class_name = classes[prediction]
      emoji_text = emoji.emojize(emotions_emojis[class_name], language='en', use_aliases=True)
      prediction = prediction.item()
    
      table_headers =["KEY", "VALUE"]
      table_data = [
          ["PREDICTED CLASS",  prediction],
          ["PREDICTED CLASS NAME",  class_name],
          ["PREDICTED CLASS EMOJI",  emoji_text],     
      ]
      tabulate(table_headers, table_data)

### Sadness

In [41]:
predict_emotion(emotions_model, "im updating my blog because i feel shitty.")

+-----------------------+---------+
| KEY                   | VALUE   |
+-----------------------+---------+
| PREDICTED CLASS       | 1       |
| PREDICTED CLASS NAME  | sadness |
| PREDICTED CLASS EMOJI | 😞      |
+-----------------------+---------+


### Fear

In [46]:
predict_emotion(emotions_model, "i am feeling apprehensive about it but also wildly excited")

+-----------------------+-------+
| KEY                   | VALUE |
+-----------------------+-------+
| PREDICTED CLASS       | 3     |
| PREDICTED CLASS NAME  | fear  |
| PREDICTED CLASS EMOJI | 😨    |
+-----------------------+-------+


### Joy

In [47]:
predict_emotion(emotions_model, "i feel a little mellow today.")

+-----------------------+-------+
| KEY                   | VALUE |
+-----------------------+-------+
| PREDICTED CLASS       | 0     |
| PREDICTED CLASS NAME  | joy   |
| PREDICTED CLASS EMOJI | 😄    |
+-----------------------+-------+


### Surprise

In [48]:
predict_emotion(emotions_model, "i feel shocked and sad at the fact that there are so many sick people.")

+-----------------------+----------+
| KEY                   | VALUE    |
+-----------------------+----------+
| PREDICTED CLASS       | 5        |
| PREDICTED CLASS NAME  | surprise |
| PREDICTED CLASS EMOJI | 😮       |
+-----------------------+----------+


### Love

In [50]:
predict_emotion(emotions_model, "i want each of you to feel my gentle embrace.")

+-----------------------+-------+
| KEY                   | VALUE |
+-----------------------+-------+
| PREDICTED CLASS       | 4     |
| PREDICTED CLASS NAME  | love  |
| PREDICTED CLASS EMOJI | 😍    |
+-----------------------+-------+


### Anger.

In [51]:
predict_emotion(emotions_model, "i feel like my irritable sensitive combination skin has finally met it s match.")

+-----------------------+-------+
| KEY                   | VALUE |
+-----------------------+-------+
| PREDICTED CLASS       | 2     |
| PREDICTED CLASS NAME  | anger |
| PREDICTED CLASS EMOJI | 😠    |
+-----------------------+-------+


> Next we will clone this repository and use conv nets to perform emotions predictions using the same dataset.