___
**project**: `a simple artificial intelligence (ai) chatbot.`

**date**: `2021/11/20`

**programmer:** `crispen gari`

**description:** `_building as simple ai chatbot more like a hello world chatbot using text classification natural langauge processing (nlp) aproach._`

**framework:**  `pytorch`

**programming language**:  `python`

**main**: `natural language processing(nlp)`.
___

### Simple chatbot

This is a simple chatbot that will be able to perform some basic communication with human in real time.


This chatbot is based on **text classification approach** using pytorch.



###  Mounting the drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Imports

We are going to import the packages that we are going to use in this notebook in the code cell that follows.


In [2]:
import time, os, torch, random, math, json

from prettytable import PrettyTable
import numpy as np

import torch, os, random
from torch import nn
import torch.nn.functional as F

from torchtext.legacy import data

torch.__version__

'1.10.0+cu111'

### Device

In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

### Data

For the data we are going to load it from my google drive where all files are created and has been saved there based on the previous notebook.

In [4]:
base_path = '/content/drive/My Drive/NLP Data/chatbot'
train_path = 'train.csv'
val_path = 'val.csv'
test_path = 'test.csv'

### FastText

We are going to create a modified version of `FastText` model that will be able to classify text. So for that we need to create a helper function that will generate `bi-grams`.


> The reason I choose fast text for this task it's because it is efficient enough for this task and we can be able to train this model on a cpu with reasonable time per train epoch, with a reasonable accuracy metric.

In [5]:
def generate_bigrams(x):
  x = [i.lower() for i in x]
  n_grams = set(zip(*[x[i: ] for i in range(2)]))
  for n_gram in n_grams:
      x.append(' '.join(n_gram))
  return x
generate_bigrams(['What', 'is', 'the', 'meaning', "of", "OCR", "in", "python"])


['what',
 'is',
 'the',
 'meaning',
 'of',
 'ocr',
 'in',
 'python',
 'in python',
 'is the',
 'the meaning',
 'meaning of',
 'what is',
 'of ocr',
 'ocr in']

### Fields

We will be using `torchtext` to create load our data from `csv` files that we have in my google drive.

In [6]:
# version
import torchtext
torchtext.__version__

'0.11.0'

In [8]:
TEXT = data.Field(
   tokenize="spacy",
   preprocessing = generate_bigrams,
    tokenizer_language = 'en_core_web_sm',
)
LABEL = data.LabelField()

In [9]:
fields = {
    "text": ("text", TEXT),
    "label": ("intent", LABEL),
}

### Dataset

Next we are going to create our dataset using the `TabularDataset.splits()` method and load our `csv` files for both, train, test and validation as follows.

In [10]:
train_data, val_data, test_data = data.TabularDataset.splits(
   base_path,
   train=train_path,
   test= test_path,
   validation= val_path,
   format = "csv",
   fields=fields
)

### Checking a single example in the train data.

In [11]:
print(vars(train_data.examples[0]))

{'text': ['you', 'are', 'a', 'very', 'clever', 'girl', 'clever girl', 'a very', 'you are', 'very clever', 'are a'], 'intent': 'clever'}


### Checking a single example in the test data

In [12]:
print(vars(test_data.examples[0]))

{'text': ['see', 'you', 'later', 'see you', 'you later'], 'intent': 'goodbye'}


### Checking a single example in the validation data.

In [13]:
print(vars(val_data.examples[0]))

{'text': ['why', 'will', 'you', 'not', 'open', 'the', 'pod', 'bay', 'door', 'pod bay', 'the pod', 'open the', 'bay door', 'will you', 'not open', 'why will', 'you not'], 'intent': 'podbaydoorresponse'}


### Building the vocabulary

We ar then going to build the vocabulary on the `train_data`. WE are going to use the pretrained `glove.6B.100d` word vectors which were trainned with about 6B english words so that we can be able to improve our model performance.


In [14]:
MAX_VOCAB_SIZE = 100_000

TEXT.build_vocab(
     train_data,
     max_size = MAX_VOCAB_SIZE,
     vectors = "glove.6B.100d",
     unk_init = torch.Tensor.normal_
)
LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [02:40, 5.38MB/s]                           
100%|█████████▉| 399999/400000 [00:22<00:00, 17782.42it/s]


### Saving and donloading the words and labels(intents) vocabularies as json files.


In [15]:
words = dict(TEXT.vocab.stoi)
intents = dict(LABEL.vocab.stoi)


words_vocab_path = "words_vocab.json"
intents_vocab_path = "intents_vocab.json"

with open(words_vocab_path, "w") as f:
  json.dump(words, f, indent=2)

with open(intents_vocab_path, "w") as f:
  json.dump(intents, f, indent=2)

print("Done")

Done


In [17]:
# downloading

from google.colab import files
files.download(intents_vocab_path)
files.download(words_vocab_path)

print("Done")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Done


### Iterators

We are then going to create iterators using the `BucketIterator` for both train, test and validation data.

In [18]:
sort_key = lambda x: len(x.text)

BATCH_SIZE = 16
train_iter, val_iter, test_iter = data.BucketIterator.splits(
    (train_data, val_data, test_data),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = sort_key,
    sort_within_batch=True
)

### Creating a model

Based on the `FastText` achitecture our model will have a simple Embedding layer and a Linear or Fully connected layer.

In [19]:
class ChatBot(nn.Module):
  def __init__(self,
               vocab_size,
               embedding_size,
               output_dim,
               pad_index,
               dropout=.5
               ):
    super(ChatBot, self).__init__()
    self.embedding = nn.Embedding(
        vocab_size,
        embedding_size,
        padding_idx = pad_index
    )
    self.out = nn.Linear(
        embedding_size,
        out_features = output_dim
    )
  def forward(self, text):
    embedded = self.embedding(text).permute(1 ,0, 2)
    pooled = F.avg_pool2d(embedded,(embedded.shape[1], 1)
                          ).squeeze(1)
    return self.out(pooled)

In [20]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
OUTPUT_DIM =  len(LABEL.vocab)
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token] 

chatbot_model = ChatBot(
            INPUT_DIM, 
            EMBEDDING_DIM, 
            OUTPUT_DIM, 
            pad_index = PAD_IDX
            ).to(device)
chatbot_model

ChatBot(
  (embedding): Embedding(328, 100, padding_idx=1)
  (out): Linear(in_features=100, out_features=22, bias=True)
)

### Initializing weights

In [21]:
def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.normal_(param.data, mean=0, std=0.1)

chatbot_model.apply(init_weights)

ChatBot(
  (embedding): Embedding(328, 100, padding_idx=1)
  (out): Linear(in_features=100, out_features=22, bias=True)
)

### Counting model parameters

In [22]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(chatbot_model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")


Total number of paramaters: 35,022
Total tainable parameters: 35,022


### Criterion and optimizer

For the optimizer we are going to make use of the `Adam` optimizer, and for the loss or criterion function we are going to use the `cross_entropy_loss()` since this is a multi-label text classification.

In [23]:
optimizer = torch.optim.Adam(chatbot_model.parameters())
criterion = nn.CrossEntropyLoss().to(device)

### Categoricall accuracy function

This function helps us by calculating the categorical accuracy againist the predicted labels and the true labels. 

In [24]:
def categorical_accuracy(preds, y):
  top_pred = preds.argmax(1, keepdim = True)
  correct = top_pred.eq(y.view_as(top_pred)).sum()
  acc = correct.float() / y.shape[0]
  return acc

### Train and Evaluation functions

In [25]:
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()
    for batch in iterator:
        optimizer.zero_grad()
        text = batch.text
        predictions = model(text).squeeze(1)
        loss = criterion(predictions, batch.intent)
        acc = categorical_accuracy(predictions, batch.intent)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            text = batch.text
            predictions = model(text).squeeze(1)
            loss = criterion(predictions, batch.intent)
            acc = categorical_accuracy(predictions, batch.intent)
            epoch_loss += loss.item()
            epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Visualizing training

We are going to use visualize our training, first we need to create some helper functions that will help us to visualize training.


1. time to string function

In [26]:
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


2. visualize_training

In [28]:
from prettytable import PrettyTable

In [29]:
def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)


### Running the trainning loop

We are going to train the model as long as we want, we are going to monitor the validation loss and save the model using `torch.save` based on the validation accuracy. If the validation accuracy from the previous epoch is greater than the current validation loss then we are going to save the model, otherwise we are going to continue with the training until we finish all the epochs.

In [36]:
N_EPOCHS = 150
MODEL_NAME = "chatbot.pt"
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(chatbot_model, 
                                  train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(chatbot_model, 
                                     val_iter, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(chatbot_model.state_dict(), MODEL_NAME)
    end = time.time()

    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)

+--------------------------------------------+
|     EPOCH: 01/150 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.025 |    1.000 | 0:00:00.03 |
| Validation | 0.205 |    1.000 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/150 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.026 |    1.000 | 0:00:00.02 |
| Validation | 0.204 |    1.000 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/150 saving best model...     |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model.


In [37]:
chatbot_model.load_state_dict(torch.load('best-lang-ident-model.pt'))

test_loss, test_acc = evaluate(chatbot_model, test_iter, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.253 | Test Acc: 100.00%


### Model inference

We are now ready to make predictions based on the user input, which `intent` the text belongs to.

In [38]:
labels = {v:k for k, v in LABEL.vocab.stoi.items() }
labels

{0: 'courtesygreetingresponse',
 1: 'greetingresponse',
 2: 'clever',
 3: 'courtesygreeting',
 4: 'currenthumanquery',
 5: 'greeting',
 6: 'nottalking2u',
 7: 'podbaydoor',
 8: 'podbaydoorresponse',
 9: 'realnamequery',
 10: 'selfaware',
 11: 'shutup',
 12: 'timequery',
 13: 'courtesygoodbye',
 14: 'gossip',
 15: 'jokes',
 16: 'namequery',
 17: 'thanks',
 18: 'understandquery',
 19: 'whoami',
 20: 'goodbye',
 21: 'swearing'}

### Tokenizer function

This function is responsible of tokenizing the text, which is converting the text a list of words. We are going to use the `en_core_web_sm` model.

In [39]:
import en_core_web_sm
nlp = en_core_web_sm.load()

def tokenize_sent(sent):
  return [tok.text for tok in nlp.tokenizer(sent)]

In [40]:
def predict_language(model, sent):
  model.eval()
  sent = sent.lower()
  with torch.no_grad():
    tokenized = tokenize_sent(sent)
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    probabilities = torch.softmax(model(tensor), dim=1)
    prediction = torch.argmax(probabilities, dim=1)
    item = prediction.item()

  return {
      "label": item,
      "lang": labels[item]
  }

predict_language(chatbot_model, "hi")

{'label': 5, 'lang': 'greeting'}

In [41]:
predict_language(chatbot_model, "how are you?")

{'label': 3, 'lang': 'courtesygreeting'}

### Downloading the model.

We are going to download the model

In [42]:
files.download(MODEL_NAME)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>