# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
!pip uninstall torchtext -y

Found existing installation: torchtext 0.14.1
Uninstalling torchtext-0.14.1:
  Successfully uninstalled torchtext-0.14.1


In [3]:
!pip install torchtext==0.9.0 torch==1.8.0 torchvision==0.9.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp39-cp39-manylinux1_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m34.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch==1.8.0
  Downloading torch-1.8.0-cp39-cp39-manylinux1_x86_64.whl (735.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m735.5/735.5 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.9.0
  Downloading torchvision-0.9.0-cp39-cp39-manylinux1_x86_64.whl (17.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m61.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvision, torchtext
  Attempting uninstall: torch
    Found existing installation: torch 1.13.1+cu116
    Uninstalling torch-1.13.1+cu116:
      Successfully uninstalled torch-1.13.1+cu116
  

In [4]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from nltk.corpus import brown
from torchtext.legacy.data import Field, BucketIterator, Dataset
from torchtext.data.utils import get_tokenizer
from sklearn.model_selection import train_test_split
import json
import requests
from torchtext.legacy.data import Example

nltk.download('brown')
nltk.download('punkt')

# Load Brown embeddings
model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')
w2v = gensim.models.Word2Vec.load('brown.embedding')


def prepare_text(sentence):
    tokenizer = get_tokenizer('basic_english')
    tokens = tokenizer(sentence)
    return tokens

def download_file(url, filename):
    response = requests.get(url)
    open(filename, 'wb').write(response.content)

download_file("https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json", "squad_train.json")
download_file("https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json", "squad_dev.json")

def load_squad_data():
    with open('squad_train.json', 'r') as f:
        squad_data = json.load(f)

    src_data = []
    trg_data = []

    for article in squad_data['data']:
        for paragraph in article['paragraphs']:
            for qa in paragraph['qas']:
                if not qa['is_impossible']:
                    question = qa['question']
                    answer = qa['answers'][0]['text']
                    src_data.append(question)
                    trg_data.append(answer)

    return src_data, trg_data

SRC, TRG = load_squad_data()
SRC_train, SRC_test, TRG_train, TRG_test = train_test_split(SRC, TRG, test_size=0.2, random_state=42)

train_df = pd.DataFrame({'src': SRC_train, 'trg': TRG_train})
test_df = pd.DataFrame({'src': SRC_test, 'trg': TRG_test})

train_df.to_csv('train.csv', index=False)
test_df.to_csv('test.csv', index=False)

SRC_TEXT = Field(tokenize=prepare_text, init_token='<sos>', eos_token='<eos>', lower=True)
TRG_TEXT = Field(tokenize=prepare_text, init_token='<sos>', eos_token='<eos>', lower=True)

data_fields = [('src', SRC_TEXT), ('trg', TRG_TEXT)]

def create_dataset(df, fields):
    examples = []
    for index, row in df.iterrows():
        src_data = row['src']
        trg_data = row['trg']
        examples.append(Example.fromlist([src_data, trg_data], fields))
    return Dataset(examples, fields)

train_data = create_dataset(train_df, data_fields)
test_data = create_dataset(test_df, data_fields)

SRC_TEXT.build_vocab(train_data, min_freq=2)
TRG_TEXT.build_vocab(train_data, min_freq=2)


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
# Encoder
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers=1, dropout=0):
        super(Encoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, dropout=dropout)

    def forward(self, src):
        embedded = self.embedding(src)
        output, (hidden, cell) = self.lstm(embedded)
        return hidden, cell

In [6]:
# Decoder
class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size, num_layers=1, dropout=0):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.num_layers = num_layers
        self.dropout = dropout

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, num_layers, dropout=dropout)
        self.fc_out = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden, cell):
        input = input.unsqueeze(0)
        embedded = self.embedding(input)
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(0))
        return prediction, hidden, cell

In [7]:
# Seq2Seq
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = len(TRG_TEXT.vocab)

        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        hidden, cell = self.encoder(src)

        input = trg[0, :]

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[t] if teacher_force else top1

        return outputs

In [8]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [9]:
from tqdm import tqdm
import random

# Hyperparameters
INPUT_DIM = len(SRC_TEXT.vocab)
OUTPUT_DIM = len(TRG_TEXT.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
NUM_EPOCHS = 20
LEARNING_RATE = 0.001
# Create iterators
BATCH_SIZE = 128
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data),
    batch_size=BATCH_SIZE,
    sort_within_batch=True,
    sort_key=lambda x: len(x.src),
    device=device)


# Initialize models
enc = Encoder(INPUT_DIM, HID_DIM, dropout=ENC_DROPOUT)
dec = Decoder(HID_DIM, OUTPUT_DIM, dropout=DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Loss function
TRG_PAD_IDX = TRG_TEXT.vocab.stoi[TRG_TEXT.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

# Training loop
for epoch in range(NUM_EPOCHS):
    epoch_loss = 0

    # I Wrapped the train_iterator with tqdm to display a progress bar
    for idx, batch in enumerate(tqdm(train_iterator)):
        src = batch.src
        trg = batch.trg

        optimizer.zero_grad()

        output = model(src, trg)
        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)
        loss.backward()

        optimizer.step()
        epoch_loss += loss.item()

    print(f"Epoch {epoch+1}: Loss = {epoch_loss/len(train_iterator)}")

100%|██████████| 543/543 [14:09<00:00,  1.56s/it]


Epoch 1: Loss = 6.0158536289278315


100%|██████████| 543/543 [15:30<00:00,  1.71s/it]


Epoch 2: Loss = 5.404767302518391


100%|██████████| 543/543 [15:32<00:00,  1.72s/it]


Epoch 3: Loss = 5.058805553513557


100%|██████████| 543/543 [15:33<00:00,  1.72s/it]


Epoch 4: Loss = 4.736509864062455


100%|██████████| 543/543 [15:50<00:00,  1.75s/it]


Epoch 5: Loss = 4.389767016275592


100%|██████████| 543/543 [16:20<00:00,  1.81s/it]


Epoch 6: Loss = 4.033947415553843


100%|██████████| 543/543 [16:13<00:00,  1.79s/it]


Epoch 7: Loss = 3.6694505759146114


100%|██████████| 543/543 [14:31<00:00,  1.60s/it]


Epoch 8: Loss = 3.3013825842469218


100%|██████████| 543/543 [14:36<00:00,  1.61s/it]


Epoch 9: Loss = 2.8908930601994633


100%|██████████| 543/543 [14:34<00:00,  1.61s/it]


Epoch 10: Loss = 2.5242633924958455


100%|██████████| 543/543 [14:26<00:00,  1.60s/it]


Epoch 11: Loss = 2.1248148664365596


100%|██████████| 543/543 [14:27<00:00,  1.60s/it]


Epoch 12: Loss = 1.785863956474248


100%|██████████| 543/543 [13:37<00:00,  1.51s/it]


Epoch 13: Loss = 1.462346934481879


100%|██████████| 543/543 [13:57<00:00,  1.54s/it]


Epoch 14: Loss = 1.1754933632978857


100%|██████████| 543/543 [13:58<00:00,  1.55s/it]


Epoch 15: Loss = 0.9318665676573583


100%|██████████| 543/543 [16:17<00:00,  1.80s/it]


Epoch 16: Loss = 0.7209270115715364


100%|██████████| 543/543 [14:08<00:00,  1.56s/it]


Epoch 17: Loss = 0.540498335168287


100%|██████████| 543/543 [14:02<00:00,  1.55s/it]


Epoch 18: Loss = 0.41438500931688876


100%|██████████| 543/543 [14:34<00:00,  1.61s/it]


Epoch 19: Loss = 0.30131793951077135


100%|██████████| 543/543 [14:33<00:00,  1.61s/it]

Epoch 20: Loss = 0.22775825866348837





In [10]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for _, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0)  
            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

test_loss = evaluate(model, test_iterator, criterion)
print(f"Test Loss: {test_loss}")


Test Loss: 8.253702857915092


In [49]:
def text_to_tensor(text, field, device):
    tokens = field.tokenize(text)
    indexes = [field.vocab.stoi[token] for token in tokens]
    tensor = torch.LongTensor(indexes).to(device)
    return indexes, tensor


def generate_response(model, sentence, src_field, trg_field, device, max_len=50):
    model.eval()
    
    src_indexes, src_tensor = text_to_tensor(sentence, src_field, device)
    src_tensor = src_tensor.unsqueeze(1)

    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]
    for _ in range(max_len):
        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)  
        with torch.no_grad():
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)  
        pred_token = output.argmax(1).item()
        trg_indexes.append(pred_token)
        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break

    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    return ' '.join(trg_tokens[1:-1])  


In [56]:
import traceback

def test_bot():
    while True:
        try:
            input_sentence = input("Enter a question or type 'exit' to quit: ")
            if input_sentence.lower() == "exit":
                break
            response = generate_response(model, input_sentence, SRC_TEXT, TRG_TEXT, device)
            print(f"Answer: {response}")
        except Exception as e:
            print(f"Error: {e}")
            traceback.print_exc()  

test_bot()


Enter a question or type 'exit' to quit: Prior to Kennedy v. Louisiana, how many states criminalized child rape?
Answer: three
Enter a question or type 'exit' to quit: In what year did Louis die?
Answer: <unk>
Enter a question or type 'exit' to quit: How many people attended the service in Lviv?
Answer: around 280 , 000
Enter a question or type 'exit' to quit: exit
