# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [3]:
import nltk
import pandas as pd
import string
import torch

stemmer = nltk.stem.snowball.SnowballStemmer('english')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def getDict(dataPipe):

    data_dict = {
        'Question': [],
        'Answer': []
    }
    
    for _, question, answers, _ in dataPipe:
        data_dict['Question'].append(question)
        data_dict['Answer'].append(answers[0])
        
    return data_dict


def loadDF(path):
    # load data
    train_data, val_data = torchtext.datasets.SQuAD2|(path)
    
    # convert dataPipe to dictionary 
    train_dict, val_dict = getDict(train_data), getDict(val_data)
    
    # convert Dictionaries to Pandas DataFrame
    train_df = pd.DataFrame(train_dict)    
    validation_df = pd.DataFrame(val_dict)    
    
    return train_df.append(validation_df)


def prepare_text(sentence):
    # clean text and tokenize it 
    sentence = ''.join([s.lower() for s in sentence if s not in string.punctuation])
    sentence = ' '.join(stemmer.stem(w) for w in sentence.split())
    tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(sentence)

    return tokens


def toTensor(vocab, sentence):
    # convert list of words "sentence" to a torch tensor of indices
    indices = [vocab.word2index[word] for word in sentence.split(' ')]
    indices.append(vocab.word2index[''])
    return torch.Tensor(indices).long().to(device).view(-1, 1)


def getPairs(df):
    # convert df to list of pairs
    temp1 = df['Question'].apply(lambda x: " ".join(x) ).to_list()
    temp2 = df['Answer'].apply(lambda x: " ".join(x) ).to_list()
    return [list(i) for i in zip(temp1, temp2)]


def getMaxLen(pairs):
    max_src = 0 
    max_trg = 0
    
    for p in pairs:
        max_src = len(p[0].split()) if len(p[0].split()) > max_src else max_src
        max_trg = len(p[1].split()) if len(p[1].split()) > max_trg else max_trg
        
    return max_src, max_trg



## Arsitektur

In [4]:
import random 
import torch
import torch.nn as nn


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(self.input_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)

    def forward(self, x, hidden, cell_state):
        x = self.embedding(x)
        x = x.view(1, 1, -1)
        x, (hidden, cell_state) = self.lstm(x, (hidden, cell_state))
        return x, hidden, cell_state
        

class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.embedding = nn.Embedding(output_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)
        self.fc = nn.Linear(self.hidden_size, self.output_size)
        self.softmax = nn.LogSoftmax(dim= 1)

    def forward(self, x, hidden, cell_state):
        x = self.embedding(x)
        x = x.view(1, 1, -1)
        x, (hidden, cell_state) = self.lstm(x, (hidden, cell_state))
        x = self.softmax(self.fc(x[0]))
        return x, hidden, cell_state
    
     
class Seq2Seq(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Seq2Seq, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.encoder = Encoder(self.input_size, self.hidden_size)
        self.decoder = Decoder(self.hidden_size, self.output_size)
        
    def forward(self, src, trg, src_len, trg_len, teacher_force=1):
        
        output = {
            'decoder_output':[]
        }
        
        encoder_hidden = torch.zeros([1, 1, self.hidden_size]).to(device) # 1 = number of LSTM layers
        cell_state = torch.zeros([1, 1, self.hidden_size]).to(device)  
        
        for i in range(src_len):
            encoder_output, encoder_hidden, cell_state = self.encoder(src[i], encoder_hidden, cell_state)

        decoder_input = torch.Tensor([[0]]).long().to(device) # 0 = SOS_token
        decoder_hidden = encoder_hidden
        
        for i in range(trg_len):
            decoder_output, decoder_hidden, cell_state = self.decoder(decoder_input, decoder_hidden, cell_state)
            output['decoder_output'].append(decoder_output)
            
            if self.training: # Model not in eval mode
                decoder_input = target_tensor[i] if random.random() > teacher_force else decoder_output.argmax(1) # teacher forcing
            else:
                _, top_index = decoder_output.data.topk(1)
                decoder_input = top_index.squeeze().detach()
                
        return output


## Train Function

In [45]:
import torch
import torch.nn as nn
from sklearn.model_selection import KFold

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def train(source_data, target_data, model, epochs, batch_size, print_every, learning_rate):
    model.to(device)
    total_training_loss = 0
    total_valid_loss = 0
    loss = 0
    
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    # use cross validation
    kf = KFold(n_splits=epochs, shuffle=True)

    for e, (train_index, test_index) in enumerate(kf.split(source_data), 1):
        model.train()
        for i in range(0, len(train_index)):

            src = source_data[i]
            trg = target_data[i]

            output = model(src, trg, src.size(0), trg.size(0))

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            loss += current_loss
            total_training_loss += (current_loss.item() / trg.size(0)) # add the iteration loss

            if i % batch_size == 0 or i == (len(train_index)-1):
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
                loss = 0


        # validation set 
        model.eval()
        for i in range(0, len(test_index)):
            src = source_data[i]
            trg = target_data[i]

            output = model(src, trg, src.size(0), trg.size(0))

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            total_valid_loss += (current_loss.item() / trg.size(0)) # add the iteration loss


        if e % print_every == 0:
            training_loss_average = total_training_loss / (len(train_index)*print_every)
            validation_loss_average = total_valid_loss / (len(test_index)*print_every)
            print("{}/{} Epoch  -  Training Loss = {:.4f}  -  Validation Loss = {:.4f}".format(e, epochs, training_loss_average, validation_loss_average))
            total_training_loss = 0
            total_valid_loss = 0 

def train_wo_valid(source_data, target_data, model, epochs, batch_size, print_every, learning_rate):
    model.to(device)
    total_training_loss = 0
    total_valid_loss = 0
    loss = 0
    
    optimizer = torch.optim.RMSprop(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    # use cross validation
    kf = KFold(n_splits=epochs, shuffle=True)

    for e, (train_index, test_index) in enumerate(kf.split(source_data), 1):
        model.train()
        for i in range(0, len(train_index)):

            src = source_data[i]
            trg = target_data[i]

            output = model(src, trg, src.size(0), trg.size(0))

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            loss += current_loss
            total_training_loss += (current_loss.item() / trg.size(0)) # add the iteration loss

            if i % batch_size == 0 or i == (len(train_index)-1):
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
                loss = 0

        if e % print_every == 0:
            training_loss_average = total_training_loss / (len(train_index)*print_every)
            validation_loss_average = total_valid_loss / (len(test_index)*print_every)
            print("{}/{} Epoch  -  Training Loss = {:.4f}  -  Validation Loss = {:.4f}".format(e, epochs, training_loss_average, validation_loss_average))
            total_training_loss = 0
            total_valid_loss = 0 


In [6]:
SOS_token = 0
EOS_token = 1

class Vocab:
    def __init__(self):
        self.word2index = {"": SOS_token, "": EOS_token}
        self.index2word = {SOS_token: "", EOS_token: ""}
        self.words_count = len(self.word2index)

    def add_words(self, sentence):
        for word in sentence.split(" "):
            if word not in self.word2index:
                self.word2index[word] = self.words_count
                self.index2word[self.words_count] = word
                self.words_count += 1


In [7]:
import random

In [8]:
learning_rate = 0.0001
hidden_size = 500 # encoder and decoder hidden size
batch_size = 50
epochs = 100

In [9]:
knowledgebase = pd.read_excel('https://raw.githubusercontent.com/AndiAlifs/FLUENT-Chatbot-2023/main/KnowledgeBaseFilkom.xlsx', engine='openpyxl')
knowledgebase.head()

qa_paired = knowledgebase.drop(columns=knowledgebase.columns.drop(['Pertanyaan', 'Jawaban']))
qa_paired.dropna(inplace=True)

In [15]:
# data_df = loadDF('data')
# I will take only the first 5,000 Q&A to avoid CUDA out of memory error due to the large dataset
# data_df = data_df.iloc[:5000, :]
data_df = pd.DataFrame(columns=['Question', 'Answer'])
data_df['Question'] = qa_paired['Pertanyaan']
data_df['Answer'] = qa_paired['Jawaban']

In [16]:
for i in range(0, 25): # first 5 Q&A
    print("> ", data_df.iloc[i,0], "\n< ", data_df.iloc[i,1], "\n") 

>  email Fitra A. Bachtiar 
<   fitra.bachtiar[at]ub.ac.id 

>  NIK/NIP Fitra A. Bachtiar 
<  198406282019031006 

>  nama lengkap Fitra A. Bachtiar 
<  Dr.Eng. Fitra A. Bachtiar 

>  Departemen Fitra A. Bachtiar 
<  Departemen Teknik Informatika 

>  Program Studi Fitra A. Bachtiar 
<  S2 Ilmu Komputer 

>  bidang penelitian Fitra A. Bachtiar 
<  Affective Computing, Affective Engineering, Intelligent System, Data Mining, Educational Data Mining 

>  nama awal Fakultas Ilmu Komputer (FILKOM) 
<  Program Teknologi Informasi dan Ilmu Komputer (PTIIK) 

>  rujukan surat keputusan SK Dikti dibentuk PTIIK 
<   SK Dikti No.163/KEP/DIKTI/2007  

>  surat keputusan SK Rektor bentuk PTIIK 
<  Surat Keputusan Rektor Universitas Brawijaya Nomor: 516/SK/2011 

>  tanggal dibentuk PTIIK 
<  27 Oktober 2011 

>  program studi pembentuk PTIIK 
<  Teknik Perangkat Lunak dari Fakultas Teknik dan Ilmu Komputer dari Fakultas MIPA 

>  tanggal perubahan PTIIK menjadi FILKOM 
<  10 Desember 2014 

>  sura

In [17]:
data_df['Question'] = data_df['Question'].apply(prepare_text)
data_df['Answer'] = data_df['Answer'].apply(prepare_text)

In [18]:
pairs = getPairs(data_df)

In [19]:
max_src, max_trg = getMaxLen(pairs)
max_trg, max_src

(290, 13)

In [20]:
Q_vocab = Vocab()
A_vocab = Vocab()

# build vocabularies for questions "source" and answers "target"
for pair in pairs:
    Q_vocab.add_words(pair[0])
    A_vocab.add_words(pair[1])

In [21]:
source_data = [toTensor(Q_vocab, pair[0]) for pair in pairs]
target_data = [toTensor(A_vocab, pair[1]) for pair in pairs]

## Training

In [46]:
seq2seq = Seq2Seq(Q_vocab.words_count, hidden_size, A_vocab.words_count)

train_wo_valid(source_data = source_data,
      target_data = target_data,
      model = seq2seq,
      print_every = 1,
      epochs = epochs,
      learning_rate = learning_rate,
      batch_size = batch_size)


1/100 Epoch  -  Training Loss = 6.1203  -  Validation Loss = 0.0000
2/100 Epoch  -  Training Loss = 5.4104  -  Validation Loss = 0.0000
3/100 Epoch  -  Training Loss = 5.2192  -  Validation Loss = 0.0000
4/100 Epoch  -  Training Loss = 5.0602  -  Validation Loss = 0.0000
5/100 Epoch  -  Training Loss = 4.9288  -  Validation Loss = 0.0000
6/100 Epoch  -  Training Loss = 4.8133  -  Validation Loss = 0.0000
7/100 Epoch  -  Training Loss = 4.7127  -  Validation Loss = 0.0000
8/100 Epoch  -  Training Loss = 4.6099  -  Validation Loss = 0.0000
9/100 Epoch  -  Training Loss = 4.5155  -  Validation Loss = 0.0000
10/100 Epoch  -  Training Loss = 4.4316  -  Validation Loss = 0.0000
11/100 Epoch  -  Training Loss = 4.3481  -  Validation Loss = 0.0000
12/100 Epoch  -  Training Loss = 4.2624  -  Validation Loss = 0.0000
13/100 Epoch  -  Training Loss = 4.1806  -  Validation Loss = 0.0000
14/100 Epoch  -  Training Loss = 4.1004  -  Validation Loss = 0.0000
15/100 Epoch  -  Training Loss = 4.0295  - 

In [25]:
import torch

model_path = 'seq2seq.pt'

torch.save(seq2seq, model_path)

seq2seq = torch.load(model_path, map_location=torch.device('cuda'))
seq2seq.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(740, 500)
    (lstm): LSTM(500, 500)
  )
  (decoder): Decoder(
    (embedding): Embedding(2655, 500)
    (lstm): LSTM(500, 500)
    (fc): Linear(in_features=500, out_features=2655, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
)

In [47]:
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    src = input("> ")
    if src.strip() == "exit":
        break
    evaluate(src, Q_vocab, A_vocab, seq2seq, max_trg)

Type 'exit' to finish the chat.
 ------------------------------ 



NameError: name 'evaluate' is not defined