# Seq2Seq Implementation with PyTorch

This notebook aims to train a simple chatbot to handle common questions that Amazon customer agents have to answer such as questions regarding order delays, refunds, etc.

The following resources played a large role in helping me building this model:

[Seq2Seq Tutorial by Ben Trevett](https://github.com/bentrevett/pytorch-seq2seq)

[TorchText and Seq2Seq Tutorial by Adam Wearne](https://medium.com/@adam.wearne/lets-get-sentimental-with-pytorch-dcdd9e1ea4c9)

In [1]:
import os
#from google.colab import drive
#drive.mount('/content/gdrive')
#os.chdir('/content/gdrive/My Drive/Colab Notebooks')

In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
from torchtext import data
from torchtext.data import Field, BucketIterator, TabularDataset
import random
from tqdm import tqdm
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
device

device(type='cuda')

### **Using Torchtext to Pre-process Data**

In [4]:
df = pd.read_csv('chatbot_data2.csv')
df.head()

Unnamed: 0,questions,answers
0,amazon fireTVstick,Fire TV Stick https://t.co/2pbG55qJ7h ET
1,3 different people have given 3 different answ...,We'd like to take a further look into this wit...
2,Way to drop the ball on customer service so pi...,I'm sorry we've let you down! Without providin...
3,I want my amazon payments account CLOSED. dm m...,I am unable to affect your account via Twitter...
4,Okay danke f r die Info,Wir haben zu danken. Sch nen Abend noch.


**Model Hyperparameters**

In [5]:
MAX_VOCAB_SIZE = 30_000
MIN_COUNT = 3
MAX_SEQUENCE_LENGTH = 15
BATCH_SIZE = 128

**Create Field object and load csv into tabular dataset**

In [6]:
from nltk.tokenize import TweetTokenizer

In [7]:
def my_tokenizer(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    tknzr = TweetTokenizer()
    return tknzr.tokenize(text)


In [8]:
TEXT = data.Field(tokenize=my_tokenizer, lower=True, include_lengths=True, init_token='<sos>', eos_token='<eos>')
# Setup Field to tokenize using spacy, and include <sos> and <eos> tokens

fields = [('input_sequence', TEXT), ('output_sequence', TEXT)] #use FIELD processing on each column (both are text based)

filepath = 'chatbot_data2.csv'

table_data = data.TabularDataset(path=filepath, format='csv', fields=fields)  #turn data into torchtext's TabularDataset

**Build Vocabulary**

By default, *UNK* tokens are tokens that appear in initial dataset vocabulary but not the Glove embedding.

These *UNK* tokens are set to tensors of zero, but training is faster if they are initialized to some random values instead.

In [9]:
TEXT.build_vocab(table_data,
                max_size=MAX_VOCAB_SIZE, #30_000
                min_freq=MIN_COUNT, #3 
                vectors='glove.6B.300d',
                unk_init=torch.Tensor.normal_) 

In [10]:
print(f"Unique tokens in vocabulary: {len(TEXT.vocab)}")

Unique tokens in vocabulary: 30004


**Train-Test Split and Bucket Iterator**

In [11]:
train_data, test_data = table_data.split()
train_data, valid_data = train_data.split()

In [12]:
train_data[0].__dict__.keys()
train_data[0].input_sequence[:10]

['thanks', 'for', 'this', 'is', 'there', 'anyway', 'to', 'get', 'a', 'refund']

Bucket Iterator essentially batches together similar samples such that there is only a minimum amount of padding. Iterator basically processes the data so that it can easily be input into the neural network.

In [13]:
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    sort_within_batch = True,
    sort_key = lambda x:len(x.input_sequence),
    device = device)

### **Encoder**

<img src="https://docs.chainer.org/en/stable/_images/seq2seq.png"  width="500" height="350">

In [14]:
class Encoder(nn.Module):
  
    def __init__(self, hidden_dims, embedding_size,
                 embedding, num_layers=2, dropout=0.0):
      
        super(Encoder, self).__init__()
        
        # Basic network params
        self.hidden_dims = hidden_dims  #dimensionality of hidden state
        self.embedding = embedding #embedding layer using GloVe
        self.embedding_size = embedding_size #dimensionality of embedding layer (Vocab size)
        self.num_layers = num_layers #num stacked RNN layers
        self.dropout = dropout

        # Bidirectional LSTM
        self.gru = nn.GRU(embedding_size,hidden_dims,
                          num_layers=num_layers,
                          dropout=dropout,
                          bidirectional=True)
        
    def forward(self, input_sequence, input_lengths):
        word_embeddings = self.embedding(input_sequence) #turn tokens into word embeddings
        packed_embeddings = nn.utils.rnn.pack_padded_sequence(word_embeddings, input_lengths) #pad sequence to max length
        outputs, hidden = self.gru(packed_embeddings) #Run padded sequence through LSTM RNN
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs) #Unpack sequence to only get inputs w/o <PAD> tokens
        outputs = outputs[:, :, :self.hidden_dims] + outputs[:, : ,self.hidden_dims:]
        return outputs, hidden

### **Attention Class**

In [15]:
class Attention(nn.Module):
    def __init__(self, hidden_dimensions):
        super(Attention, self).__init__() 
        self.hidden_dimensions = hidden_dimensions
        
    def dot_score(self, hidden_state, encoder_states):
        return torch.sum(hidden_state * encoder_states, dim=2) #get dot product of decoder hidden state vs. encoder states
    
    def forward(self, hidden, encoder_outputs, mask):
       
        attn_scores = self.dot_score(hidden, encoder_outputs)
# Transpose max_length and batch_size dimensions
        attn_scores = attn_scores.t()
# Apply mask so network does not attend <pad> tokens        
        attn_scores = attn_scores.masked_fill(mask == 0, -1e10)   
# Return softmax over attention scores      
        return F.softmax(attn_scores, dim=1).unsqueeze(1)

### **Decoder**

In [16]:
class Decoder(nn.Module):
    def __init__(self, embedding, embedding_size,
                 hidden_dims, output_size, n_layers=1, dropout=0.1):
        
        super(Decoder, self).__init__()
        
        # Basic network params
        self.hidden_dims = hidden_dims
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
        self.embedding = embedding
                
        self.gru = nn.GRU(embedding_size, hidden_dims, n_layers, 
                          dropout=dropout)
        
        self.concat = nn.Linear(hidden_dims * 2, hidden_dims)
        self.out = nn.Linear(hidden_dims, output_size)
        self.attn = Attention(hidden_dims)
        
    def forward(self, current_token, hidden_state, encoder_outputs, mask):
      
        # convert current_token to word_embedding
        embedded = self.embedding(current_token) #turn token into word embedding
        
        # Pass through LSTM
        rnn_output, hidden_state = self.gru(embedded, hidden_state) #pass word embedding through GRU
        
        # Get attention distribution after passing through softmax layer
        attention_weights = self.attn(rnn_output, encoder_outputs, mask) 

        # Matrix multiply attention weights vs. encoder outputs to get context words 
        context = attention_weights.bmm(encoder_outputs.transpose(0, 1)) #ma
        
        # Concatenate context vector and LSTM output
        rnn_output = rnn_output.squeeze(0)
        context = context.squeeze(1)
        concat_input = torch.cat((rnn_output, context), 1)
        concat_output = torch.tanh(self.concat(concat_input))
        
        # Pass concat_output to final output layer
        output = self.out(concat_output)
        
        # Return output and final hidden state
        return output, hidden_state

In [17]:
class seq2seq(nn.Module):
    def __init__(self, embedding_size, hidden_dims, vocab_size, 
                 device, pad_token, eos_token, sos_token, teacher_forcing_ratio=0.5):
        super(seq2seq, self).__init__()
        
        # Embedding layer shared by encoder and decoder
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        
        # Initialize Encoder network
        self.encoder = Encoder(hidden_dims, 
                               embedding_size, 
                               self.embedding,
                              num_layers=2,
                              dropout=0.5)
        
        # Initalize Decoder network        
        self.decoder = Decoder(self.embedding,
                               embedding_size,
                              hidden_dims,
                              vocab_size,
                              n_layers=2,
                              dropout=0.5)
        
        
        # Indices of special tokens and hardware device 
        self.pad_token = pad_token
        self.eos_token = eos_token
        self.sos_token = sos_token
        self.device = device
        
    def create_mask(self, input_sequence):
        return (input_sequence != self.pad_token).permute(1, 0)
        
        
    def forward(self, input_sequence, output_sequence, teacher_forcing_ratio=0.5):
      
        # Unpack input_sequence tuple
        input_tokens = input_sequence[0]
        input_lengths = input_sequence[1]
      
        # Unpack output_tokens, or create an empty tensor for text generation
        if output_sequence is None:
            inference = True
            output_tokens = torch.zeros((100, input_tokens.shape[1])).long().fill_(self.sos_token).to(self.device)
        else:
            inference = False
            output_tokens = output_sequence[0]

        vocab_size = self.decoder.output_size
        batch_size = len(input_lengths)
        max_seq_len = len(output_tokens)
        
        
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_seq_len, batch_size, vocab_size).to(self.device)
        
        
        # Pass through the first half of the network
        encoder_outputs, hidden = self.encoder(input_tokens, input_lengths)
        
        # Ensure dim of hidden_state can be fed into Decoder
        hidden =  hidden[:self.decoder.n_layers]
        
        #first input to the decoder is the <sos> tokens
        output = output_tokens[0,:]
        
        # Create mask
        mask = self.create_mask(input_tokens)
        
        
        # Step through the length of the output sequence one token at a time
        # Teacher forcing is used to assist training
        for t in range(1, max_seq_len):
            output = output.unsqueeze(0)
            
            output, hidden = self.decoder(output, hidden, encoder_outputs, mask)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (output_tokens[t] if teacher_force else top1)
            
            # If we're in inference mode, keep generating until we produce an
            # <eos> token
            if inference and output.item() == self.eos_token:
                return outputs[:t]
        return outputs

In [1]:
pad_token = TEXT.vocab.stoi['<pad>']
eos_token = TEXT.vocab.stoi['<eos>']
sos_token = TEXT.vocab.stoi['<sos>']

embedding_dim = 300
hidden_dim = 512
vocab_size = len(TEXT.vocab)

model = seq2seq(embedding_dim,
                 hidden_dim, 
                 vocab_size, 
                 device, pad_token, eos_token, sos_token).to(device)


NameError: name 'TEXT' is not defined

In [None]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_token = TEXT.vocab.stoi[TEXT.unk_token]
model.embedding.weight.data[UNK_token] = torch.zeros(embedding_dim)
model.embedding.weight.data[pad_token] = torch.zeros(embedding_dim)
model.embedding.weight.requires_grad = False

optimizer = optim.Adam([param for param in model.parameters() if param.requires_grad == True]
                       , lr=1.0e-3)

criterion = nn.CrossEntropyLoss(ignore_index = pad_token)

In [19]:
def evaluate(model, iterator, criterion, optimizer, clip=1.0):
   # Put the model in evaluation mode!
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for index, batch in tqdm(enumerate(iterator), total=len(iterator)):

            input_sequence = batch.input_sequence
            output_sequence = batch.output_sequence
            if index == 0:
                print(input_sequence)

            target_tokens = output_sequence[0]

            # Run the batch through our model
            output = model(input_sequence, output_sequence)

            # Throw it through our loss function
            output = output[1:].view(-1, output.shape[-1])
            target_tokens = target_tokens[1:].view(-1)

            loss = criterion(output, target_tokens)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [20]:
def train(model, iterator, criterion, optimizer, clip=1.0):
   # Put the model in training mode!
    model.train()
    
    epoch_loss = 0
    
    for index, batch in tqdm(enumerate(iterator), total=len(iterator)):
        
        input_sequence = batch.input_sequence
        output_sequence = batch.output_sequence
        
        target_tokens = output_sequence[0]
        
        # zero out the gradient for the current batch
        optimizer.zero_grad()
        
        # Run the batch through our model
        output = model(input_sequence, output_sequence)
        
        # Throw it through our loss function
        output = output[1:].view(-1, output.shape[-1])
        target_tokens = target_tokens[1:].view(-1)
        
        loss = criterion(output, target_tokens)
        
        # Perform back-prop and calculate the gradient of our loss function
        loss.backward()
          
        # Clip the gradient if necessary.          
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        # Update model parameters
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [21]:
import time

In [22]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [23]:
print('WORKING')

WORKING


In [None]:
N_EPOCHS = 100
CLIP = 50.0
best_valid_loss = float('inf')

In [None]:
N_EPOCHS = 100
CLIP = 50.0
best_valid_loss = float('inf')
best_train_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_iterator, criterion, optimizer, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion, optimizer, CLIP)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f}')

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-second-model.pt') 
        with open('epoch_curr.txt', 'w') as file:
            file.write(str(epoch))
        
        
    if train_loss < best_train_loss:
        best_train_loss = train_loss
        torch.save(model.state_dict(), 'best-train-model.pt') 
        with open('epoch_curr_train.txt', 'w') as file:
            file.write(str(epoch))
    
    

In [24]:
model.load_state_dict(torch.load('best-model.pt'))

<All keys matched successfully>

In [None]:
file = open('epoch_curr.txt', 'r')
file.read()
file.close()
    

In [None]:
with open('test.txt', 'w') as file:
    file.write(str(50))

In [None]:
with open('test.txt', 'r') as file:
    print(file.read())

In [None]:
print(f'\t Val. Loss: {valid_loss:.3f}')

In [None]:
num=90
file = open('epoch_curr.txt', 'w')
file.write(str(num))
file.close()

In [None]:
file = open('epoch_curr.txt', 'r')
file.read()

In [25]:
import spacy
nlp = spacy.load('en_core_web_sm')
def translate_sentence(model, sentence, nlp):
    model.eval()
    
    tokenized = nlp(sentence) 
    
    tokenized = ['<sos>'] + [t.lower_ for t in tokenized] + ['<eos>']
    numericalized = [TEXT.vocab.stoi[t] for t in tokenized] 
    
    sentence_length = torch.LongTensor([len(numericalized)]).to(model.device) 
    tensor = torch.LongTensor(numericalized).unsqueeze(1).to(model.device) 
    
    translation_tensor_logits = model((tensor, sentence_length), None, 0) 
    
    #print(len(sentence))
    #print("SHAPE:", translation_tensor_logits.squeeze(1).shape)

    translation_tensor = torch.argmax(translation_tensor_logits.squeeze(1), 1) 
    #print(torch.argmax(translation_tensor_logits.squeeze(1), 1)
    translation = [TEXT.vocab.itos[t] for t in translation_tensor]
 
    # Start at the first index.  We don't need to return the <sos> token...
    translation = translation[1:]
    return translation, translation_tensor_logits


In [None]:
first = 0

def cap(match):
    return(match.group().capitalize())

def my_capitalize(cap, s):
    p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
    result = p.sub(cap, s)
    return re.sub(r'\s([?.!"](?:\s|$))', r'\1', result)


while(1):
    if first==0:
        print('Chatbot: Hi, how can I help you?')
        print(' ')
    first+=1
    input_sent = input('Enter: ')
    
    if input_sent == 'q' or input_sent == 'quit': break

    elif input_sent == 'hello':
        print('Chatbot: Hello!')
    else:
        response, logits = translate_sentence(model, input_sent, nlp)
        print(' ')
        temp = " ".join(response)
        print("Chatbot: " + my_capitalize(cap, temp))
        print(' ')

Chatbot: Hi, how can I help you?
 
Enter: My package was stolen
 
Chatbot: Oh no! I'm sorry for the trouble! ! Please reach us here : https://t.co/haplpmlfhn so we can look into this with you.
 
Enter: my order has been delayed for a week!
 
Chatbot: I'm sorry for the delay! What does the tracking show for the order? You can check here : https://t.co/y5jpi9grhe
 
Enter: it still says on the way!
 
Chatbot: Thanks for confirming. Please reach out to us here so we can look into this with you : https://t.co/haplpmlfhn
 
Enter: ok thank you!
 
Chatbot: You're welcome! Let us know if you need any other questions or concerns.
 


In [38]:
s = "i'm sorry for the delay ! what does the tracking show for the order?"
punctuation = ['!', '?', '.']

sentences = sent_tokenizer.tokenize(s)
sentences = [sent.capitalize() for sent in sentences]
pprint(sentences)

NameError: name 'sent_tokenizer' is not defined

In [49]:
import re
p = re.compile(r'(?<=[\.\?!]\s)(\w+)')

s = "i'm sorry for the delay ! what does the tracking show for the order ? you can check here : https://t.co/y5jpi9grhe"
               
def cap(match):
    return(match.group().capitalize())

def my_capitalize(cap, s):
    p = re.compile(r'((?<=[\.\?!]\s)(\w+)|(^\w+))')
    return p.sub(cap, s)

my_capitalize(cap,s)

"I'm sorry for the delay ! What does the tracking show for the order ? You can check here : https://t.co/y5jpi9grhe"

"i'm sorry for the delay ! I'm sorry for the delay ! what does the tracking show for the order ? you can check here : https://t.co/y5jpi9grhe does the tracking show for the order ? I'm sorry for the delay ! what does the tracking show for the order ? you can check here : https://t.co/y5jpi9grhe can check here : https://t.co/y5jpi9grhe"

In [None]:
n = 291

print(df.sample(10, random_state=n)['questions'].iloc[1])
print(df.sample(10, random_state=n)['answers'].iloc[1])

In [None]:
df = pd.read_csv('chatbot_data2.csv')
df.head()

In [None]:
df.head()['questions'][1]

In [None]:
s = 'im so sorry for this!'

def check_sorry(s):
    if 'sorry' in str(s):
        return 1
    else:
        return 0 

check_sorry(s)

In [None]:
df['sorrys'] = df['answers'].apply(check_sorry)


In [None]:
df['sorrys'].value_counts()