In this lab, we will practice different transformer-based techniques to impletement chatbots using Pytorch. 

Materials are adopted from https://medium.com/geekculture/simple-chatbot-using-bert-and-pytorch-part-1-2735643e0baa.


##Pytorch:
PyTorch is a Python-based scientific computing package that uses the power of graphics processing units(GPU). Since its release in January 2016, many researchers have continued to increasingly adopt PyTorch. It has quickly become a go-to library because of its ease in building extremely complex neural networks. It is giving a tough competition to TensorFlow especially when used for research work.

Some of the key highlights of PyTorch includes:

**Simple Interface:** It offers easy to use API.

**Pythonic in nature:** This library, being Pythonic, smoothly integrates with the Python data science stack.

**Tensors:** It is basically the same as a NumPy array. To run operations on the GPU, just cast the Tensor to a Cuda datatype.

**Computational graphs:** PyTorch provides an excellent platform that offers dynamic computational graphs.

**AUTOGRAD(Automatic Differentiation):** This class is an engine to calculate derivatives.

##Transformer:
Google introduced the transformer architecture in the paper “Attention is All you need”. The transformer uses a self-attention mechanism, which is suitable for language understanding.
Let’s say “I went to the Himalayas this summer. I really enjoyed my time out there”. The last word “there” refers to the Himalayas. But to understand this, remembering the first few parts is essential. To achieve this, the attention mechanism decides at each step of an input sequence which other parts of the sequence are important.
The transformer has an encoder-decoder architecture. They are composed of modules that contain feed-forward and attention layers.


In [None]:
!pip install torch
# transformers: This library brings together over 40 state-of-the-art pre-trained NLP models (BERT, GPT-2, Roberta, etc..)
!pip install transformers
# To print the model architecture.
!pip install torchinfo

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 3.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.47-py2.py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 39.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 31.4 MB/s 
Collecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 40.5 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found exis

In [None]:
import numpy as np
import pandas as pd
import re
import torch
import random
import torch.nn as nn
import transformers
import matplotlib.pyplot as plt
# specify GPU
device = torch.device("cuda" if torch.cuda.is_available() and not args.no_cuda else "cpu")

##BERT (Bidirectional Encoder Representations from Transformers):
First, we will build a chatbot using BERT, a transformer-based model that only adopts the encoder component. It is a transformer-based machine learning technique for natural language processing pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google.
BERT uses bidirectional training i.e it reads the sentence from both directions to understand the context of the sentence.
Note that BERT is just an encoder. It does not have a decoder.

##The Data
As a first step, we need to set up an intents JSON file that defines the intentions of the chatbot user.
For example:
A user may wish to know the name of our chatbot; therefore, we have created an intent called name.
In this chatbot, we have used 5 intents: name, help, hobby, greeting, and goodbye. We have used the training set that has utterances belonging to each of these intents. When the user enters any input, the intent will be recognized by the bot.
Within this intents JSON file, alongside each intents tag, there are responses. For our chatbot, once the intent is recognized the response will be randomly selected from the static set of responses associated with each intent.

In [None]:
# used a dictionary to represent an intents JSON file
intents = {"intents": [
{"tag": "greeting",
 "responses": ["Howdy Partner!", "Hello", "How are you doing?",   "Greetings!", "How do you do?"]},
{"tag": "hobby",
 "responses": ["I'm working on it.", "I should get one. It's all work and no play lately."]},
{"tag": "help",
 "responses": ["Sure. I'd be happy to. What's up?", "I'm glad to help. What can I do for you?"]},
{"tag": "name",
 "responses": ["My name is James", "I'm James", "James"]},
{"tag": "goodbye",
 "responses": ["It was nice speaking to you", "See you later", "Speak soon!"]}
]}

In [None]:
# We have prepared a intent dataset with 5 labels, download training file
!wget https://raw.github.com/JinfenLi/teaching_material/master/chatbot_intent.xlsx


--2022-03-07 00:28:51--  https://raw.github.com/JinfenLi/teaching_material/master/chatbot_intent.xlsx
Resolving raw.github.com (raw.github.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.github.com (raw.github.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://raw.githubusercontent.com/JinfenLi/teaching_material/master/chatbot_intent.xlsx [following]
--2022-03-07 00:28:51--  https://raw.githubusercontent.com/JinfenLi/teaching_material/master/chatbot_intent.xlsx
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13171 (13K) [application/octet-stream]
Saving to: ‘chatbot_intent.xlsx’


2022-03-07 00:28:51 (61.8 MB/s) - ‘chatbot_intent.xlsx’ saved [13171/13171]



In [None]:
# Load Dataset
df = pd.read_excel("chatbot_intent.xlsx")
print(df.head())
df["label"].value_counts()

                text    label
0              BBIAB  goodbye
1    I got to go now  goodbye
2         I gotta go  goodbye
3       I'll be back  goodbye
4  I'll be back soon  goodbye


goodbye     81
greeting    43
help        34
name        10
hobby        7
Name: label, dtype: int64

In [None]:
# Converting the labels into encodings
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])
df['label']

0      0
1      0
2      0
3      0
4      0
      ..
170    3
171    3
172    3
173    3
174    3
Name: label, Length: 175, dtype: int64

In [None]:
# In this example we have used all the utterances for training purpose
train_text, train_labels = df['text'], df['label']

##Model Preparation
We will build Bert-base-uncased as an example and leave Roberta-base model to you.

In [None]:
from transformers import AutoModel, BertTokenizerFast
# Load the BERT tokenizer
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
# Import BERT-base pretrained model
bert_model = AutoModel.from_pretrained('bert-base-uncased')

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
# Sample data for BERT tokenizer
text = ["this is a distil bert model.","data is oil"]
# Encode the text, padding if text length < maximum length and truncation otherwise
encoded_input = bert_tokenizer(text, padding=True,truncation=True, return_tensors='pt')
"""
In input_ids:
101 - Indicates beginning of the sentence
102 - Indicates end of the sentence
In token_type_ids:
0 - first part of the text
1 - second part of the text
In attention_mask:
1 - Actual token
0 - Padded token
"""
print(encoded_input)

{'input_ids': tensor([[  101,  2023,  2003,  1037,  4487, 16643,  2140, 14324,  2944,  1012,
           102],
        [  101,  2951,  2003,  3514,   102,     0,     0,     0,     0,     0,
             0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]])}


In [None]:
# get length of all the messages in the train set
seq_len = [len(i.split()) for i in train_text]
# choose the max length from seq_len. Note: the max length should not exceed the corresponding configuration in the model (e.g., 256 for bert-base-uncased model)
bert_max_seq_len = max(seq_len)
bert_max_seq_len

8

In [None]:
# tokenize and encode sequences in the training set
bert_tokens_train = bert_tokenizer(
    train_text.tolist(),
    max_length = bert_max_seq_len,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)
bert_tokens_train



{'input_ids': [[101, 22861, 2401, 2497, 102, 0, 0, 0], [101, 1045, 2288, 2000, 2175, 2085, 102, 0], [101, 1045, 10657, 2175, 102, 0, 0, 0], [101, 1045, 1005, 2222, 2022, 2067, 102, 0], [101, 1045, 1005, 2222, 2022, 2067, 2574, 102], [101, 1045, 1005, 2222, 2131, 2067, 2000, 102], [101, 1045, 1005, 2222, 3335, 2017, 102, 0], [101, 2009, 1005, 1055, 2042, 3835, 2000, 102], [101, 27133, 2891, 102, 0, 0, 0, 0], [101, 10303, 9061, 102, 0, 0, 0, 0], [101, 10303, 2204, 2305, 102, 0, 0, 0], [101, 9120, 1996, 11834, 102, 0, 0, 0], [101, 2004, 2696, 2474, 13005, 102, 0, 0], [101, 2067, 1999, 1037, 2978, 102, 0, 0], [101, 2022, 2067, 1999, 1019, 2781, 102, 0], [101, 2022, 2067, 1999, 1037, 2261, 102, 0], [101, 9061, 102, 0, 0, 0, 0, 0], [101, 9061, 9061, 102, 0, 0, 0, 0], [101, 9061, 9061, 2156, 2017, 102, 0, 0], [101, 9061, 9061, 2156, 2017, 2574, 102, 0], [101, 9061, 9061, 2202, 2729, 102, 0, 0], [101, 9061, 2005, 2085, 102, 0, 0, 0], [101, 9061, 2204, 2305, 102, 0, 0, 0], [101, 9061, 1011, 906

In [None]:
# convert the integer sequences to tensors
bert_train_seq = torch.tensor(bert_tokens_train['input_ids'])
bert_train_mask = torch.tensor(bert_tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler
#define a batch size
batch_size = 16
# wrap tensors
bert_train_data = TensorDataset(bert_train_seq, bert_train_mask, train_y)
# sampler for sampling the data during training
bert_train_sampler = RandomSampler(bert_train_data)
# DataLoader for train set
bert_train_dataloader = DataLoader(bert_train_data, sampler=bert_train_sampler, batch_size=batch_size)

In [None]:
# build a model for intent classification (sequence classification)
class BERT_Arch(nn.Module):
   def __init__(self, bert):      
       super(BERT_Arch, self).__init__()
       self.bert = bert 
      
       # dropout layer
       self.dropout = nn.Dropout(0.2)
      
       # relu activation function
       self.relu =  nn.ReLU()
       # dense layer
       self.fc1 = nn.Linear(768,512)
       self.fc2 = nn.Linear(512,256)
       self.fc3 = nn.Linear(256,5)
       #softmax activation function
       self.softmax = nn.LogSoftmax(dim=1)
       #define the forward pass
   def forward(self, sent_id, mask):
      #pass the inputs to the model  
      cls_hs = self.bert(sent_id, attention_mask=mask)[0][:,0]
      
      x = self.fc1(cls_hs)
      x = self.relu(x)
      x = self.dropout(x)
      
      x = self.fc2(x)
      x = self.relu(x)
      x = self.dropout(x)
      # output layer
      x = self.fc3(x)
   
      # apply softmax activation
      x = self.softmax(x)
      return x

In [None]:
# freeze all the parameters. This will prevent updating of model weights during fine-tuning.
for param in bert_model.parameters():
      param.requires_grad = False
bert_seq_model = BERT_Arch(bert_model)
# push the model to GPU if GPU is available, otherwise CPU
bert_seq_model = bert_seq_model.to(device)
from torchinfo import summary
summary(bert_seq_model)

Layer (type:depth-idx)                                  Param #
BERT_Arch                                               --
├─BertModel: 1-1                                        --
│    └─BertEmbeddings: 2-1                              --
│    │    └─Embedding: 3-1                              (23,440,896)
│    │    └─Embedding: 3-2                              (393,216)
│    │    └─Embedding: 3-3                              (1,536)
│    │    └─LayerNorm: 3-4                              (1,536)
│    │    └─Dropout: 3-5                                --
│    └─BertEncoder: 2-2                                 --
│    │    └─ModuleList: 3-6                             (85,054,464)
│    └─BertPooler: 2-3                                  --
│    │    └─Linear: 3-7                                 (590,592)
│    │    └─Tanh: 3-8                                   --
├─Dropout: 1-2                                          --
├─ReLU: 1-3                                             --
├─Linea

##Finetune Model
Optimizer
Using the Optimizer we reduce the loss during backpropagation through the network.

###Optimizer
Using the Optimizer we reduce the loss during backpropagation through the network.

In [None]:
from transformers import AdamW
# define the optimizer
bert_optimizer = AdamW(bert_seq_model.parameters(), lr = 1e-3)



###Find Class Weights

In [None]:
from sklearn.utils.class_weight import compute_class_weight
#compute the class weights
class_wts = compute_class_weight(
                                        class_weight = "balanced",
                                        classes = np.unique(train_labels),
                                        y = train_labels                                                    
                                    )
class_wts

array([0.43209877, 0.81395349, 1.02941176, 5.        , 3.5       ])

###Balancing the weights while calculating the error

In [None]:
# convert class weights to tensor
weights= torch.tensor(class_wts,dtype=torch.float)
weights = weights.to(device)
# loss function
cross_entropy = nn.NLLLoss(weight=weights) 

###Setting up the epochs

In [None]:
from torch.optim import lr_scheduler
# empty lists to store training and validation loss of each epoch
bert_train_losses=[]
# number of training epochs
epochs = 10
# We can also use learning rate scheduler to achieve better results
bert_lr_sch = lr_scheduler.StepLR(bert_optimizer, step_size=100, gamma=0.1)

###Fine-Tune the model

In [None]:
# function to train the model
def train(model, train_dataloader, optimizer, lr_sch):
  
  model.train()
  total_loss = 0
  
  # empty list to save model predictions
  total_preds=[]
  
  # iterate over batches
  for step,batch in enumerate(train_dataloader):

    # progress update after every 50 batches.
    if step % 50 == 0 and not step == 0:
      print('  Batch {:>5,}  of  {:>5,}.'.format(step, len(train_dataloader)))
    # push the batch to gpu
    batch = [r.to(device) for r in batch] 
    sent_id, mask, labels = batch
    # get model predictions for the current batch
    preds = model(sent_id, mask)
    # compute the loss between actual and predicted values
    loss = cross_entropy(preds, labels)
    print(loss)
    # add on to the total loss
    total_loss = total_loss + loss.item()
    # backward pass to calculate the gradients
    loss.backward()
    # clip the the gradients to 1.0. It helps in preventing the    exploding gradient problem
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    # update parameters
    optimizer.step()
    # clear calculated gradients
    optimizer.zero_grad()
  
    # We are using learning rate scheduler
    # lr_sch.step()
    # model predictions are stored on GPU. So, push it to CPU
    preds=preds.detach().cpu().numpy()
    # append the model predictions
    total_preds.append(preds)
  # compute the training loss of the epoch
  avg_loss = total_loss / len(bert_train_dataloader)
    
  # predictions are in the form of (no. of batches, size of batch, no. of classes).
  # reshape the predictions in form of (number of samples, no. of classes)
  total_preds  = np.concatenate(total_preds, axis=0)
  #returns the loss and predictions
  return avg_loss, total_preds

###Start Model Training

In [None]:
for epoch in range(epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
    
    #train model
    train_loss, _ = train(bert_seq_model, bert_train_dataloader, bert_optimizer, bert_lr_sch)
    
    # append training and validation loss
    bert_train_losses.append(train_loss)
    # it can make your experiment reproducible, similar to set  random seed to all options where there needs a random seed.
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
print(f'\nTraining Loss: {train_loss:.3f}')


 Epoch 1 / 10
tensor(1.6242, grad_fn=<NllLossBackward0>)
tensor(1.6601, grad_fn=<NllLossBackward0>)
tensor(1.5640, grad_fn=<NllLossBackward0>)
tensor(1.5766, grad_fn=<NllLossBackward0>)
tensor(1.5053, grad_fn=<NllLossBackward0>)
tensor(1.4827, grad_fn=<NllLossBackward0>)
tensor(1.3986, grad_fn=<NllLossBackward0>)
tensor(1.6557, grad_fn=<NllLossBackward0>)
tensor(2.0035, grad_fn=<NllLossBackward0>)
tensor(1.9824, grad_fn=<NllLossBackward0>)
tensor(1.2658, grad_fn=<NllLossBackward0>)

 Epoch 2 / 10
tensor(1.5280, grad_fn=<NllLossBackward0>)
tensor(1.6509, grad_fn=<NllLossBackward0>)
tensor(1.2097, grad_fn=<NllLossBackward0>)
tensor(1.4314, grad_fn=<NllLossBackward0>)
tensor(1.3516, grad_fn=<NllLossBackward0>)
tensor(1.4837, grad_fn=<NllLossBackward0>)
tensor(1.2350, grad_fn=<NllLossBackward0>)
tensor(1.2988, grad_fn=<NllLossBackward0>)
tensor(1.2348, grad_fn=<NllLossBackward0>)
tensor(1.2569, grad_fn=<NllLossBackward0>)
tensor(1.1698, grad_fn=<NllLossBackward0>)

 Epoch 3 / 10
tensor(1.

###Get Predictions for Test Data

In [None]:
def get_prediction(str, model, tokenizer, max_seq_len):
  str = re.sub(r'[^a-zA-Z ]+', '', str)
  test_text = [str]
  model.eval()

  tokens_test_data = tokenizer(
  test_text,
  max_length = max_seq_len,
  pad_to_max_length=True,
  truncation=True,
  return_token_type_ids=False
  )
  test_seq = torch.tensor(tokens_test_data['input_ids'])
  test_mask = torch.tensor(tokens_test_data['attention_mask'])

  preds = None
  with torch.no_grad():
    preds = model(test_seq.to(device), test_mask.to(device))
  preds = preds.detach().cpu().numpy()
  preds = np.argmax(preds, axis = 1)
  print("Intent Identified: ", le.inverse_transform(preds)[0])
  return le.inverse_transform(preds)[0]
def get_response(message, model, tokenizer, max_seq_len): 
  intent = get_prediction(message, model, tokenizer, max_seq_len)
  for i in intents['intents']: 
    if i["tag"] == intent:
      result = random.choice(i["responses"])
      break
  print(f"Response : {result}")
  return "Intent: "+ intent + '\n' + "Response: " + result

###Let's test the model now:

In [None]:
get_response("why dont you introduce yourself", bert_seq_model, bert_tokenizer, bert_max_seq_len)

Intent Identified:  name
Response : I'm James




"Intent: name\nResponse: I'm James"

## Get Your Hands Dirty!
Now, it is your turn to build the Roberta-base model.

###Roberta
RoBERTa is part of Facebook’s ongoing commitment to advancing the state-of-the-art in self-supervised systems that can be developed with less reliance on time- and resource-intensive data labeling.
The authors of RoBERTa suggest that BERT is largely undertrained and hence, they put forth the following improvements for the same.
*   More training data (16G vs 160G)
*   Uses dynamic masking pattern instead of static masking pattern.
*   Replacing the next sentence prediction objective with full sentences without NSP.
*   Training on Longer Sequences.

Aopted from https://medium.com/analytics-vidhya/evolving-with-bert-introduction-to-roberta-5174ec0e7c82.




In [None]:
# import the Roberta model and Roberta tokenizer
roberta_model = 
roberta_tokenizer = 

SyntaxError: ignored

In [None]:
# there is a sequence classification pretrained model that you don't need to implement your own BERT_Arch
# import that sequence classification model here
roberta_seq_model = 
# push the model to GPU if GPU is available, otherwise CPU
roberta_seq_model = 


In [None]:
# tokenize and encode sequences in the train_text, try to set max_seq_len as 20 this time
roberta_max_seq_len = 
roberta_tokens_train = 

In [None]:
# convert the integer sequences to tensors.
roberta_train_seq = 
roberta_train_mask = 

In [None]:
# Now, try to create dataloaders for the train_data and use sequential sampler this time
from torch.utils.data import SequentialSampler
#define a batch size
batch_size = 16
# wrap tensors
roberta_train_data = 
# sampler for sampling the data during training
roberta_train_sampler = 
# DataLoader for train set
roberta_train_dataloader = 

What are the differences between RandomSampler and SequentialSampler? Please provide your rationale here.

In [None]:
rationale = input()

In [None]:
# try to use Adafactor and AdafactorSchedule this time
from transformers.optimization import Adafactor, AdafactorSchedule
roberta_optimizer = 
roberta_lr_sch = 

In [None]:
# empty lists to store training and validation loss of each epoch
roberta_train_losses = []
# number of training epochs
roberta_epochs = 10

In [None]:
# start Roberta model training
for epoch in range(roberta_epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, roberta_epochs))
    
    #train model
    train_loss, _ = 
    
    # append training and validation loss
    roberta_train_losses.append(train_loss)
    # it can make your experiment reproducible, similar to set  random seed to all options where there needs a random seed.
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
print(f'\nTraining Loss: {train_loss:.3f}')

In [None]:
# Get Predictions for Test Data
get_response("Goodbye", roberta_seq_model, roberta_tokenizer, roberta_max_seq_len)

##GPT-2. 
GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences. Please check
https://huggingface.co/gpt2?text=A+long+time+ago%2C for more.

First, we will implement a GPT-2 chatbot using the previous coding style.

In [None]:
from transformers import GPT2Tokenizer, GPT2Model
gpt_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# select a token to use as `pad_token`
gpt_tokenizer.pad_token = gpt_tokenizer.eos_token
gpt_model = GPT2Model.from_pretrained('gpt2')

In [None]:
# build sequence classification model here
gpt_seq_model = BERT_Arch(gpt_model)
# push the model to GPU if GPU is available, otherwise CPU
gpt_seq_model = gpt_seq_model.to(device)

In [None]:
text = "Replace me by any text you'd like."
# Encode the text
gpt_encoded_input = gpt_tokenizer(text, padding=True,truncation=True, return_tensors='pt')

In [None]:
# tokenize and encode sequences in the train_text, try to set max_seq_len as 10 this time
gpt_max_seq_len = 10
gpt_tokenizer
gpt_tokens_train = gpt_tokenizer(
    train_text.tolist(),
    max_length = gpt_max_seq_len,
    pad_to_max_length=True,
    truncation=True,
    return_token_type_ids=False
)

In [None]:
# convert the integer sequences to tensors.
gpt_train_seq = torch.tensor(gpt_tokens_train['input_ids'])
gpt_train_mask = torch.tensor(gpt_tokens_train['attention_mask'])

In [None]:
# create dataloader
#define a batch size
batch_size = 16
# wrap tensors
gpt_train_data = TensorDataset(gpt_train_seq, gpt_train_mask, train_y)
# sampler for sampling the data during training
gpt_train_sampler = RandomSampler(gpt_train_data)
# DataLoader for train set
gpt_train_dataloader = DataLoader(gpt_train_data, sampler=gpt_train_sampler, batch_size=batch_size)

In [None]:
# create a optimizer
gpt_optimizer = AdamW(gpt_seq_model.parameters(), lr = 1e-3)
gpt_lr_sch = lr_scheduler.StepLR(gpt_optimizer, step_size=100, gamma=0.1)

In [None]:
# empty lists to store training and validation loss of each epoch
gpt_train_losses = []
# number of training epochs
gpt_epochs = 10

In [None]:
# start GPT model training
for epoch in range(gpt_epochs):
     
    print('\n Epoch {:} / {:}'.format(epoch + 1, gpt_epochs))
    
    #train model
    train_loss, _ = train(gpt_seq_model, gpt_train_dataloader, gpt_optimizer, gpt_lr_sch)
    
    # append training and validation loss
    gpt_train_losses.append(train_loss)
    # it can make your experiment reproducible, similar to set  random seed to all options where there needs a random seed.
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
print(f'\nTraining Loss: {train_loss:.3f}')

In [None]:
get_response("Goodbye", gpt_seq_model, gpt_tokenizer, gpt_max_seq_len)

There is a built-in pipepline that generate several squences of responses given one input.

In [None]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)


Now, let's take a look at how to manually implement a gpt-2 chatbot given several chatting rounds.

Initialize a pretrained GPT-2 model using causal language modeling (CLM) objective. Under CLM, the idea is to predict the masked token in a given sentence, which is only allowed to consider words that occur to its left (unidirectional). CLM is suitable for text generation given its unidirectional nature.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
gpt_generate_model = AutoModelForCausalLM.from_pretrained('gpt2')


##One-round Conversation
The basic idea is to aquire an input from user, encode the input and decode the response.

In [None]:
# Encode user input and End-of-String (EOS) token
user_input_ids = gpt_tokenizer.encode(input(">> You:") + gpt_tokenizer.eos_token, return_tensors='pt')
print(user_input_ids)
# Generate response given maximum chat length history of 50 tokens
response_ids = gpt_generate_model.generate(user_input_ids, max_length=50, pad_token_id=gpt_tokenizer.eos_token_id)
print(response_ids)
# Print response, the response_ids will also contain the user_input_ids, we only need to print the chatbot resonse part
print("GPT: {}".format(gpt_tokenizer.decode(response_ids[0][user_input_ids.shape[-1]:], skip_special_tokens=True)))
  


##DialoGPT
DialoGPT adapts pretraining techniques to response generation using hundreds of Gigabytes of colloquial data.  Like GPT-2, DialoGPT is formulated as an autoregressive (AR) language model, and uses a multi-layer transformer as model architecture. Unlike GPT-2, which trains on general text data,  DialoGPT draws on 147M multi-turn dialogues extracted from Reddit discussion threads.


In [None]:
# Initialize tokenizer and model
from transformers import AutoModelForCausalLM, AutoTokenizer
dialo_tokenizer = AutoTokenizer.from_pretrained("microsoft/DialoGPT-small")
dialo_model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

##Two-round Conversation
To concatenate two rounds of conversations' chat history and generate a response.

In [None]:
import torch
# conversation round
chat_round = 2
chat_history_ids = None
# Encode user input and End-of-String (EOS) token
user_input_ids = dialo_tokenizer.encode(input(">> You:") + dialo_tokenizer.eos_token, return_tensors='pt')
# Generate response given maximum chat length history of 50 tokens
chat_history_ids = dialo_model.generate(user_input_ids, max_length=50, pad_token_id=dialo_tokenizer.eos_token_id)
# Print response, the response_ids will also contain the user_input_ids, we only need to print the chatbot resonse part
print("DialoGPT: {}".format(dialo_tokenizer.decode(chat_history_ids[:, user_input_ids.shape[-1]:][0], skip_special_tokens=True)))

# second round
user_input_ids = dialo_tokenizer.encode(input(">> You:") + dialo_tokenizer.eos_token, return_tensors='pt')
bot_input_ids = torch.cat([chat_history_ids, user_input_ids], dim=-1)
chat_history_ids = dialo_model.generate(bot_input_ids, max_length=50, pad_token_id=dialo_tokenizer.eos_token_id)
print("DialoGPT: {}".format(dialo_tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)))



###Your Turn!
Now is your turn to try to build a n-round Conversation DialoGPT where n should be larger than 4.

In [None]:


def generate_response(tokenizer, model, chat_round, chat_history_ids):
  """
    Generate a response to some user input.
  """
  # Encode user input and End-of-String (EOS) token
  new_input_ids = 

  # Append tokens to chat history
  bot_input_ids = 

  # Generate response given maximum chat length history of 1250 tokens
  chat_history_ids = 

  # Print response
  print()
  
  # Return the chat history ids
  return chat_history_ids


def chat_for_n_rounds(n=5):
  """
  Chat with chatbot for n rounds (n = 5 by default)
  """
  # Initialize history variable
  chat_history_ids = None
  
  # Chat for n rounds
  for chat_round in range(n):
    chat_history_ids = 



In [None]:
chat_for_n_rounds(5)