# **QUESTION ANSWERING MODEL USING GPT-2**

I'll be using transformers library by hugging face to access GPT-2.
The model used in the code is GPT2LMHeadModel.

I decided to to fine tune GPT-2 as Question Answering model.



In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m40.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m77.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m83.5 MB/s[0m eta [36m0:00:0

AutoTokenizer automatically identifies the best tokenizer for the model.

## **Tokenizing**

In [None]:
import torch
from transformers import AutoTokenizer, GPT2LMHeadModel

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

I a using T4 GPU as my runtime type to execute the code faster.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"


## **PREPROCESSING DATA**

My datasest was in .txt format, so I had to prepprocess it using pandas.
I am using https://www.kaggle.com/datasets/saurabhprajapat/chatbot-training-dataset?resource=download as my dataset.

In [None]:
import pandas as pd

file_path = '/content/drive/MyDrive/Colab Notebooks/Final_Project/chatbot dataset.txt'

df = pd.read_csv(file_path, sep="\t", lineterminator="\n", header=None)

df.head()


Unnamed: 0,0,1
0,What are your interests,I am interested in all kinds of things. We can...
1,What are your favorite subjects,"My favorite subjects include robotics, compute..."
2,What are your interests,"I am interested in a wide variety of topics, a..."
3,What is your number,I don't have any number
4,What is your number,23 skiddoo!


In [None]:
Q = df[0]

A = df[1]

"<startofstring> "+Q[0]+" <bot>: "+A[0]+" <endofstring>"


'<startofstring> What are your interests <bot>: I am interested in all kinds of things. We can talk about anything! <endofstring>'

In [None]:
X = []
for i in range(len(A)):
  X.append( "<startofstring> "+Q[i]+" <bot>: "+A[i]+" <endofstring>")


I added special tokens and padded it to max length.

In [None]:
tokenizer.add_special_tokens({"pad_token": "<pad>",
                                "bos_token": "<startofstring>",
                                "eos_token": "<endofstring>"})
tokenizer.add_tokens(["<bot>:"])

X_encoded = tokenizer(X,max_length=40, truncation=True, padding="max_length", return_tensors="pt")
input_ids = X_encoded['input_ids']
attention_mask = X_encoded['attention_mask']


These are some more modules that are being used in the code in subsequent cells.

In [None]:
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
import tqdm


## **HYPERPARAMETERS**

In [None]:
#Defining Hyperparameters
batch_size = 64
epochs = 12
lr = 1e-3

## TRAIN FUNCTION

This is the model which is responsible to train the model. I am using adam to train my model.

In [None]:
def train(dataloader, model, optim):
    for i in tqdm.tqdm(range(epochs)):
        for batch_input_ids, batch_attention_mask in dataloader:
            batch_input_ids = batch_input_ids.to(device)
            batch_attention_mask = batch_attention_mask.to(device)

            optim.zero_grad()
            loss = model(batch_input_ids, attention_mask=batch_attention_mask, labels=batch_input_ids).loss
            loss.backward()
            optim.step()
        torch.save(model.state_dict(), "model_state.pt")

## INFER FUNCTION

This functions takes the input and tokenize that first and then generate output tokens from the trained model. Then those are decoded to get the final output.

In [None]:
def infer(inp):
    inp = "<startofstring> "+inp+" <bot>: "
    inp = tokenizer(inp, return_tensors="pt")
    X = inp["input_ids"].to(device)
    a = inp["attention_mask"].to(device)
    output = model.generate(X, attention_mask=a, max_length = 50, no_repeat_ngram_size=2, top_k=60, temperature = 0.8)# These parameters can be changed to improve the replies of the bot
    output = tokenizer.decode(output[0],skip_special_tokens=True)
    return output

## RESIZING


Resizing the token embeddings of the model is important to accommodate any additional tokens that you have introduced, such as special tokens or other tokens specific to your use case. When you add new tokens to the tokenizer's vocabulary, you need to ensure that the model's embedding layer can handle these new tokens.

In [None]:
model.resize_token_embeddings(len(tokenizer))
model = model.to(device)

## **Dataset Class**

This class prepares a class of chatdataset which is required to feed the dataset to dataloader so that it can be made into batches.

In [None]:
class ChatDataset(Dataset):
    def __init__(self, input_ids, attention_mask):
        self.input_ids = input_ids
        self.attention_mask = attention_mask

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attention_mask[idx]

In [None]:
chat_data = ChatDataset(input_ids, attention_mask)
chat_dataloader = DataLoader(chat_data, batch_size=batch_size, shuffle=True)



## **TRAINING**

In [None]:
model.train()
optim = Adam(model.parameters(), lr)

In [None]:
print("training .... ")
train(chat_dataloader, model, optim)

training .... 


100%|██████████| 12/12 [01:36<00:00,  8.04s/it]


In [None]:
inp = input()
print(infer(inp))

What are your favorite subjects


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 What are your favorite subjects <bot>:  I am not capable of being a computer.  I can be programmed to perform very well. I have been programmed by a software to run through the computer to simulate the stimulus. i


## **APP USING GRADIO**

Importing Gradio to make a web application to access the bot easily using an UI.

In [None]:
import gradio as gr

In [None]:
def answer(inp):
  return infer(inp)

In [None]:
model_gui = gr.Interface(
  answer,
  gr.Textbox(lines=3,label="Input"),
  gr.Textbox(lines=3, label="Model"),
  title='GPT-2',
)
model_gui.launch(share=True`)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Note: opening Chrome Inspector may crash demo inside Colab notebooks.

To create a public link, set `share=True` in `launch()`.


<IPython.core.display.Javascript object>

