<a href="https://colab.research.google.com/github/ManasviKundalia/Misc-NLP/blob/master/transformer_chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Install Requirements

In [21]:
!pip install transformers
!pip install torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [22]:
import torch
import transformers
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv
model = transformers.GPT2LMHeadModel.from_pretrained('gpt2')


In [23]:
tokenizer = transformers.GPT2Tokenizer.from_pretrained('gpt2')



In [24]:
def sanity_check(response):
  sentences = response.split(".")
  result = []
  for sentence in sentences:
    if len(result)==0:
      result.append(sentence)
    else:
      words = set(sentence.split())
      prev_sentence_words = set(result[-1].split())
      if words.intersection(prev_sentence_words) / max(len(prev_sentence_words), len(words)) > 0.5:
        continue
      result.append(sentence)
    final_result = ". ".join(result)
    words = final_result.split()
    if len(set(words)) / len(words) < 0.3:
      return "I am not sure :("
    return ". ".join(result)

def generate_response(input_str):
    input_ids = torch.tensor(tokenizer.encode(input_str)).unsqueeze(0)  # Encode the input string as a tensor
    attention_mask = torch.ones_like(input_ids)  # Create an attention mask
    pad_token_id = tokenizer.pad_token_id  # Get the pad token id
    output = model.generate(input_ids, attention_mask=attention_mask, pad_token_id=pad_token_id, max_length=100, top_p=0.95, top_k=0)  # Generate a response
    return sanity_check(tokenizer.decode(output[0], skip_special_tokens=True))  # Decode the response and remove special tokens


while True:
    input_str = input("You: ")
    if input_str == 'exit':
        break
    response = generate_response(input_str)
    print("AI:", response)

You: What is a recommendation system?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: What is a recommendation system?

A recommendation system is a system that allows you to make a recommendation based on your own experience
You: What is physics?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: What is physics?

Physicists have long known that the universe is a complex, fluid, and chaotic place
You: Yes, but what is physics?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: Yes, but what is physics?

The answer is that physics is a very complex system of laws and equations
You: exit


### Fine tune on WikiQA Dataset

In [25]:
POS_ANS_PATH = "/content/WikiQASent.pos.ans.tsv"

In [26]:
import pandas as pd
df_wikiqa = pd.read_csv(POS_ANS_PATH, delimiter="\t")

In [27]:
df_wikiqa.head()

Unnamed: 0,QuestionID,Question,DocumentID,DocumentTitle,SentenceID,Sentence,AnswerPhrase1,AnswerPhrase2,AnswerPhrase3
0,Q0,HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US,D0,African immigration to the United States,D0-5,"As such, African immigrants are to be distingu...",,involuntarily brought to the United States by ...,Atlantic slave trade
1,Q1,how are glacier caves formed?,D1,Glacier cave,D1-3,A glacier cave is a cave formed within the ice...,,within the ice of a glacier,formed within the ice of a glacier
2,Q4,how a water pump works,D4,Pump,D4-4,Pumps operate by some mechanism (typically rec...,,Pumps operate by some mechanism (typically rec...,mechanism (typically reciprocating or rotary )
3,Q11,"how big is bmc software in houston, tx",D11,BMC Software,D11-3,"Employing over 6,000, BMC is often credited wi...",,"Employing over 6,000","Employing over 6,000"
4,Q11,"how big is bmc software in houston, tx",D11,BMC Software,D11-4,"For 2011, the company recorded an annual reven...",,annual revenue of $2.1 billion,annual revenue of $2.1 billion


In [28]:
conversations = [f"{row['Question']} ? {row['Sentence']}" for _, row in df_wikiqa.iterrows()]

In [29]:
len(conversations), conversations[:5]

(1473,
 ['HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US ? As such, African immigrants are to be distinguished from African American people, the latter of whom are descendants of mostly West and Central Africans who were involuntarily brought to the United States by means of the historic Atlantic slave trade .',
  'how are glacier caves formed? ? A glacier cave is a cave formed within the ice of a glacier .',
  'how a water pump works ? Pumps operate by some mechanism (typically reciprocating or rotary ), and consume energy to perform mechanical work by moving the fluid.',
  'how big is bmc software in houston, tx ? Employing over 6,000, BMC is often credited with pioneering the BSM concept as a way to help better align IT operations with business needs.',
  'how big is bmc software in houston, tx ? For 2011, the company recorded an annual revenue of $2.1 billion, making it the #20 largest software company in terms of revenue for that year.'])

In [30]:
from torch.utils.data import Dataset

class WikiQADataset(Dataset):  
    def __init__(self, control_code, truncate=False, gpt2_type="gpt2", max_length=1024):

        self.tokenizer = transformers.GPT2Tokenizer.from_pretrained(gpt2_type)
        self.conversations = []

        for row in conversations:
            self.conversations.append(torch.tensor(
                self.tokenizer.encode(f"<|{control_code}|>{row[:max_length]}<|endoftext|>")
            ))               
        if truncate:
            self.conversations = self.conversations[:20000]
        self.convo_count = len(self.conversations)
        
    def __len__(self):
        return self.convo_count

    def __getitem__(self, item):
        return self.conversations[item]
    
train_dataset = WikiQADataset("<BOS>", truncate=True, gpt2_type="gpt2")

In [31]:
from transformers import TrainingArguments, Trainer


#Accumulated batch size (since GPT2 is so big)
def pack_tensor(new_tensor, packed_tensor, max_seq_len):
    if packed_tensor is None:
        return new_tensor, True, None
    if new_tensor.size()[1] + packed_tensor.size()[1] > max_seq_len:
        return packed_tensor, False, new_tensor
    else:
        packed_tensor = torch.cat([new_tensor, packed_tensor[:, 1:]], dim=1)
        return packed_tensor, True, None

def train(
    dataset, model, tokenizer,
    batch_size=32, epochs=5, lr=2e-4,
    max_seq_len=400, warmup_steps=200,
    gpt2_type="gpt2", output_dir=".", output_prefix="wreckgar",
    test_mode=False,save_model_on_epoch=False,
):
    acc_steps = 10
#     model = model.cuda()
    model.train()

    optimizer = AdamW(model.parameters(), lr=lr)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=-1
    )

    train_dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
    loss=0
    accumulating_batch_count = 0
    input_tensor = None

    for epoch in range(epochs):

        print(f"Training epoch {epoch}")
        print(loss)
        for idx, entry in tqdm(enumerate(train_dataloader)):
            (input_tensor, carry_on, remainder) = pack_tensor(entry, input_tensor, 768)

            if carry_on and idx != len(train_dataloader) - 1:
                continue

            outputs = model(input_tensor, labels=input_tensor)
            loss = outputs[0]
            loss.backward()

            if (accumulating_batch_count % batch_size) == 0:
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                model.zero_grad()

            accumulating_batch_count += 1
            input_tensor = None
        if save_model_on_epoch:
            torch.save(
                model.state_dict(),
                os.path.join(output_dir, f"{output_prefix}-{epoch}.pt"),
            )
    return model

In [32]:
model = train(train_dataset, model, tokenizer)



Training epoch 0
0


1473it [18:58,  1.29it/s]


Training epoch 1
tensor(4.0089, grad_fn=<NllLossBackward0>)


1473it [18:54,  1.30it/s]


Training epoch 2
tensor(3.3641, grad_fn=<NllLossBackward0>)


1473it [18:58,  1.29it/s]


Training epoch 3
tensor(4.1742, grad_fn=<NllLossBackward0>)


1473it [18:52,  1.30it/s]


Training epoch 4
tensor(3.4433, grad_fn=<NllLossBackward0>)


1473it [18:54,  1.30it/s]


In [None]:
while True:
  input_str = input("You: ")
  if input_str == 'exit':
    break
  response = generate_response(input_str)
  print("AI:", response)

You: What is a recommendation system?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: What is a recommendation system?

A recommendation system is a system that allows you to make a decision about whether or not to make a recommendation
You: How big is the sun?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: How big is the sun?

The sun is the sun's diameter
You: How did African Americans come to the US?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: How did African Americans come to the US?


The first African American to come to the US was the first African American to be elected president of the United States
You: When was slavery abolished?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: When was slavery abolished?

The abolition of slavery was a major event in the history of the United States
You: Can you tell me more about the French Revolution


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: Can you tell me more about the French Revolution?

I think it was a very important event
You: French Revolution


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


You: What is the story of Harry Potter


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


AI: What is the story of Harry Potter?

Harry Potter is a character that has been in the works for over the years
