# Research and Study

The first step was to understand various architectures used in chatbot development. We explored Seq2Seq, Transformers, and GPT models in this course:

Seq2Seq models encode the input sequence and generate the output sequence. However, they struggle with long-term dependencies because the model tends to forget the earlier parts of conversations.
Transformers, on the other hand, introduced the self-attention mechanism, which allows the model to focus on relevant parts of the input sequence, thus handling long-range dependencies better ( Lewis Tunstall)

GPT-2 (Generative Pre-trained Transformer 2) is a Transformer-based model that excels at generating human-like text and handling multi-turn dialogues due to its ability to manage long-term context using its pre-trained knowledge.

# Data Collection and Preprocessing 

The Cornell Movie Dialogs Corpus was selected for training. It contains over 220,000 conversational exchanges from movies​(README). The preprocessing phase involved tokenizing the dialogues, managing conversation turns, and cleaning the data.

## Preprocessing Code

In [6]:
# Load the Cornell Movie Dialogs Corpus
movie_lines_file = "/Users/bandito2/Documents/FA24/usdjourney/aai520/final-project/archive/movie_lines.txt"
movie_conversations_file = "/Users/bandito2/Documents/FA24/usdjourney/aai520/final-project/archive/movie_conversations.txt"

# Function to load movie lines
def load_movie_lines(file_path):
    lines = {}
    try:
        with open(file_path, 'r', encoding='iso-8859-1') as f:
            for line in f:
                parts = line.strip().split(" +++$+++ ")
                if len(parts) == 5:
                    lines[parts[0]] = parts[4]  # Line ID -> Dialogue text
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
    return lines

# Function to load conversations
def load_conversations(file_path, lines_dict):
    conversations = []
    try:
        with open(file_path, 'r', encoding='iso-8859-1') as f:
            for line in f:
                parts = line.strip().split(" +++$+++ ")
                if len(parts) == 4:
                    utterance_ids = eval(parts[-1])  # Convert string list to actual list
                    conversation = [lines_dict.get(utterance_id, "") for utterance_id in utterance_ids if utterance_id in lines_dict]
                    if conversation and all(conversation):
                        conversations.append(conversation)
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
    return conversations

# Load movie lines and conversations
lines_dict = load_movie_lines(movie_lines_file)
conversations = load_conversations(movie_conversations_file, lines_dict)

# Example: Print first conversation
print(conversations[0])

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.', "Well, I thought we'd start with pronunciation, if that's okay with you.", 'Not the hacking and gagging and spitting part.  Please.', "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?"]


## Code for Tokenization and Padding 

In [19]:
from transformers import GPT2Tokenizer

# Initialize the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token  # Use EOS token as the padding token

# Tokenize conversations and pad to max length
def tokenize_conversations(conversations, tokenizer, max_length=1024):
    tokenized_conversations = []
    for conversation in conversations:
        for sentence in conversation:
            # Tokenize and truncate sentences exceeding max_length
            tokens = tokenizer.encode(sentence, add_special_tokens=True, truncation=True, max_length=max_length)
            tokenized_conversations.append(tokens)

    # Pad all sequences to max_length
    padded_conversations = tokenizer.pad(
        {"input_ids": tokenized_conversations},
        padding="max_length",    # Pad sequences to max_length
        max_length=max_length,   # Ensure all sequences are exactly max_length
        return_tensors="pt"      # Return as PyTorch tensors
    )
    
    return padded_conversations

# Tokenize and pad the conversations
tokenized_conversations = tokenize_conversations(conversations, tokenizer)

# Check a sample of the tokenized and padded conversations
print(tokenized_conversations['input_ids'][0])


tensor([   39, 50256, 50256,  ..., 50256, 50256, 50256])


Tokenization: We used the GPT-2 tokenizer to convert text into tokens that the model understands.
Handling Multi-Turn Conversations: We prepared conversations by grouping exchanges so that the model could learn context across multiple dialogue turns​( Lewis Tunstall).

# Model Design and Training

In this phase, we fine-tuned the GPT-2 model on the tokenized data from the Cornell Movie Dialogs Corpus. Fine-tuning the pre-trained model on conversational data helps it adapt to the dialogue structure and style, making it capable of generating contextually relevant responses.

# Prepare Dataset with Labels for GTP-2

In [21]:
import torch
from torch.utils.data import Dataset

# Custom Dataset class for GPT-2 with labels
class ChatbotDataset(Dataset):
    def __init__(self, tokenized_data):
        self.input_ids = tokenized_data['input_ids']
        self.attention_mask = tokenized_data['attention_mask']

        # Labels are the same as input_ids for GPT-2
        self.labels = self.input_ids.clone()

    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_mask[idx],
            'labels': self.labels[idx]
        }

    def __len__(self):
        return len(self.input_ids)

# Prepare the dataset with the tokenized data
chatbot_dataset = ChatbotDataset(tokenized_conversations)

# Check the first item to ensure it contains input_ids, attention_mask, and labels
print(chatbot_dataset[0])


{'input_ids': tensor([   39, 50256, 50256,  ..., 50256, 50256, 50256]), 'attention_mask': tensor([1, 0, 0,  ..., 0, 0, 0]), 'labels': tensor([   39, 50256, 50256,  ..., 50256, 50256, 50256])}


## Fine-Tuning Code 

In [None]:
from transformers import GPT2LMHeadModel, Trainer, TrainingArguments

# Load the GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1,
    per_device_train_batch_size=2,    # Batch size per device
    logging_dir='./logs',             # Directory for logs
    logging_steps=100,                # Log every 100 steps
    save_steps=500,                   # Save model every 500 steps
    save_total_limit=2,               # Limit the number of saved checkpoints
    learning_rate=5e-5
)

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=chatbot_dataset,    # Use the custom dataset
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model and tokenizer
model.save_pretrained('./fine_tuned_gpt2')
tokenizer.save_pretrained('./fine_tuned_gpt2')


The Hugging Face Trainer is used to manage the training loop, handling forward passes, loss calculation, and backpropagation​( Lewis Tunstall).

The GPT-2 model was fine-tuned for three epochs with a batch size of 2, allowing the model to adjust its weights based on the conversation data, making it proficient in dialogue generation.

# Evaluation and Testing

After training, the chatbot was evaluated both interactively and through quantitative metrics such as perplexity. Perplexity is a common metric used to evaluate language models, where a lower perplexity indicates better performance in predicting the next word.

## Perplexity Calculation Code 

In [None]:
import torch
import math
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Check if MPS (Apple GPU) is available
if torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

# Load the fine-tuned model and tokenizer
model = GPT2LMHeadModel.from_pretrained('./fine_tuned_gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('./fine_tuned_gpt2')

# Move the model to MPS device
model.to(device)

# Function to calculate perplexity
def calculate_perplexity(text):
    inputs = tokenizer(text, return_tensors='pt').to(device)  # Ensure inputs are on MPS
    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        loss = outputs.loss
        perplexity = math.exp(loss.item())
    return perplexity

# Example text
sample_text = "How have you been doing since we last spoke?"

# Calculate perplexity
perplexity = calculate_perplexity(sample_text)
print(f"Perplexity: {perplexity}")


Perplexity measures how well the model predicts the next token in a sequence. A lower value indicates that the model is more accurate and confident in its predictions (Lewis Tunstall).

Interactive Testing: We also tested the chatbot interactively by providing various conversational prompts and analyzing how well the model handles context over multiple turns.

# Building a Web Interface

We used Flask to build a simple web interface, allowing users to interact with the chatbot in real-time. The web application takes user input, sends it to the chatbot, and displays the response on the webpage.

Flask is a Python web framework was used to create the user interface. The chatbot generates a response in real-time based on user input and returns it to the webpage.

Web Interface allows users to interact with the chatbot in a more accessible and user-friendly environment (Lewis Tunstall ).

# Conclusion

This project demonstrated how to build a generative-based chatbot using GPT-2, from data preprocessing to model training, evaluation, and deployment. We explored the key steps in creating a multi-turn conversational chatbot, including the challenges of managing context and generating coherent responses. Future improvements could include training on more diverse datasets, refining conversational abilities, and deploying the chatbot in a scalable environment like AWS or Heroku.