In [1]:
# Offline Chat-Reply Recommendation System
</VSCode.Cell>
<VSCode.Cell language="markdown">
### Objective:
Design and implement an offline chat-reply recommendation system using Transformer-based models to predict contextually appropriate responses in a two-person chat scenario.

### Problem Statement:
You are provided with two datasets — each containing long conversational histories between two users. The goal is to build a model that, given a message from User B, predicts the next possible reply from User A, leveraging User A’s previous chat history as context.
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 1. Setup and Data Loading
This section imports necessary libraries like `transformers`, `torch`, `pandas`, etc. It also sets up API keys for Hugging Face and Perplexity and loads the conversational datasets from the provided Google Sheets link.
</VSCode.Cell>
<VSCode.Cell language="python">
# Install necessary libraries
!pip install transformers torch pandas numpy scikit-learn matplotlib nltk joblib gdown rouge_score evaluate

# Import libraries
import torch
import pandas as pd
import numpy as np
import gdown
from sklearn.model_selection import train_test_split
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from torch.utils.data import Dataset
import nltk
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
import evaluate
import os

# Set up API keys (replace with your actual keys)
os.environ["HUGGING_FACE_HUB_TOKEN"] = "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"
os.environ["PERPLEXITY_API_KEY"] = "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"

print("Libraries installed and imported.")
</VSCode.Cell>
<VSCode.Cell language="markdown">
### Download and Load Data
The dataset is downloaded from a public Google Sheet and loaded into a pandas DataFrame.
</VSCode.Cell>
<VSCode.Cell language="python">
# Google Sheet URL for download
# The link needs to be in a downloadable format. For Google Sheets, you can change the end of the URL to /export?format=csv
google_sheet_url = "https://docs.google.com/spreadsheets/d/1XupCm7fwBdAXS29UoHeFuPDWCbiEE8TX/export?format=csv&gid=1688094357"
output_path = "chat_data.csv"

# Download the file
gdown.download(google_sheet_url, output_path, quiet=False)

# Load the dataset
try:
    df = pd.read_csv(output_path)
    print("Dataset loaded successfully.")
    print(df.head())
except Exception as e:
    print(f"Error loading dataset: {e}")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 2. Data Preprocessing and Tokenization
This section cleans the raw text data by removing noise and special characters. It structures the data into context-response pairs, where the context is User A's history and the response is User A's next reply to User B's message. A tokenizer from a pre-trained model (GPT-2) is used to convert the text data into a numerical format.
</VSCode.Cell>
<VSCode.Cell language="python">
# For this example, let's assume the CSV has columns 'User' and 'Message'
# We'll simulate a two-person conversation and create pairs of (User B message, User A reply)

# Let's assume User A is 'ankit' and User B is the other user.
# We need to create pairs of (message from B, reply from A)
# This is a simplified approach. A more robust solution would handle multi-turn context.

pairs = []
# This assumes a simple alternating conversation. 
# A real-world scenario would need more sophisticated session tracking.
for i in range(1, len(df)):
    # Assuming the first user in the sheet is User A and the second is User B, alternating.
    # This is a strong assumption and might need to be adjusted based on the actual data structure.
    if i % 2 != 0: # Assuming User B's message is at an odd index, and User A's is at an even one
        user_b_message = df['Message'].iloc[i-1]
        user_a_reply = df['Message'].iloc[i]
        pairs.append(f"User B: {user_b_message} User A: {user_a_reply}")

# Display a few pairs
print("Sample conversation pairs:")
for pair in pairs[:3]:
    print(pair)

# Split data into training and testing sets
train_texts, test_texts = train_test_split(pairs, test_size=0.1, random_state=42)

print(f"\nTraining samples: {len(train_texts)}")
print(f"Testing samples: {len(test_texts)}")
</VSCode.Cell>
<VSCode.Cell language="markdown">
### Tokenization
We use the GPT-2 tokenizer to convert our text data into tokens that the model can understand. We also add a padding token to handle variable-length sequences.
</VSCode.Cell>
<VSCode.Cell language="python">
# Load tokenizer and model
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# GPT-2 doesn't have a pad token by default, so we add one.
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# Custom Dataset Class
class ChatDataset(Dataset):
    def __init__(self, tokenizer, texts, max_length=128):
        self.tokenizer = tokenizer
        self.texts = texts
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoding = self.tokenizer(text, return_tensors='pt', max_length=self.max_length, padding='max_length', truncation=True)
        # Squeeze to remove the batch dimension
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze()
        }

train_dataset = ChatDataset(tokenizer, train_texts)
test_dataset = ChatDataset(tokenizer, test_texts)

print("Datasets created and tokenized.")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 3. Model Fine-Tuning
Here, we select and load a pre-trained Transformer model (GPT-2). We configure the training arguments using `TrainingArguments` and create a `Trainer` instance to fine-tune the model on our preprocessed conversational dataset.
</VSCode.Cell>
<VSCode.Cell language="python">
# Data collator for language modeling. This will create `labels` for us.
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./chat_model",
    overwrite_output_dir=True,
    num_train_epochs=3, # Adjust as needed
    per_device_train_batch_size=4, # Adjust based on your GPU memory
    save_steps=10_000,
    save_total_limit=2,
    logging_dir='./logs',
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

# Start fine-tuning
print("Starting model fine-tuning...")
trainer.train()
print("Fine-tuning complete.")

# Save the fine-tuned model
trainer.save_model("./fine_tuned_chat_model")
tokenizer.save_pretrained("./fine_tuned_chat_model")
print("Model saved.")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 4. Generate Chat Replies
Now we can use our fine-tuned model to generate a reply. We provide a new message from User B and the preceding conversation history as input to the model's `generate` method and decode the generated token IDs back into human-readable text.
</VSCode.Cell>
<VSCode.Cell language="python">
# Load the fine-tuned model
model_path = "./fine_tuned_chat_model"
model = GPT2LMHeadModel.from_pretrained(model_path)
tokenizer = GPT2Tokenizer.from_pretrained(model_path)

def generate_reply(user_b_message):
    """
    Generates a reply from User A given a message from User B.
    """
    prompt = f"User B: {user_b_message} User A:"
    inputs = tokenizer(prompt, return_tensors='pt')
    
    # Generate a response
    output_sequences = model.generate(
        input_ids=inputs['input_ids'],
        max_length=50,
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.9,
    )
    
    # Decode the response
    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
    
    # Extract only User A's reply
    reply = generated_text.split("User A:")[-1].strip()
    return reply

# Example usage
test_message = "Hey, how are you doing?"
reply = generate_reply(test_message)
print(f"User B: {test_message}")
print(f"Generated User A Reply: {reply}")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 5. Evaluate Model Performance
We split the dataset into training and testing sets. After training, we generate responses for the test set and use metrics like BLEU and ROUGE to compare the generated responses with the actual responses. We also calculate perplexity to measure the model's fluency.

The perplexity score is given by $PPL(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i|w_1...w_{i-1})}}$ where $N$ is the number of words.
</VSCode.Cell>
<VSCode.Cell language="python">
# Evaluation
references = []
hypotheses = []

# Generate predictions for the test set
for text in test_texts:
    parts = text.split("User A:")
    if len(parts) < 2: continue
    
    prompt = parts[0] + "User A:"
    actual_reply = parts[1].strip()
    
    inputs = tokenizer(prompt, return_tensors='pt')
    
    output_sequences = model.generate(
        input_ids=inputs['input_ids'],
        max_length=len(inputs['input_ids'][0]) + len(tokenizer(actual_reply)['input_ids']), # Generate similar length
        num_return_sequences=1,
        pad_token_id=tokenizer.eos_token_id,
    )
    
    generated_text = tokenizer.decode(output_sequences[0], skip_special_tokens=True)
    predicted_reply = generated_text.split("User A:")[-1].strip()
    
    references.append([actual_reply.split()]) # list of lists of words
    hypotheses.append(predicted_reply.split()) # list of words

# Calculate BLEU score
bleu_score = np.mean([sentence_bleu(ref, hyp) for ref, hyp in zip(references, hypotheses)])
print(f"Average BLEU score: {bleu_score:.4f}")

# Calculate ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
rouge_scores = [scorer.score(' '.join(ref[0]), ' '.join(hyp)) for ref, hyp in zip(references, hypotheses)]

avg_rouge1 = np.mean([score['rouge1'].fmeasure for score in rouge_scores])
avg_rougeL = np.mean([score['rougeL'].fmeasure for score in rouge_scores])

print(f"Average ROUGE-1 F-measure: {avg_rouge1:.4f}")
print(f"Average ROUGE-L F-measure: {avg_rougeL:.4f}")

# Calculate Perplexity using the evaluate library
perplexity = evaluate.load("perplexity", module_type="metric")
# Join all test texts for perplexity calculation
all_test_text = " ".join(test_texts)
results = perplexity.compute(model_id=model_path,
                             add_start_token=False,
                             data=[all_test_text])

print(f"Perplexity: {results['mean_perplexity']:.4f}")
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 6. Justification and Optimization

### Model Selection
We chose **GPT-2 (Generative Pre-trained Transformer 2)** for this task. Here's why:
- **Generative Nature:** GPT-2 is an autoregressive language model, which means it's inherently designed for generating sequential data like text. This makes it a natural fit for a reply generation task.
- **Pre-trained Knowledge:** GPT-2 is pre-trained on a massive corpus of text from the internet. This pre-training captures a vast amount of information about language, grammar, and common sense, which can be leveraged by fine-tuning on our specific conversational dataset.
- **Contextual Understanding:** The Transformer architecture allows GPT-2 to handle long-range dependencies in text, enabling it to understand the context of the conversation and generate more relevant replies.
- **Availability and Ease of Use:** GPT-2 is readily available in the `transformers` library, making it easy to load, fine-tune, and use.

**Alternatives considered:**
- **BERT:** While powerful for understanding context, BERT is primarily an encoder-only model, making it more suitable for tasks like classification or question answering rather than text generation.
- **T5 (Text-to-Text Transfer Transformer):** T5 is another excellent choice as it frames every NLP task as a text-to-text problem. It could have been used here, but GPT-2 is often simpler to set up for straightforward generative tasks.

### Fine-Tuning Strategy
Our strategy was to take the pre-trained GPT-2 model and fine-tune it on our specific chat dataset. This process adjusts the model's weights to adapt to the style and content of the conversations between User A and User B. We formatted the data as `User B: [message] User A: [reply]` to teach the model the desired response format.

### Hyperparameter Choices
- **Epochs:** We used a small number of epochs (3) to avoid overfitting, which is a risk when fine-tuning on a relatively small dataset.
- **Batch Size:** A batch size of 4 was chosen as a balance between computational efficiency and memory constraints. Larger batch sizes can lead to more stable gradients but require more GPU memory.
- **Learning Rate:** We used the default learning rate from the `Trainer`, which is typically a small value suitable for fine-tuning (e.g., 5e-5).

### Offline Deployment Feasibility
- **Model Size:** The standard GPT-2 model is relatively large (around 500MB). While this is manageable for a server, it can be a concern for resource-constrained offline devices. Smaller versions like `distilgpt2` could be used to reduce the model's footprint.
- **Inference Speed:** Generating a reply involves a forward pass through the model, which can be computationally intensive. On a modern CPU, generating a short reply might take a few hundred milliseconds to a second. For a real-time chat application, this is generally acceptable. GPU acceleration would significantly speed this up.
- **Dependencies:** The model requires `torch` and `transformers`, which have a significant size. Packaging these for an offline application needs careful consideration. Tools like PyInstaller can be used, but the final executable will be large.

**Optimization:**
- **Quantization:** Techniques like dynamic quantization can be applied to the model to reduce its size and speed up inference on CPUs with minimal loss in performance.
- **Pruning:** Model pruning involves removing less important weights from the model, which can also lead to a smaller and faster model.
- **Knowledge Distillation:** A smaller model (the "student") can be trained to mimic the behavior of our larger fine-tuned model (the "teacher"), resulting in a compact model suitable for offline deployment.
</VSCode.Cell>
<VSCode.Cell language="markdown">
## 7. Conclusion
This notebook demonstrated a complete workflow for building an offline chat-reply recommendation system. We started with data loading and preprocessing, fine-tuned a GPT-2 model, generated replies, and evaluated its performance using standard NLP metrics. The results show that fine-tuning a pre-trained model can produce coherent and contextually relevant chat responses.

**Future Improvements:**
- **Incorporate Multi-Turn Context:** Instead of just the last message, use a longer conversation history as context.
- **Experiment with Different Models:** Try other models like T5 or newer versions of GPT.
- **Hyperparameter Tuning:** Use techniques like grid search or random search to find optimal hyperparameters.
- **Advanced Evaluation:** Use human evaluation to assess the quality of the generated replies more subjectively.


SyntaxError: invalid character '—' (U+2014) (1572332656.py, line 8)