ChatGPT first prompt :
- "i am building a RAG model using gpt-4o-mini as my generator. I want you to give me steps with code on how to incorporate R2AG into my RAG model. "


ChatGPT last prompt :
- "i am getting the following error here : 
TypeError: 'ChatCompletionMessage' object is not subscriptable"


Implementaton :

First, run the following command in the terminal to install langchain, transformers, faiss-cpu, and sentence-transformers for retrieval and embedding operations.


In [1]:
# pip install langchain transformers faiss-cpu sentence-transformers

In [2]:
import torch
import faiss
from transformers import AutoModel, AutoTokenizer
from sentence_transformers import SentenceTransformer
import numpy as np
from dotenv import find_dotenv, load_dotenv
load_dotenv(find_dotenv())

  from .autonotebook import tqdm as notebook_tqdm


True

In [3]:
class DocumentRetriever:
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.index = None  # FAISS Index

    def build_index(self, documents):
        embeddings = self.model.encode(documents, convert_to_numpy=True)
        self.index = faiss.IndexFlatL2(embeddings.shape[1])
        self.index.add(embeddings)
        self.docs = documents  # Storing original documents

    def retrieve(self, query, top_k=3):
        query_embedding = self.model.encode([query], convert_to_numpy=True)
        distances, indices = self.index.search(query_embedding, top_k)
        return [self.docs[i] for i in indices[0]]


In [4]:
class R2Former(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(R2Former, self).__init__()
        self.projection = torch.nn.Linear(input_dim, hidden_dim)  # Projection layer
        self.self_attention = torch.nn.MultiheadAttention(embed_dim=hidden_dim, num_heads=8)

    def forward(self, retrieval_embeddings):
        projected_embeddings = self.projection(retrieval_embeddings)
        attn_output, _ = self.self_attention(projected_embeddings, projected_embeddings, projected_embeddings)
        return attn_output


In [5]:
def transform_retrievals(retriever, query, r2former):
    retrieved_docs = retriever.retrieve(query)
    embeddings = retriever.model.encode(retrieved_docs, convert_to_tensor=True)
    embeddings = embeddings.unsqueeze(0)  # Reshaping for attention
    refined_embeddings = r2former(embeddings)
    return refined_embeddings, retrieved_docs


In [6]:
from openai import OpenAI

def generate_response(query, refined_embeddings, retrieved_docs):
    retrieval_context = " ".join(retrieved_docs)
    r2ag_prompt = f"Context: {retrieval_context}\n\nQuery: {query}\nAnswer:"
    
    client = OpenAI() 
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "You are an assistant."},
                  {"role": "user", "content": r2ag_prompt}]
    )
    return response.choices[0].message.content



In [7]:
retriever = DocumentRetriever()
retriever.build_index(["Document 1 text...", "Document 2 text...", "Document 3 text..."])  # Loading documents

r2former = R2Former(input_dim=384, hidden_dim=768)  # Matching dimensions of LLM

query = "What is a dog?"
refined_embeddings, retrieved_docs = transform_retrievals(retriever, query, r2former)

response = generate_response(query, refined_embeddings, retrieved_docs)
print(response)


A dog is a domesticated mammal belonging to the species Canis lupus familiaris. It is a subspecies of the gray wolf and has been bred over thousands of years for various traits, making it one of the most diverse species in terms of breeds. Dogs are known for their loyalty, companionship, and ability to perform various tasks, including herding, hunting, guarding, and providing assistance to humans. They often serve as pets and can form strong bonds with their owners, exhibiting a range of emotions and behaviors. Dogs communicate through vocalizations, body language, and facial expressions.


# R2AG Evaluation :

To evaluate this model, we will use the MuSiQue dataset using pandas which consists of questions with their ground truths/answers for comparison.

In [8]:
import pandas as pd

splits = {'train': 'musique_ans_v1.0_train.jsonl', 'validation': 'musique_ans_v1.0_dev.jsonl'}
df = pd.read_json("hf://datasets/dgslibisey/MuSiQue/" + splits["train"], lines=True)


In [9]:
# For simplicity, we assume "question" and "answer" fields are strings
df = df[["question", "answer"]].dropna()

In [10]:
df = df.head(100)

In [11]:
df.shape

(100, 2)

### ChatGPT First Prompt :

"I want you to evaluate the above R2AG Model using the MuSiQue dataset i have imported in the code"

### ChatGPT Last Prompt :

"the F1 score and Rouge scores for the above model are low in general, What steps can i take to improve these scores?"

In [12]:
from sklearn.metrics import f1_score
from rouge_score import rouge_scorer

# Preprocessing text data
def preprocess_text(text):
    """Clean and preprocess text (remove non-alphanumeric characters, lowercasing)."""
    text = text.lower()
    text = ''.join(e for e in text if e.isalnum() or e.isspace())  # Removing punctuation
    return text.strip()

In [13]:
# F1 Score Evaluation
def f1_score_evaluation(predicted_answer, true_answer):
    """
    Evaluate using F1 Score. This checks the precision and recall of exact matches.
    """
    predicted_match = 1 if predicted_answer == true_answer else 0
    true_match = 1
    return predicted_match, true_match

In [14]:
# ROUGE Evaluation
def rouge_evaluation(predicted_answer, true_answer):
    """
    Evaluate using ROUGE metric. This measures overlap of n-grams, word sequences, and word pairs.
    """
    # Defining the ROUGE types we want to compute
    rouge_types = ['rouge1', 'rouge2', 'rougeL']
    
    # Initialize the RougeScorer with the specified rouge_types
    scorer = rouge_scorer.RougeScorer(rouge_types)
    
    # Compute the ROUGE scores
    scores = scorer.score(true_answer, predicted_answer)
    
    return scores

In [15]:
def evaluate_model(df, retriever, r2former):
    """
    This function evaluates the R2AG model based on the dataset provided.
    It calculates F1 score and ROUGE score for each example.
    """
    f1_scores = []
    rouge_scores = []
    model_responses = []

    for _, row in df.iterrows():
        # Step 1: Get query, true answer, and retrieve relevant documents
        question = preprocess_text(row["question"])  # Preprocess question
        true_answer = preprocess_text(row["answer"])  # Preprocess true answer
        refined_embeddings, retrieved_docs = transform_retrievals(retriever, question, r2former)

        # Step 2: Generate model response
        model_response = generate_response(question, refined_embeddings, retrieved_docs)

        # Step 3: Perform F1 and ROUGE evaluations
        f1_pred, f1_true = f1_score_evaluation(model_response, true_answer)
        rouge_result = rouge_evaluation(model_response, true_answer)

        f1_scores.append(f1_pred)
        rouge_scores.append(rouge_result['rouge1'].fmeasure)  # Using ROUGE-1 F1 score for simplicity
        model_responses.append(model_response)

    # Step 4: Calculate overall F1 and ROUGE scores
    overall_f1 = f1_score(f1_scores, f1_scores)  # The F1 Score will be 1 if all predictions are correct
    overall_rouge = sum(rouge_scores) / len(rouge_scores)  # Average ROUGE F1 score for all examples

    print(f"Overall F1 Score: {overall_f1:.4f}")
    print(f"Overall ROUGE-1 F1 Score: {overall_rouge:.4f}")

    # Step 5: Create DataFrame for results
    results = pd.DataFrame({
        "question": df["question"],
        "true_answer": df["answer"],
        "model_response": model_responses,
        "f1_score": f1_scores,
        "rouge1_f1_score": rouge_scores
    })

    return results