In this notebook, we are going to evaluate our retrieval and rag mechanisms. <br> <br> <br>
Retrieval evaluations <br>
For this, first we will need is a ground truth dataframe. We will use only a sample of the prepared data as we have around 500 chunks and we will end up spending a lot of open ai credits. And more importantly we will breach the rate limit set by open ai (https://platform.openai.com/docs/guides/rate-limits). If you are using an offline llm system, feel free to use the entire dataset. We have taken a sample of 50 chunks for the evaluations. Then we will construct a prompt that will help us create 5 meaningful questions a user could potentially ask the system. To make these questions relevant it is important to feed additional context in the prompt as to the kind of data we are analyzing, typos, accents, inaccuracies etc to expect considering we are working with youtube translated trascripts and errors are to be expected. I used chatgpt to construct this prompt. I generated 5 questions for each chunk in my sample and did some text processing to parse that data and saved it as ground truth data. <br><br>
The next step for us is to perform the retrieval and evaluate. I'm using the faiss(Facebook AI Similarity Search) library to perform vector search and sentence transformers library to create embeddings. I have used <b> 2 </b> approaches to evaluate retrieval. First one is to test 2 fields of text in our dataframe and evaluate which would be best to use for retrieval. 
Second one is to evaluate 3 models of sentence transformers and select the one with best performance. The evaluations will be done using hit rate and mrr as metrics. <br><br><br>
RAG evaluations <br>
For evaluating the rag, I have used <b> 2 </b> prompts with slight variation. First one is more reliable that will cross reference the web for additional info and second one is more balanced for intuition as well. I have <b> also </b> tested whether there is an impact if we change the number of nearest neighbors from the retrieval system on the quality of answer generated. The quality of rag output is evaluated <b> critically </b> using another prompt and gpt models on a scale of 1 to 5 with 5 being the best. <br><br><br>
<b> Final conclusion </b> (See data below) <br>
Retrieval <br>
Sentence transformer to use - > all-mpnet-base-v2 <br>
Field for searching -> RAG Text <br>
RAG <br>
Use Prompt 2  <br>
Same performance on K value. But use K = 5 for more context  <br>

In [1]:
# Import libraries
import os
import pandas as pd
import numpy as np
import faiss   
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import time
import re
# Optional
import warnings
warnings.filterwarnings("ignore")

  from tqdm.autonotebook import tqdm, trange


In [38]:
# # Read the prepared data
# df = pd.read_excel('prepared_data.xlsx')
# df.head()

Unnamed: 0,Video Series,Video Name,Video Link,Start Time,Original Text,RAG Text
0,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,2.959,"No love, one love, Premchandra ji, please, ple...",FM Nirmala Sitharamans Reply On Union Budget |...
1,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,301.199,With in the a with in the total expenditure si...,FM Nirmala Sitharamans Reply On Union Budget |...
2,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,619.04,What the Budget 2024 25 Tries to bring in a ba...,FM Nirmala Sitharamans Reply On Union Budget |...
3,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,943.399,Sir some members of the general public have ra...,FM Nirmala Sitharamans Reply On Union Budget |...
4,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,1166.24,sir now I come to the general B in which I go ...,FM Nirmala Sitharamans Reply On Union Budget |...


### Create a ground truth dataset

In [39]:
# # Create sample dataset of 50 records
# df_sample = df.sample(n=50, random_state=4)
# df_sample = df_sample.reset_index(drop=True)
# df_sample

Unnamed: 0,Video Series,Video Name,Video Link,Start Time,Original Text,RAG Text
0,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_26 July 2024,https://www.youtube.com/watch?v=SNKvQYusLRs&li...,1006.759,"Gaurav Gog ji, Chairman, I want to express my ...",LS Question Hour Budget Session | Gaurav Gog j...
1,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,812.639,"Speaker, Sir, I, through you, request the Hono...","LS Question Hour Budget Session | Speaker, Sir..."
2,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_08 August...,https://www.youtube.com/watch?v=IKxbiA_jQVc&li...,193.44,Sukant Kumar [Music] Page Sir Vat The current...,LS Question Hour Budget Session | Sukant Kumar...
3,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,4825.8,"Sir, before I come to the conclusion one, I wi...",FM Nirmala Sitharamans Reply On Union Budget |...
4,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,3091.52,Quay No. 85 TA Mahesh Kumar Honorable Member H...,LS Question Hour Budget Session | Quay No. 85 ...
5,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_22 July 2024,https://www.youtube.com/watch?v=tKUxL9C2xRM&li...,523.56,speaker sir rebel member re question regarding...,LS Question Hour Budget Session | speaker sir ...
6,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_01 August...,https://www.youtube.com/watch?v=JLKf3WlykVU&li...,3257.799,"Mr. Pappu Yadav ji Pappu Yadav, Minister Sir, ...",LS Question Hour Budget Session | Mr. Pappu Ya...
7,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_08 August...,https://www.youtube.com/watch?v=IKxbiA_jQVc&li...,3562.4,"End of question. Honorable members, I have rec...",LS Question Hour Budget Session | End of quest...
8,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_01 August...,https://www.youtube.com/watch?v=JLKf3WlykVU&li...,3357.4,"Honorable Speaker, since another route has bee...",LS Question Hour Budget Session | Honorable Sp...
9,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_01 August...,https://www.youtube.com/watch?v=JLKf3WlykVU&li...,2153.68,"sir, we are honorable members. We can underst...","LS Question Hour Budget Session | sir, we are ..."


In [40]:
# # Save for use later
# # df_sample.to_excel('evaluations_data.xlsx',index=False)
# df_sample = pd.read_excel('evaluations_data.xlsx')
# df_sample.head()

Unnamed: 0,Video Series,Video Name,Video Link,Start Time,Original Text,RAG Text
0,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_26 July 2024,https://www.youtube.com/watch?v=SNKvQYusLRs&li...,1006.759,"Gaurav Gog ji, Chairman, I want to express my ...",LS Question Hour Budget Session | Gaurav Gog j...
1,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,812.639,"Speaker, Sir, I, through you, request the Hono...","LS Question Hour Budget Session | Speaker, Sir..."
2,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_08 August...,https://www.youtube.com/watch?v=IKxbiA_jQVc&li...,193.44,Sukant Kumar [Music] Page Sir Vat The current...,LS Question Hour Budget Session | Sukant Kumar...
3,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,4825.8,"Sir, before I come to the conclusion one, I wi...",FM Nirmala Sitharamans Reply On Union Budget |...
4,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,3091.52,Quay No. 85 TA Mahesh Kumar Honorable Member H...,LS Question Hour Budget Session | Quay No. 85 ...


In [7]:
# Use llm to generate questions

In [8]:
# # Authenticate OpenAI 
# # Read key
# with open(os.path.join(os.path.abspath(os.path.join(os.getcwd(),'..')),'OpenAI Key.txt'), 'r') as file:
#     key = file.read()
    
# client = OpenAI(
#     api_key=key,
# )

In [9]:
# # Create function to make llm requests

# # Prompt to create questions - chatgpt generated
# prompt = """You are analyzing transcripts from Indian Parliament discussions, particularly focused on the question hour and budget discussions, including remarks and responses from the ruling party and the opposition. The transcripts are provided in the format: "Video series name | Text". These transcripts contain errors related to grammar, regional accents, incorrect numbers, and out-of-context words. Your task is to create a set of five possible questions that a user might ask based on the corrected version of the text. These questions should consider the context, important details, potential inaccuracies, and overall meaning of the discussion.
    
#     Please respond strictly with the questions. Separate each question with a line breaker.
    
#     Text: """

# def llm_requests(prompt, prompt_text, client, model_name):
    
#     prompt += prompt_text
    
#     # Make the request
#     response = client.chat.completions.create(
#         model=model_name,
#         messages=[
#             {
#                     "role": "user", 
#                     "content": prompt
#             }
#                 ]
#                                             )
    
# #     return the response from OpenAI
#     return response
    

In [10]:
# # Make request for each question 
# df_sample['Questions'] = [llm_requests(prompt, text, client, 'gpt-4o-mini').choices[0].message.content for text in df_sample['RAG Text']]

In [44]:
# # Save the questions to use later - to avoid overwriting/redoing!
# # df_sample.to_excel('evaluations_data_with_questions.xlsx',index=False)
# df_sample = pd.read_excel('evaluations_data_with_questions.xlsx')
# df_sample.head()

Unnamed: 0,Video Series,Video Name,Video Link,Start Time,Original Text,RAG Text,Questions
0,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_26 July 2024,https://www.youtube.com/watch?v=SNKvQYusLRs&li...,1006.759,"Gaurav Gog ji, Chairman, I want to express my ...",LS Question Hour Budget Session | Gaurav Gog j...,What measures is the government implementing t...
1,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,812.639,"Speaker, Sir, I, through you, request the Hono...","LS Question Hour Budget Session | Speaker, Sir...",What innovations have been introduced by the g...
2,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_08 August...,https://www.youtube.com/watch?v=IKxbiA_jQVc&li...,193.44,Sukant Kumar [Music] Page Sir Vat The current...,LS Question Hour Budget Session | Sukant Kumar...,1. What is the current status of the Rural Ent...
3,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,4825.8,"Sir, before I come to the conclusion one, I wi...",FM Nirmala Sitharamans Reply On Union Budget |...,1. What challenges did FM Nirmala Sitharaman i...
4,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,3091.52,Quay No. 85 TA Mahesh Kumar Honorable Member H...,LS Question Hour Budget Session | Quay No. 85 ...,What steps is the Union Government taking to a...


In [12]:
# # Parse and create the ground truth dataset
# df_text = []
# df_question = []

# for text, question in zip(df_sample['RAG Text'], df_sample['Questions']):
#     # Remove special characters created in LLM response
#     question = [i.strip() for i in question.split('\n') if i not in ['','---',' ','  ']]
    
#     for item in question:
#         df_text.append(text)
#         df_question.append(item)
        
# df_ground_truth = pd.DataFrame({'Text' : df_text, 'Question' : df_question})

In [41]:
# # Save ground truth data read later - to avoid redoing!
# # df_ground_truth.to_excel('ground_truth_data.xlsx',index=False)
# df_ground_truth = pd.read_excel('ground_truth_data.xlsx')
# df_ground_truth.head()

Unnamed: 0,Text,Question
0,LS Question Hour Budget Session | Gaurav Gog j...,What measures is the government implementing t...
1,LS Question Hour Budget Session | Gaurav Gog j...,How does the government evaluate and monitor t...
2,LS Question Hour Budget Session | Gaurav Gog j...,Can the Minister provide specific steps that h...
3,LS Question Hour Budget Session | Gaurav Gog j...,What feedback or complaints has the government...
4,LS Question Hour Budget Session | Gaurav Gog j...,Has the government made any changes to its dru...


### A - Retrieval

In [15]:
# # Functions for evaluation - Hit rate and MRR

# # Function to calculate hit rate
# def calculate_hit_rate(match_matrix):
#     return sum([1 if True in i else 0 for i in match_matrix])/len(match_matrix) 

# # Function to calulcate mrr
# def calculate_mrr(match_matrix):
#     # To track scores basis the rank
#     score = 0
#     # Iterate through each list
#     for i in match_matrix:
#         for j in range(len(i)):
#             if i[j] == True:
#                 score += 1/(j+1)
#     return score/len(match_matrix) 

### Option 1 - Evaluate search fields for indexing

In [16]:
# # FAISS supports only vector search so multiple fields cannot be searched 
# # So here we will build the index for 2 fields and test which gives us better results
# # We will use hit rate and mrr to evaluate
# # Build index for each field 
# for field_name in ['Original Text', 'RAG Text']:
#     print(field_name)
#     # Create dictionaries to store the text vector embeddings
#     text_dict = {i:'' for i in df_sample[field_name]}
#     # Load the model - paraphrase-albert-small-v2 - Smallest available
#     model = SentenceTransformer('paraphrase-albert-small-v2')
#     # Create embeddings 
#     for text in text_dict.keys():
#         text_dict[text] = model.encode(text)
#     # Add all vectors together
#     all_embeddings = np.vstack(text_dict.values())
#     # Build index
#     # Print dimensions of the embeddings 
#     print('Index dimensions: ' + str(all_embeddings.shape[1]))
#     d = all_embeddings.shape[1]
#     # Create a FAISS index
#     index = faiss.IndexFlatL2(d)  # L2 distance (Euclidean distance)
#     # Add the combined embeddings to the index
#     index.add(all_embeddings)
#     print('Number of vectors in the Index: ' + str(index.ntotal))
# #     Save the index
#     faiss.write_index(index,os.path.join(os.getcwd(),'indices',field_name + '_index_option_1.index'))    
# #     Create dictionaries to store the question vector embeddings
#     question_dict = {i:'' for i in df_ground_truth['Question']}
#     # Create embeddings 
#     for question in question_dict.keys():
#         question_dict[question] = model.encode([question])
#     # Set K 
#     k = 3
#     # Do the vector search 
#     # Get the indices of the nearest neighbors
#     # Check matches and create matrix to calculate hitrate and mrr
#     t_f_match_matrix = []
#     for question in question_dict.keys():
#         result = index.search(question_dict[question], k)[1][0] 
#         # Get the original matching answer/text to the question
#         original_text = df_ground_truth[df_ground_truth['Question'] == question]['Text'].values[0]
#         matches = [df_sample['RAG Text'][result_index] for result_index in result]
#         t_f_match_matrix.append([original_text == match for match in matches])
#     print('Hit Rate : ' + str(calculate_hit_rate(t_f_match_matrix)))
#     print('MRR : ' + str(calculate_mrr(t_f_match_matrix)))
#     print('\n')

In [42]:
# Similar performance for both fields which is surprising. I was expecting better from 'RAG text'
# Ill still use RAG text as it has additional context that can help with searching/RAG response
print(
"""
Option 1 results : 

Original Text
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.744
MRR : 0.6693333333333333


RAG Text
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.74
MRR : 0.6626666666666666
"""
)


Option 1 results : 

Original Text
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.744
MRR : 0.6693333333333333


RAG Text
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.74
MRR : 0.6626666666666666



### Option 2 - Evaluate multiple sentence transformers

In [18]:
# Pretrained models - https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
# Option A - paraphrase-albert-small-v2 - Smallest model 
# Option B - all-mpnet-base-v2 - Best performance
# Option C - paraphrase-MiniLM-L3-v2 - Fastest model 

In [19]:
# # So here we will build the index for 3 models and test which gives us better results
# # We will use hit rate and mrr to evaluate
# # Build index for each model 
# for model_name in ['paraphrase-albert-small-v2','all-mpnet-base-v2','paraphrase-MiniLM-L3-v2']:
#     print(model_name)
#     # Create dictionaries to store the text vector embeddings - Using RAG Text as per results from option 1 evaluation
#     text_dict = {i:'' for i in df_sample['RAG Text']}
#     # Load the model - paraphrase-albert-small-v2 - Smallest available
#     model = SentenceTransformer(model_name)
#     # Create embeddings 
#     for text in text_dict.keys():
#         text_dict[text] = model.encode(text)
#     # Add all vectors together
#     all_embeddings = np.vstack(text_dict.values())
#     # Build index
#     # Print dimensions of the embeddings 
#     print('Index dimensions: ' + str(all_embeddings.shape[1]))
#     d = all_embeddings.shape[1]
#     # Create a FAISS index
#     index = faiss.IndexFlatL2(d)  # L2 distance (Euclidean distance)
#     # Add the combined embeddings to the index
#     index.add(all_embeddings)
#     print('Number of vectors in the Index: ' + str(index.ntotal))
# #     Save the index
#     faiss.write_index(index,os.path.join(os.getcwd(),'indices',model_name + '_index_option_2.index'))    
# #     Create dictionaries to store the question vector embeddings
#     question_dict = {i:'' for i in df_ground_truth['Question']}
#     # Create embeddings 
#     for question in question_dict.keys():
#         question_dict[question] = model.encode([question])
#     # Set K 
#     k = 3
#     # Do the vector search 
#     # Get the indices of the nearest neighbors
#     # Check matches and create matrix to calculate hitrate and mrr
#     t_f_match_matrix = []
#     for question in question_dict.keys():
#         result = index.search(question_dict[question], k)[1][0] 
#         # Get the original matching answer/text to the question
#         original_text = df_ground_truth[df_ground_truth['Question'] == question]['Text'].values[0]
#         matches = [df_sample['RAG Text'][result_index] for result_index in result]
#         t_f_match_matrix.append([original_text == match for match in matches])
#     print('Hit Rate : ' + str(calculate_hit_rate(t_f_match_matrix)))
#     print('MRR : ' + str(calculate_mrr(t_f_match_matrix)))
#     print('\n')

In [45]:
# Excellent jump in performance with all-mpnet-base-v2
print(
"""
Option 2 results : 

paraphrase-albert-small-v2
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.74
MRR : 0.6626666666666666


all-mpnet-base-v2
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.924
MRR : 0.8633333333333335


paraphrase-MiniLM-L3-v2
Index dimensions: 384
Number of vectors in the Index: 50
Hit Rate : 0.796
MRR : 0.7306666666666668

"""
)


Option 2 results : 

paraphrase-albert-small-v2
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.74
MRR : 0.6626666666666666


all-mpnet-base-v2
Index dimensions: 768
Number of vectors in the Index: 50
Hit Rate : 0.924
MRR : 0.8633333333333335


paraphrase-MiniLM-L3-v2
Index dimensions: 384
Number of vectors in the Index: 50
Hit Rate : 0.796
MRR : 0.7306666666666668




In [21]:
# Conclusion 
# Sentence transformer to use - > all-mpnet-base-v2
# Field for searching - RAG Text

### B - RAG

In [22]:
# Evaluate which prompt format works better
# Evaluate if increasing K neighbors helps with better answers

In [24]:
# # For RAG evaluation, we will use only 10 questions for each test 
# df_rag_eval = df_ground_truth.sample(n=10, random_state=7)
# df_rag_eval

Unnamed: 0,Text,Question
148,LS Question Hour Budget Session | Like I Menti...,4. What challenges have been faced in ensuring...
219,LS Question Hour Budget Session | Honorable Ch...,How does the number of farmers covered under t...
94,LS Question Hour Budget Session | I also wish ...,What implications could arise from the lack of...
84,Rahul Gandhi Discussion On Union Budget | Hi R...,What preparations did Rahul Gandhi express he ...
3,LS Question Hour Budget Session | Gaurav Gog j...,What feedback or complaints has the government...
121,LS Question Hour Budget Session | Honorable Sp...,What are the key features of the Software Tech...
97,FM Nirmala Sitharamans Reply On Union Budget |...,"3. In her remarks, what discrepancies does FM ..."
236,LS Question Hour Budget Session | Coal but in...,How does the current budget for renewable ener...
138,LS Question Hour Budget Session | Honorable Sp...,4. What are the implications of the member's c...
122,LS Question Hour Budget Session | Honorable Sp...,Can you clarify the relationship between the M...


In [25]:
# Below are 2 prompts I will test
# Prompt 1 is focused more on reliability 
# Prompt 2 is more balanced for intuition

def prompt_1(question, context):
    return """You are answering questions based on transcripts from Indian Parliament discussions, particularly focused on the question hour and budget discussions. The question hour is a critical part of Lok Sabha parliamentary proceedings where Members of Parliament ask questions to the ruling party or ministers, who must respond directly. These questions cover a wide range of topics, from policy matters to the specifics of governance. 

These discussions involve several key figures, including Rahul Gandhi, a prominent opposition leader from the Indian National Congress, and Nirmala Sitharaman, the current Finance Minister of India from the ruling Bharatiya Janata Party (BJP). However, many other ministers and Members of Parliament also contribute to these discussions, raising questions, presenting policies, or defending government decisions.

The provided transcripts still contain errors related to grammar, regional accents, incorrect numbers, and out-of-context words. These questions were generated from corrected versions of the text, but you must now answer them using the provided transcripts, which may still have inaccuracies. Use your understanding to correct and interpret these texts accurately, and cross-reference any statistics or specific claims to ensure your answer is informed and reliable. Please include any relevant web results if available to support your response.

Question: """ + question + """

Text for Context: """ + context



def prompt_2(question, context):
    return """You are tasked with answering questions based on transcripts from Indian Parliament discussions, particularly focused on the question hour and budget sessions. The question hour is an essential mechanism in the Lok Sabha where Members of Parliament question ministers on various topics, particularly on accountability and implementation of policies.

In the context of these discussions, several ministers and Members of Parliament are involved. Key figures include Rahul Gandhi, a leading opposition member from the Indian National Congress, often questioning the government’s policies, and Nirmala Sitharaman, the Finance Minister, responsible for defending the budget and outlining the government’s economic policies. However, many other ministers across various departments are also central to these debates, either presenting or defending specific proposals.

The texts still contain errors related to grammar, regional accents, incorrect numbers, and out-of-context words. Although the questions were generated from corrected text, your answers should be based on the provided transcripts, taking care to verify and correct any statistics or specific details mentioned. Aim to provide a clear, accurate, and well-informed response. Include any relevant web results if they help substantiate your answer.

Question: """ + question + """

Text for Context: """ + context


In [26]:
# Prompt to evaluate RAG response
# Evaluate on a scale of 1 to 5 with 5 being best
def prompt_rag_response_eval(question, answer):
    return f"""You are tasked with critically evaluating the following answer provided by an LLM system in response to a user-generated question. Your rating should be based on how well the answer addresses the question, the clarity of the response, and its general accuracy (without needing the specific context). Be extremely critical in your judgment, focusing on the following criteria:

    - **Relevance**: Does the answer directly address the user’s question?
    - **Clarity**: Is the answer clearly written and easy to understand?
    - **Accuracy**: Does the answer seem factually correct and logically sound, even though the full context is not provided?

    Rate the answer on a scale of 1 to 5:
    - 1: Poor – The answer fails to address the question or is highly unclear and inaccurate.
    - 2: Below Average – The answer has some relevance but lacks clarity or accuracy.
    - 3: Average – The answer addresses the question but may have noticeable issues in clarity or accuracy.
    - 4: Good – The answer is mostly relevant, clear, and accurate with minor issues.
    - 5: Excellent – The answer is fully relevant, clear, and seems accurate.

    Provide only the rating as a single number from 1 to 5.

    User Question: {question}
    LLM System Answer: {answer}
    """

In [27]:
print(prompt_1('Sample Question', 'Sample Context'))
print('\n'*2)
print(prompt_2('Sample Question', 'Sample Context'))
print('\n'*2)
print(prompt_rag_response_eval('Sample Question', 'Sample Response'))

You are answering questions based on transcripts from Indian Parliament discussions, particularly focused on the question hour and budget discussions. The question hour is a critical part of Lok Sabha parliamentary proceedings where Members of Parliament ask questions to the ruling party or ministers, who must respond directly. These questions cover a wide range of topics, from policy matters to the specifics of governance. 

These discussions involve several key figures, including Rahul Gandhi, a prominent opposition leader from the Indian National Congress, and Nirmala Sitharaman, the current Finance Minister of India from the ruling Bharatiya Janata Party (BJP). However, many other ministers and Members of Parliament also contribute to these discussions, raising questions, presenting policies, or defending government decisions.

The provided transcripts still contain errors related to grammar, regional accents, incorrect numbers, and out-of-context words. These questions were genera

In [28]:
# # Perform RAG evaluation for both prompts and 2 values of K i.e. 3 and 5
# # Save results in a dataframe
# df_question = []
# df_answer = []
# df_evaluation_rating = []
# df_prompt_for_answer = []
# df_prompt_to_evaluate = []
# df_number_of_neighbors_in_context = []
# df_prompt_version = []

# # Prepare for search
# # Read the saved index - Best performing
# index = faiss.read_index(os.path.join(os.getcwd(),'indices','all-mpnet-base-v2_index_option_2.index'))
# # load model for sentence transformation
# model = SentenceTransformer('all-mpnet-base-v2')

# # Create dataframe to save results
# # Counter to track progress
# counter = 1

# for question in df_rag_eval['Question']:
#     question_vector = model.encode([question])
#     # Testing k variation
#     for k_value in [3,5]:
#         result = index.search(question_vector, k_value)[1][0] 
#         matches = [df_sample['RAG Text'][result_index] for result_index in result]
#         # Testing prompt variations
#         for prompt in [prompt_1, prompt_2]:
#             print(counter)
#             # Period is actually just for namesake. I have not defined the generic llm request function well. 
#             answer = llm_requests(prompt(question, '\n'.join(matches)), '.', client, 'gpt-4o').choices[0].message.content
#             evaluation = llm_requests(prompt_rag_response_eval(question, answer), '.', client, 'gpt-4o').choices[0].message.content            
#             #Update records
#             df_question.append(question)
#             df_answer.append(answer)
#             df_evaluation_rating.append(evaluation)
#             # Call the function
#             df_prompt_for_answer.append(prompt(question, '\n'.join(matches)))
#             df_prompt_to_evaluate.append(prompt_rag_response_eval(question, answer))
#             df_number_of_neighbors_in_context.append(k_value)
#             #Update version
#             if prompt == prompt_1:
#                 df_prompt_version.append('Prompt 1')
#             else:
#                 df_prompt_version.append('Prompt 2')                
#             counter += 1
#             time.sleep(3)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40


In [29]:
# # Create the dataframe
# df_rag_eval_results = pd.DataFrame({
#     'Question' : df_question, 
#     'Answer' : df_answer,
#     'Evaluation Rating' : df_evaluation_rating,
#     'Answer Prompt' : df_prompt_for_answer,
#     'Evaluation Prompt' : df_prompt_to_evaluate,
#     'K Neighbors' : df_number_of_neighbors_in_context,
#     'Prompt Version' : df_prompt_version
# })

In [30]:
# # Save results
# df_rag_eval_results.to_excel('df_rag_evaluation_results.xlsx',index=False)
# df_rag_eval_results.head()

Unnamed: 0,Question,Answer,Evaluation Rating,Answer Prompt,Evaluation Prompt,K Neighbors,Prompt Version
0,4. What challenges have been faced in ensuring...,The challenges faced in ensuring that passenge...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1
1,4. What challenges have been faced in ensuring...,Ensuring that passenger information aligns wit...,4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,3,Prompt 2
2,4. What challenges have been faced in ensuring...,Ensuring that passenger information aligns wit...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,5,Prompt 1
3,4. What challenges have been faced in ensuring...,"According to the provided transcript, several ...",4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,5,Prompt 2
4,How does the number of farmers covered under t...,The corrected transcript provides several impo...,3,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1


In [32]:
# # Read RAG evaluation results
# df_rag_eval_results = pd.read_excel('df_rag_evaluation_results.xlsx')
# df_rag_eval_results.head()

Unnamed: 0,Question,Answer,Evaluation Rating,Answer Prompt,Evaluation Prompt,K Neighbors,Prompt Version
0,4. What challenges have been faced in ensuring...,The challenges faced in ensuring that passenge...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1
1,4. What challenges have been faced in ensuring...,Ensuring that passenger information aligns wit...,4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,3,Prompt 2
2,4. What challenges have been faced in ensuring...,Ensuring that passenger information aligns wit...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,5,Prompt 1
3,4. What challenges have been faced in ensuring...,"According to the provided transcript, several ...",4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,5,Prompt 2
4,How does the number of farmers covered under t...,The corrected transcript provides several impo...,3,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1


In [43]:
# # It took on average 38 seconds to process each question

In [34]:
# # Add one 'Combination 'column to help evaluate results
# df_rag_eval_results['Rating'] = [int(re.search(r'\b([1-5])\b', i).group(1)) for i in df_rag_eval_results['Evaluation Rating']]
# df_rag_eval_results['Combination'] = [str(i) + ' | ' + j for i,j in zip(df_rag_eval_results['K Neighbors'], df_rag_eval_results['Prompt Version'])]
# df_rag_eval_results.to_excel('df_rag_evaluation_results.xlsx',index=False)
# df_rag_eval_results

Unnamed: 0,Question,Answer,Evaluation Rating,Answer Prompt,Evaluation Prompt,K Neighbors,Prompt Version,Rating,Combination
0,4. What challenges have been faced in ensuring...,The challenges faced in ensuring that passenge...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1,4,3 | Prompt 1
1,4. What challenges have been faced in ensuring...,Ensuring that passenger information aligns wit...,4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,3,Prompt 2,4,3 | Prompt 2
2,4. What challenges have been faced in ensuring...,Ensuring that passenger information aligns wit...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,5,Prompt 1,4,5 | Prompt 1
3,4. What challenges have been faced in ensuring...,"According to the provided transcript, several ...",4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,5,Prompt 2,4,5 | Prompt 2
4,How does the number of farmers covered under t...,The corrected transcript provides several impo...,3,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1,3,3 | Prompt 1
5,How does the number of farmers covered under t...,"In the current crop insurance scheme, the Indi...",3,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,3,Prompt 2,3,3 | Prompt 2
6,How does the number of farmers covered under t...,To address your query regarding the comparison...,3,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,5,Prompt 1,3,5 | Prompt 1
7,How does the number of farmers covered under t...,"Based on the provided transcript, there are di...",2,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,5,Prompt 2,2,5 | Prompt 2
8,What implications could arise from the lack of...,The implications of not having fixed ceiling p...,4,You are answering questions based on transcrip...,You are tasked with critically evaluating the ...,3,Prompt 1,4,3 | Prompt 1
9,What implications could arise from the lack of...,The implications of not having fixed ceiling p...,4,You are tasked with answering questions based ...,You are tasked with critically evaluating the ...,3,Prompt 2,4,3 | Prompt 2


In [35]:
# # Evaluation of results - Check for sum and average for each combination
# df_rag_eval_results = pd.read_excel('df_rag_evaluation_results.xlsx')
# df_rag_eval_results.groupby('Combination')['Rating'].agg(['sum', 'mean'])

Unnamed: 0_level_0,sum,mean
Combination,Unnamed: 1_level_1,Unnamed: 2_level_1
3 | Prompt 1,35,3.5
3 | Prompt 2,35,3.5
5 | Prompt 1,34,3.4
5 | Prompt 2,35,3.5


In [None]:
# Final conclusion 
# Use Prompt 2
# Use K = 5 
# Sentence transformer to use - > all-mpnet-base-v2
# Field for searching - RAG Text