This notebook is to prepare materials that we will use in our final app. <br>
Summaries - First we will create summaries of the videos in our knowledge base. We can use those summaries in our app to help the user get better context about the project and also ask better questions to the RAG. We used a function to split the entire text for each video into 2 sections, summarise each and then construct a final summary of both. This is essential because the texts are long and can breach the token limit. <br>
Query rewriting function - We will rewrite user's question/query using llm. We can create a summary of all summaries we created in section 1 to help us construct a prompt that will make sure the rewritten question is relevant to our RAG. <br>
Reranking - Here we are implementing a mechanism to rerank and give us better matches from the retrieval system. We will use the llm for reranking as well. We will get k=10 nearest neighbors and ask the llm to rerank and select top 5 for us. 

In [5]:
# Import libraries
import os
import pandas as pd
import numpy as np
import faiss   
from sentence_transformers import SentenceTransformer
from openai import OpenAI
import time
import re
# Optional
import warnings
warnings.filterwarnings("ignore")

### A - Create summaries of the videos to use in the app

The app will have a Q & A section however as a user sometimes it is difficult to ask relevant questions as the topic of this project is open ended. <br>
So to make the app useful and for the users to ask meaningful questions, we will have a seperate page for the user to read the summaries of each video we have included for this project. 

In [2]:
# # Read the prepared data
# df = pd.read_excel('prepared_data.xlsx')
# df.sort_values(by=['Start Time'])

Unnamed: 0,Video Series,Video Name,Video Link,Start Time,Original Text,RAG Text
252,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_09 August...,https://www.youtube.com/watch?v=4tboiTlhtUk&li...,0.080,Best wishes for future success Question Hour Q...,LS Question Hour Budget Session | Best wishes ...
304,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_22 July 2024,https://www.youtube.com/watch?v=tKUxL9C2xRM&li...,0.320,Om Shanti Question Hour Question Number One Sh...,LS Question Hour Budget Session | Om Shanti Qu...
68,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_02 August...,https://www.youtube.com/watch?v=Re-I0SntUfs&li...,0.399,I also wish all the best to the team of player...,LS Question Hour Budget Session | I also wish ...
162,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_07 August...,https://www.youtube.com/watch?v=42GN_eihmOw&li...,1.560,Honorable Members Honorable Speaker [Applause]...,LS Question Hour Budget Session | Honorable Me...
403,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_29 July 2024,https://www.youtube.com/watch?v=c2BEW8kUUQ8&li...,1.599,Question No. 81 Shri Kalyan Banerjee Question ...,LS Question Hour Budget Session | Question No....
...,...,...,...,...,...,...
9,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,3643.160,Sir Household Savings Household Savings Year L...,FM Nirmala Sitharamans Reply On Union Budget |...
206,LS Question Hour Budget Session,LS_Question Hour_Budget Session 2024_07 August...,https://www.youtube.com/watch?v=42GN_eihmOw&li...,3704.559,Question Number Over Honorable Members Speake...,LS Question Hour Budget Session | Question Nu...
10,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,4238.159,sir Sir this is not something I very much want...,FM Nirmala Sitharamans Reply On Union Budget |...
11,FM Nirmala Sitharamans Reply On Union Budget,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,4825.800,"Sir, before I come to the conclusion one, I wi...",FM Nirmala Sitharamans Reply On Union Budget |...


In [3]:
# # Authenticate OpenAI 
# # Read key
# with open(os.path.join(os.path.abspath(os.path.join(os.getcwd(),'..')),'OpenAI Key.txt'), 'r') as file:
#     key = file.read()
    
# client = OpenAI(
#     api_key=key,
# )

In [4]:
# # Function to make requests
# def llm_requests(prompt, client, model_name):
    
#     # Make the request
#     response = client.chat.completions.create(
#         model=model_name,
#         messages=[
#             {
#                     "role": "user", 
#                     "content": prompt
#             }
#                 ]
#                                             )
# #     return the response from OpenAI
#     return response

In [5]:
# Prompt to summarize the speech - Generated by chatgpt

# Prompt to summarise each section
def discussion_summary_prompt(context):
    return f"""
    You are tasked with summarizing a detailed discussion from the Indian Parliament. The transcript may contain errors related to grammar, regional accents, incorrect numbers, and out-of-context words. Your job is to carefully interpret and correct these errors to ensure the summary is clear, accurate, and reliable.

    Provide a summary in the form of a single bullet-point list that captures the key points of the discussion. The summary should be concise and include only the most relevant information. Correct any inaccuracies and cross-reference data where necessary.

    Transcript of the Discussion:
    {context}
    """
# Prompt to create final summary
def final_summary_prompt(multiple_summaries):
    return f"""
    You are tasked with creating a final, cohesive summary based on several smaller summaries generated from different portions of a larger discussion. Your job is to combine these summaries, ensuring there is no repetition, and that the final output is concise and well-organized. The goal is to capture the most important points without losing any critical information.

    Combine the following summaries into a single, clear, and concise bullet-point list:

    {multiple_summaries}
    """

In [6]:
# # Create dataframe of summaries
# df_video_name = []
# df_video_link = []
# df_summary = []

# # Iterate through each video
# for video in df['Video Name'].unique():
#     # Print to keep track
#     print(video)
#     # Attend video name and video link
#     df_video_name.append(video)
#     df_video_link.append(df[df['Video Name'] == video]['Video Link'].unique()[0])
#     # Make sure to sort the dataframe in ascending order of timestamp so that the final text makes sense!
#     temp_df = df[df['Video Name'] == video].sort_values(by=['Start Time'])
#     # Summary 1 and 2
#     summary_1 = llm_requests(discussion_summary_prompt('\n'.join([i for i in temp_df[:int(len(temp_df)/2)]['Original Text']])), client, 'gpt-4o-mini').choices[0].message.content
#     summary_2 = llm_requests(discussion_summary_prompt('\n'.join([i for i in temp_df[int(len(temp_df)/2):]['Original Text']])), client, 'gpt-4o-mini').choices[0].message.content
#     # Final summary
#     summary = llm_requests(final_summary_prompt('\n'.join([summary_1,summary_2])), client, 'gpt-4o-mini').choices[0].message.content
#     df_summary.append(summary)

In [7]:
# # Create dataframe and save

# df_summaries = pd.DataFrame({
#     'Video Name' : df_video_name,
#     'Video Link' : df_video_link,
#     'Summary' : df_summary 
# })

# df_summaries.to_excel('summaries.xlsx',index=False)

In [6]:
# # Create summary of summaries for query rewriting
# df_summaries = pd.read_excel('summaries.xlsx')
# df_summaries

Unnamed: 0,Video Name,Video Link,Summary
0,FM_Nirmala_Sitharamans_reply_on_Union_Budget_f...,https://www.youtube.com/watch?v=kYHWCD7FZgQ,- The Honorable Finance Minister delivered a s...
1,LS_Question Hour_Budget Session 2024_01 August...,https://www.youtube.com/watch?v=JLKf3WlykVU&li...,- The Indian Parliament discussion focused on ...
2,LS_Question Hour_Budget Session 2024_02 August...,https://www.youtube.com/watch?v=Re-I0SntUfs&li...,- A discussion in the Indian Parliament focuse...
3,LS_Question Hour_Budget Session 2024_05 August...,https://www.youtube.com/watch?v=oeUCexzMd3E,- **Agricultural Support and Financial Inclusi...
4,LS_Question Hour_Budget Session 2024_06 August...,https://www.youtube.com/watch?v=0BOY9fx7xZo&li...,- The government is enhancing the Unique Disab...
5,LS_Question Hour_Budget Session 2024_07 August...,https://www.youtube.com/watch?v=42GN_eihmOw&li...,- Focus on developing IT hubs and technology p...
6,LS_Question Hour_Budget Session 2024_08 August...,https://www.youtube.com/watch?v=IKxbiA_jQVc&li...,- The discussion highlighted various agricultu...
7,LS_Question Hour_Budget Session 2024_09 August...,https://www.youtube.com/watch?v=4tboiTlhtUk&li...,- **High Court Benches in Bihar**: There are c...
8,LS_Question Hour_Budget Session 2024_22 July 2024,https://www.youtube.com/watch?v=tKUxL9C2xRM&li...,- **Kendriya Vidyalayas Expansion**: The first...
9,LS_Question Hour_Budget Session 2024_25 July 2024,https://www.youtube.com/watch?v=brvpcLl6Vrk,- Indian Parliament discussions emphasized mai...


In [7]:
# # Add all the summaries together
# summaries = ''
# for i in df_summaries['Summary']:
#     summaries += i + '\n'

# print(summaries)

- The Honorable Finance Minister delivered a speech in the Indian Parliament outlining the budget's key features under Prime Minister Narendra Modi’s leadership, reaffirming the government's commitment to economic growth and development.
- The budget aims for a developed India by 2047, focusing on people-centric policies, social inclusivity, and regional development.
- Total government expenditure is projected to increase to ₹48.2 lakh crore, with capital expenditure set to rise significantly, reaching ₹11.11 lakh crore—almost three times pre-pandemic levels—to support post-COVID economic recovery and infrastructure development.
- The budget shows an upward trend in allocations for agriculture, education, rural and urban development, health, and women's welfare, countering claims of funding reductions in these sectors.
- The Finance Minister highlighted improved fiscal deficit management, projecting a gradual decrease to below 4.5% of GDP by 2025.
- The budget allocates ₹17,000 crore f

In [11]:
# # Function to create a summary of summaries - Chatgpt generated
# def summary_of_summaries_prompt(summaries):
#     return f"""
#     You are tasked with creating a final, consolidated summary based on multiple summaries generated from various discussions or videos. Your goal is to synthesize these summaries into a single, cohesive bullet-point list that captures the most important and relevant points. Be sure to avoid any repetition and only include the key information that adds value.

#     Combine the following summaries into a single, clear, and concise summary:

#     {summaries}
#     """

In [12]:
# # Create an overall summary - Run only once
# summary = llm_requests(summary_of_summaries_prompt(summaries), client, 'gpt-4o-mini').choices[0].message.content

In [9]:
# Print the summary for reference
summary = """### Consolidated Summary of Parliamentary Discussions

- **Budget and Economic Development**: The Finance Minister presented the budget aimed at achieving a developed India by 2047, emphasizing economic growth and welfare through substantial government expenditure, which is projected at ₹48.2 lakh crore, and increased capital expenditure for infrastructure development.

- **Fiscal Management and Sector Allocations**: The budget reflects positive trends in fiscal deficit management, aiming for a reduction below 4.5% of GDP by 2025. Key sector allocations are increased for agriculture, education, health, and women's welfare, debunking claims of funding cuts in these areas.

- **Support for Jammu and Kashmir**: The budget allocates ₹17,000 crore for development in Jammu and Kashmir, showing improvements in employment rates and introducing welfare initiatives for tribal communities to enhance healthcare and education.

- **Youth and Employment Initiatives**: New youth empowerment schemes focus on skill training and job creation, aiming to address unemployment rates, with confidence shown from recent statistics indicating improvements in labor market conditions.

- **Water Supply and Infrastructure Improvements**: Discussions on the Jal Jeevan Mission aimed at ensuring clean water access highlighted regional water supply issues and natural disaster management, emphasizing cooperation between central and state governments.

- **Healthcare Initiatives**: The government focused on drug pricing, affordable medicines, and the Ayushman Bharat scheme’s challenges, with efforts to monitor health expenditures and improve healthcare access, especially for vulnerable populations.

- **Agricultural Support and Financial Inclusion**: Initiatives to empower farmers included a significant increase in funding through credit schemes, reforms in crop insurance, and a new emphasis on sustainability and technology in agriculture.

- **MSME and Economic Growth**: MSMEs were acknowledged for their role in job creation and economic stability, with various government measures announced to enhance financial access and support innovation.

- **Environmental Concerns and Renewable Energy**: The discussions included strides in renewable energy adoption, addressing pollution, and strategies to manage waste, including updates on e-waste management rules and carbon emissions tracking.

- **Judicial and Legislative Reforms**: Issues surrounding judicial proposals, backlogs, and the need for regional language accessibility in legal proceedings were discussed, emphasizing the importance of reform in the legal system.

- **Child Welfare and Education**: Rising concerns about child welfare were raised, highlighting the need for better support systems for marginalized children, while educational discussions pointed to the urgent need for quality improvements and integrity in examination systems.

- **Infrastructure Development**: Various infrastructure projects, including railway expansions and highway improvements, were questioned for their progress, with calls for enhanced project monitoring and timely completion.

- **Parliamentary Conduct and Opposition Comments**: Observations on maintaining parliamentary decorum and effective opposition engagement were made, with Rahul Gandhi criticizing the budget's neglect of various societal needs, emphasizing a focus on marginalized groups and pressing for legislative changes.

- **Future Outlook**: The discussions conveyed a need for strategic investments in various sectors, greater representation for underserved communities, and an emphasis on compassion in governance amidst current socio-political challenges. 

This consolidated summary encapsulates vital outcomes and concerns from the parliamentary discussions, illustrating the government's initiatives and the pressing needs identified by opposition members and representatives from various sectors."""

print(summary)

### Consolidated Summary of Parliamentary Discussions

- **Budget and Economic Development**: The Finance Minister presented the budget aimed at achieving a developed India by 2047, emphasizing economic growth and welfare through substantial government expenditure, which is projected at ₹48.2 lakh crore, and increased capital expenditure for infrastructure development.

- **Fiscal Management and Sector Allocations**: The budget reflects positive trends in fiscal deficit management, aiming for a reduction below 4.5% of GDP by 2025. Key sector allocations are increased for agriculture, education, health, and women's welfare, debunking claims of funding cuts in these areas.

- **Support for Jammu and Kashmir**: The budget allocates ₹17,000 crore for development in Jammu and Kashmir, showing improvements in employment rates and introducing welfare initiatives for tribal communities to enhance healthcare and education.

- **Youth and Employment Initiatives**: New youth empowerment schemes f

### B - Query rewriting

In [15]:
# Prompt to modify the user's question
def modify_question_prompt(summary, question):
    return f"""
    You are tasked with modifying the following question based on the context provided by a summary of discussions. Your goal is to adjust the question without losing its original meaning or semantic value. The modification should incorporate any relevant details or key points from the summary, ensuring the question remains aligned with the context of the discussions.

    Summary:
    {summary}

    Original Question: {question}

    Provide a modified version of the question that takes the summary into account but does not alter the core meaning.
    """

In [16]:
# Function to call for rewriting
def query_rewriting(summary, question, model_name):
    return llm_requests(modify_question_prompt(summary, question), client, model_name).choices[0].message.content

In [17]:
# # Sample rewritten query/question below from an improperly structured question
# query_rewriting(summary,'What reforms in agriculture', 'gpt-4o')

'What specific reforms in agriculture were introduced in the recent budget to empower farmers, increase funding through credit schemes, improve crop insurance, and emphasize sustainability and technology?'

### C - Reranking with llm

In [14]:
# Prompt for reranking searches from retrieval system - We will return top 5 as thats what we concluded from RAG evaluation in part 2
def select_top_relevant_matches_prompt(question, matches, top_n):
    return f"""
    You are tasked with selecting the top {top_n} most relevant passages from the following list, based on the question provided. The passages may contain varying levels of relevance, and your goal is to pick only the ones that are directly related to answering the question.

    The question is: "{question}"

    List of matches:
    {matches}

    Select the top {top_n} most relevant passages and output only those passages.
    """

In [19]:
# # Build index for a test - Select model and text to search using conclusions from part 2
# # Create dictionaries to store the text vector embeddings
# text_dict = {i:'' for i in df['RAG Text']}
# model = SentenceTransformer('all-mpnet-base-v2')
# # Create embeddings 
# for text in text_dict.keys():
#     text_dict[text] = model.encode(text)
# # Add all vectors together
# all_embeddings = np.vstack(text_dict.values())
# # Build index
# d = all_embeddings.shape[1]
# index = faiss.IndexFlatL2(d)  
# # Add the combined embeddings to the index
# index.add(all_embeddings)
# # Save the index
# faiss.write_index(index,'retrieval_index.index')    

In [20]:
# Function to rerank - chatgpt generated - Change K value if you like. Be careful to make sure the texts combined do not breach token limit. 
def rerank(df, question, model_name, k_value=10):
    # Encode question
    question_vector = model.encode([question])
    # Get back 10 neighbors for reranking to 5
    result = index.search(question_vector, k_value)
    # Get matches to feed to llm
    matches = [df['RAG Text'][result_index] for result_index in result[1][0]]
    # Request the LLM to rerank the matches using the prompt we constructed earlier
    return llm_requests(select_top_relevant_matches_prompt(question, matches, 5), client, model_name).choices[0].message.content.splitlines()

In [24]:
# # Sample response from llm
# question = 'What specific reforms in agriculture were introduced in the recent budget to empower farmers, increase funding through credit schemes, improve crop insurance, and emphasize sustainability and technology?'
# response = rerank(df,question,'gpt-4o')
# response

["1. 'LS Question Hour Budget Session | Honorable Speaker Sir, human life is very precious and if even one unfortunate incident happens, then it is a matter of trouble for us. Anyway it was said, Jeeve Sarada Satam, I want to say that to save the farmers of the country from such unfortunate situations, many schemes have been started under the leadership of Prime Minister Shri Narendra Modi. Honorable Speaker Sir, the income of the farmers should be doubled and for this we have a six-point strategy. One is to increase the production, second is to reduce the cost of production, third is to give fair price for the production, fourth, in case of natural disaster, the government compensates the loss through schemes like crop insurance and gives relief, starting schemes like diversification of crops and natural farming and that is why I say. That the government is fully committed to the welfare of the farmers, planned efforts are being made to increase the income and many efforts have been m