# Document for generating questions for RAG testing ☀️
We need to make questions for
* General questions
* Specific questions targeting specific contexts
* Opinionated questions

# Generate general questions 👩‍🏫
We'll use the SQUAD dataset retrieved from here: https://rajpurkar.github.io/SQuAD-explorer/

First, we'll identify common topics in the true/false possible-category.


In [11]:
# The script src/ensure_common_general_questions.py opens the json file and ensures there's only common topics in there. 
# It has already been run, so the filtered_dev-v2.0.json file has been made - only including topics that are in both the true and false impossible category.

In [12]:
import json
import pandas as pd
import random
import collections
from collections import defaultdict

# Make root path
path_to_root = '/work/PernilleHøjlundBrams#8577/NLP_2023_P'

# Load the dataset (only with common topics in true/false impossible category)
with open(f'{path_to_root}/data/filtered_dev-v2.0.json') as f:
    data = json.load(f)

# List of specified common topics
common_topics = [
    "Fresno,_California", "Ctenophora", "Scottish_Parliament",
    "University_of_Chicago", "Steam_engine", "Computational_complexity_theory",
    "Pharmacy", "French_and_Indian_War", "Southern_California",
    "Normans", "Private_school", "Islamism", "Oxygen", "Force"
]

# Organize questions by topic and is_impossible flag
questions_by_topic = defaultdict(lambda: defaultdict(list))
for entry in data['data']:
    if entry['title'] in common_topics:
        for paragraph in entry['paragraphs']:
            for qa in paragraph['qas']:
                is_impossible = qa['is_impossible']
                questions_by_topic[entry['title']][is_impossible].append(qa)

# Function to sample questions with a maximum per topic
def sample_questions(is_impossible, num_samples=30, max_per_topic=3):
    sampled_questions = []
    topic_counters = defaultdict(int)

    topics = list(common_topics)
    random.shuffle(topics)  # Shuffle to ensure random selection of topics

    while len(sampled_questions) < num_samples:
        for topic in topics:
            if topic_counters[topic] < max_per_topic and questions_by_topic[topic][is_impossible]:
                qa = random.choice(questions_by_topic[topic][is_impossible])
                answer_text = qa['answers'][0]['text'] if qa['answers'] else 'No Answer'
                sampled_questions.append({
                    'title': topic,
                    'question': qa['question'],
                    'answer': answer_text,
                    'category': 'True' if is_impossible else 'False'
                })
                questions_by_topic[topic][is_impossible].remove(qa)
                topic_counters[topic] += 1

                if len(sampled_questions) == num_samples:
                    return sampled_questions

    return sampled_questions

# Sample questions for each category
sampled_true = sample_questions(True, 30)
sampled_false = sample_questions(False, 30)

# Combine and create a DataFrame
sampled_questions = sampled_true + sampled_false
df = pd.DataFrame(sampled_questions)
df.reset_index(inplace=True)
df.rename(columns={'index': 'index'}, inplace=True)

In [13]:
# Create sub-DataFrames for each category
df_true = df[df['category'] == 'True']
df_false = df[df['category'] == 'False']

# Function to count topics in a DataFrame
def count_topics(dataframe):
    return dataframe.groupby('title').size()

# Count topics in each sub-DataFrame
topic_counts_true = count_topics(df_true)
topic_counts_false = count_topics(df_false)

# Display the counts
print("Counts of each topic in the True category:")
print(topic_counts_true)
print("\nCounts of each topic in the False category:")
print(topic_counts_false)


Counts of each topic in the True category:
title
Computational_complexity_theory    2
Ctenophora                         2
Force                              2
French_and_Indian_War              2
Fresno,_California                 3
Islamism                           2
Normans                            2
Oxygen                             2
Pharmacy                           2
Private_school                     2
Scottish_Parliament                2
Southern_California                3
Steam_engine                       2
University_of_Chicago              2
dtype: int64

Counts of each topic in the False category:
title
Computational_complexity_theory    2
Ctenophora                         2
Force                              2
French_and_Indian_War              3
Fresno,_California                 2
Islamism                           2
Normans                            2
Oxygen                             2
Pharmacy                           2
Private_school                     2

In [14]:
# Saving
#df.to_csv(f"{path_to_root}/data/questions/GQ.csv", index=False)

# Generate specific questions based on chunks 👩‍🏫💅🏻


In [15]:
# ----- Setup llama_index -----
from llama_index import (
    ServiceContext,
    OpenAIEmbedding,
    PromptHelper,
)

from llama_index.text_splitter import SentenceSplitter

embed_model = OpenAIEmbedding()

text_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
prompt_helper = PromptHelper(
    context_window=4096,
    num_output = 256,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=None,
)

In [16]:
# ----- Convert data to documents -----

# Get data and drop an unecessary column
df = pd.read_csv(f"{path_to_root}/data/articles.csv", sep = ";").drop(columns = ["Unnamed: 0"])

# Convert the DataFrame into a list of Document objects that the index can understand
from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, Document

documents = [Document(text=row['Article Body'],
                      metadata={'title': row['Article Header'],
                                'source': row['Source'],
                                'author': row['Author'],
                                'date': row['Published Date'],
                                'url': row['Url']}) for index, row in df.iterrows()] 

# ----- Parse documents to chunks (nodes) -----
parser = text_splitter # This could just be SentenceSplitter()
nodes = parser.get_nodes_from_documents(documents)

# Take a look
nodes

[TextNode(id_='91ec27f8-bcd5-48ed-bfe3-d9c5311907fb', embedding=None, metadata={'title': 'Learning the language of molecules to predict their properties', 'source': 'MIT News Office', 'author': 'Adam Zewe', 'date': 'July 7, 2023', 'url': 'https://news.mit.edu/2023/learning-language-molecules-predict-properties-0707'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e6239ae7-7f14-416d-90a7-f31b7532f3ce', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'title': 'Learning the language of molecules to predict their properties', 'source': 'MIT News Office', 'author': 'Adam Zewe', 'date': 'July 7, 2023', 'url': 'https://news.mit.edu/2023/learning-language-molecules-predict-properties-0707'}, hash='15e058cc25ece23ba10f450fbf05778dfcb34e4d8fb21bc379d3e10ef18e36bc'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='ba88c889-6388-4e12-9b03-de03e09cc85d', node_type=<ObjectType.TEXT: '1'>, metadata={}, ha

In [17]:
# ----- Sample 30 random questions ----- TODO: ensure balancing
import random
random.seed(10)

QA_nodes_index = random.sample(range(0, len(nodes)-1), 30)

# Create array with all documents selected for the Q&A dataset
QA_nodes = []

# Loop through
for i in QA_nodes_index:
    QA_nodes.append(nodes[i])

# Make df
QA_nodes_df = pd.DataFrame(QA_nodes)

# ----- Cleaning -----
# Removing the '(text,' in beginning of text column (it's a tuple currently, so we'll make it a string)
QA_nodes_df[7] = QA_nodes_df[7].astype(str).str.replace(r'^\(\'text\', ', '', regex=True)

# ----- Rename cols -----
rename_dict = {
    0: 'id',
    1: 'embedding_status',
    2: 'metadata',
    3: 'excl_embd_meta_keys',
    4: 'excl_llm_meta_keys',
    5: 'node_relations',
    6: 'hash',
    7: 'text',
    8: 'start_char_idx',
    9: 'end_char_idx',
    10: 'txt_template',
    11: 'metadata_template',
    12: 'metadata_separator'
}

QA_nodes_df = QA_nodes_df.rename(columns=rename_dict)

# ----- Select cols we need -----
selection = ['id',
             'text',
             'metadata',
             'node_relations']

QA_nodes_df = QA_nodes_df[selection]

# Take a look
QA_nodes_df


Unnamed: 0,id,text,metadata,node_relations
0,"(id_, dcb4b37d-c4bf-432f-bab7-bd54af88aa7d)","""['Navigating roads less traveled in self-driv...","(metadata, {'title': 'Self-driving cars for co...","(relationships, {'NodeRelationship.SOURCE': no..."
1,"(id_, cfa79dcb-89e9-448d-8c6b-87a31636a391)","""This reaction occurs during the second stage ...","(metadata, {'title': 'Inaugural J-WAFS Grand C...","(relationships, {'NodeRelationship.SOURCE': no..."
2,"(id_, bc14c30b-33a6-4279-819e-157db191c293)","'[\'\', \'Hijacking IP addresses is an increas...","(metadata, {'title': 'Using machine learning t...","(relationships, {'NodeRelationship.SOURCE': no..."
3,"(id_, 40d6e539-b1c4-4995-aeb9-18a2d896218c)","""['In February, the Institute established five...","(metadata, {'title': '3Q: Structuring the MIT ...","(relationships, {'NodeRelationship.SOURCE': no..."
4,"(id_, 368c58ca-2039-4b38-8ec5-e07fb56afab4)","""['If you’re a rock climber, hiker, runner, da...","(metadata, {'title': 'The autonomous “selfie d...","(relationships, {'NodeRelationship.SOURCE': no..."
5,"(id_, e0dd11b1-ec09-491a-a51e-55c492c686d1)",'[\'Socrates once said: “It is not the size of...,"(metadata, {'title': 'MIT researchers make lan...","(relationships, {'NodeRelationship.SOURCE': no..."
6,"(id_, 6908cb79-e679-4880-8640-e7b783e6645f)","""', 'Q: What kinds of challenges do you face i...","(metadata, {'title': 'Q&A: More-sustainable co...","(relationships, {'NodeRelationship.SOURCE': no..."
7,"(id_, 0ef4ec57-e228-4373-ae31-5dce81853de2)","""['MIT researchers have developed a novel “pho...","(metadata, {'title': 'Chip design drastically ...","(relationships, {'NodeRelationship.SOURCE': no..."
8,"(id_, 5057dddb-3a05-48a7-952d-ac82936ece8d)",'[\'MIT engineers have developed a robot that ...,"(metadata, {'title': ''Manus' lends a hand in ...","(relationships, {'NodeRelationship.SOURCE': no..."
9,"(id_, e1f1fe01-c35e-4f56-a096-8f0706562396)","""['As part of the public launch of the Stephen...","(metadata, {'title': 'Computing the future', '...","(relationships, {'NodeRelationship.SOURCE': no..."


In [20]:
# Generated some specific questions using GPT-4 and human eval TODO: Find way to balance these - how can we make sure the questions are balanced? to control variables
questions_list = [
    "How does MapLite enable autonomous driving without 3-D maps, and what is its impact on less-mapped areas?",
    "What are the challenges and potential impacts of enhancing RuBisCO for photosynthesis through genetic engineering and synthetic biology?",
    "What are the methods and potential impacts of the new system developed by MIT and UCSD researchers to predict and prevent IP address hijacking incidents?",
    "What are the objectives and challenges faced by the Organizational Structure working group in shaping the new MIT Stephen A. Schwarzman College of Computing?",
    "What are the features and capabilities of the R1, an autonomous video-capturing drone developed by Skydio?",
    "How does the MIT CSAIL team's smaller model compare to larger language models?",
    "What challenges and objectives are involved in the MIT-IBM Watson AI Lab's project on developing more sustainable cement?",
    "How is the novel photonic chip developed by MIT researchers more efficient for handling large neural networks?",
    "What advancements have MIT engineers made with a robot that can learn and record data?",
    "What insights were shared during the fireside chat at MIT's Stephen A. Schwarzman College of Computing?"
]

gold_responses_list = [
    "MapLite, a creation of MIT's CSAIL led by Daniela Rus, innovates autonomous driving by allowing self-driving cars to operate without the dense 3-D maps they traditionally rely on. This technology is crucial because companies like Google have focused on developing these maps only in major cities, leaving vast areas like the unpaved or unlit roads across the U.S., from the Mojave Desert to the White Mountains, untouched and inaccessible to autonomous vehicles. MapLite's capability to navigate without needing these detailed maps expands the potential reach of self-driving cars to these less-mapped, often rural areas. This advancement is significant as it overcomes the previous limitation where the need for extensive 3-D mapping confined self-driving cars to specific, well-documented urban locales. With MapLite, autonomous vehicles can now adapt to a wider variety of environments, potentially transforming transportation in regions previously deemed unsuitable for this technology.",
    "Enhancing RuBisCO through genetic engineering and synthetic biology faces challenges but holds significant potential. RuBisCO's slow processing rate and inability to distinguish between oxygen and carbon dioxide molecules limit photosynthesis efficiency, especially at higher temperatures. Technological obstacles have historically hindered successful RuBisCO engineering in crops. However, recent advancements, including the development of high-throughput engineering techniques and the ability to produce plant RuBisCO in E. coli, are promising. The 2023 J-WAFS Grand Challenge aims to improve RuBisCO using transformative protein engineering, with the EPiC platform evolving and designing better crop RuBisCO. This project includes in vivo mutagenesis and continuous directed evolution campaigns, leveraging artificial intelligence for enzyme engineering. Successful enhancement of RuBisCO could significantly increase crop yields and combat food insecurity, especially important in the context of a warming climate.",
    "The MIT and UCSD researchers' machine-learning system addresses IP address hijacking by identifying 'serial hijackers' based on common qualities observed in past incidents. It was trained using data from network operator mailing lists and historical BGP data, allowing it to flag networks with key characteristics indicative of malicious intent. This system marks a shift from reactive to proactive handling of IP hijacks, potentially reducing the frequency of such cyber-attacks. However, distinguishing between legitimate network activities and actual hijacks remains a challenge, with about 20% false positives in the current model. This innovation could significantly enhance internet security, especially given the vulnerability of the Border Gateway Protocol (BGP) to such exploits, and the historical difficulty in preventing IP hijacking, as highlighted by the U.S. Senate's first-ever cybersecurity hearing in 1998.",
    "The Organizational Structure working group's objectives for the MIT Stephen A. Schwarzman College of Computing include fostering world-class computer science research and teaching, and supporting interdisciplinary research involving computing. A key challenge is overcoming the division between electrical engineering and computer science, moving beyond the either/or categorization. Additionally, they aim to build bridges between computing and other disciplines, incorporating social science as a vital component of future computing research. The group faces the challenge of designing an adaptable organizational structure that keeps pace with rapidly evolving fields like computer science. Their process involves assessing various organizational models from MIT and other institutions, aiming to evaluate strengths and weaknesses to inform their design decisions. This work reflects MIT's commitment to innovation and interdisciplinary collaboration, highlighting the high stakes and dynamic nature of the task.",
    "The R1, developed by Skydio, is an autonomous video-capturing drone with advanced capabilities. It features 13 cameras for omnidirectional video and operates completely autonomously. Users can control it via an app, setting filming and flying conditions or manually directing it. Its perception system uses computer vision and a deep neural network for object identification and individual tracking, recognizing people by characteristics like clothing and size. The motion-planning system predicts subjects' movements and optimizes filming, while the control system executes real-time plans for smooth video capture. The R1 can autonomously maintain a distance of 10 to 30 feet from the subject or be manually controlled up to 300 feet away. It offers several cinematic modes, like 'stadium mode,' and can autonomously land when batteries are low. The R1, priced at about $2,500, is portable and designed for ease of use, comparable to using a camera app.",
    "The MIT CSAIL team's smaller model, leveraging textual entailment, outperforms its larger counterparts in certain language understanding tasks without needing human-generated annotations. It addresses inefficiencies and privacy issues associated with large language models (LLMs). By using entailment for zero-shot adaptation, it adapts to various tasks without extra training. Despite traditionally being less capable, especially in multitasking and weakly supervised tasks, this smaller model demonstrates high performance and robustness. The team enhanced its capabilities with self-training and SimPLE, an algorithm for editing pseudo-labels, making it more effective in understanding language and handling adversarial data. This approach showcases the potential of smaller, more efficient models in the AI landscape.",
    "In the MIT-IBM Watson AI Lab's project on sustainable cement, the team faces challenges like handling noisy data from various sources and addressing substantial missing data issues. This requires significant effort in data organization and employing imputation techniques for building and training machine learning models. The objective is to develop robust cement designs using waste materials to reduce CO2 emissions. This involves creating flexible and adaptable recipes that can shift with feedstock changes, aiming for a scalable solution. The project leverages predictive modeling and data extraction from over 5 million texts and patents, collaborating with IBM to design methods predicting new cements' environmental impact. The goal is to lower emissions in cement production and contribute to carbon emissions mitigation efforts.",
    "The novel photonic chip developed by MIT researchers is more efficient for handling large neural networks due to its use of light instead of electricity, significantly reducing power consumption. This chip addresses the scaling issues of traditional photonic accelerators, which relied on bulky optical components, limiting their application to small neural networks. The new design utilizes compact optical components and advanced optical signal-processing techniques, drastically reducing both power consumption and chip area. This allows the chip to handle neural networks much larger than previously possible, with simulated training suggesting it can process neural networks with energy consumption over 10 million times lower than traditional electrical-based accelerators and 1,000 times below other photonic accelerators. This efficiency is crucial for reducing energy use in data centers and meeting the growing computational demands of large neural networks.",
    "MIT engineers have developed a robot that can 'learn' exercises from a physical therapist, guide patients through them, and record biomechanical data on the patient's condition and progress. This innovation not only assists therapists in repetitive exercises but also quantifies forces and movements, providing objective records of patients' progress. The robot is specifically designed to aid in exercises for the wrist and hand, with clinical trials planned for stroke patients. The robot, named Manus, can record data such as the amount of force applied, velocity of movements, and hand position. It also uses video games to engage patients during exercises, providing visual feedback to improve therapy.",
    "During the fireside chat at MIT's Stephen A. Schwarzman College of Computing, insights and discussions were shared by six MIT professors who have received the A.M. Turing Award, often described as the 'Nobel Prize for computing.' The conversation highlighted their achievements in various areas of computer science, such as AI, cryptography, and databases. They emphasized the serendipity of computer science, where breakthroughs in one area often impact unexpected domains.The panelists discussed the importance of the new college in creating connections between computing and fields like climate change and medical technology. The panelists highlighted MIT's commitment to pursuing basic research for the sake of knowledge rather than solely for financial gain."
]

# Fill the remaining 20 rows with empty strings
questions_list.extend([""] * 20)
gold_responses_list.extend([""] * 20)

# Put on the df
QA_nodes_df['question'] = questions_list
QA_nodes_df['gold_response'] = gold_responses_list

# Take a look
QA_nodes_df

Unnamed: 0,id,text,metadata,node_relations,question,gold_response
0,"(id_, dcb4b37d-c4bf-432f-bab7-bd54af88aa7d)","""['Navigating roads less traveled in self-driv...","(metadata, {'title': 'Self-driving cars for co...","(relationships, {'NodeRelationship.SOURCE': no...",How does MapLite enable autonomous driving wit...,"MapLite, a creation of MIT's CSAIL led by Dani..."
1,"(id_, cfa79dcb-89e9-448d-8c6b-87a31636a391)","""This reaction occurs during the second stage ...","(metadata, {'title': 'Inaugural J-WAFS Grand C...","(relationships, {'NodeRelationship.SOURCE': no...",What are the challenges and potential impacts ...,Enhancing RuBisCO through genetic engineering ...
2,"(id_, bc14c30b-33a6-4279-819e-157db191c293)","'[\'\', \'Hijacking IP addresses is an increas...","(metadata, {'title': 'Using machine learning t...","(relationships, {'NodeRelationship.SOURCE': no...",What are the methods and potential impacts of ...,The MIT and UCSD researchers' machine-learning...
3,"(id_, 40d6e539-b1c4-4995-aeb9-18a2d896218c)","""['In February, the Institute established five...","(metadata, {'title': '3Q: Structuring the MIT ...","(relationships, {'NodeRelationship.SOURCE': no...",What are the objectives and challenges faced b...,The Organizational Structure working group's o...
4,"(id_, 368c58ca-2039-4b38-8ec5-e07fb56afab4)","""['If you’re a rock climber, hiker, runner, da...","(metadata, {'title': 'The autonomous “selfie d...","(relationships, {'NodeRelationship.SOURCE': no...",What are the features and capabilities of the ...,"The R1, developed by Skydio, is an autonomous ..."
5,"(id_, e0dd11b1-ec09-491a-a51e-55c492c686d1)",'[\'Socrates once said: “It is not the size of...,"(metadata, {'title': 'MIT researchers make lan...","(relationships, {'NodeRelationship.SOURCE': no...",How does the MIT CSAIL team's smaller model co...,"The MIT CSAIL team's smaller model, leveraging..."
6,"(id_, 6908cb79-e679-4880-8640-e7b783e6645f)","""', 'Q: What kinds of challenges do you face i...","(metadata, {'title': 'Q&A: More-sustainable co...","(relationships, {'NodeRelationship.SOURCE': no...",What challenges and objectives are involved in...,In the MIT-IBM Watson AI Lab's project on sust...
7,"(id_, 0ef4ec57-e228-4373-ae31-5dce81853de2)","""['MIT researchers have developed a novel “pho...","(metadata, {'title': 'Chip design drastically ...","(relationships, {'NodeRelationship.SOURCE': no...",How is the novel photonic chip developed by MI...,The novel photonic chip developed by MIT resea...
8,"(id_, 5057dddb-3a05-48a7-952d-ac82936ece8d)",'[\'MIT engineers have developed a robot that ...,"(metadata, {'title': ''Manus' lends a hand in ...","(relationships, {'NodeRelationship.SOURCE': no...",What advancements have MIT engineers made with...,MIT engineers have developed a robot that can ...
9,"(id_, e1f1fe01-c35e-4f56-a096-8f0706562396)","""['As part of the public launch of the Stephen...","(metadata, {'title': 'Computing the future', '...","(relationships, {'NodeRelationship.SOURCE': no...",What insights were shared during the fireside ...,During the fireside chat at MIT's Stephen A. S...


In [21]:
# Save
# QA_nodes_df.to_csv(f"{path_to_root}/data/questions/SQ.csv", index = True)