GenQuest-RAG is an innovative system designed to generate insightful questions by harnessing the power of machine learning and utilizing Wikipedia as a vast knowledge repository. Through the integration of advanced natural language processing techniques, the system can retrieve relevant information from Wikipedia articles and utilize it to augment question generation.
By leveraging RAG, the project goes beyond traditional question-generation approaches by incorporating contextually rich information extracted from Wikipedia articles. This enables the system to produce more accurate, contextually relevant, and informative questions across a wide range of topics.
The project aims to enhance learning, comprehension, and knowledge acquisition by providing users with dynamically generated questions tailored to specific topics or areas of interest. Whether used in educational settings, content creation, or research, the Question Generation with RAG project offers a powerful tool for generating high-quality questions that stimulate critical thinking and deepen understanding.
GenQuest-RAG is implemented using the following Python packages:
Package | Description |
---|---|
Wikipedia | A Wikipedia API to make it my knowledge source for Retrieval Augmented Generation (RAG) |
PyTorch | An Open-source machine learning framework |
Transformers | A Hugging Face package contains state-of-the-art Natural Language Processing models |
Datasets | A Hugging Face package contains popular open-source datasets |
Evaluate | A Hugging Face package contains several evaluation metrics like BLEU, ROUGE, METEOR, BERTScore, etc. |
LangChain | A framework for developing applications powered by language models |
Weaviate | An open-source, cloud-native, vector search engine that allows for semantic search and exploration of structured and unstructured data. |
Demo link: https://genquest-rag.streamlit.app/
Hugging Face Model Card: https://huggingface.co/mohammedaly22/t5-small-squad-qg-v2
GenQuest-RAG.mp4
- Define some useful functions for highlighting the answer in the paragraph and preparing the instruction prompt that will be fed to the model:
def highlight_answer(context, answer):
context_splits = context.split(answer)
text = ""
for split in context_splits:
text += split
text += ' <h> '
text += answer
text += ' <h> '
text += split
return text
def prepare_instruction(answer_highlighted_context):
instruction_prompt = f"""Generate a question whose answer is highlighted by <h> from the context delimited by the triple backticks.
context:
```
{answer_highlighted_context}
```
"""
return instruction_prompt
- Use the model as a Hugging Face Pipeline:
from transformers import pipeline
pipe = pipeline('text2text-generation', model='mohammedaly22/t5-small-squad-qg')
context = """During the 2011–12 season, he set the La Liga and European records\
for most goals scored in a single season, while establishing himself as Barcelona's\
all-time top scorer. The following two seasons, Messi finished second for the Ballon\
d'Or behind Cristiano Ronaldo (his perceived career rival), before regaining his best\
form during the 2014–15 campaign, becoming the all-time top scorer in La Liga and \
leading Barcelona to a historic second treble, after which he was awarded a fifth \
Ballon d'Or in 2015. Messi assumed captaincy of Barcelona in 2018, and won a record \
sixth Ballon d'Or in 2019. Out of contract, he signed for French club Paris Saint-Germain\
in August 2021, spending two seasons at the club and winning Ligue 1 twice. Messi \
joined American club Inter Miami in July 2023, winning the Leagues Cup in August of that year.
"""
answer_highlighted_context = highlight_answer(context=context, answer='Inter Miami')
prompt = prepare_instruction(answer_highlighted_context)
This will be the final prompt:
Generate a question whose answer is highlighted by <h> from the context delimited by the triple backticks
context:
```During the 2011–12 season, he set the La Liga and European records\
for most goals scored in a single season, while establishing himself as Barcelona's\
all-time top scorer. The following two seasons, Messi finished second for the Ballon\
d'Or behind Cristiano Ronaldo (his perceived career rival), before regaining his best\
form during the 2014–15 campaign, becoming the all-time top scorer in La Liga and \
leading Barcelona to a historic second treble, after which he was awarded a fifth \
Ballon d'Or in 2015. Messi assumed captaincy of Barcelona in 2018, and won a record\
sixth Ballon d'Or in 2019. Out of contract, he signed for French club Paris Saint-Germain\
in August 2021, spending two seasons at the club and winning Ligue 1 twice. Messi \
joined American club <h> Inter Miami <h> in July 2023, winning the Leagues Cup in August of that year.```
- Use the loaded
pipeline
to generate questions their answer isInter Miami
:
outputs = pipe(prompt, num_return_sequences=3, num_beams=5, num_beam_groups=5, diversity_penalty=1.0)
for output in outputs:
print(output['generated_text'])
Result:
1. What club did Messi join in the 2023 season?
2. What was Messi's name of the club that won the Leagues Cup on July 20?
3. What club did Messi join in the Leagues Cup in July 2023?
The Stanford Question Answering Dataset (SQuAD) is a popular benchmark dataset in the field of natural language processing (NLP) and machine reading comprehension. It was developed by researchers at Stanford University. SQuAD consists of a large collection of real questions posed by crowd workers on a set of Wikipedia articles, where each question is paired with a corresponding passage from the article, and the answer to each question is a segment of text from the corresponding passage.
The goal of SQuAD is to train and evaluate machine learning models to understand and answer questions posed in natural language. It has been widely used as a benchmark for evaluating the performance of various question-answering systems and models, including both rule-based systems and deep learning-based approaches such as neural network models.
-
I followed Chan and Fan (2019) by introducing the highlight token
<h>
to take into account an answera
within contextc
as below:$x = [ c_1, ..., \lt h\gt , a_1, ..., a_a, \lt h\gt , ..., c_c ]$ -
Preparing the instruction prompt by following this template
Generate a question whose answer is highlighted by <h> from the context delimited by the triple backticks.
context:
```{answer_highlighted_context}```
I conducted full fine-tuning on two instances of the t5-small
model, each with differing hyperparameters. Provided below are the detailed TrainingArguments
for both versions:
Model | epochs | batch size | warmup steps | weight decay | gradient accumulation steps | learning rate | save total limit | fp16 |
---|---|---|---|---|---|---|---|---|
T5-Small-FFT-V1 | 3 | 32 | 500 | 0.01 | - | - | - | - |
T5-Small-FFT-V2 | 10 | 16 | 1000 | 0.01 | 4 | 5e-5 | 2 | True |
Here are the evaluation metrics for the two versions:
Model | BLEU | Rouge1 | Rouge2 | RougeL | RougeLSum | METEOR | BertScore |
---|---|---|---|---|---|---|---|
T5-Small-FFT-V1 | 16.07 | 43.14 | 22.13 | 40.09 | 40.10 | 40.24 | 91.22 |
T5-Small-FFT-V2 | 20.00 | 47.69 | 26.43 | 44.15 | 44.15 | 45.84 | 91.82 |