<a href="https://colab.research.google.com/github/Danny2173/RAGproject/blob/main/2_Question_Answer_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Imports and Installs

In [None]:
# Install dependencies
%pip install -q transformers openai tqdm

# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Imports
import json, time
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from tqdm import tqdm
import openai


Mounted at /content/drive


##Loading Corpus

In [None]:
load_path = '/content/drive/MyDrive/corpus.json'

# Load the corpus
with open(load_path, "r") as f:
    corpus = json.load(f)

print("Corpus loaded")


Corpus loaded


In [None]:
# Preview formatted text
for i, doc in enumerate(corpus[:7]):
    print(f"Document {i+1}")
    print(doc["text"])
    print("-" * 100)


##Generating questions and answers based on the context passages

In [None]:
# Load FLAN-T5 model
model = "google/flan-t5-large"
qa_tokenizer = AutoTokenizer.from_pretrained(model)
qa_model = AutoModelForSeq2SeqLM.from_pretrained(model).to("cuda")

# Initialise
synthetic_dataset = []

# Create one pair per chunk, with a progress bar
for chunk in tqdm(corpus, desc="Generating qa pairs"):
    context = chunk["text"]

    # Generate question
    prompt_q = f"You are a medical question writer. Based on the passage below, generate a clear, specific, factual question that could be answered from the passage. Passage: {context}"
    inputs_q = qa_tokenizer(prompt_q, return_tensors="pt", truncation=True, max_length=512).to("cuda")
    outputs_q = qa_model.generate(**inputs_q, max_length=64)
    question = qa_tokenizer.decode(outputs_q[0], skip_special_tokens=True)

    # Generate answer
    prompt_a = f"Based on the text:\n{context}\nAnswer this question: {question}"
    inputs_a = qa_tokenizer(prompt_a, return_tensors="pt", truncation=True, max_length=512).to("cuda")
    outputs_a = qa_model.generate(**inputs_a, max_length=256)
    answer = qa_tokenizer.decode(outputs_a[0], skip_special_tokens=True)

    # Save the qa pair
    synthetic_dataset.append({
        "context": context,
        "question": question,
        "answer": answer
    })

    # time.sleep(0.3)  # prevent GPU overload

with open("synthetic_dataset.json", "w") as f:
    json.dump(synthetic_dataset, f, indent=2)

In [None]:
for i, item in enumerate(synthetic_dataset[:10], 1):
    print(f"Example {i}")
    print("Context:", item["context"][:200].strip().replace("\n", " "))
    print("Question:", item["question"])
    print("Answer:", item["answer"])


##Saving synthetic dataset

In [None]:
save_path = '/content/drive/MyDrive/synthetic_dataset.json'

# Save as JSON
with open(save_path, "w", encoding="utf-8") as f:
    json.dump(synthetic_dataset, f, ensure_ascii=False, indent=2)

print(f"Saved to {save_path}")

Saved to /content/drive/MyDrive/synthetic_dataset.json


In [None]:
load_path = '/content/drive/MyDrive/synthetic_dataset.json'

# Load JSON file
with open(load_path, "r", encoding="utf-8") as f:
    synthetic_dataset = json.load(f)

print("Loaded qa pairs")

Loaded qa pairs


##Expanding the answers to form longer, grammatically correct sentences.

In [None]:
# Set your OpenAI API key
openai.api_key = "REPLACE_WITH_YOUR_OPENAI_API_KEY"

In [None]:
# OpenAI client
client = openai.OpenAI(
    api_key=openai.api_key
)

# Expnading each answer
expanded_data = []
for i, item in enumerate(tqdm(synthetic_dataset, desc="Expanding answers")):
    question = item["question"]
    short_answer = item["answer"]
    context = item["context"]

    prompt = (
        f"Context:\n{context}\n\n"
        f"Question: {question}\n"
        f"Short Answer: {short_answer}\n\n"
        f"Rewrite the short answer into a full-sentence answer that directly responds to the question."
    )

    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.4,
    )

    full_answer = response.choices[0].message.content.strip()

    if (i + 1) % 100 == 0 or i < 5:
        print(f"Example {i+1}")
        print("Question:", question)
        print("Short Answer:", short_answer)
        print("Expanded Answer:", full_answer)

    item["answer"] = full_answer
    expanded_data.append(item)

with open("expanded_dataset.json", "w", encoding="utf-8") as f:
    json.dump(expanded_data, f, indent=2, ensure_ascii=False)

Expanding answers:   0%|          | 1/5121 [00:02<3:35:54,  2.53s/it]

Example 1
Question: What is the name for dry, dark patches of skin that usually appear in the armpits, neck or groin?
Short Answer: Acanthosis nigricans
Expanded Answer: Acanthosis nigricans is the name for dry, dark patches of skin that usually appear in the armpits, neck, or groin.


Expanding answers:   0%|          | 2/5121 [00:03<2:36:14,  1.83s/it]

Example 2
Question: What is the most common cause of acanthosis nigricans?
Short Answer: obesity
Expanded Answer: Obesity is the most common cause of acanthosis nigricans.


Expanding answers:   0%|          | 3/5121 [00:05<2:44:48,  1.93s/it]

Example 3
Question: What is the name of the rare disorder of the food pipe?
Short Answer: Achalasia
Expanded Answer: The name of the rare disorder of the food pipe is Achalasia.


Expanding answers:   0%|          | 4/5121 [00:06<2:04:03,  1.45s/it]

Example 4
Question: What is the name of the test used to measure the muscle pressure along the oesophagus?
Short Answer: manometry
Expanded Answer: The test used to measure the muscle pressure along the oesophagus is called manometry.


Expanding answers:   0%|          | 5/5121 [00:07<1:39:01,  1.16s/it]

Example 5
Question: What is the name of the balloon that is inflated to help stretch the ring of muscle that lets food into your stomach?
Short Answer: balloon dilation
Expanded Answer: The balloon that is inflated to help stretch the ring of muscle that lets food into your stomach is called balloon dilation.


Expanding answers:   2%|▏         | 100/5121 [01:20<58:57,  1.42it/s]

Example 100
Question: What is a powerful chemical that can have a wide range of adverse effects on almost every part of your body?
Short Answer: Alcohol
Expanded Answer: Alcohol is a powerful chemical that can have a wide range of adverse effects on almost every part of your body.


Expanding answers:   4%|▍         | 200/5121 [02:32<1:01:45,  1.33it/s]

Example 200
Question: What is the main treatment for anal cancer?
Short Answer: chemoradiation
Expanded Answer: The main treatment for anal cancer is a combination of radiotherapy and chemotherapy, known as chemoradiation.


Expanding answers:   6%|▌         | 300/5121 [04:00<1:46:48,  1.33s/it]

Example 300
Question: What is the name of the type of axial spondyloarthritis where inflammation of the sacroiliac joints can be seen on an X-ray?
Short Answer: Ankylosing spondylitis
Expanded Answer: Ankylosing spondylitis is the type of axial spondyloarthritis where inflammation of the sacroiliac joints can be seen on an X-ray.


Expanding answers:   8%|▊         | 400/5121 [05:19<56:00,  1.40it/s]

Example 400
Question: What is the most common type of brain imaging scan?
Short Answer: magnetic resonance imaging (MRI) scan
Expanded Answer: The most common type of brain imaging scan is a magnetic resonance imaging (MRI) scan.


Expanding answers:  10%|▉         | 500/5121 [06:34<1:10:00,  1.10it/s]

Example 500
Question: What is the best way to prevent schistosomiasis?
Short Answer: avoid exposure to contaminated water
Expanded Answer: The best way to prevent schistosomiasis is to avoid exposure to contaminated water, by refraining from activities such as paddling, swimming, and washing in fresh water in affected areas, and instead opting for swimming in the sea or chlorinated swimming pools.


Expanding answers:  12%|█▏        | 600/5121 [07:52<51:54,  1.45it/s]

Example 600
Question: What is the name of the treatment for Bowen's disease?
Short Answer: cryotherapy
Expanded Answer: The name of the treatment for Bowen's disease is cryotherapy, where liquid nitrogen is sprayed on to the affected skin to freeze it.


Expanding answers:  14%|█▎        | 700/5121 [09:06<48:06,  1.53it/s]

Example 700
Question: What is the name of the section that discusses symptoms of a broken finger or thumb?
Short Answer: Broken finger or thumb
Expanded Answer: The section that discusses symptoms of a broken finger or thumb is called "Broken finger or thumb."


Expanding answers:  16%|█▌        | 800/5121 [10:19<1:01:25,  1.17it/s]

Example 800
Question: What should you do if you think you might have carbon monoxide poisoning?
Short Answer: stop using appliances you think might be making carbon monoxide (such as a boiler, cooker or heater) if you can open any windows and doors to let fresh air in go outside get medical advice as soon as possible - do not go back into the affected building until you have got advice
Expanded Answer: If you think you might have carbon monoxide poisoning, you should immediately stop using appliances you suspect are emitting carbon monoxide, such as a boiler, cooker, or heater. Open windows and doors to let fresh air in, go outside, and seek medical advice as soon as possible. Do not re-enter the affected building until you have received guidance.


Expanding answers:  18%|█▊        | 900/5121 [11:32<47:41,  1.48it/s]

Example 900
Question: What is the first thing a doctor will do if you have cervical spondylosis?
Short Answer: examine your neck and shoulder
Expanded Answer: The first thing a doctor will do if you have cervical spondylosis is examine your neck and shoulder to assess your condition.


Expanding answers:  20%|█▉        | 1000/5121 [12:46<53:55,  1.27it/s]

Example 1000
Question: What is the name of the condition that can be prevented?
Short Answer: COPD
Expanded Answer: Chronic obstructive pulmonary disease (COPD) is the name of the condition that can be prevented.


Expanding answers:  21%|██▏       | 1100/5121 [14:01<54:47,  1.22it/s]  

Example 1100
Question: What is the protein called that is suitable for most people with coeliac disease but may trigger symptoms in a few people?
Short Answer: avenin
Expanded Answer: The protein called avenin is suitable for most people with coeliac disease but may trigger symptoms in a few people.


Expanding answers:  23%|██▎       | 1200/5121 [15:13<48:00,  1.36it/s]

Example 1200
Question: What is the best way to get free rapid lateral flow test kits?
Short Answer: Find a pharmacy that offers free COVID-19 rapid lateral flow tests
Expanded Answer: The best way to get free rapid lateral flow test kits is to find a pharmacy that offers free COVID-19 rapid lateral flow tests.


Expanding answers:  25%|██▌       | 1300/5121 [16:26<43:37,  1.46it/s]

Example 1300
Question: What is the best way to get help if you have hearing loss?
Short Answer: Urgent advice: Ask for an urgent GP appointment or get help from NHS 111
Expanded Answer: The best way to get help if you have hearing loss is to ask for an urgent GP appointment or seek assistance from NHS 111.


Expanding answers:  27%|██▋       | 1400/5121 [17:41<46:08,  1.34it/s]

Example 1400
Question: What are some ways to keep your body healthy?
Short Answer: Eat a healthy, balanced diet and drink plenty of fluids. Exercise regularly.
Expanded Answer: To keep your body healthy, it is important to eat a healthy, balanced diet, drink plenty of fluids, and exercise regularly.


Expanding answers:  29%|██▉       | 1500/5121 [19:00<50:54,  1.19it/s]

Example 1500
Question: What is the name of the type of insulin pump that works with a continuous glucose monitor to automatically give you the right amount of insulin based on your blood glucose levels?
Short Answer: hybrid closed loop system
Expanded Answer: The type of insulin pump that works with a continuous glucose monitor to automatically give you the right amount of insulin based on your blood glucose levels is called a hybrid closed loop system.


Expanding answers:  31%|███       | 1600/5121 [20:10<41:18,  1.42it/s]

Example 1600
Question: What can cause stomach problems or constipation?
Short Answer: NSAIDs (such as ibuprofen) or opioid painkillers
Expanded Answer: NSAIDs (such as ibuprofen) or opioid painkillers (such as codeine) can cause stomach problems or constipation.


Expanding answers:  33%|███▎      | 1700/5121 [21:24<42:12,  1.35it/s]

Example 1700
Question: What can cause tummy pain?
Short Answer: stomach bugs and trapped wind
Expanded Answer: Tummy pain can be caused by stomach bugs and trapped wind.


Expanding answers:  35%|███▌      | 1800/5121 [22:38<36:52,  1.50it/s]

Example 1800
Question: What is the name for a group of rare inherited skin disorders that cause the skin to become very fragile?
Short Answer: Epidermolysis bullosa
Expanded Answer: Epidermolysis bullosa is the name for a group of rare inherited skin disorders that cause the skin to become very fragile.


Expanding answers:  37%|███▋      | 1900/5121 [23:59<1:14:43,  1.39s/it]

Example 1900
Question: What is the name of the small, plastic T-shaped device placed in your womb that slowly releases the progestogen hormone levonorgestrel?
Short Answer: The levonorgestrel intrauterine system
Expanded Answer: The small, plastic T-shaped device placed in your womb that slowly releases the progestogen hormone levonorgestrel is called the levonorgestrel intrauterine system (LNG-IUS).


Expanding answers:  39%|███▉      | 2000/5121 [25:16<33:40,  1.54it/s]

Example 2000
Question: What is the name of the procedure that removes the gallbladder?
Short Answer: laparoscopic cholecystectomy
Expanded Answer: The procedure that removes the gallbladder is called a laparoscopic cholecystectomy.


Expanding answers:  41%|████      | 2100/5121 [26:29<33:32,  1.50it/s]

Example 2100
Question: What is the name of the laser treatment that can be used to open up drainage tubes?
Short Answer: laser trabeculoplasty
Expanded Answer: The laser treatment that can be used to open up drainage tubes is called laser trabeculoplasty.


Expanding answers:  43%|████▎     | 2200/5121 [27:40<33:24,  1.46it/s]

Example 2200
Question: What are the common causes of head lice?
Short Answer: They are not caused by dirty hair and are picked up by head-to-head contact
Expanded Answer: Head lice are commonly picked up through head-to-head contact and are not caused by having dirty hair.


Expanding answers:  45%|████▍     | 2300/5121 [28:51<36:44,  1.28it/s]

Example 2300
Question: What device is implanted into the artery in hospital?
Short Answer: pulmonary artery pressure sensor
Expanded Answer: The device that is implanted into the artery in hospital is a pulmonary artery pressure sensor.


Expanding answers:  47%|████▋     | 2400/5121 [30:04<32:30,  1.40it/s]

Example 2400
Question: What is a sign of diabetic ketoacidosis?
Short Answer: A high level of ketones
Expanded Answer: A sign of diabetic ketoacidosis is indicated by a high level of ketones in the blood or urine.


Expanding answers:  49%|████▉     | 2500/5121 [41:19<31:53,  1.37it/s]

Example 2500
Question: What are the two types of brain scans used to diagnose hydrocephalus?
Short Answer: CT scans and MRI scans
Expanded Answer: The two types of brain scans used to diagnose hydrocephalus are CT scans and MRI scans.


Expanding answers:  51%|█████     | 2600/5121 [42:33<27:52,  1.51it/s]

Example 2600
Question: What can cause total incontinence?
Short Answer: a problem with your bladder
Expanded Answer: Total incontinence can be caused by a problem with your bladder.


Expanding answers:  53%|█████▎    | 2700/5121 [43:48<29:43,  1.36it/s]

Example 2700
Question: What is the most common cause of kernicterus?
Short Answer: newborn jaundice
Expanded Answer: Newborn jaundice is the most common cause of kernicterus.


Expanding answers:  55%|█████▍    | 2800/5121 [44:57<27:02,  1.43it/s]

Example 2800
Question: What is the first thing you should do if you have a meniscus tear?
Short Answer: Urgent advice: Ask for an urgent GP appointment or get help from NHS 111
Expanded Answer: If you have a meniscus tear, the first thing you should do is ask for an urgent GP appointment or seek help from NHS 111.


Expanding answers:  57%|█████▋    | 2900/5121 [46:17<25:09,  1.47it/s]

Example 2900
Question: What is the main treatment for AML?
Short Answer: Chemotherapy
Expanded Answer: The main treatment for AML is chemotherapy, which is used to kill as many leukaemia cells in the body as possible and reduce the risk of the condition coming back.


Expanding answers:  59%|█████▊    | 3000/5121 [47:29<23:03,  1.53it/s]

Example 3000
Question: What is the single biggest risk factor for lung cancer?
Short Answer: Smoking cigarettes
Expanded Answer: Smoking cigarettes is identified as the single biggest risk factor for developing lung cancer.


Expanding answers:  61%|██████    | 3100/5121 [48:44<23:27,  1.44it/s]

Example 3100
Question: What is the most common cause of malnutrition?
Short Answer: long-term health conditions
Expanded Answer: The most common cause of malnutrition is long-term health conditions that affect appetite, weight, and/or how well nutrients are absorbed by the gut, such as Crohn's disease.


Expanding answers:  62%|██████▏   | 3200/5121 [49:58<20:30,  1.56it/s]

Example 3200
Question: What is the name of the condition that can be life-threatening?
Short Answer: Middle East respiratory syndrome
Expanded Answer: The condition that can be life-threatening is called Middle East respiratory syndrome (MERS).


Expanding answers:  64%|██████▍   | 3300/5121 [51:11<21:08,  1.44it/s]

Example 3300
Question: What is the average incubation period for mumps?
Short Answer: 17 days
Expanded Answer: The average incubation period for mumps is around 17 days.


Expanding answers:  66%|██████▋   | 3400/5121 [52:21<18:20,  1.56it/s]

Example 3400
Question: What is the Epworth sleepiness scale?
Short Answer: a questionnaire used to assess how likely it is you'll fall asleep while doing different activities
Expanded Answer: The Epworth sleepiness scale is a questionnaire that is utilized to evaluate the likelihood of an individual falling asleep while engaging in various activities.


Expanding answers:  68%|██████▊   | 3500/5121 [53:35<17:13,  1.57it/s]

Example 3500
Question: What is the name of the condition that can affect people with Noonan syndrome?
Short Answer: lymphoedema
Expanded Answer: Lymphoedema is the name of the condition that can affect people with Noonan syndrome.


Expanding answers:  70%|███████   | 3600/5121 [54:48<16:45,  1.51it/s]

Example 3600
Question: What is the most common injury in people with osteoporosis?
Short Answer: broken wrist
Expanded Answer: The most common injury in people with osteoporosis is a broken wrist.


Expanding answers:  72%|███████▏  | 3700/5121 [56:05<18:02,  1.31it/s]

Example 3700
Question: What is the aim of your treatment if you have advanced pancreatic cancer?
Short Answer: to limit the cancer and its symptoms, and help you live longer
Expanded Answer: The aim of your treatment if you have advanced pancreatic cancer is to limit the cancer and its symptoms, and help you live longer.


Expanding answers:  74%|███████▍  | 3800/5121 [57:19<14:59,  1.47it/s]

Example 3800
Question: What can help with mood swings, low mood and anxiety around the time of the menopause and perimenopause?
Short Answer: Cognitive behavioural therapy
Expanded Answer: Cognitive behavioural therapy (CBT) is a type of talking therapy that can help with mood swings, low mood, and anxiety around the time of menopause and perimenopause.


Expanding answers:  76%|███████▌  | 3900/5121 [58:36<17:48,  1.14it/s]

Example 3900
Question: What is the most common type of poisoning?
Short Answer: accidental poisoning
Expanded Answer: The most common type of poisoning is accidental poisoning, which often occurs at home and poses a higher risk to children under the age of 5.


Expanding answers:  78%|███████▊  | 4000/5121 [59:51<15:32,  1.20it/s]

Example 4000
Question: What is the name of the condition that is caused by contact with strong chemicals?
Short Answer: dyshidrotic eczema
Expanded Answer: The condition that is caused by contact with strong chemicals is called dyshidrotic eczema.


Expanding answers:  80%|████████  | 4100/5121 [1:01:04<11:45,  1.45it/s]

Example 4100
Question: What can be done to stop progressive supranuclear palsy gradually worsening?
Short Answer: There's currently nothing that can be done to stop PSP gradually worsening
Expanded Answer: Unfortunately, there is currently no known treatment or intervention that can effectively halt the gradual worsening of progressive supranuclear palsy (PSP).


Expanding answers:  82%|████████▏ | 4200/5121 [1:02:17<11:08,  1.38it/s]

Example 4200
Question: What is the name of the group that provides information and support for people affected by IPF?
Short Answer: UK Charities Action for Pulmonary Fibrosis
Expanded Answer: The group that provides information and support for people affected by IPF is UK Charities Action for Pulmonary Fibrosis.


Expanding answers:  84%|████████▍ | 4300/5121 [1:03:33<08:45,  1.56it/s]

Example 4300
Question: What is the name of the gene that causes Rett syndrome?
Short Answer: MECP2
Expanded Answer: The gene that causes Rett syndrome is called MECP2.


Expanding answers:  86%|████████▌ | 4400/5121 [1:04:52<10:45,  1.12it/s]

Example 4400
Question: What is the first sign of scarlet fever?
Short Answer: high temperature
Expanded Answer: The first sign of scarlet fever is a high temperature.


Expanding answers:  88%|████████▊ | 4500/5121 [1:06:08<07:04,  1.46it/s]

Example 4500
Question: What is the main test for melanoma?
Short Answer: excision biopsy
Expanded Answer: The main test for melanoma is an excision biopsy, where a specialist cuts out the mole and a small area of surrounding skin to be sent to a lab and checked for cancer.


Expanding answers:  90%|████████▉ | 4600/5121 [1:08:51<05:15,  1.65it/s]

Example 4600
Question: What can cause pain when peeing on your eyes?
Short Answer: blisters and sores
Expanded Answer: Blisters and sores on your eyes can cause pain when peeing.


Expanding answers:  92%|█████████▏| 4700/5121 [1:10:07<07:36,  1.09s/it]

Example 4700
Question: What is the best way to treat sunburn?
Short Answer: get out of the sun as soon as possible
Expanded Answer: The best way to treat sunburn is to get out of the sun as soon as possible.


Expanding answers:  94%|█████████▎| 4800/5121 [1:11:16<03:09,  1.69it/s]

Example 4800
Question: What is the purpose of chemotherapy?
Short Answer: kill cancer cells
Expanded Answer: The purpose of chemotherapy is to use medicines to kill cancer cells.


Expanding answers:  96%|█████████▌| 4900/5121 [1:12:32<02:44,  1.34it/s]

Example 4900
Question: What is the name of the condition that causes your fingers to bend into the palm of your hand?
Short Answer: Dupuytren's contracture
Expanded Answer: The condition that causes one or more fingers to bend into the palm of your hand is called Dupuytren's contracture.


Expanding answers:  98%|█████████▊| 5000/5121 [1:13:49<01:34,  1.29it/s]

Example 5000
Question: What are the main symptoms of vaginal cancer?
Short Answer: a lump in the vagina ulcers and other skin changes in or around the vagina
Expanded Answer: The main symptoms of vaginal cancer include the presence of a lump in the vagina and ulcers and other skin changes in or around the vagina.


Expanding answers: 100%|█████████▉| 5100/5121 [1:15:02<00:13,  1.51it/s]

Example 5100
Question: What can be done to help prevent infecting others?
Short Answer: wash your hands with soap and water after going to the toilet or changing nappies
Expanded Answer: To help prevent infecting others, you should wash your hands with soap and water after going to the toilet or changing nappies.


Expanding answers: 100%|██████████| 5121/5121 [1:15:16<00:00,  1.13it/s]


##Saving Expanded dataset

In [None]:
save_path = "/content/drive/MyDrive/expanded_dataset.json"

# Save as JSON
with open(save_path, "w", encoding="utf-8") as f:
    json.dump(expanded_data, f, indent=2, ensure_ascii=False)

print(f"Saved to {save_path}")

Saved to /content/drive/MyDrive/expanded_dataset.json
