# GI R4 Top 10 Reading List
In this notebook, we will use RadQG to generate 100 MCQ and 100 short-answer (ANKI) questions from the RadioGraphics GI R4 Top 10 Reading List. We will then save these into word documents for further editing and review.

In [13]:
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

import datetime
import io
import os
import pickle
import sys

sys.path.append("../../")
from docx import Document
from docx.shared import Inches, Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from PIL import Image
import radqg.configs as configs
from radqg.generator import Generator
from radqg.llm.openai import embed_fn as openai_embed_fn
from radqg.llm.openai import qa as openai_qa

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Extracting the figures and text from HTML files 

Please download your desired articles from the RadioGraphics website as HTML files and put them in the toy_data_dir folder. Do not change the names of the files and folders when saving them from the website. Five sample articles are already provided.

In [3]:
# Listing all the HTML files

DATA_DIR = "/research/projects/m221279_Pouria/RadQG/data/html_articles/Top_10_GI_R4"

print("Name of the articles: \n")
for file in os.listdir(DATA_DIR):
    if file.endswith(".html"):
        print(file)

Name of the articles: 

MR Imaging Evaluation of Perianal Fistulas_ Spectrum of Imaging Features _ RadioGraphics.html
Gallbladder Carcinoma and Its Differential Diagnosis at MRI_ What Radiologists Should Know _ RadioGraphics.html
Imaging Features of Premalignant Biliary Lesions and Predisposing Conditions with Pathologic Correlation _ RadioGraphics.html
CT and PET in Stomach Cancer_ Preoperative Staging and Monitoring of Response to Therapy _ RadioGraphics.html
Chronic Pancreatitis or Pancreatic Tumor_ A Problem-solving Approach _ RadioGraphics.html
Gastrointestinal Bleeding at CT Angiography and CT Enterography_ Imaging Atlas and Glossary of Terms _ RadioGraphics.html
Imaging and Surgical Management of Anorectal Vaginal Fistulas _ RadioGraphics.html
Pancreatic Neuroendocrine Neoplasms_ 2020 Update on Pathologic and Imaging Findings and Classification _ RadioGraphics.html
Heterotopic Pancreas_ Histopathologic Features, Imaging Findings, and Complications _ RadioGraphics.html
Liver Meta

### Create a QA generator

Creating the QA generator is the first step in the pipeline. In addition to the path to the directory containing the HTML files, we need to specify an embedding function (e.g., from OpenAI), and the chunk_size and chunk_overlap values that should be used for splitting the articles into chunks. The latter two could be changed in the notebook or in the `config.py` file.

The next step is to setup the generator. This step will return all the article names, paths to figures detected for the articles, their captions, and also a Python sampler for selecting random figures as the source for question generation. The user can specify a word or phrase as the interested `topic` when setting up the question bank for the generator. If provided, then the QA generator will be more inclined to select figures as the source for question genenration that have haptions related to the topic. Otherwise, the generator will pick completely random figures for question generation.

In [4]:
# Setting up the generator

generator = Generator(
    data_dir=DATA_DIR,
    embed_fn=openai_embed_fn,
    chunk_size=configs.CHUNK_SIZE,
    chunk_overlap=configs.CHUNK_OVERLAP,
)

topic = None
article_names, figpaths, captions, sampler = generator.setup_qbank(topic)

The collection "20231123_123145" has been created with:
    348 figures from 10 articles


### Question Generation Pipeline

In [5]:
# Sampling a certain number of figures

target_question_number = 50
selected_article_names = []
selected_figpaths = []
selected_captions = []

for i in tqdm(range(target_question_number)):
    article_name, figpath, caption = generator.select_figure(
        article_names, figpaths, captions, sampler, reset_memory=False
    )
    selected_article_names.append(article_name)
    selected_figpaths.append(figpath)
    selected_captions.append(caption)

assert (
    len(selected_article_names)
    == len(selected_figpaths)
    == len(selected_captions)
    == target_question_number
)
assert len(set(selected_figpaths)) == target_question_number

  0%|          | 0/50 [00:00<?, ?it/s]

#### Generate Questions from Figures

In [6]:
generated_mcqs = []
generated_saqs = []

for i, (article_name, figpath, caption) in enumerate(
    zip(selected_article_names, selected_figpaths, selected_captions)
):
    while True:
        try:
            # Multiple-choice questions

            print("\n", "*" * 50, f"Generation {i} - MCQ logs:", "*" * 50, "\n")
            print(f"Figure: {figpath}")

            time0 = datetime.datetime.now()
            (
                mcq_qa_json,
                mcq_context,
                mcq_price,
                mcq_conversation,
            ) = generator.generate_qa(
                qa_fn=openai_qa,
                article_name=article_name,
                figpath=figpath,
                caption=caption,
                type_of_question="MCQ",
                complete_return=True,
            )
            mcq_time = datetime.datetime.now() - time0
            mcq_seconds = mcq_time.total_seconds()
            generated_mcqs.append(
                [mcq_qa_json, mcq_price, mcq_seconds, mcq_conversation]
            )
            print(f"MCQ: seconds: {mcq_seconds} - price: {mcq_price}")

            # Short-answer questions

            print("\n", "*" * 50, f"Generation {i} - SAQ logs:", "*" * 50, "\n")
            print(f"Figure: {figpath}")

            (
                saq_qa_json,
                saq_context,
                saq_price,
                saq_conversation,
            ) = generator.generate_qa(
                qa_fn=openai_qa,
                article_name=article_name,
                figpath=figpath,
                caption=caption,
                type_of_question="Short-Answer",
                complete_return=True,
            )
            saq_time = datetime.datetime.now() - time0 - mcq_time
            saq_seconds = saq_time.total_seconds()
            generated_saqs.append(
                [saq_qa_json, saq_price, saq_seconds, saq_conversation]
            )
            print(f"SAQ: seconds: {saq_seconds} - price: {saq_price}")
            break
        except Exception as e:
            print(e)
            continue


 ************************************************** Generation 0 - MCQ logs: ************************************************** 

Figure: /research/projects/m221279_Pouria/RadQG/data/html_articles/Top_10_GI_R4/Gallbladder Carcinoma and Its Differential Diagnosis at MRI_ What Radiologists Should Know _ RadioGraphics_files/images_medium_rg.2021200087.fig6b.gif

---------Round 1---------

Radiologist output: Radiologist > Question stem:
A 65-year-old man presents with right upper quadrant pain and weight loss. Based on the findings in Figure 6b, which of the following is the most likely diagnosis?
Options:
{'A': 'Gallbladder adenocarcinoma with liver invasion and extensive lymphadenopathy', 'B': 'Primary lymphoma of the gallbladder', 'C': 'Exophytic hepatocellular carcinoma', 'D': 'Pericholecystic abscess due to perforated cholecystitis', 'E': 'Metastatic liver lesion'}
Answer:
A
Educationist > Status: Fail: The question stem is revealing the figure number, which should not be disclosed.

In [7]:
# Saving the generated questions

with open("generated_mcqs.pkl", "wb") as f:
    pickle.dump(generated_mcqs, f)

with open("generated_saqs.pkl", "wb") as f:
    pickle.dump(generated_saqs, f)

with open("selected_article_names.pkl", "wb") as f:
    pickle.dump(selected_article_names, f)

with open("selected_figpaths.pkl", "wb") as f:
    pickle.dump(selected_figpaths, f)

with open("selected_captions.pkl", "wb") as f:
    pickle.dump(selected_captions, f)

### Building the word files

In [None]:
with open("generated_mcqs.pkl", "rb") as f:
    generated_mcqs = pickle.load(f)

with open("generated_saqs.pkl", "rb") as f:
    generated_saqs = pickle.load(f)

with open("selected_article_names.pkl", "rb") as f:
    selected_article_names = pickle.load(f)

with open("selected_figpaths.pkl", "rb") as f:
    selected_figpaths = pickle.load(f)

with open("selected_captions.pkl", "rb") as f:
    selected_captions = pickle.load(f)

In [23]:
all_mcq_questions = [x[0] for x in generated_mcqs]
all_mcq_prices = [x[1] for x in generated_mcqs]
all_mcq_seconds = [x[2] for x in generated_mcqs]
all_mcq_len_conversations = [len(x[3]) - 1 for x in generated_mcqs]

print("Number of generated MCQs: ", len(all_mcq_questions))
print("Average price of MCQs: ", sum(all_mcq_prices) / len(all_mcq_prices))
print("Average time of MCQs: ", sum(all_mcq_seconds) / len(all_mcq_seconds))
print("Total price of MCQs: ", sum(all_mcq_prices))
print("Total time of MCQs (in hours): ", sum(all_mcq_seconds) / 3600)
print(
    "Average number of conversations: ",
    sum(all_mcq_len_conversations) / len(all_mcq_len_conversations),
)

all_saq_questions = [x[0] for x in generated_saqs]
all_saq_prices = [x[1] for x in generated_saqs]
all_saq_seconds = [x[2] for x in generated_saqs]
all_saq_len_conversations = [len(x[3]) - 1 for x in generated_saqs]

print("Number of generated SAQs: ", len(all_saq_questions))
print("Average price of SAQs: ", sum(all_saq_prices) / len(all_saq_prices))
print("Average time of SAQs: ", sum(all_saq_seconds) / len(all_saq_seconds))
print("Total price of SAQs: ", sum(all_saq_prices))
print("Total time of SAQs (in hours): ", sum(all_saq_seconds) / 3600)
print(
    "Average number of conversations: ",
    sum(all_saq_len_conversations) / len(all_saq_len_conversations),
)

assert (
    len(all_mcq_questions)
    == len(all_mcq_prices)
    == len(all_mcq_seconds)
    == len(all_saq_questions)
    == len(all_saq_prices)
    == len(all_saq_seconds)
    == target_question_number
)

Number of generated MCQs:  50
Average price of MCQs:  0.3848165999999999
Average time of MCQs:  51.15475182000002
Total price of MCQs:  19.240829999999995
Total time of MCQs (in hours):  0.7104826641666669
Average number of conversations:  4.72
Number of generated SAQs:  50
Average price of SAQs:  0.25719539999999996
Average time of SAQs:  23.130401799999994
Total price of SAQs:  12.859769999999997
Total time of SAQs (in hours):  0.3212555805555555
Average number of conversations:  3.96


In [17]:
def create_word_file(article_names, figpaths, captions, questions, save_path, title):
    doc = Document()

    # Add title
    title_para = doc.add_paragraph(title)
    title_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    run = title_para.runs[0]
    run.font.size = Pt(14)
    run.font.bold = True

    for i, question in enumerate(questions, start=1):
        # Question
        doc.add_paragraph(f"Question {i}:", style="Heading1")
        doc.add_paragraph(question["question"])

        # Image
        if os.path.exists(figpaths[i - 1]):
            with open(figpaths[i - 1], "rb") as f:
                image_stream = io.BytesIO(f.read())
                image = Image.open(image_stream)
                image_stream = io.BytesIO()
                image.save(image_stream, format="PNG")
                doc.add_picture(image_stream, width=Inches(4))

        # Options
        if "options" in question:
            # doc.add_paragraph('Options:', style='Heading1')
            for option, text in question["options"].items():
                doc.add_paragraph(f"{option}: {text}")

        # Answer
        doc.add_paragraph("Answer:", style="Heading1")
        doc.add_paragraph(question["answer"])

        # Source and Figure Caption
        article_name = article_names[i - 1].split(" _ RadioGraphics.html")[0]
        if article_name == "":
            article_name = article_names[i - 1].split(" _ RadioGraphics.html")[1]
        doc.add_paragraph("---Source:")
        doc.add_paragraph(article_name)
        doc.add_paragraph("---Original Figure Caption:")
        doc.add_paragraph(captions[i - 1])

        # Horizontal line
        if i < len(questions):  # No line after the last question
            p = doc.add_paragraph()
            p.alignment = WD_ALIGN_PARAGRAPH.CENTER
            run = p.add_run()
            run.add_break()
            horizontal_line = OxmlElement("w:pBdr")
            bottom_border = OxmlElement("w:bottom")
            bottom_border.set(qn("w:val"), "single")
            bottom_border.set(qn("w:sz"), "6")
            bottom_border.set(qn("w:space"), "1")
            bottom_border.set(qn("w:color"), "auto")
            horizontal_line.append(bottom_border)
            p._element.get_or_add_pPr().append(horizontal_line)

    # Save the document
    doc.save(save_path)

In [18]:
# Save MCQ questions to Word file

create_word_file(
    article_names=selected_article_names,
    figpaths=selected_figpaths,
    captions=selected_captions,
    questions=all_mcq_questions,
    save_path="MCQs.docx",
    title="AI-Generated Multiple Choice Questions from\nRadioGraphics Top 10 Reading List\n(Gastrointestinal Imaging - R4)",
)

# Save SAQ questions to Word file
create_word_file(
    article_names=selected_article_names,
    figpaths=selected_figpaths,
    captions=selected_captions,
    questions=all_saq_questions,
    save_path="SAQs.docx",
    title="AI-Generated Short Answer Questions from\nRadioGraphics Top 10 Reading List\n(Gastrointestinal Imaging - R4)",
)

  return self._get_style_id_from_style(self[style_name], style_type)
