# GI Intermedite Top 10 Reading List
In this notebook, we will use RadQG to generate 100 MCQ and 100 short-answer (ANKI) questions from the RadioGraphics GI Intermedite Top 10 Reading List. We will then save these into word documents for further editing and review.

In [6]:
%load_ext autoreload
%reload_ext autoreload
%autoreload 2

import datetime
import os
import pickle
import sys
from tqdm.notebook import tqdm

sys.path.append("../../")
import matplotlib.pyplot as plt
import skimage.io as io
import radqg.configs as configs
from radqg.generator import Generator
from radqg.llm.openai import embed_fn as openai_embed_fn
from radqg.llm.openai import qa as openai_qa

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Extracting the figures and text from HTML files 

Please download your desired articles from the RadioGraphics website as HTML files and put them in the toy_data_dir folder. Do not change the names of the files and folders when saving them from the website. Five sample articles are already provided.

In [8]:
# Listing all the HTML files

DATA_DIR = "/research/projects/m221279_Pouria/RadQG/data/html_articles/Top_10_GI_R4"

print("Name of the articles: \n")
for file in os.listdir(DATA_DIR):
    if file.endswith(".html"):
        print(file)

Name of the articles: 

MR Imaging Evaluation of Perianal Fistulas_ Spectrum of Imaging Features _ RadioGraphics.html
Gallbladder Carcinoma and Its Differential Diagnosis at MRI_ What Radiologists Should Know _ RadioGraphics.html
Imaging Features of Premalignant Biliary Lesions and Predisposing Conditions with Pathologic Correlation _ RadioGraphics.html
CT and PET in Stomach Cancer_ Preoperative Staging and Monitoring of Response to Therapy _ RadioGraphics.html
Chronic Pancreatitis or Pancreatic Tumor_ A Problem-solving Approach _ RadioGraphics.html
Gastrointestinal Bleeding at CT Angiography and CT Enterography_ Imaging Atlas and Glossary of Terms _ RadioGraphics.html
Imaging and Surgical Management of Anorectal Vaginal Fistulas _ RadioGraphics.html
Pancreatic Neuroendocrine Neoplasms_ 2020 Update on Pathologic and Imaging Findings and Classification _ RadioGraphics.html
Heterotopic Pancreas_ Histopathologic Features, Imaging Findings, and Complications _ RadioGraphics.html
Liver Meta

### Create a QA generator

Creating the QA generator is the first step in the pipeline. In addition to the path to the directory containing the HTML files, we need to specify an embedding function (e.g., from OpenAI), and the chunk_size and chunk_overlap values that should be used for splitting the articles into chunks. The latter two could be changed in the notebook or in the `config.py` file.

The next step is to setup the generator. This step will return all the article names, paths to figures detected for the articles, their captions, and also a Python sampler for selecting random figures as the source for question generation. The user can specify a word or phrase as the interested `topic` when setting up the question bank for the generator. If provided, then the QA generator will be more inclined to select figures as the source for question genenration that have haptions related to the topic. Otherwise, the generator will pick completely random figures for question generation.

In [9]:
# Setting up the generator

generator = Generator(
    data_dir=DATA_DIR,
    embed_fn=openai_embed_fn,
    chunk_size=configs.CHUNK_SIZE,
    chunk_overlap=configs.CHUNK_OVERLAP,
)

topic = None
article_names, figpaths, captions, sampler = generator.setup_qbank(topic)

OperationalError: attempt to write a readonly database

### Question Generation Pipeline

In [4]:
# Sampling a certain number of figures

target_question_number = 50
selected_article_names = []
selected_figpaths = []
selected_captions = []

for i in tqdm(range(target_question_number)):
    article_name, figpath, caption = generator.select_figure(
        article_names, figpaths, captions, sampler, reset_memory=False
    )
    selected_article_names.append(article_name)
    selected_figpaths.append(figpath)
    selected_captions.append(caption)

assert (
    len(selected_article_names)
    == len(selected_figpaths)
    == len(selected_captions)
    == target_question_number
)
assert len(set(selected_figpaths)) == target_question_number

  0%|          | 0/50 [00:00<?, ?it/s]

#### Generate Questions from Figures

In [None]:
generated_mcqs = []
generated_saqs = []

for i, (article_name, figpath, caption) in enumerate(
    zip(selected_article_names, selected_figpaths, selected_captions)
):
    # Multiple-choice questions

    print("\n", "*" * 50, f"Generation {i} - MCQ logs:", "*" * 50, "\n")
    print(f"Figure: {figpath}")

    time0 = datetime.datetime.now()
    mcq_qa_json, mcq_context, mcq_price, mcq_conversation = generator.generate_qa(
        qa_fn=openai_qa,
        article_name=article_name,
        figpath=figpath,
        caption=caption,
        type_of_question="MCQ",
        complete_return=True,
    )
    mcq_time = datetime.datetime.now() - time0
    mcq_seconds = mcq_time.total_seconds()
    generated_mcqs.append([mcq_qa_json, mcq_price, mcq_seconds, mcq_conversation])
    print(f"MCQ: seconds: {mcq_seconds} - price: {mcq_price}")

    # Short-answer questions

    print("\n", "*" * 50, f"Generation {i} - SAQ logs:", "*" * 50, "\n")
    print(f"Figure: {figpath}")

    saq_qa_json, saq_context, saq_price, saq_conversation = generator.generate_qa(
        qa_fn=openai_qa,
        article_name=article_name,
        figpath=figpath,
        caption=caption,
        type_of_question="Short-Answer",
        complete_return=True,
    )
    saq_time = datetime.datetime.now() - time0 - mcq_time
    saq_seconds = saq_time.total_seconds()
    generated_saqs.append([saq_qa_json, saq_price, saq_seconds, saq_conversation])
    print(f"SAQ: seconds: {saq_seconds} - price: {saq_price}")

In [None]:
# Saving the generated questions

with open("generated_mcqs2.pkl", "wb") as f:
    pickle.dump(generated_mcqs, f)

with open("generated_saqs2.pkl", "wb") as f:
    pickle.dump(generated_saqs, f)

with open("selected_article_names.pkl", "wb") as f:
    pickle.dump(selected_article_names, f)

with open("selected_figpaths.pkl", "wb") as f:
    pickle.dump(selected_figpaths, f)

with open("selected_captions.pkl", "wb") as f:
    pickle.dump(selected_captions, f)

### Building the word files

In [None]:
with open("generated_mcqs.pkl", "rb") as f:
    generated_mcqs = pickle.load(f)

with open("generated_saqs.pkl", "rb") as f:
    generated_saqs = pickle.load(f)

with open("selected_article_names.pkl", "rb") as f:
    selected_article_names = pickle.load(f)

with open("selected_figpaths.pkl", "rb") as f:
    selected_figpaths = pickle.load(f)

with open("selected_captions.pkl", "rb") as f:
    selected_captions = pickle.load(f)

In [None]:
all_mcq_questions = [x[0] for x in generated_mcqs]
all_mcq_prices = [x[1] for x in generated_mcqs]
all_mcq_seconds = [x[2] for x in generated_mcqs]

In [None]:
print("An example:")
print(selected_article_names[0])
print(selected_figpaths[0])
print(selected_captions[0])
print(all_mcq_questions[0])
print(all_mcq_prices[0])
print(all_mcq_seconds[0])

An example:
Imaging of Drug-induced Complications in the Gastrointestinal System _ RadioGraphics.html
/research/projects/m221279_Pouria/RadQG/data/html_articles/Top_10_GI_intermediate/Imaging of Drug-induced Complications in the Gastrointestinal System _ RadioGraphics_files/images_medium_rg.2016150132.fig14a.gif
Figure 14a.HCAin a 24-year-old woman taking oral contraceptive pills who presented for imaging follow up. Axial arterial phase(a), portal venous phase(b), and 1-hour delayed phase(c)MR images obtained with hepatocyte-specific gadolinium contrast agent show a liver lesion in the right lobe (arrow) with early arterial enhancement that persists in the portal venous phase. The late persistent enhancement seen at the periphery incis due to a ductular reaction (proliferation of ductular structures from a large duct obstruction). Pathologic analysis of a surgical tissue sample showedHCA, inflammatory subtype.
{'question': 'A 72-year-old woman presents with diffuse abdominal pain and h