# üß† Biomedical QA with BioBERT + MedQuAD  
**By Michael Yassa**

This project builds a **medical question answering (QA) system** using:
- **BioBERT**, a transformer model pre-trained on biomedical text
- **MedQuAD**, a real-world medical dataset of over 47,000 QA pairs from trusted sources like NIH and MedlinePlus
- **FAISS**, a fast similarity search library for scalable retrieval

It enables users to ask health-related questions and retrieve relevant, expert-verified answers in real time.

---

## üéØ Objectives

- Load and parse the MedQuAD biomedical dataset
- Use **Sentence-BERT** to embed questions semantically
- Build a **FAISS similarity index** for fast retrieval
- Integrate with **BioBERT QA model** for advanced medical understanding (optional)
- Provide a simple interface to ask medical questions

---

## üì¶ Dataset: MedQuAD

MedQuAD (Medical Question Answering Dataset) contains XML files of Q&A pairs covering:
- Diseases, Symptoms, Treatments
- Genetics, Diagnosis, Prognosis
- Drugs and Medical Tests

It is extracted from:
- MedlinePlus
- NIH Genetic Home Reference
- National Institute of Neurological Disorders
- ...and more

Total QA pairs extracted: ‚úÖ **~47,000+**

---

## üîç Approach Overview

| Step | Description |
|------|-------------|
| 1Ô∏è‚É£   | Load and parse Q&A XML files from MedQuAD |
| 2Ô∏è‚É£   | Store clean QA pairs into a JSON file |
| 3Ô∏è‚É£   | Encode the **questions** using `sentence-transformers` |
| 4Ô∏è‚É£   | Build a **FAISS index** for fast similarity-based search |
| 5Ô∏è‚É£   | Retrieve the most relevant Q&A pairs using user queries |

---

## üõ† Tools Used

- ü§ñ **Transformers**: BioBERT + AutoTokenizer for QA
- üí¨ **Sentence-BERT**: For semantic question embeddings
- ‚ö° **FAISS**: High-speed nearest-neighbor search
- üß™ **MedQuAD**: Biomedical QA dataset
- üêç **Python, JSON, XML**: Data preprocessing and parsing

---

## üöÄ Example

```python
ask_bot("What are the symptoms of diabetes?")


# üì¶ Step 1: Install Required Packages
# Install required NLP, QA, and similarity search libraries

In [10]:
!pip -q install transformers torch sentencepiece faiss-cpu sentence-transformers


# ‚úÖ Step 2: Load BioBERT QA Model
# Import QA model utilities from Hugging Face

In [4]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Load a BioBERT model fine-tuned for QA
tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/biobert_v1.1_pubmed_squad_v2")
model = AutoModelForQuestionAnswering.from_pretrained("ktrapeznikov/biobert_v1.1_pubmed_squad_v2")



Some weights of the model checkpoint at ktrapeznikov/biobert_v1.1_pubmed_squad_v2 were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# üìÅ Step 3: Mount Google Drive and Unzip MedQuAD


In [5]:
from google.colab import drive
drive.mount('/content/drive')
## Quiet unzip (use -q to hide individual file names)
!unzip -q "/content/drive/MyDrive/MedBot/MedQuAD.zip" -d "/content/MedQuAD"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
replace /content/MedQuAD/MedQuAD/.git/config? [y]es, [n]o, [A]ll, [N]one, [r]ename: n
replace /content/MedQuAD/MedQuAD/.git/description? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


# üì• Step 4: Parse XML files and Extract Q&A Pairs


In [6]:
import os
import xml.etree.ElementTree as ET

base_path = "/content/MedQuAD/MedQuAD"
qa_pairs = []

for root_dir, _, files in os.walk(base_path):
    for file in files:
        if file.endswith(".xml"):
            file_path = os.path.join(root_dir, file)
            try:
                tree = ET.parse(file_path)
                root = tree.getroot()
                qa_section = root.find("QAPairs")
                if qa_section is not None:
                    for pair in qa_section.findall("QAPair"):
                        question = pair.findtext("Question")
                        answer = pair.findtext("Answer")
                        if question and answer:
                            qa_pairs.append({
                                "question": question.strip(),
                                "answer": answer.strip()
                            })
            except Exception as e:
                print(f"‚ö†Ô∏è Skipping {file}: {e}")

print("‚úÖ Total Q&A pairs loaded:", len(qa_pairs))


‚úÖ Total Q&A pairs loaded: 16407


# üíæ Step 5: Save extracted QA pairs to a JSON file


In [7]:
import json
with open("medquad_qa.json", "w") as f:
    json.dump(qa_pairs, f, indent=2)
print("‚úÖ Saved to medquad_qa.json")


‚úÖ Saved to medquad_qa.json


# üîç Step 6: Build a FAISS-based Semantic Search Index


In [8]:
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load QA pairs
with open("medquad_qa.json", "r") as f:
    qa_pairs = json.load(f)

# Embed questions
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
questions = [qa["question"] for qa in qa_pairs]
embeddings = model.encode(questions, show_progress_bar=True)

# Build FAISS index
dimension = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))
print("‚úÖ FAISS index built with", len(questions), "questions.")


Batches:   0%|          | 0/513 [00:00<?, ?it/s]

  return forward_call(*args, **kwargs)


‚úÖ FAISS index built with 16407 questions.


# ü§ñ Step 7: Ask the Bot

In [9]:
def ask_bot(question, top_k=3):
    query_vec = model.encode([question])
    distances, indices = index.search(np.array(query_vec), top_k)

    print(f"\nüí¨ You asked: {question}")
    for i, idx in enumerate(indices[0]):
        print(f"\nüîπ Top {i+1} Answer:")
        print("Q:", qa_pairs[idx]["question"])
        print("A:", qa_pairs[idx]["answer"][:800])  # Limit long answers

# üß™ Example Usage
ask_bot("What are the symptoms of diabetes?")


üí¨ You asked: What are the symptoms of diabetes?

üîπ Top 1 Answer:
Q: What are the symptoms of Diabetes ?
A: Diabetes is often called a "silent" disease because it can cause serious complications even before you have symptoms. Symptoms can also be so mild that you dont notice them. An estimated 8 million people in the United States have type 2 diabetes and dont know it, according to 2012 estimates by the Centers for Disease Control and Prevention (CDC). Common Signs Some common symptoms of diabetes are: - being very thirsty  - frequent urination  - feeling very hungry or tired  - losing weight without trying  - having sores that heal slowly  - having dry, itchy skin  - loss of feeling or tingling in the feet  - having blurry eyesight. being very thirsty frequent urination feeling very hungry or tired losing weight without trying having sores that heal slowly having dry, itchy skin loss of feeling

üîπ Top 2 Answer:
Q: What are the symptoms of Diabetes ?
A: Many people with diabet