# 🏥 RxReveal - Medical Document Question-Answering Assistant
**Capstone Submission for Gen AI Intensive Course 2025Q1**

> This notebook is a submission for the **Gen AI Intensive Course Capstone 2025Q1** organized by **Google** and **Kaggle**, conducted from **March 31 – April 4, 2025**.
 Team Members:**  Md Haaris Hussain, Venkata Satyanarayana Gudala**

---

## 📌 Overview

This project implements a sophisticated medical assistant that can answer health-related questions by referencing a corpus of medical literature. The system leverages **Retrieval-Augmented Generation (RAG)** in a three-step process:

1. **Retrieve**: Find relevant documents using semantic search  
2. **Rerank**: Identify the most pertinent information  
3. **Generate**: Create coherent, accurate answers based on retrieved context

---

## 🔍 Problem Statement

Medical professionals and patients often struggle to quickly find specific information within vast medical literature. This assistant streamlines the process by enabling:

- Natural language queries about medical conditions  
- Semantic understanding of both questions and documents  
- Context-aware answer generation from trusted sources

---

## 🧠 Technical Implementation

### 🧾 Document Processing Pipeline

```
Text Extraction → Chunking → Embedding Generation → Vector Database Indexing → Retrieval → Answer Generation
```

### 🧩 Core Components

#### 1. Document Collection & Preprocessing
- Extract text from medical PDFs using PyPDF2
- Split documents into manageable chunks
- Clean and normalize medical text data

#### 2. Document Embeddings
- Generate vector representations using Sentence-Transformers
- Transform medical text into high-dimensional embeddings 
- Capture semantic relationships between medical concepts

#### 3. Vector Database (FAISS)
- Build efficient similarity search index
- Enable sub-second query performance
- Support for exact and approximate nearest neighbor search

#### 4. Answer Generation
- Leverage pre-trained language models (FLAN-T5)
- Use retrieved context to generate accurate responses
- Ensure answers are grounded in the source documents

---

## 📊 Sample Query & Output

**Query:** "What are the common symptoms of tuberculosis?"  
**System Output:**  
- Shortness of breath  
- Cough  
- Chest pain  
- Sweating  
- Loss of appetite  
- Fever  
- Fatigue  
- Muscle ache

---

## ✅ Fit Analysis

### 🔍 Use Case & Innovation (Max 5 points)

- **Use Case:** A Medical Q&A assistant that answers user queries based on uploaded medical documents  
- **Impact:** Solves a real-world problem — making medical documents searchable and interactive using GenAI  
- **Creativity:** While medical RAG is a known use case, our implementation is focused, practical, and clearly impactful  

**✔️ Score Potential:** 4.5 / 5 (High relevance and clear application)

---

## 🧠 Gen AI Capabilities Demonstrated

| Gen AI Capability                | Used? | Notes |
|----------------------------------|-------|-------|
| Document Understanding           | ✅    | Medical PDFs processed and parsed |
| Embeddings                       | ✅    | SentenceTransformer used for vectorization |
| Vector Search / RAG              | ✅    | FAISS + top-k document retrieval |
| Retrieval Augmented Generation   | ✅    | Generated responses from relevant documents |
| Grounding                        | ✅    | Answers are grounded in source documents |

✔️ **More than 3 Gen AI capabilities used**

---

## 🛠️ Technologies Used

- **Python** for core implementation  
- **PyPDF2** for document parsing  
- **Sentence-Transformers** for embedding generation  
- **FAISS** for vector search  
- **Hugging Face Transformers (FLAN-T5)** for response generation  
- **Gradio** for interactive user interface  

---

## 🚀 Future Improvements

- Integrate biomedical-specific embedding models  
- Include document metadata for more context  
- Support cross-document reasoning  
- Enhance answer quality using medical knowledge graphs  
- Expand document coverage

---

## 🙌 Acknowledgments

Thanks to **Google** and **Kaggle** for this amazing learning opportunity. Special thanks to our mentor instructors and judges.

---
## Blogpost 
🔗 https://medium.com/@mdhaarishussain/building-a-medical-document-q-a-assistant-with-gen-ai-and-rag-77544b66aaa7
## 📚 Citation

> Addison Howard, Brenda Flynn, Myles O'Neill, Nate, and Polong Lin. *Gen AI Intensive Course Capstone 2025Q1*. [Kaggle](https://kaggle.com/competitions/gen-ai-intensive-course-capstone-2025q1), 2025.

In [1]:
!pip install PyPDF2 sentence-transformers faiss-cpu gradio


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting gradio
  Downloading gradio-5.25.2-py3-none-any.whl.metadata (16 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.6-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6 (from gradio)
  Down

### 📁 Medical Document Collection and Preprocessing

In [2]:
import os

# List all datasets mounted in /kaggle/input
for dirname, _, filenames in os.walk('/kaggle/input'):
    print(f"Directory: {dirname}")
    for filename in filenames:
        print(f" - {filename}")


Directory: /kaggle/input
Directory: /kaggle/input/synthetic-medical-records-for-100-patients
 - synthetic_medical_records.pdf


In [3]:

import os
from pathlib import Path
from PyPDF2 import PdfReader

# Direct path to your PDF file
pdf_path = Path("/kaggle/input/synthetic-medical-records-for-100-patients/synthetic_medical_records.pdf")

# Read and extract text
reader = PdfReader(str(pdf_path))
documents = []

for page in reader.pages:
    text = page.extract_text()
    if text:
        # Split into separate records per person
        for chunk in text.split("\n\n"):
            if chunk.strip():
                documents.append({"text": chunk.strip()})

# Show preview of the first document
#print(documents[0]["text"][:500])


### 🔬 Document Embeddings with Sentence-Transformers

In [4]:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load sentence-transformer model suitable for medical domain
model = SentenceTransformer('all-MiniLM-L6-v2')  # You can replace with a biomedical-specific model

# Generate embeddings
doc_texts = [doc["text"] for doc in documents]
doc_embeddings = model.encode(doc_texts)

# Confirm shape
print(np.array(doc_embeddings).shape)


2025-04-20 16:40:55.802053: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745167256.062973      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745167256.133078      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(31, 384)


### 📦 Building FAISS Vector Index

In [5]:
!pip install faiss-cpu

import faiss

# FAISS index initialization
embedding_dim = doc_embeddings[0].shape[0]
index = faiss.IndexFlatL2(embedding_dim)

# Add embeddings to index
index.add(np.array(doc_embeddings).astype('float32'))

# Validate index
print(f"Number of documents indexed: {index.ntotal}")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Number of documents indexed: 31


### ❓ Querying the Medical Assistant

In [6]:
# Example query
query = "What are the common symptoms of tuberculosis?"

# Encode the query
query_embedding = model.encode([query])

# Search the FAISS index
top_k = 5
distances, indices = index.search(np.array(query_embedding).astype('float32'), top_k)

# Retrieve the most relevant documents
retrieved_docs = [documents[i] for i in indices[0]]

# Filter and print only relevant lines containing 'Tuberculosis' and symptoms
print("Top Results for Query:", query)
print("="*60)

for i, doc in enumerate(retrieved_docs):
    text = doc["text"]
    if "tuberculosis" in text.lower():
        print(f"\n--- Relevant Match {i+1} ---")
        lines = text.split('\n')
        for line in lines:
            if "symptom" in line.lower() or "tuberculosis" in line.lower():
                print(line.strip())


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Top Results for Query: What are the common symptoms of tuberculosis?

--- Relevant Match 1 ---
Symptoms: Dizziness, Nausea, Shortness of breath, Fatigue
Symptoms: Cough, Muscle ache
Disease: Tuberculosis
Symptoms: Shortness of breath, Fever, Runny nose
Symptoms: Nausea, Cough

--- Relevant Match 2 ---
Disease: Tuberculosis
Symptoms: Shortness of breath, Chest pain, Sweating, Cough
Symptoms: Cough, Loss of appetite
Symptoms: Sore throat, Chest pain

--- Relevant Match 3 ---
Disease: Tuberculosis
Symptoms: Sweating, Loss of appetite, Dizziness, Joint pain
Symptoms: Loss of appetite, Muscle ache, Fever, Nausea, Sweating
Symptoms: Vomiting, Shortness of breath, Sore throat, Cough

--- Relevant Match 4 ---
Symptoms: Muscle ache, Runny nose
Symptoms: Cough, Fatigue, Headache
Disease: Tuberculosis
Symptoms: Cough, Muscle ache, Dizziness

--- Relevant Match 5 ---
Symptoms: Sweating, Muscle ache, Headache, Chest pain, Fever
Symptoms: Chest pain, Loss of appetite, Sweating, Joint pain, Shortness

### 🧠 Answer Generation with Hugging Face Transformers

In [7]:

from transformers import pipeline

# Load QA pipeline (or generative pipeline)
qa_pipeline = pipeline("text2text-generation", model="google/flan-t5-base")

# Combine retrieved context
context = " ".join([doc["text"] for doc in retrieved_docs])

# Generate answer using context and query
input_text = f"Context: {context}\n\nQuestion: {query}"
response = qa_pipeline(input_text, max_length=200, do_sample=False)

print("Answer:", response[0]["generated_text"])


config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Device set to use cpu
Token indices sequence length is longer than the specified maximum sequence length for this model (1410 > 512). Running this sequence through the model will result in indexing errors


Answer: Shortness of breath, Chest pain, Sweating, Cough Checkup: Blood Pressure - 115/84 mmHg, Heart Rate - 82 bpm, Temperature - 104.0 °F Treatment: Corticosteroids, Physiotherapy Follow-up: 3 week(s) Patient 21: Name: Jason Fleming Age: 64 Disease: Pneumonia Age: 18 Disease: Tuberculosis Age: 19 Disease: Tuberculosis Symptoms: Shortness of breath, Chest pain, Sweating, Cough Checkup: Blood Pressure - 109/74 mmHg, Heart Rate - 82 bpm, Temperature - 100.4 °F Treatment: Corticosteroids, Physiotherapy Follow-up: 3 week(s) Patient 23: Name: Curtis Carter Age: 44 Disease:
