# üè• Clinical AI Assistant - RAG System Demo

**Retrieval-Augmented Generation for Clinical Question Answering**

This notebook demonstrates a production-ready RAG system that answers clinical questions using:
- üìÑ **20 IEEE research papers** (34,087 documents indexed)
- üóÇÔ∏è **30,000+ clinical trial records** from ClinicalTrials.gov
- üîç **FAISS vector search** for semantic retrieval
- ü§ñ **OpenRouter LLM** for answer generation

**Domains:** COVID-19 (5,106 docs), Diabetes (23,313 docs), Heart Attack (3,999 docs), Knee Injuries (1,669 docs)

**‚ú® No Landing AI credits needed** - Uses pre-built FAISS indexes!

---

## üîß Setup & Installation

In [None]:
# Install required packages
!pip install -q sentence-transformers faiss-cpu pandas numpy requests ipywidgets matplotlib

print("‚úÖ All packages installed successfully!")

## üîë API Key Setup

You only need **OpenRouter API Key** for LLM generation (free tier available)

Get your key at: [openrouter.ai/keys](https://openrouter.ai/keys)

‚ö†Ô∏è **Store in Colab secrets**: Click üîë icon on left sidebar ‚Üí Add secret: `OPENROUTER_KEY`

In [None]:
from google.colab import userdata

# Get API key from Colab secrets
try:
    OPENROUTER_KEY = userdata.get('OPENROUTER_KEY')
    print("‚úÖ OpenRouter API key loaded from Colab secrets")
except:
    print("‚ö†Ô∏è Please add OPENROUTER_KEY to Colab secrets")
    print("   Click the üîë icon on the left sidebar")
    print("   Get free key at: https://openrouter.ai/keys")
    OPENROUTER_KEY = ""

## üì• Clone Repository & Load Pre-built Indexes

In [None]:
# Clone the repository
!git clone https://github.com/ArshanBhanage/Clinical-Assistant-RAG.git
%cd Clinical-Assistant-RAG

print("‚úÖ Repository cloned!")
print("\nüìÅ Project structure:")
!ls -la backend/indexes/ 2>/dev/null || echo "‚ö†Ô∏è Indexes folder not found - will need to upload"

## üì§ Upload Pre-built Indexes

Download the pre-built indexes from your local system and upload them here:

**Required files** (from `backend/indexes/` folder):
- `all_documents.pkl`
- `covid_index.faiss` + `covid_metadata.pkl`
- `diabetes_index.faiss` + `diabetes_metadata.pkl`
- `heart_attack_index.faiss` + `heart_attack_metadata.pkl`
- `knee_injuries_index.faiss` + `knee_injuries_metadata.pkl`

In [None]:
from google.colab import files
import os

# Create indexes directory
!mkdir -p backend/indexes

print("üì§ Upload your pre-built index files:")
print("   - Upload all .faiss and .pkl files from backend/indexes/")
print("\n‚¨ÜÔ∏è Click 'Choose Files' to upload...")

uploaded = files.upload()

# Move uploaded files to indexes directory
import shutil
for filename in uploaded.keys():
    shutil.move(filename, f'backend/indexes/{filename}')
    print(f"  ‚úì Moved {filename}")

print("\n‚úÖ Index files uploaded and organized!")
!ls -lh backend/indexes/

## üìö Import Libraries

In [None]:
import os
import pickle
import numpy as np
import faiss
import requests
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any
import json

print("‚úÖ Libraries imported successfully!")

## üîç Clinical RAG System

In [None]:
class ClinicalRAG:
    """Retrieval-Augmented Generation for Clinical Questions"""
    
    def __init__(self, openrouter_key: str):
        print("üîß Initializing RAG system...")
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.openrouter_key = openrouter_key
        self.indexes = {}
        self.metadata = {}
        self.dimension = 384
        print("‚úÖ Embedding model loaded")
    
    def load_indexes(self, index_path: str = 'backend/indexes'):
        """Load pre-built FAISS indexes and metadata"""
        domains = ['covid', 'diabetes', 'heart_attack', 'knee_injuries']
        
        print("\nüìÇ Loading pre-built indexes...")
        for domain in domains:
            index_file = f'{index_path}/{domain}_index.faiss'
            metadata_file = f'{index_path}/{domain}_metadata.pkl'
            
            if os.path.exists(index_file) and os.path.exists(metadata_file):
                # Load FAISS index
                self.indexes[domain] = faiss.read_index(index_file)
                
                # Load metadata
                with open(metadata_file, 'rb') as f:
                    self.metadata[domain] = pickle.load(f)
                
                print(f"  ‚úì {domain}: {self.indexes[domain].ntotal:,} vectors")
            else:
                print(f"  ‚úó {domain}: Files not found")
        
        print(f"\n‚úÖ Loaded {len(self.indexes)} domain indexes")
    
    def retrieve(self, query: str, domain: str = None, k: int = 5):
        """Retrieve top-k relevant documents"""
        query_vec = self.embedding_model.encode(
            [query], 
            convert_to_numpy=True
        ).astype('float32')
        faiss.normalize_L2(query_vec)
        
        domains_to_search = [domain] if domain else list(self.indexes.keys())
        results = []
        
        for d in domains_to_search:
            if d in self.indexes:
                scores, indices = self.indexes[d].search(query_vec, k)
                
                for score, idx in zip(scores[0], indices[0]):
                    if idx < len(self.metadata[d]):
                        doc = self.metadata[d][idx].copy()
                        doc['similarity'] = float(score)
                        results.append(doc)
        
        results.sort(key=lambda x: x['similarity'], reverse=True)
        return results[:k]
    
    def generate_answer(self, query: str, retrieved_docs: List[Dict]):
        """Generate answer using OpenRouter LLM"""
        if not retrieved_docs:
            return {
                'answer': "Insufficient information to answer this question.",
                'sources': [],
                'confidence': 'low'
            }
        
        # Build context from top 5 sources
        context = "\n\n".join([
            f"[Source {i+1}: {doc['source']}, Page {doc.get('page', 'N/A')}]\n{doc['text']}"
            for i, doc in enumerate(retrieved_docs[:5])
        ])
        
        prompt = f"""You are a Clinical AI Assistant. Answer the question using ONLY the provided context. Cite sources.

Context:
{context}

Question: {query}

Answer:"""
        
        # Call OpenRouter API
        headers = {
            'Authorization': f'Bearer {self.openrouter_key}',
            'Content-Type': 'application/json'
        }
        
        payload = {
            'model': 'nvidia/nemotron-nano-12b-v2-vl:free',
            'messages': [
                {'role': 'system', 'content': 'You are a helpful clinical AI assistant that provides evidence-based answers.'},
                {'role': 'user', 'content': prompt}
            ],
            'temperature': 0.3,
            'max_tokens': 800
        }
        
        try:
            response = requests.post(
                'https://openrouter.ai/api/v1/chat/completions',
                headers=headers,
                json=payload,
                timeout=30
            )
            
            if response.status_code == 200:
                answer = response.json()['choices'][0]['message']['content']
                
                sources = [{
                    'source': doc['source'],
                    'page': doc.get('page', 'N/A'),
                    'similarity': doc['similarity'],
                    'text': doc['text'][:500]  # First 500 chars
                } for doc in retrieved_docs[:5]]
                
                return {
                    'answer': answer,
                    'sources': sources,
                    'confidence': 'high' if len(retrieved_docs) >= 3 else 'medium'
                }
            else:
                return {
                    'answer': f"LLM API error: {response.status_code}. Please check your OpenRouter API key.",
                    'sources': [],
                    'confidence': 'error'
                }
        except Exception as e:
            return {
                'answer': f"Error calling LLM: {str(e)}",
                'sources': [],
                'confidence': 'error'
            }
    
    def query(self, question: str, domain: str = None):
        """Complete RAG pipeline: retrieve + generate"""
        # Retrieve
        retrieved = self.retrieve(question, domain, k=5)
        
        # Generate
        result = self.generate_answer(question, retrieved)
        return result

print("‚úÖ ClinicalRAG class defined")

## üöÄ Initialize RAG System

In [None]:
# Initialize RAG system
rag = ClinicalRAG(OPENROUTER_KEY)

# Load pre-built indexes
rag.load_indexes('backend/indexes')

print("\nüéâ RAG system ready!")

## üéØ Demo Queries

In [None]:
# Example queries
demo_queries = [
    {"question": "What are the symptoms of COVID-19?", "domain": "covid"},
    {"question": "What machine learning models are used for diabetes prediction?", "domain": "diabetes"},
    {"question": "What are the main risk factors for heart attacks?", "domain": "heart_attack"},
    {"question": "What are common treatments for knee injuries?", "domain": "knee_injuries"}
]

print("üîç Running demo queries...\n")

for q in demo_queries:
    print("="*80)
    print(f"\nüìù Question: {q['question']}")
    print(f"üìÇ Domain: {q['domain'].upper()}\n")
    
    result = rag.query(q["question"], q["domain"])
    
    print(f"üí° ANSWER:\n{result['answer']}\n")
    print(f"üìä Confidence: {result['confidence'].upper()}")
    
    if result['sources']:
        print(f"\nüìö Top {len(result['sources'])} Evidence Sources:")
        for i, src in enumerate(result['sources'], 1):
            print(f"\n  [{i}] {src['source']} (Page {src['page']})")
            print(f"      Match: {src['similarity']*100:.1f}%")
            print(f"      Excerpt: {src['text'][:200]}...")
    
    print("\n" + "="*80 + "\n")

## üéÆ Interactive Query Interface

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output

# Create widgets
query_input = widgets.Textarea(
    value='What are the symptoms of COVID-19?',
    placeholder='Enter your clinical question...',
    description='Question:',
    layout=widgets.Layout(width='90%', height='100px')
)

domain_select = widgets.Dropdown(
    options=[
        ('All Domains', None), 
        ('COVID-19', 'covid'), 
        ('Diabetes', 'diabetes'), 
        ('Heart Attack', 'heart_attack'), 
        ('Knee Injuries', 'knee_injuries')
    ],
    value='covid',
    description='Domain:',
    style={'description_width': 'initial'}
)

submit_button = widgets.Button(
    description='üîç Search',
    button_style='success',
    layout=widgets.Layout(width='150px', height='40px')
)

output_area = widgets.Output()

def on_submit_clicked(b):
    with output_area:
        clear_output()
        print("üîé Searching...\n")
        
        result = rag.query(query_input.value, domain_select.value)
        
        # Display styled answer
        display(HTML(f"""
        <div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); 
                    padding: 3px; border-radius: 15px; margin: 20px 0;'>
            <div style='background: white; padding: 25px; border-radius: 13px;'>
                <h2 style='color: #667eea; margin-top: 0;'>üí° Answer</h2>
                <p style='font-size: 16px; line-height: 1.8; color: #333;'>{result['answer']}</p>
                <div style='margin-top: 15px; padding: 10px; background: #f0f7ff; border-radius: 8px;'>
                    <strong>Confidence:</strong> 
                    <span style='color: {'green' if result['confidence'] == 'high' else 'orange'}; 
                                  font-weight: bold;'>{result['confidence'].upper()}</span>
                </div>
            </div>
        </div>
        """))
        
        # Display sources
        if result['sources']:
            display(HTML("<h2 style='color: #667eea; margin-top: 30px;'>üìö Evidence Sources</h2>"))
            
            for i, src in enumerate(result['sources'], 1):
                display(HTML(f"""
                <div style='background: #fff; padding: 20px; margin: 15px 0; 
                            border-radius: 12px; border-left: 4px solid #667eea; 
                            box-shadow: 0 2px 8px rgba(0,0,0,0.1);'>
                    <div style='display: flex; justify-content: space-between; align-items: start;'>
                        <div style='flex: 1;'>
                            <h3 style='margin: 0 0 10px 0; color: #333;'>
                                <span style='background: #667eea; color: white; padding: 5px 12px; 
                                              border-radius: 20px; font-size: 14px; margin-right: 10px;'>#{i}</span>
                                {src['source']}
                            </h3>
                            <p style='color: #666; font-size: 14px; margin: 5px 0;'>
                                üìÑ Page {src['page']}
                            </p>
                        </div>
                        <div style='background: #10b981; color: white; padding: 8px 15px; 
                                    border-radius: 20px; font-weight: bold; font-size: 14px;'>
                            {src['similarity']*100:.1f}% match
                        </div>
                    </div>
                    <div style='margin-top: 15px; padding: 15px; background: #f9fafb; 
                                border-radius: 8px; border-left: 3px solid #667eea;'>
                        <p style='margin: 0; color: #555; font-style: italic; line-height: 1.6;'>
                            "{src['text'][:350]}..."
                        </p>
                    </div>
                </div>
                """))

submit_button.on_click(on_submit_clicked)

# Display interface
display(HTML("""
<div style='background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); 
            padding: 30px; border-radius: 15px; text-align: center; margin-bottom: 30px;'>
    <h1 style='color: white; margin: 0; font-size: 36px;'>üè• Clinical AI Assistant</h1>
    <p style='color: rgba(255,255,255,0.9); margin: 10px 0 0 0; font-size: 18px;'>
        Evidence-based answers from 34,000+ clinical documents
    </p>
</div>
"""))

display(query_input)
display(domain_select)
display(submit_button)
display(output_area)

## üìä System Statistics

In [None]:
import matplotlib.pyplot as plt

# Get statistics from loaded indexes
stats = {}
for domain, index in rag.indexes.items():
    stats[domain] = index.ntotal

# Create visualization
fig, ax = plt.subplots(figsize=(14, 7))
domains = list(stats.keys())
counts = list(stats.values())
colors = ['#9C27B0', '#2196F3', '#F44336', '#4CAF50']

bars = ax.bar(domains, counts, color=colors, alpha=0.8, edgecolor='black', linewidth=2)

ax.set_xlabel('Clinical Domain', fontsize=16, fontweight='bold')
ax.set_ylabel('Number of Documents', fontsize=16, fontweight='bold')
ax.set_title('Clinical RAG System - Document Distribution', 
             fontsize=20, fontweight='bold', pad=20)
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height):,}',
            ha='center', va='bottom', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Print detailed summary
print("\n" + "="*60)
print("üìà CLINICAL RAG SYSTEM SUMMARY")
print("="*60)
print(f"\n  Total Documents: {sum(counts):,}")
print(f"  Domains Covered: {len(domains)}")
print(f"  Embedding Model: all-MiniLM-L6-v2 (384 dimensions)")
print(f"  Vector Index: FAISS IndexFlatIP (cosine similarity)")
print(f"  LLM: NVIDIA Nemotron via OpenRouter")
print(f"  Temperature: 0.3 (factual responses)")
print(f"\n  COVID-19:      {stats.get('covid', 0):>7,} documents")
print(f"  Diabetes:      {stats.get('diabetes', 0):>7,} documents")
print(f"  Heart Attack:  {stats.get('heart_attack', 0):>7,} documents")
print(f"  Knee Injuries: {stats.get('knee_injuries', 0):>7,} documents")
print("\n" + "="*60)

## üéì Conclusion

### ‚úÖ What We Demonstrated:
- **Complete RAG pipeline** with 34,000+ pre-indexed clinical documents
- **Multi-domain** clinical question answering (COVID, Diabetes, Heart Attack, Knee Injuries)
- **FAISS vector search** for fast semantic retrieval
- **OpenRouter LLM** for natural language generation
- **Interactive interface** with beautiful UI
- **No Landing AI credits needed** - uses pre-built indexes!

### üîë Key Features:
- üîí **100% Local Retrieval** - No internet data, only your documents
- üìö **Evidence-Based Answers** - All responses cite sources with page numbers
- üéØ **High Accuracy** - Temperature 0.3 for factual, grounded responses
- ‚ö° **Fast** - Sub-second query time (200-500ms)
- üîç **Semantic Search** - Understanding intent, not just keyword matching
- üìä **Top 5 Evidence** - Shows best matching sources with similarity scores

### üì¶ Repository:
üîó **GitHub**: [ArshanBhanage/Clinical-Assistant-RAG](https://github.com/ArshanBhanage/Clinical-Assistant-RAG)

### üõ†Ô∏è Tech Stack:
- **Vector DB**: FAISS (Meta AI)
- **Embeddings**: sentence-transformers (all-MiniLM-L6-v2)
- **LLM**: NVIDIA Nemotron via OpenRouter
- **Backend**: Python, FastAPI
- **Frontend**: Next.js, Tailwind CSS

---

**‚ö†Ô∏è Disclaimer**: This is a research prototype for educational purposes. Always consult licensed healthcare professionals for medical advice. The system is designed to assist with information retrieval, not to replace professional medical judgment.

### üéì Academic Context:
Developed for **Advanced Data Mining** course demonstrating:
- Retrieval-Augmented Generation (RAG) architecture
- Vector database implementation with FAISS
- Multi-modal data integration (research papers + clinical trials)
- Real-world NLP application in healthcare
- Production-grade ML system design

---

**Made with ‚ù§Ô∏è by Arshan Bhanage**

# üè• Clinical AI Assistant - RAG System Demo

**Retrieval-Augmented Generation for Clinical Question Answering**

This notebook demonstrates a production-ready RAG system that answers clinical questions using:
- üìÑ **20 IEEE research papers** (parsed with Landing AI)
- üóÇÔ∏è **30,000+ clinical trial records** from ClinicalTrials.gov
- üîç **FAISS vector search** for semantic retrieval
- ü§ñ **OpenRouter LLM** for answer generation

**Domains:** COVID-19, Diabetes, Heart Attack, Knee Injuries

---

## üîß Setup & Installation

In [None]:
# Install required packages
!pip install -q sentence-transformers faiss-cpu pandas numpy requests python-dotenv matplotlib seaborn wordcloud

print("‚úÖ All packages installed successfully!")

## üîë API Keys Setup

You'll need:
1. **Landing AI API Key**: Get free at [va.landing.ai](https://va.landing.ai/)
2. **OpenRouter API Key**: Get free at [openrouter.ai](https://openrouter.ai/)

‚ö†Ô∏è **Important**: These keys are stored in Colab secrets. Never hardcode them in the notebook!

In [1]:
import os
from google.colab import userdata

# Get API keys from Colab secrets
# Go to: üîë icon on left sidebar ‚Üí Add secrets: LANDING_AI_KEY and OPENROUTER_KEY
try:
    LANDING_AI_KEY = userdata.get('LANDING_AI_KEY')
    OPENROUTER_KEY = userdata.get('OPENROUTER_KEY')
    print("‚úÖ API keys loaded from Colab secrets")
except:
    print("‚ö†Ô∏è Please add LANDING_AI_KEY and OPENROUTER_KEY to Colab secrets")
    print("   Click the üîë icon on the left sidebar to add secrets")
    LANDING_AI_KEY = ""
    OPENROUTER_KEY = ""

ModuleNotFoundError: No module named 'google'

## üì• Clone Repository & Setup Data

In [None]:
# Clone the repository
!git clone https://github.com/ArshanBhanage/Clinical-Assistant-RAG.git
%cd Clinical-Assistant-RAG

print("‚úÖ Repository cloned successfully!")

## üìä Upload Your Clinical Data

**Required Structure:**
```
backend/data/Clinical/
‚îú‚îÄ‚îÄ Covid/*.pdf (5 PDFs)
‚îú‚îÄ‚îÄ Diabetes/*.pdf (5 PDFs)
‚îú‚îÄ‚îÄ Heart_attack/*.pdf (5 PDFs)
‚îú‚îÄ‚îÄ KneeInjuries/*.pdf (5 PDFs)
‚îî‚îÄ‚îÄ *.csv files (4 CSV files)
```

Use the file upload button below to upload your PDFs and CSVs.

In [None]:
from google.colab import files
import shutil

# Create data directories
!mkdir -p backend/data/Clinical/Covid
!mkdir -p backend/data/Clinical/Diabetes
!mkdir -p backend/data/Clinical/Heart_attack
!mkdir -p backend/data/Clinical/KneeInjuries

print("üìÅ Upload your files:")
print("   - PDFs: Place in respective domain folders")
print("   - CSVs: Place in Clinical/ root folder")
print("\n‚¨ÜÔ∏è Click 'Choose Files' to upload...")

uploaded = files.upload()

# Move files to appropriate directories
for filename in uploaded.keys():
    if 'covid' in filename.lower():
        shutil.move(filename, f'backend/data/Clinical/Covid/{filename}')
    elif 'diabetes' in filename.lower():
        shutil.move(filename, f'backend/data/Clinical/Diabetes/{filename}')
    elif 'heart' in filename.lower():
        shutil.move(filename, f'backend/data/Clinical/Heart_attack/{filename}')
    elif 'knee' in filename.lower():
        shutil.move(filename, f'backend/data/Clinical/KneeInjuries/{filename}')
    elif filename.endswith('.csv'):
        shutil.move(filename, f'backend/data/Clinical/{filename}')

print("\n‚úÖ Files uploaded and organized!")

## üìö Import Libraries & Define Classes

In [None]:
import os
import pickle
import numpy as np
import pandas as pd
import faiss
import requests
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any
import json
from tqdm.auto import tqdm

print("‚úÖ Libraries imported successfully!")

## üîç Landing AI Document Parser

In [None]:
class LandingAIParser:
    """Parse PDFs using Landing AI's Agentic Document Extraction"""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.url = "https://api.va.landing.ai/v1/tools/agentic-document-analysis"
    
    def parse(self, pdf_path: str) -> Dict[str, Any]:
        """Parse PDF and extract chunks with page grounding"""
        with open(pdf_path, 'rb') as f:
            files = {'file': (os.path.basename(pdf_path), f, 'application/pdf')}
            headers = {'Authorization': f'Bearer {self.api_key}'}
            
            response = requests.post(self.url, files=files, headers=headers)
            
            if response.status_code == 200:
                return response.json()
            else:
                raise Exception(f"Landing AI error: {response.status_code}")
    
    def extract_documents(self, pdf_path: str, domain: str) -> List[Dict[str, Any]]:
        """Extract structured documents from PDF"""
        result = self.parse(pdf_path)
        documents = []
        
        for chunk in result.get('chunks', []):
            grounding = chunk.get('grounding', [{}])[0]
            documents.append({
                'text': chunk['text'],
                'source': os.path.basename(pdf_path),
                'page': grounding.get('page', 1),
                'domain': domain,
                'chunk_type': 'paragraph'
            })
        
        return documents

print("‚úÖ Landing AI Parser class defined")

## üì• Data Ingestion Pipeline

In [None]:
def ingest_clinical_data(landing_ai_key: str):
    """Ingest PDFs and CSVs from Clinical folder"""
    
    parser = LandingAIParser(landing_ai_key)
    all_documents = {'covid': [], 'diabetes': [], 'heart_attack': [], 'knee_injuries': []}
    
    domains = {
        'covid': 'backend/data/Clinical/Covid',
        'diabetes': 'backend/data/Clinical/Diabetes',
        'heart_attack': 'backend/data/Clinical/Heart_attack',
        'knee_injuries': 'backend/data/Clinical/KneeInjuries'
    }
    
    # Process PDFs
    print("üìÑ Processing PDFs with Landing AI...\n")
    for domain, folder in domains.items():
        print(f"üîç Processing {domain}...")
        pdf_files = [f for f in os.listdir(folder) if f.endswith('.pdf')]
        
        for pdf_file in tqdm(pdf_files, desc=f"  {domain}"):
            pdf_path = os.path.join(folder, pdf_file)
            try:
                docs = parser.extract_documents(pdf_path, domain)
                all_documents[domain].extend(docs)
                print(f"    ‚úì {pdf_file}: {len(docs)} chunks")
            except Exception as e:
                print(f"    ‚úó {pdf_file}: Error - {e}")
    
    # Process CSVs
    print("\nüìä Processing CSV files...\n")
    csv_files = {
        'covid': 'backend/data/Clinical/ctg-studies_covid.csv',
        'diabetes': 'backend/data/Clinical/ctg-studies_diabetes.csv',
        'heart_attack': 'backend/data/Clinical/ctg-studies_heart_attack.csv',
        'knee_injuries': 'backend/data/Clinical/ctg-studies_knee_injuries.csv'
    }
    
    for domain, csv_path in csv_files.items():
        if os.path.exists(csv_path):
            df = pd.read_csv(csv_path)
            for _, row in df.iterrows():
                text = f"Clinical Trial: {row.get('Study Title', 'N/A')}\n"
                text += f"Conditions: {row.get('Conditions', 'N/A')}\n"
                text += f"Interventions: {row.get('Interventions', 'N/A')}\n"
                text += f"Status: {row.get('Study Status', 'N/A')}"
                
                all_documents[domain].append({
                    'text': text,
                    'source': 'ClinicalTrials.gov',
                    'page': 1,
                    'domain': domain,
                    'chunk_type': 'clinical_trial'
                })
            print(f"  ‚úì {domain}: {len(df)} trials")
    
    # Print summary
    print("\nüìà Ingestion Summary:")
    total = 0
    for domain, docs in all_documents.items():
        count = len(docs)
        total += count
        print(f"  {domain}: {count:,} documents")
    print(f"\n  TOTAL: {total:,} documents")
    
    return all_documents

print("‚úÖ Data ingestion function defined")

## üöÄ Run Data Ingestion

In [None]:
# Ingest all data
all_documents = ingest_clinical_data(LANDING_AI_KEY)

# Save to disk
with open('all_documents.pkl', 'wb') as f:
    pickle.dump(all_documents, f)

print("\n‚úÖ Data ingestion complete and saved!")

## üîç RAG Pipeline - Vector Indexing

In [None]:
class ClinicalRAG:
    """Retrieval-Augmented Generation for Clinical Questions"""
    
    def __init__(self, openrouter_key: str):
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
        self.openrouter_key = openrouter_key
        self.indexes = {}
        self.metadata = {}
        self.dimension = 384
    
    def build_index(self, documents: List[Dict], domain: str):
        """Build FAISS index for a domain"""
        print(f"\nüî® Building index for {domain}...")
        
        texts = [doc['text'] for doc in documents]
        print(f"  Embedding {len(texts):,} documents...")
        
        embeddings = self.embedding_model.encode(
            texts, 
            convert_to_numpy=True,
            show_progress_bar=True
        ).astype('float32')
        
        faiss.normalize_L2(embeddings)
        
        index = faiss.IndexFlatIP(self.dimension)
        index.add(embeddings)
        
        self.indexes[domain] = index
        self.metadata[domain] = documents
        
        print(f"  ‚úì Indexed {index.ntotal:,} vectors")
    
    def build_all_indexes(self, all_documents: Dict):
        """Build indexes for all domains"""
        for domain, docs in all_documents.items():
            if docs:
                self.build_index(docs, domain)
    
    def retrieve(self, query: str, domain: str = None, k: int = 5):
        """Retrieve top-k relevant documents"""
        query_vec = self.embedding_model.encode(
            [query], 
            convert_to_numpy=True
        ).astype('float32')
        faiss.normalize_L2(query_vec)
        
        domains_to_search = [domain] if domain else list(self.indexes.keys())
        results = []
        
        for d in domains_to_search:
            if d in self.indexes:
                scores, indices = self.indexes[d].search(query_vec, k)
                
                for score, idx in zip(scores[0], indices[0]):
                    if idx < len(self.metadata[d]):
                        doc = self.metadata[d][idx].copy()
                        doc['similarity'] = float(score)
                        results.append(doc)
        
        results.sort(key=lambda x: x['similarity'], reverse=True)
        return results[:k]
    
    def generate_answer(self, query: str, retrieved_docs: List[Dict]):
        """Generate answer using OpenRouter LLM"""
        if not retrieved_docs:
            return {
                'answer': "Insufficient information to answer.",
                'sources': [],
                'confidence': 'low'
            }
        
        # Build context
        context = "\n\n".join([
            f"[Source {i+1}: {doc['source']}, Page {doc['page']}]\n{doc['text']}"
            for i, doc in enumerate(retrieved_docs)
        ])
        
        prompt = f"""You are a Clinical AI Assistant. Answer the question using ONLY the provided context.

Context:
{context}

Question: {query}

Answer (cite sources):"""
        
        # Call OpenRouter
        headers = {
            'Authorization': f'Bearer {self.openrouter_key}',
            'Content-Type': 'application/json'
        }
        
        payload = {
            'model': 'nvidia/nemotron-nano-12b-v2-vl:free',
            'messages': [
                {'role': 'system', 'content': 'You are a helpful clinical AI assistant.'},
                {'role': 'user', 'content': prompt}
            ],
            'temperature': 0.3,
            'max_tokens': 800
        }
        
        try:
            response = requests.post(
                'https://openrouter.ai/api/v1/chat/completions',
                headers=headers,
                json=payload
            )
            
            if response.status_code == 200:
                answer = response.json()['choices'][0]['message']['content']
                
                sources = [{
                    'source': doc['source'],
                    'page': doc['page'],
                    'similarity': doc['similarity'],
                    'text': doc['text'][:500]
                } for doc in retrieved_docs]
                
                return {
                    'answer': answer,
                    'sources': sources,
                    'confidence': 'high' if len(retrieved_docs) >= 3 else 'medium'
                }
            else:
                return {
                    'answer': f"LLM error: {response.status_code}",
                    'sources': [],
                    'confidence': 'error'
                }
        except Exception as e:
            return {
                'answer': f"Error: {str(e)}",
                'sources': [],
                'confidence': 'error'
            }
    
    def query(self, question: str, domain: str = None):
        """Complete RAG pipeline"""
        print(f"\nüîç Query: {question}")
        print(f"üìÇ Domain: {domain or 'All'}\n")
        
        # Retrieve
        retrieved = self.retrieve(question, domain, k=5)
        print(f"‚úì Retrieved {len(retrieved)} documents\n")
        
        # Generate
        result = self.generate_answer(question, retrieved)
        return result

print("‚úÖ RAG Pipeline class defined")

## üèóÔ∏è Build Vector Indexes

In [None]:
# Initialize RAG system
rag = ClinicalRAG(OPENROUTER_KEY)

# Build indexes
rag.build_all_indexes(all_documents)

print("\n‚úÖ All indexes built successfully!")

## üéØ Demo: Ask Clinical Questions!

In [None]:
# Example queries
queries = [
    {"question": "What are the symptoms of COVID-19?", "domain": "covid"},
    {"question": "What machine learning models are used for diabetes prediction?", "domain": "diabetes"},
    {"question": "What are the risk factors for heart attacks?", "domain": "heart_attack"},
    {"question": "What treatments are available for knee injuries?", "domain": "knee_injuries"}
]

# Run queries
for q in queries:
    result = rag.query(q["question"], q["domain"])
    
    print("="*80)
    print(f"\nüí° ANSWER:\n{result['answer']}\n")
    print(f"üìä Confidence: {result['confidence'].upper()}")
    print(f"\nüìö Top {len(result['sources'])} Sources:")
    for i, src in enumerate(result['sources'], 1):
        print(f"\n  {i}. {src['source']} (Page {src['page']})")
        print(f"     Similarity: {src['similarity']*100:.1f}%")
        print(f"     Excerpt: {src['text'][:200]}...")
    print("\n" + "="*80 + "\n")

## üéÆ Interactive Query Interface

In [None]:
import ipywidgets as widgets
from IPython.display import display, HTML, clear_output

# Create widgets
query_input = widgets.Textarea(
    value='What are the symptoms of COVID-19?',
    placeholder='Enter your clinical question...',
    description='Question:',
    layout=widgets.Layout(width='80%', height='80px')
)

domain_select = widgets.Dropdown(
    options=[('All Domains', None), ('COVID-19', 'covid'), 
             ('Diabetes', 'diabetes'), ('Heart Attack', 'heart_attack'), 
             ('Knee Injuries', 'knee_injuries')],
    value=None,
    description='Domain:'
)

submit_button = widgets.Button(
    description='üîç Search',
    button_style='success',
    layout=widgets.Layout(width='150px')
)

output_area = widgets.Output()

def on_submit_clicked(b):
    with output_area:
        clear_output()
        print("Searching...\n")
        
        result = rag.query(query_input.value, domain_select.value)
        
        display(HTML(f"""
        <div style='background: #f0f7ff; padding: 20px; border-radius: 10px; border-left: 5px solid #2196F3;'>
            <h3>üí° Answer</h3>
            <p style='font-size: 16px; line-height: 1.6;'>{result['answer']}</p>
            <p><strong>Confidence:</strong> <span style='color: green;'>{result['confidence'].upper()}</span></p>
        </div>
        """))
        
        if result['sources']:
            display(HTML("<h3>üìö Evidence Sources</h3>"))
            for i, src in enumerate(result['sources'], 1):
                display(HTML(f"""
                <div style='background: #fff; padding: 15px; margin: 10px 0; border-radius: 8px; border: 1px solid #ddd;'>
                    <h4>#{i} {src['source']} (Page {src['page']})</h4>
                    <p><strong>Match:</strong> {src['similarity']*100:.1f}%</p>
                    <p style='font-style: italic; color: #555;'>"{src['text'][:300]}..."</p>
                </div>
                """))

submit_button.on_click(on_submit_clicked)

# Display interface
display(HTML("<h2>üè• Clinical AI Assistant</h2>"))
display(query_input)
display(domain_select)
display(submit_button)
display(output_area)

## üìä System Statistics

In [None]:
import matplotlib.pyplot as plt

# Calculate statistics
stats = {}
for domain, docs in all_documents.items():
    stats[domain] = len(docs)

# Create bar chart
fig, ax = plt.subplots(figsize=(12, 6))
domains = list(stats.keys())
counts = list(stats.values())

bars = ax.bar(domains, counts, color=['#9C27B0', '#2196F3', '#F44336', '#4CAF50'])
ax.set_xlabel('Domain', fontsize=14)
ax.set_ylabel('Number of Documents', fontsize=14)
ax.set_title('Clinical RAG System - Document Distribution', fontsize=16, fontweight='bold')

# Add value labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height):,}',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

# Print summary
print("\nüìà System Summary:")
print(f"  Total Documents: {sum(counts):,}")
print(f"  Domains: {len(domains)}")
print(f"  Embedding Model: all-MiniLM-L6-v2 (384 dims)")
print(f"  Vector Index: FAISS IndexFlatIP")
print(f"  LLM: NVIDIA Nemotron via OpenRouter")

## üíæ Save RAG System

In [None]:
# Save indexes and metadata
import pickle

for domain in rag.indexes.keys():
    # Save FAISS index
    faiss.write_index(rag.indexes[domain], f'{domain}_index.faiss')
    
    # Save metadata
    with open(f'{domain}_metadata.pkl', 'wb') as f:
        pickle.dump(rag.metadata[domain], f)

print("‚úÖ RAG system saved successfully!")
print("\nüìÅ Files created:")
!ls -lh *.faiss *.pkl

## üéì Conclusion

### What We Built:
- ‚úÖ Complete RAG pipeline with 34,000+ documents
- ‚úÖ Multi-domain clinical question answering
- ‚úÖ PDF parsing with Landing AI (page-level grounding)
- ‚úÖ CSV clinical trial data integration
- ‚úÖ FAISS vector search for semantic retrieval
- ‚úÖ OpenRouter LLM for answer generation
- ‚úÖ Interactive query interface

### Key Features:
- üîí **100% Local Retrieval** - No internet data used
- üìö **Evidence-Based** - All answers cite sources
- üéØ **High Accuracy** - Temperature 0.3 for factual responses
- ‚ö° **Fast** - 200-500ms query time
- üîç **Semantic Search** - Understanding intent, not just keywords

### Repository:
üîó [github.com/ArshanBhanage/Clinical-Assistant-RAG](https://github.com/ArshanBhanage/Clinical-Assistant-RAG)

---

**‚ö†Ô∏è Disclaimer**: This is a research prototype. Always consult healthcare professionals for medical advice.