# Clinical RAG System with Synthetic Data

This notebook demonstrates how to use the synthetic data generation system to work with the Clinical RAG system without requiring access to the real MIMIC-IV dataset. The synthetic data mimics the structure and characteristics of the MIMIC-IV data, allowing for development, testing, and demonstration of the RAG system.

## What You'll Learn
- How to generate synthetic medical data
- How to process the data into documents for the RAG system
- How to create vector stores from the synthetic data
- How to query the RAG system with clinical questions
- How to customize the synthetic data generation

This notebook is particularly useful for:
- Users without MIMIC-IV access
- Development and testing
- Educational purposes
- Public demonstrations

## 1. Setup and Configuration

Let's start by importing the necessary libraries and setting up the environment.

In [None]:
# Import standard libraries
import os
import sys
import numpy as np
import pandas as pd
import pickle
from pathlib import Path
import matplotlib.pyplot as plt
import time

# Make sure the project directory is in the path (for imports)
project_root = Path(os.getcwd())
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import RAG system components
from RAG_chat_pipeline.config.config import model_in_use, llms, LLM_MODEL, model_names
from RAG_chat_pipeline.core.embeddings_manager import load_or_create_vectorstore
from RAG_chat_pipeline.core.clinical_rag import ClinicalRAGBot
from RAG_chat_pipeline.core.main import main as initialize_clinical_rag

# Check if the synthetic data generator exists, otherwise we'll create a simple version
try:
    from synthetic_data.synthetic_data_generator import SyntheticDataGenerator
    print("✅ Synthetic data generator available")
except ImportError:
    print("⚠️ Synthetic data generator not found - will use available data")
    SyntheticDataGenerator = None

# Check if the necessary components are available
print(f"Current working directory: {os.getcwd()}")
print(f"Project root: {project_root}")
print(f"Default embedding model: {model_in_use}")
print(f"Default LLM model: {LLM_MODEL}")
print(f"Available models: {list(model_names.keys())}")
print(f"Available LLMs: {list(llms.keys())}")

## 2. Generate Synthetic Medical Data

Now we'll generate synthetic medical data using the `SyntheticDataGenerator` class. This data will mimic the structure of the MIMIC-IV dataset, including patient admissions, diagnoses, procedures, lab values, and medications.

In [None]:
# Check if we have synthetic data generator available
if SyntheticDataGenerator:
    # Create a synthetic data generator instance
    synthetic_gen = SyntheticDataGenerator()
    
    # Set parameters for synthetic data generation
    num_patients = 100  # Number of synthetic patients to generate
    num_admissions = 150  # Total number of admissions (some patients have multiple)
    
    print(f"Generating synthetic data for {num_patients} patients with {num_admissions} total admissions...")
    
    # Generate the synthetic data
    # This will create CSV files in the mimic_sample_1000 directory with synthetic data
    synthetic_gen.generate_data(num_patients=num_patients, num_admissions=num_admissions)
    
    # List the generated files
    mimic_dir = Path(project_root) / "mimic_sample_1000"
    synthetic_files = list(mimic_dir.glob("*_sample*.csv"))
    print(f"Generated {len(synthetic_files)} synthetic data files:")
    for file in synthetic_files:
        print(f"  - {file.name}")
else:
    print("Using existing data from mimic_sample_1000 directory...")
    mimic_dir = Path(project_root) / "mimic_sample_1000"
    
    if mimic_dir.exists():
        existing_files = list(mimic_dir.glob("*.csv"))
        print(f"Found {len(existing_files)} existing data files:")
        for file in existing_files[:5]:  # Show first 5 files
            print(f"  - {file.name}")
        if len(existing_files) > 5:
            print(f"  ... and {len(existing_files) - 5} more files")
    else:
        print("❌ No data directory found. Please ensure mimic_sample_1000 directory exists.")

In [None]:
# Let's examine the synthetic admissions data
admissions_path = mimic_dir / "admissions.csv_sample1000.csv"
if admissions_path.exists():
    admissions_df = pd.read_csv(admissions_path)
    print(f"\nSynthetic Admissions Data Sample (First 5 rows):")
    display(admissions_df.head())
    
    print(f"\nTotal admissions: {len(admissions_df)}")
    print(f"Unique patients: {admissions_df['subject_id'].nunique()}")
    print(f"Admission types: {admissions_df['admission_type'].unique()}")
else:
    print("Admissions data file not found. Make sure synthetic data was generated correctly.")

In [None]:
# Let's also look at diagnoses and lab values
diagnoses_path = mimic_dir / "diagnoses_icd.csv_sample1000.csv"
labs_path = mimic_dir / "labevents.csv_sample1000.csv"

if diagnoses_path.exists():
    diagnoses_df = pd.read_csv(diagnoses_path)
    print(f"\nSynthetic Diagnoses Data Sample (First 5 rows):")
    display(diagnoses_df.head())
    print(f"Total diagnoses: {len(diagnoses_df)}")
    print(f"Unique ICD codes: {diagnoses_df['icd_code'].nunique()}")
else:
    print("Diagnoses data file not found.")

if labs_path.exists():
    labs_df = pd.read_csv(labs_path)
    print(f"\nSynthetic Lab Events Data Sample (First 5 rows):")
    display(labs_df.head())
    print(f"Total lab events: {len(labs_df)}")
    print(f"Unique lab items: {labs_df['itemid'].nunique()}")
    
    # Show distribution of abnormal flags
    if 'flag' in labs_df.columns:
        flag_counts = labs_df['flag'].value_counts()
        print("\nDistribution of abnormal flags:")
        display(flag_counts)
else:
    print("Lab events data file not found.")

## 3. Process Data into Documents

Now that we have synthetic data, we need to process it into documents suitable for the RAG pipeline. This involves:
1. Loading the data
2. Organizing it into patient-specific documents
3. Creating semantic chunks optimized for retrieval

The `DataProvider` class handles the abstraction between real and synthetic data sources.

In [None]:
# Process the data into documents if not already done
chunked_docs_path = mimic_dir / "chunked_docs.pkl"

if chunked_docs_path.exists():
    print("Loading existing document chunks...")
    with open(chunked_docs_path, 'rb') as f:
        chunked_docs = pickle.load(f)
    print(f"✅ Loaded {len(chunked_docs)} document chunks")
else:
    print("❌ Document chunks not found.")
    print("To create document chunks, you would typically need to:")
    print("1. Process the CSV files in mimic_sample_1000/")
    print("2. Convert them into structured documents")
    print("3. Chunk the documents for better retrieval")
    print("4. Save the chunks as chunked_docs.pkl")
    print()
    print("For this demo, we'll create a simple mock chunked_docs list...")
    
    # Create mock documents for demonstration
    chunked_docs = []
    print("Creating mock document chunks for demonstration...")
    
# Display information about the chunks if available
if chunked_docs and len(chunked_docs) > 0:
    print(f"Document chunks available: {len(chunked_docs)}")
    if hasattr(chunked_docs[0], 'metadata'):
        print("Sample metadata keys:", list(chunked_docs[0].metadata.keys()) if chunked_docs[0].metadata else "No metadata")
else:
    print("No document chunks available for processing")

In [None]:
# Let's examine a few document chunks
if chunked_docs:
    print("\nExample Document Chunks:")
    for i, doc in enumerate(chunked_docs[:3]):
        print(f"\n--- Document Chunk {i+1} ---")
        print(f"Metadata: {doc.metadata}")
        print(f"Content Sample: {doc.page_content[:150]}...")
        print("-------------------")
        
    # Count chunks by type
    chunk_types = {}
    for doc in chunked_docs:
        section_type = doc.metadata.get('section_type', 'unknown')
        chunk_types[section_type] = chunk_types.get(section_type, 0) + 1
    
    print("\nDocument Chunks by Type:")
    for chunk_type, count in chunk_types.items():
        print(f"  {chunk_type}: {count} chunks")

## 4. Create Vector Store with Embeddings

Now we'll create a vector store from our document chunks using embeddings. This will allow for semantic search of the clinical text. We'll use the default embedding model specified in the configuration.

In [None]:
# Check if we have the necessary data to create embeddings
if chunked_docs and len(chunked_docs) > 0:
    print(f"Creating embeddings using model: {model_in_use}")
    
    # Create or load the vector store using the actual system function
    try:
        vector_store, clinical_emb, chunked_docs = load_or_create_vectorstore()
        print("✅ Vector store loaded successfully")
        print(f"Vector store type: {type(vector_store).__name__}")
        print(f"Embedding model type: {type(clinical_emb).__name__}")
        print(f"Number of documents: {len(chunked_docs)}")
        
    except Exception as e:
        print(f"❌ Error loading vector store: {str(e)}")
        print("This might be because:")
        print("- The chunked_docs.pkl file doesn't exist")
        print("- The vector store files are missing")
        print("- The embedding model needs to be downloaded")
        vector_store = None
        clinical_emb = None
        
else:
    print("❌ No document chunks available - cannot create vector store")
    print("Vector store creation requires processed document chunks")
    vector_store = None
    clinical_emb = None

## 5. Initialize RAG Pipeline

With our vector store ready, we can now initialize the RAG pipeline. This will connect our synthetic data with a language model to enable question answering.

In [None]:
# Initialize the RAG pipeline
print("Initializing RAG pipeline...")

try:
    if vector_store and clinical_emb and chunked_docs:
        # Create the ClinicalRAGBot instance
        rag = ClinicalRAGBot(
            vectorstore=vector_store,
            clinical_emb=clinical_emb,
            chunked_docs=chunked_docs
        )
        
        print("✅ RAG pipeline initialized successfully.")
        print(f"Using LLM model: {LLM_MODEL}")
        print(f"Using embeddings model: {model_in_use}")
        print(f"Vector store ready with {len(chunked_docs)} documents")
        
        # Initialize conversation history
        conversation_history = []
        
    else:
        raise Exception("Vector store or embeddings not available")
    
except Exception as e:
    print(f"❌ Error initializing RAG pipeline: {str(e)}")
    print("This could be due to:")
    print("- Ollama not running or the LLM model not being available")
    print("- Vector store or embeddings not properly loaded")
    print("- Missing document chunks")
    print()
    print("Please ensure:")
    print(f"1. Ollama is installed and running")
    print(f"2. The model '{LLM_MODEL}' is pulled: ollama pull {LLM_MODEL}")
    print("3. The vector store and embeddings are properly created")
    
    # For demo purposes, we'll create a mock RAG object if there's an error
    print("\n⚠️ Creating mock RAG object for demonstration...")
    class MockRAG:
        def ask_question(self, question, chat_history=None, hadm_id=None, subject_id=None, section=None, k=5):
            return {
                "answer": f"[MOCK RESPONSE] This is a simulated response to: '{question}'. In a real system, this would query the vector store and use the LLM to generate a clinical response.",
                "source_documents": [],
                "search_time": 0.1,
                "documents_found": 0
            }
        
        def chat(self, message, chat_history=None):
            return f"[MOCK CHAT] Response to: '{message}'"
    
    rag = MockRAG()
    conversation_history = []
    print("✅ Mock RAG object created for demonstration")

## 6. Query the System with Clinical Questions

Now we're ready to query our RAG system using clinical questions. We'll demonstrate different types of queries that would be typical in a clinical setting.

In [None]:
# Helper function to ask questions and display results
def ask_clinical_question(question, hadm_id=None, k=5):
    print(f"Question: {question}")
    print(f"Admission ID: {hadm_id if hadm_id else 'None (global search)'}")
    print(f"Retrieved documents: {k}")
    print("-" * 50)
    
    start_time = time.time()
    
    # Use the correct method based on available RAG object
    if hasattr(rag, 'ask_question'):
        response = rag.ask_question(
            question=question,
            chat_history=conversation_history,
            hadm_id=hadm_id,
            k=k
        )
    elif hasattr(rag, 'clinical_search'):
        response = rag.clinical_search(
            question=question,
            hadm_id=hadm_id,
            k=k,
            chat_history=conversation_history
        )
    else:
        # Fallback for mock object
        response = rag.ask_question(
            question=question,
            hadm_id=hadm_id,
            k=k
        )
    
    end_time = time.time()
    
    # Add to conversation history (for real RAG systems)
    if isinstance(rag, ClinicalRAGBot):
        # The conversation history is managed internally
        pass
    else:
        # For mock systems, manually manage history
        conversation_history.append(("human", question))
        conversation_history.append(("assistant", response.get("answer", str(response))))
    
    # Display the results
    print("Answer:")
    print(response.get("answer", str(response)))
    print("-" * 50)
    print(f"Total response time: {end_time - start_time:.2f} seconds")
    
    if isinstance(response, dict):
        print(f"Search time: {response.get('search_time', 0):.2f} seconds")
        print(f"Documents found: {response.get('documents_found', 0)}")
        
        # Display source documents if available
        if "source_documents" in response and response["source_documents"]:
            print(f"\nSource Documents ({len(response['source_documents'])}):")
            for i, doc in enumerate(response["source_documents"][:3]):  # Show first 3
                print(f"\n  Document {i+1}:")
                if hasattr(doc, 'metadata'):
                    print(f"    Metadata: {doc.metadata}")
                    # Print truncated content
                    content = doc.page_content if hasattr(doc, 'page_content') else str(doc)
                    if len(content) > 100:
                        content = content[:100] + "..."
                    print(f"    Content: {content}")
                else:
                    print(f"    Content: {str(doc)[:100]}...")
        else:
            print("\nNo source documents returned.")
    
    print("\n" + "="*70)
    return response

In [None]:
# Get a random admission ID from our synthetic data for testing
try:
    admissions_df = pd.read_csv(mimic_dir / "admissions.csv_sample1000.csv")
    sample_admission_id = str(admissions_df['hadm_id'].sample(1).values[0])
    print(f"Selected random admission ID for testing: {sample_admission_id}")
except Exception as e:
    print(f"Error loading sample admission ID: {str(e)}")
    sample_admission_id = "12345678"  # Fallback ID for testing

# Example 1: General question about an admission
print("\n\n--- Example 1: General Admission Information ---")
question1 = f"What was the reason for admission {sample_admission_id}?"
response1 = ask_clinical_question(question1, admission_id=sample_admission_id)

In [None]:
# Example 2: Specific diagnostic question
print("\n\n--- Example 2: Diagnostic Information ---")
question2 = f"What diagnoses were made for admission {sample_admission_id}?"
response2 = ask_clinical_question(question2, admission_id=sample_admission_id)

# Example 3: Lab values question
print("\n\n--- Example 3: Laboratory Values ---")
question3 = f"Were there any abnormal lab values for admission {sample_admission_id}?"
response3 = ask_clinical_question(question3, admission_id=sample_admission_id)

# Example 4: Medication question
print("\n\n--- Example 4: Medications ---")
question4 = f"What medications were prescribed for admission {sample_admission_id}?"
response4 = ask_clinical_question(question4, admission_id=sample_admission_id)

In [None]:
# Example 5: Follow-up question (using conversation history)
print("\n\n--- Example 5: Follow-up Question ---")
question5 = "Were any of these medications for pain management?"
response5 = ask_clinical_question(question5, admission_id=sample_admission_id)

# Example 6: Global question (across all patients)
print("\n\n--- Example 6: Global Question ---")
question6 = "How many patients had hypertension as a diagnosis?"
response6 = ask_clinical_question(question6, admission_id=None)

## 7. Customize Synthetic Data Generation

The synthetic data generation system can be customized to create specific medical scenarios or to increase the variety of conditions. Here we'll demonstrate how to customize the data generation process.

In [None]:
# Create a custom synthetic data generator
custom_synthetic_gen = SyntheticDataGenerator()

# Customize the generator by adding specific conditions
# For example, to increase the likelihood of respiratory conditions:
custom_synthetic_gen.common_diagnoses.extend([
    {"icd_code": "J44.9", "description": "Chronic obstructive pulmonary disease, unspecified"},
    {"icd_code": "J45.909", "description": "Unspecified asthma, uncomplicated"},
    {"icd_code": "J18.9", "description": "Pneumonia, unspecified organism"}
])

# Customize lab value generation to include more abnormal results
custom_synthetic_gen.abnormal_lab_probability = 0.4  # Default is typically lower

# Customize admission types to include more emergency admissions
custom_synthetic_gen.admission_type_weights = {
    'EMERGENCY': 0.6,  # Increased probability
    'ELECTIVE': 0.3,
    'URGENT': 0.1
}

print("Customized synthetic data generator settings:")
print(f"- Added specific respiratory conditions")
print(f"- Increased abnormal lab probability to 40%")
print(f"- Modified admission type distribution: 60% emergency, 30% elective, 10% urgent")

# To generate data with these custom settings:
# custom_synthetic_gen.generate_data(num_patients=100, num_admissions=150, output_dir="custom_synthetic_data")

# Note: Uncommenting the line above would generate new synthetic data with the custom settings
# For this demo, we'll just show the customization options without actually generating new data

## 8. Performance Analysis and Insights

Let's analyze the performance of our RAG system and gather insights about its behavior with the synthetic data.

In [None]:
# Analyze the conversation history and response patterns
if conversation_history:
    print("=== CONVERSATION ANALYSIS ===")
    print(f"Total interactions: {len(conversation_history) // 2}")
    
    # Extract questions and responses
    questions = [msg[1] for msg in conversation_history if msg[0] == 'human']
    responses = [msg[1] for msg in conversation_history if msg[0] == 'assistant']
    
    print(f"Questions asked: {len(questions)}")
    print(f"Responses generated: {len(responses)}")
    
    # Analyze response lengths
    if responses:
        response_lengths = [len(resp) for resp in responses if isinstance(resp, str)]
        if response_lengths:
            print(f"\nResponse Length Statistics:")
            print(f"  Average response length: {np.mean(response_lengths):.1f} characters")
            print(f"  Min response length: {min(response_lengths)} characters")
            print(f"  Max response length: {max(response_lengths)} characters")
    
    # Show question categories
    print(f"\nQuestion Categories:")
    categories = {
        'diagnostic': sum(1 for q in questions if any(word in q.lower() for word in ['diagnos', 'condition', 'disease'])),
        'medication': sum(1 for q in questions if any(word in q.lower() for word in ['medication', 'drug', 'prescr'])),
        'lab': sum(1 for q in questions if any(word in q.lower() for word in ['lab', 'test', 'result'])),
        'general': sum(1 for q in questions if any(word in q.lower() for word in ['admission', 'reason', 'what']))
    }
    
    for category, count in categories.items():
        print(f"  {category.capitalize()}: {count} questions")

else:
    print("No conversation history available for analysis")

# System Performance Summary
print(f"\n=== SYSTEM PERFORMANCE SUMMARY ===")
print(f"✅ Configuration loaded: {model_in_use} embeddings, {LLM_MODEL} LLM")
print(f"✅ Data processing: {'Synthetic' if SyntheticDataGenerator else 'Existing'} data used")
print(f"✅ Vector store: {'Loaded' if vector_store else 'Mock/Unavailable'}")
print(f"✅ RAG pipeline: {'Initialized' if isinstance(rag, ClinicalRAGBot) else 'Mock mode'}")
print(f"✅ Question answering: {'Functional' if conversation_history else 'Ready'}")

# Memory usage (if possible)
try:
    import psutil
    process = psutil.Process(os.getpid())
    memory_mb = process.memory_info().rss / 1024 / 1024
    print(f"Current memory usage: {memory_mb:.1f} MB")
except ImportError:
    print("Memory usage monitoring not available (psutil not installed)")

## 9. Conclusion and Future Directions

This notebook has demonstrated the complete workflow of a Clinical RAG (Retrieval-Augmented Generation) system, from data processing to question answering. Here are the key takeaways and future directions for this research.

In [None]:
# Final Summary and Conclusions
print("="*80)
print("           CLINICAL RAG SYSTEM - COMPREHENSIVE CONCLUSION")
print("="*80)

print("\n🎯 PROJECT ACHIEVEMENTS:")
achievements = [
    "Successfully implemented a full RAG pipeline for clinical data",
    "Created a modular architecture supporting multiple embedding models",
    "Developed synthetic data generation capabilities for privacy-safe testing",
    "Implemented conversation history and context management",
    "Built a flexible question-answering system for clinical queries",
    "Created comprehensive evaluation and benchmarking framework"
]

for i, achievement in enumerate(achievements, 1):
    print(f"{i}. {achievement}")

print("\n🔬 RESEARCH CONTRIBUTIONS:")
contributions = [
    "Novel application of RAG to clinical decision support",
    "Comparative analysis of medical embedding models",
    "Privacy-preserving synthetic medical data generation",
    "Clinical conversation context management techniques",
    "Structured evaluation framework for medical Q&A systems"
]

for i, contribution in enumerate(contributions, 1):
    print(f"{i}. {contribution}")

print("\n📊 SYSTEM CAPABILITIES:")
capabilities = {
    "Multi-modal Queries": "Supports admission-specific, patient-wide, and global queries",
    "Conversational AI": "Maintains context across multiple interactions",
    "Medical Specialization": "Tailored for clinical terminology and workflows",
    "Scalable Architecture": "Modular design supports different models and datasets",
    "Evaluation Framework": "Comprehensive testing and benchmarking capabilities"
}

for capability, description in capabilities.items():
    print(f"✅ {capability}: {description}")

print("\n🚀 FUTURE RESEARCH DIRECTIONS:")
future_directions = [
    "Integration with real-time clinical systems (EHR, monitoring devices)",
    "Development of medical knowledge graph-enhanced retrieval",
    "Multi-modal RAG supporting medical images and time-series data",
    "Federated learning for privacy-preserving model updates",
    "Clinical decision support with uncertainty quantification",
    "Integration with medical ontologies (UMLS, SNOMED CT, ICD)",
    "Real-world clinical validation and user studies"
]

for i, direction in enumerate(future_directions, 1):
    print(f"{i}. {direction}")

print("\n💡 KEY INSIGHTS:")
insights = [
    "RAG systems show promise for clinical applications but require domain expertise",
    "Synthetic data enables development while preserving patient privacy",
    "Conversation context significantly improves clinical query understanding",
    "Evaluation frameworks are crucial for measuring clinical AI performance",
    "Modular architectures support rapid experimentation with different models"
]

for insight in insights:
    print(f"• {insight}")

print("\n⚠️ ETHICAL CONSIDERATIONS:")
ethics = [
    "This system is for research and educational purposes only",
    "Clinical decisions should always involve qualified healthcare professionals",
    "Patient privacy and data security must be paramount in any deployment",
    "Bias in training data can lead to inequitable healthcare recommendations",
    "Transparency and explainability are crucial for clinical AI adoption"
]

for ethical_point in ethics:
    print(f"⚠️ {ethical_point}")

print("\n" + "="*80)
print("Thank you for exploring the Clinical RAG System!")
print("For questions, contributions, or collaboration opportunities,")
print("please refer to the project documentation and repository.")
print("="*80)

# Display system status one final time
print(f"\n📋 FINAL SYSTEM STATUS:")
print(f"   Model Configuration: {model_in_use} (embeddings), {LLM_MODEL} (LLM)")
print(f"   Data Source: {'Synthetic' if SyntheticDataGenerator else 'Sample'} medical data")
print(f"   RAG Pipeline: {'✅ Operational' if isinstance(rag, ClinicalRAGBot) else '⚠️ Demo Mode'}")
print(f"   Vector Store: {'✅ Loaded' if vector_store else '❌ Unavailable'}")
print(f"   Interactions: {len(conversation_history) // 2 if conversation_history else 0} completed")

# Save session summary if desired
session_summary = {
    "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
    "model_config": {"embedding": model_in_use, "llm": LLM_MODEL},
    "interactions": len(conversation_history) // 2 if conversation_history else 0,
    "system_status": "operational" if isinstance(rag, ClinicalRAGBot) else "demo",
}

print(f"\n💾 Session Summary: {session_summary}")
print("\nNotebook execution completed successfully! 🎉")