# Clinical RAG System with Synthetic Data

This notebook demonstrates how to use the synthetic data generation system to work with the Clinical RAG system without requiring access to the real MIMIC-IV dataset. The synthetic data mimics the structure and characteristics of the MIMIC-IV data, allowing for development, testing, and demonstration of the RAG system.

## What You'll Learn
- How to generate synthetic medical data
- How to process the data into documents for the RAG system
- How to create vector stores from the synthetic data
- How to query the RAG system with clinical questions
- How to customize the synthetic data generation

This notebook is particularly useful for:
- Users without MIMIC-IV access
- Development and testing
- Educational purposes
- Public demonstrations

## 1. Setup and Configuration

Let's start by importing the necessary libraries and setting up the environment.

In [None]:
# Import standard libraries
import os
import sys
import numpy as np
import pandas as pd
import pickle
from pathlib import Path
import matplotlib.pyplot as plt
import time

# Make sure the project directory is in the path (for imports)
project_root = Path(os.getcwd())
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Import RAG system components
from RAG_chat_pipeline.config import model_in_use, llms, LLM_MODEL
from RAG_chat_pipeline.utils.data_provider import DataProvider
from RAG_chat_pipeline.utils.synthetic_data.synthetic_data_generator import SyntheticDataGenerator
from RAG_chat_pipeline.embeddings_manager import initialize_embeddings, load_or_create_vector_stores
from RAG_chat_pipeline.clinical_rag import ClinicalRAG

# Check if the necessary components are available
print(f"Current working directory: {os.getcwd()}")
print(f"Project root: {project_root}")
print(f"Default embedding model: {model_in_use}")
print(f"Default LLM model: {LLM_MODEL['name']}")

## 2. Generate Synthetic Medical Data

Now we'll generate synthetic medical data using the `SyntheticDataGenerator` class. This data will mimic the structure of the MIMIC-IV dataset, including patient admissions, diagnoses, procedures, lab values, and medications.

In [None]:
# Create a synthetic data generator instance
synthetic_gen = SyntheticDataGenerator()

# Set parameters for synthetic data generation
num_patients = 100  # Number of synthetic patients to generate
num_admissions = 150  # Total number of admissions (some patients have multiple)

print(f"Generating synthetic data for {num_patients} patients with {num_admissions} total admissions...")

# Generate the synthetic data
# This will create CSV files in the mimic_sample_1000 directory with synthetic data
synthetic_gen.generate_data(num_patients=num_patients, num_admissions=num_admissions)

# List the generated files
mimic_dir = Path(project_root) / "mimic_sample_1000"
synthetic_files = list(mimic_dir.glob("*_sample*.csv"))
print(f"Generated {len(synthetic_files)} synthetic data files:")
for file in synthetic_files:
    print(f"  - {file.name}")

In [None]:
# Let's examine the synthetic admissions data
admissions_path = mimic_dir / "admissions.csv_sample1000.csv"
if admissions_path.exists():
    admissions_df = pd.read_csv(admissions_path)
    print(f"\nSynthetic Admissions Data Sample (First 5 rows):")
    display(admissions_df.head())
    
    print(f"\nTotal admissions: {len(admissions_df)}")
    print(f"Unique patients: {admissions_df['subject_id'].nunique()}")
    print(f"Admission types: {admissions_df['admission_type'].unique()}")
else:
    print("Admissions data file not found. Make sure synthetic data was generated correctly.")

In [None]:
# Let's also look at diagnoses and lab values
diagnoses_path = mimic_dir / "diagnoses_icd.csv_sample1000.csv"
labs_path = mimic_dir / "labevents.csv_sample1000.csv"

if diagnoses_path.exists():
    diagnoses_df = pd.read_csv(diagnoses_path)
    print(f"\nSynthetic Diagnoses Data Sample (First 5 rows):")
    display(diagnoses_df.head())
    print(f"Total diagnoses: {len(diagnoses_df)}")
    print(f"Unique ICD codes: {diagnoses_df['icd_code'].nunique()}")
else:
    print("Diagnoses data file not found.")

if labs_path.exists():
    labs_df = pd.read_csv(labs_path)
    print(f"\nSynthetic Lab Events Data Sample (First 5 rows):")
    display(labs_df.head())
    print(f"Total lab events: {len(labs_df)}")
    print(f"Unique lab items: {labs_df['itemid'].nunique()}")
    
    # Show distribution of abnormal flags
    if 'flag' in labs_df.columns:
        flag_counts = labs_df['flag'].value_counts()
        print("\nDistribution of abnormal flags:")
        display(flag_counts)
else:
    print("Lab events data file not found.")

## 3. Process Data into Documents

Now that we have synthetic data, we need to process it into documents suitable for the RAG pipeline. This involves:
1. Loading the data
2. Organizing it into patient-specific documents
3. Creating semantic chunks optimized for retrieval

The `DataProvider` class handles the abstraction between real and synthetic data sources.

In [None]:
# Initialize the data provider to work with our data
data_provider = DataProvider(project_root=project_root)

# Check if we're working with synthetic or real data
print(f"Using synthetic data: {data_provider.using_synthetic_data}")

# Process the data into documents if not already done
chunked_docs_path = mimic_dir / "chunked_docs.pkl"

if chunked_docs_path.exists():
    print("Loading existing document chunks...")
    with open(chunked_docs_path, 'rb') as f:
        chunked_docs = pickle.load(f)
    print(f"Loaded {len(chunked_docs)} document chunks")
else:
    print("Processing data into document chunks...")
    # This would normally be a longer process involving:
    # 1. Loading admission data
    # 2. Joining with diagnoses, procedures, lab values, etc.
    # 3. Creating text documents for each admission
    # 4. Chunking the documents for better retrieval
    
    # For this demo, we'll force the data provider to generate chunks
    chunked_docs = data_provider.get_chunked_docs()
    print(f"Generated {len(chunked_docs)} document chunks")
    
    # Save the chunks for future use
    with open(chunked_docs_path, 'wb') as f:
        pickle.dump(chunked_docs, f)
    print(f"Saved document chunks to {chunked_docs_path}")

In [None]:
# Let's examine a few document chunks
if chunked_docs:
    print("\nExample Document Chunks:")
    for i, doc in enumerate(chunked_docs[:3]):
        print(f"\n--- Document Chunk {i+1} ---")
        print(f"Metadata: {doc.metadata}")
        print(f"Content Sample: {doc.page_content[:150]}...")
        print("-------------------")
        
    # Count chunks by type
    chunk_types = {}
    for doc in chunked_docs:
        section_type = doc.metadata.get('section_type', 'unknown')
        chunk_types[section_type] = chunk_types.get(section_type, 0) + 1
    
    print("\nDocument Chunks by Type:")
    for chunk_type, count in chunk_types.items():
        print(f"  {chunk_type}: {count} chunks")

## 4. Create Vector Store with Embeddings

Now we'll create a vector store from our document chunks using embeddings. This will allow for semantic search of the clinical text. We'll use the default embedding model specified in the configuration.

In [None]:
# Initialize the embedding model (using the default from config.py)
embeddings = initialize_embeddings(model_name=model_in_use)

print(f"Using embedding model: {model_in_use}")

# Create or load the vector store
vector_store_path = Path(project_root) / "vector_stores" / f"faiss_mimic_sample1000_{model_in_use}"
print(f"Vector store path: {vector_store_path}")

if vector_store_path.exists():
    print("Vector store already exists. Loading existing vector store...")
    # In the actual system, load_or_create_vector_stores would handle this
else:
    print("Creating new vector store from document chunks...")
    
# Get the vector store (this will create it if it doesn't exist)
vector_store = load_or_create_vector_stores(
    docs=chunked_docs,
    model_name=model_in_use,
    embeddings=embeddings
)

print(f"Vector store ready for querying: {type(vector_store).__name__}")

## 5. Initialize RAG Pipeline

With our vector store ready, we can now initialize the RAG pipeline. This will connect our synthetic data with a language model to enable question answering.

In [None]:
# Initialize the RAG pipeline
print("Initializing RAG pipeline...")

try:
    # Create the ClinicalRAG instance
    rag = ClinicalRAG(
        vector_store=vector_store,
        embeddings=embeddings,
        llm_model=LLM_MODEL,
        verbose=True
    )
    
    print(f"RAG pipeline initialized successfully.")
    print(f"Using LLM model: {LLM_MODEL['name']}")
    print(f"Using embeddings model: {model_in_use}")
    
    # Initialize conversation history
    conversation_history = []
    
except Exception as e:
    print(f"Error initializing RAG pipeline: {str(e)}")
    print("This could be due to Ollama not running or the LLM model not being available.")
    print("Please ensure Ollama is installed and the specified model is pulled.")
    print(f"You may need to run: ollama pull {LLM_MODEL['name']}")
    
    # For demo purposes, we'll create a mock RAG object if there's an error
    print("\nCreating mock RAG object for demonstration...")
    class MockRAG:
        def ask_question(self, query, k=5, admission_id=None, conversation_history=None):
            return {
                "answer": f"[MOCK] This is a simulated response to: '{query}'",
                "source_documents": [],
                "search_time": 0.1,
                "total_time": 0.2
            }
    rag = MockRAG()

## 6. Query the System with Clinical Questions

Now we're ready to query our RAG system using clinical questions. We'll demonstrate different types of queries that would be typical in a clinical setting.

In [None]:
# Helper function to ask questions and display results
def ask_clinical_question(question, admission_id=None, k=5):
    print(f"Question: {question}")
    print(f"Admission ID: {admission_id if admission_id else 'None (global search)'}")
    print(f"Retrieved documents: {k}")
    print("-" * 50)
    
    start_time = time.time()
    response = rag.ask_question(
        query=question,
        k=k,
        admission_id=admission_id,
        conversation_history=conversation_history
    )
    end_time = time.time()
    
    # Add to conversation history
    conversation_history.append({"role": "user", "content": question})
    conversation_history.append({"role": "assistant", "content": response["answer"]})
    
    # Display the results
    print("Answer:")
    print(response["answer"])
    print("-" * 50)
    print(f"Total response time: {end_time - start_time:.2f} seconds")
    print(f"Search time: {response.get('search_time', 0):.2f} seconds")
    
    # Display source documents
    if "source_documents" in response and response["source_documents"]:
        print("\nSource Documents:")
        for i, doc in enumerate(response["source_documents"]):
            print(f"\nDocument {i+1}:")
            print(f"  Score: {doc.metadata.get('score', 'N/A')}")
            print(f"  Metadata: {doc.metadata}")
            # Print truncated content
            content = doc.page_content
            if len(content) > 100:
                content = content[:100] + "..."
            print(f"  Content: {content}")
    else:
        print("\nNo source documents returned.")
    
    return response

In [None]:
# Get a random admission ID from our synthetic data for testing
try:
    admissions_df = pd.read_csv(mimic_dir / "admissions.csv_sample1000.csv")
    sample_admission_id = str(admissions_df['hadm_id'].sample(1).values[0])
    print(f"Selected random admission ID for testing: {sample_admission_id}")
except Exception as e:
    print(f"Error loading sample admission ID: {str(e)}")
    sample_admission_id = "12345678"  # Fallback ID for testing

# Example 1: General question about an admission
print("\n\n--- Example 1: General Admission Information ---")
question1 = f"What was the reason for admission {sample_admission_id}?"
response1 = ask_clinical_question(question1, admission_id=sample_admission_id)

In [None]:
# Example 2: Specific diagnostic question
print("\n\n--- Example 2: Diagnostic Information ---")
question2 = f"What diagnoses were made for admission {sample_admission_id}?"
response2 = ask_clinical_question(question2, admission_id=sample_admission_id)

# Example 3: Lab values question
print("\n\n--- Example 3: Laboratory Values ---")
question3 = f"Were there any abnormal lab values for admission {sample_admission_id}?"
response3 = ask_clinical_question(question3, admission_id=sample_admission_id)

# Example 4: Medication question
print("\n\n--- Example 4: Medications ---")
question4 = f"What medications were prescribed for admission {sample_admission_id}?"
response4 = ask_clinical_question(question4, admission_id=sample_admission_id)

In [None]:
# Example 5: Follow-up question (using conversation history)
print("\n\n--- Example 5: Follow-up Question ---")
question5 = "Were any of these medications for pain management?"
response5 = ask_clinical_question(question5, admission_id=sample_admission_id)

# Example 6: Global question (across all patients)
print("\n\n--- Example 6: Global Question ---")
question6 = "How many patients had hypertension as a diagnosis?"
response6 = ask_clinical_question(question6, admission_id=None)

## 7. Customize Synthetic Data Generation

The synthetic data generation system can be customized to create specific medical scenarios or to increase the variety of conditions. Here we'll demonstrate how to customize the data generation process.

In [None]:
# Create a custom synthetic data generator
custom_synthetic_gen = SyntheticDataGenerator()

# Customize the generator by adding specific conditions
# For example, to increase the likelihood of respiratory conditions:
custom_synthetic_gen.common_diagnoses.extend([
    {"icd_code": "J44.9", "description": "Chronic obstructive pulmonary disease, unspecified"},
    {"icd_code": "J45.909", "description": "Unspecified asthma, uncomplicated"},
    {"icd_code": "J18.9", "description": "Pneumonia, unspecified organism"}
])

# Customize lab value generation to include more abnormal results
custom_synthetic_gen.abnormal_lab_probability = 0.4  # Default is typically lower

# Customize admission types to include more emergency admissions
custom_synthetic_gen.admission_type_weights = {
    'EMERGENCY': 0.6,  # Increased probability
    'ELECTIVE': 0.3,
    'URGENT': 0.1
}

print("Customized synthetic data generator settings:")
print(f"- Added specific respiratory conditions")
print(f"- Increased abnormal lab probability to 40%")
print(f"- Modified admission type distribution: 60% emergency, 30% elective, 10% urgent")

# To generate data with these custom settings:
# custom_synthetic_gen.generate_data(num_patients=100, num_admissions=150, output_dir="custom_synthetic_data")

# Note: Uncommenting the line above would generate new synthetic data with the custom settings
# For this demo, we'll just show the customization options without actually generating new data

## Conclusion

In this notebook, we've demonstrated how to use the synthetic data generation system with the Clinical RAG pipeline. The synthetic data option provides several advantages:

1. **No MIMIC-IV Credentials Required**: You can develop and test without needing access to the real dataset
2. **Customizable Scenarios**: You can generate specific medical scenarios for testing or educational purposes
3. **Shareable Code**: The synthetic data can be committed to Git, making it easy to share your work
4. **Seamless Integration**: The system automatically detects and uses synthetic data if real data isn't available

The RAG system works identically with both synthetic and real data, making it easy to transition between them as needed.

### Next Steps

- Try customizing the synthetic data generation for specific scenarios
- Experiment with different embedding models and LLMs
- Use the web interface to interact with the system
- Contribute to improving the synthetic data generation for more realistic scenarios