# GPT-3.5 Discharge Summary Generation

This notebook generates discharge summaries using GPT-3.5 and compares them to original clinician-written instructions.

## Analysis Steps:
1. Load EHR discharge notes
2. Extract relevant sections (HPI, Hospital Course)
3. Generate summaries using GPT-3.5
4. Save generated summaries for comparison

## 1. Setup and Imports

In [None]:
import pandas as pd
import re
import os
from openai import OpenAI
from dotenv import load_dotenv

# Import custom data loader
import sys
sys.path.insert(0, '..')
from src.data_loader import load_for_analysis

# Load environment variables (FIXED: was missing this!)
load_dotenv()

# Initialize OpenAI client
api_key = os.getenv('OPENAI_API_KEY')
if not api_key:
    raise ValueError(
        "OPENAI_API_KEY not found in environment variables.\n"
        "Please create a .env file with your API key:\n"
        "  cp .env.example .env\n"
        "  # Edit .env to add your key"
    )

CLIENT = OpenAI(api_key=api_key)
print("✓ OpenAI client initialized")

## 2. Load Data

In [None]:
# Load a sample for GPT summarization
# Note: Full dataset would be expensive to run through GPT-3.5
df = load_for_analysis(
    filepath='../data/sample=8k.csv',  # Using smaller sample
    sample_size=None,
    random_state=42
)

print(f"Loaded {len(df)} records for summarization")
df.head()

## 3. Section Extraction Functions

In [None]:
def sep_sections(text):
    """
    Extract clinical note sections using regex patterns.
    """
    sections = {
        "Chief Complaint": "",
        "History of Present Illness": "",
        "Family History": "",
        "Brief Hospital Course": "",
        "Transitional Issues": "",
        "Discharge Instructions": "",
        "Followup Instructions": "",
    }
    
    patterns = {
        "Chief Complaint": r"(?smi)^\s*Chief Complaint(?::)?\n(.*?)^\s*Major Surgical or Invasive Procedure",
        "History of Present Illness": r"(?smi)^\s*History of Present Illness(?::)?\n(.*?)^\s*Past Medical History",
        "Family History": r"(?smi)^\s*Family History(?::)?\n(.*?)^\s*Physical Exam",
        "Brief Hospital Course": r"(?smi)^\s*Brief Hospital Course(?::)?\n(.*?)^\s*TRANSITIONAL ISSUES",
        "Transitional Issues": r"(?smi)^\s*TRANSITIONAL ISSUES(?::)?\n(.*?)^\s*Medications on Admission",
        "Discharge Instructions": r"(?smi)^\s*Discharge Instructions(?::)?\n(.*?)^\s*Followup Instructions",
        "Followup Instructions": r"(?smi)^\s*Followup Instructions(?::)?\n(.*)",
    }
    
    for section, pattern in patterns.items():
        match = re.search(pattern, text, re.DOTALL)
        if match:
            sections[section] = match.group(1).strip()
    
    return sections

## 4. GPT-3.5 Summarization Function

In [None]:
def generate_discharge_summary(text):
    """
    Generate discharge summary using GPT-3.5.
    
    Based on goals-of-care conversation prompt.
    """
    prompt = (
        "You are a healthcare provider preparing discharge instructions for a patient. "
        "Below is clinical information from the hospital stay. "
        "Please create clear, patient-friendly discharge instructions that include:\n"
        "1. Summary of hospital stay and treatment\n"
        "2. Key medications and instructions\n"
        "3. Follow-up care needed\n"
        "4. Warning signs to watch for\n\n"
    )
    
    # Example one-shot
    example = (
        "Example:\n"
        "Dear Patient,\n\n"
        "You were admitted for [condition]. During your stay, we [treatment]. "
        "You are now stable and ready to go home.\n\n"
        "Important instructions:\n"
        "- Take medications as prescribed\n"
        "- Follow up with your doctor\n"
        "- Call 911 if symptoms worsen\n\n"
    )
    
    response = CLIENT.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": prompt + example + "\n\nPatient Information:\n" + text,
            }
        ],
        model="gpt-3.5-turbo-0125",
        response_format={"type": "text"}
    )
    
    return response.choices[0].message.content

## 5. Generate Summaries

**Note:** This cell can be expensive to run on large datasets. 
Start with a small sample to test.

In [None]:
# Create output directory
os.makedirs('../results/GPTSummaries', exist_ok=True)

# Generate summaries
generated_summaries = []

# IMPORTANT: Limit to small sample for testing
test_sample = df.head(10)  # Start with just 10

print(f"Generating summaries for {len(test_sample)} records...")
print("Estimated cost: ~$0.01 per summary\n")

for idx, row in test_sample.iterrows():
    print(f"Processing {idx+1}/{len(test_sample)}...", end=" ")
    
    # Extract sections
    sections = sep_sections(row['text'])
    
    # Combine relevant sections
    context = "\n\n".join([
        f"{key}:\n{sections[key]}"
        for key in ["Chief Complaint", "History of Present Illness", "Brief Hospital Course"]
        if sections[key]
    ])
    
    # Generate summary
    try:
        summary = generate_discharge_summary(context)
        
        # Save to file
        filename = f"{row['subject_id']}_{row['gender']}_{row['race']}.txt".replace("/", "")
        filepath = f"../results/GPTSummaries/{filename}"
        
        with open(filepath, 'w', encoding='utf-8') as f:
            f.write(summary)
        
        generated_summaries.append({
            'subject_id': row['subject_id'],
            'gender': row['gender'],
            'race': row['race'],
            'text': summary
        })
        
        print("✓")
        
    except Exception as e:
        print(f"✗ Error: {e}")

print(f"\nGenerated {len(generated_summaries)} summaries")

## 6. Save Results to CSV

In [None]:
# Create DataFrame from generated summaries
generated_df = pd.DataFrame(generated_summaries)

# Save to CSV
output_file = '../results/GPTSummaries/generated_summaries.csv'
generated_df.to_csv(output_file, index=False)

print(f"Saved {len(generated_df)} summaries to {output_file}")
generated_df.head()

## 7. Next Steps

With generated summaries, you can:
1. Run Fighting Words analysis on GPT-generated vs original instructions
2. Compare word usage patterns across racial groups
3. Assess whether GPT-3.5 amplifies or reduces bias

### Cost Estimation:
- GPT-3.5-turbo: ~$0.001 per 1K tokens
- Average discharge note: ~2K tokens
- For 8,000 summaries: ~$16-$20

### Ethical Considerations:
- Generated summaries should NOT be used clinically
- This is research only - not medical advice
- MIMIC data use agreement restrictions apply