# Medical NLP with Azure - Complete Analysis

This notebook demonstrates a comprehensive medical Natural Language Processing pipeline with Azure Cognitive Services integration.

## Key Features:
- **Medical Text Preprocessing**: Specialized cleaning and normalization for medical texts
- **Entity Recognition**: Extract medications, conditions, procedures, and vital signs
- **Text Classification**: Classify medical document types (clinical notes, discharge summaries, etc.)
- **Sentiment Analysis**: Analyze patient feedback sentiment
- **Azure Integration**: Leverage Azure Text Analytics for Health (with local fallbacks)
- **Interactive Visualizations**: Comprehensive analysis and reporting

## Healthcare Applications:
- Clinical documentation analysis
- Patient feedback monitoring
- Medical record classification
- Quality improvement initiatives
- Regulatory compliance support

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("📚 Libraries imported successfully!")

In [None]:
# Add project paths
import sys
from pathlib import Path

# Add src directory to Python path
project_root = Path.cwd()
src_dir = project_root / 'src'
shared_dir = project_root.parent.parent / 'shared'

sys.path.insert(0, str(src_dir))
sys.path.insert(0, str(shared_dir))

print(f"📁 Project root: {project_root}")
print(f"📁 Source directory: {src_dir}")
print(f"📁 Shared directory: {shared_dir}")

## 1. Data Generation and Loading

We'll start by generating synthetic medical text data that mimics real-world clinical documentation while ensuring complete privacy and HIPAA compliance.

In [None]:
# Import our custom modules
try:
    from data_generators.medical_text_generator import MedicalTextGenerator
    from nlp_pipeline import MedicalNLPPipeline
    print("✅ Custom modules imported successfully!")
except ImportError as e:
    print(f"⚠️ Import error: {e}")
    print("Installing required packages...")
    
    # Install required packages
    import subprocess
    packages = ['scikit-learn', 'nltk', 'textblob']
    
    for package in packages:
        try:
            subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
        except:
            print(f"Failed to install {package}")

In [None]:
# Generate synthetic medical text data
print("🏥 Generating synthetic medical text data...")

# Initialize generator
generator = MedicalTextGenerator(seed=42)

# Generate dataset
medical_data = generator.generate_dataset(total_records=500)

print(f"✅ Generated {len(medical_data)} medical text records")
print(f"📊 Columns: {list(medical_data.columns)}")

# Display basic statistics
print("\n📈 Note Type Distribution:")
print(medical_data['note_type'].value_counts())

In [None]:
# Display sample records
print("📝 Sample Medical Text Records:\n")

for note_type in medical_data['note_type'].unique():
    sample = medical_data[medical_data['note_type'] == note_type].iloc[0]
    print(f"{'='*50}")
    print(f"📋 {note_type.upper().replace('_', ' ')}")
    print(f"{'='*50}")
    print(f"Text: {sample['text'][:200]}...")
    if 'sentiment' in sample and pd.notna(sample['sentiment']):
        print(f"Sentiment: {sample['sentiment']}")
    print()

## 2. Medical NLP Pipeline Initialization

Initialize our comprehensive medical NLP pipeline with all components:
- Text preprocessing
- Entity recognition
- Classification models
- Sentiment analysis

In [None]:
# Initialize the medical NLP pipeline
print("🔧 Initializing Medical NLP Pipeline...")

# Configuration for the pipeline
config = {
    'preserve_medical_terms': True,
    'use_azure': False,  # Set to True if you have Azure credentials
    'max_features': 3000,
    'test_size': 0.2,
    'random_state': 42
}

# Initialize pipeline
pipeline = MedicalNLPPipeline()
pipeline.config.update(config)

print("✅ Pipeline initialized successfully!")
print(f"🔹 Configuration: {pipeline.config}")

## 3. Text Preprocessing Analysis

Demonstrate medical-specific text preprocessing including:
- Medical abbreviation expansion
- Clinical term normalization
- Entity-aware tokenization

In [None]:
# Run text preprocessing
print("🔄 Running text preprocessing...")

processed_data, feature_data = pipeline.preprocess_texts(medical_data)

print(f"✅ Preprocessed {len(processed_data)} texts")
print(f"📊 Feature matrix shape: {feature_data.shape}")

# Display preprocessing summary
prep_summary = pipeline.results['preprocessing_summary']
print("\n📋 Preprocessing Summary:")
for key, value in prep_summary.items():
    if isinstance(value, float):
        print(f"  🔹 {key}: {value:.2f}")
    else:
        print(f"  🔹 {key}: {value}")

In [None]:
# Visualize preprocessing results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('Medical Text Preprocessing Analysis', fontsize=16, fontweight='bold')

# Text length distribution
axes[0,0].hist(medical_data['text'].str.len(), bins=30, alpha=0.7, color='skyblue', label='Original')
axes[0,0].hist(processed_data['cleaned_text'].str.len(), bins=30, alpha=0.7, color='lightcoral', label='Cleaned')
axes[0,0].set_xlabel('Text Length (characters)')
axes[0,0].set_ylabel('Frequency')
axes[0,0].set_title('Text Length Distribution')
axes[0,0].legend()

# Token count distribution
axes[0,1].hist(processed_data['token_count'], bins=20, alpha=0.7, color='lightgreen')
axes[0,1].set_xlabel('Token Count')
axes[0,1].set_ylabel('Frequency')
axes[0,1].set_title('Token Count Distribution')

# Feature statistics
feature_cols = ['char_count', 'word_count', 'sentence_count', 'avg_word_length']
feature_stats = feature_data[feature_cols].mean()
axes[1,0].bar(range(len(feature_stats)), feature_stats.values, color='gold')
axes[1,0].set_xticks(range(len(feature_stats)))
axes[1,0].set_xticklabels(feature_stats.index, rotation=45)
axes[1,0].set_ylabel('Average Value')
axes[1,0].set_title('Average Text Features')

# Entity indicators
entity_cols = ['medication_count', 'condition_count', 'measurement_count', 'procedure_count']
entity_stats = feature_data[entity_cols].mean()
axes[1,1].bar(range(len(entity_stats)), entity_stats.values, color='mediumpurple')
axes[1,1].set_xticks(range(len(entity_stats)))
axes[1,1].set_xticklabels(entity_stats.index, rotation=45)
axes[1,1].set_ylabel('Average Count')
axes[1,1].set_title('Average Entity Counts')

plt.tight_layout()
plt.show()

## 4. Medical Entity Recognition

Extract and analyze medical entities including:
- Medications and dosages
- Medical conditions and diagnoses
- Procedures and treatments
- Vital signs and measurements

In [None]:
# Run entity recognition
print("🔍 Running medical entity recognition...")

entity_data = pipeline.extract_entities(processed_data)

print(f"✅ Extracted entities from {len(entity_data)} texts")

# Display entity statistics
entity_stats = pipeline.results['entity_statistics']
print("\n📊 Entity Recognition Summary:")
for key, value in entity_stats.items():
    if isinstance(value, float):
        print(f"  🔹 {key}: {value:.2f}")
    else:
        print(f"  🔹 {key}: {value}")

In [None]:
# Analyze entity patterns
print("📈 Most Common Extracted Entities:\n")

# Medications
all_medications = []
for meds in entity_data['medications_found']:
    if isinstance(meds, list):
        all_medications.extend(meds)
    elif isinstance(meds, str) and meds:
        all_medications.extend(meds.split(', '))

if all_medications:
    med_counts = pd.Series(all_medications).value_counts().head(10)
    print("💊 Top Medications:")
    for med, count in med_counts.items():
        print(f"  • {med}: {count}")

# Conditions
all_conditions = []
for conds in entity_data['conditions_found']:
    if isinstance(conds, list):
        all_conditions.extend(conds)
    elif isinstance(conds, str) and conds:
        all_conditions.extend(conds.split(', '))

if all_conditions:
    cond_counts = pd.Series(all_conditions).value_counts().head(10)
    print("\n🏥 Top Conditions:")
    for cond, count in cond_counts.items():
        print(f"  • {cond}: {count}")

In [None]:
# Visualize entity recognition results
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=('Entity Distribution by Type', 'Records with Entities', 
                   'Average Entities per Note Type', 'Entity Co-occurrence'),
    specs=[[{'type': 'bar'}, {'type': 'bar'}],
           [{'type': 'bar'}, {'type': 'bar'}]]
)

# Entity type distribution
entity_counts = {
    'Medications': entity_stats.get('records_with_medications', 0),
    'Conditions': entity_stats.get('records_with_conditions', 0),
    'Vital Signs': entity_stats.get('records_with_vital_signs', 0),
    'Procedures': entity_stats.get('records_with_procedures', 0)
}

fig.add_trace(
    go.Bar(x=list(entity_counts.keys()), y=list(entity_counts.values()),
           name='Entity Types', marker_color='lightblue'),
    row=1, col=1
)

# Total entities by note type
if 'note_type' in processed_data.columns:
    entity_by_type = processed_data.groupby('note_type')['token_count'].mean()
    fig.add_trace(
        go.Bar(x=entity_by_type.index, y=entity_by_type.values,
               name='Avg Tokens', marker_color='lightcoral'),
        row=1, col=2
    )

# Entity coverage
coverage = {
    'Has Medications': (entity_data['medications_found'].str.len() > 0).sum(),
    'Has Conditions': (entity_data['conditions_found'].str.len() > 0).sum(),
    'Has Vital Signs': (entity_data['vital_signs_found'].str.len() > 0).sum(),
    'Has Procedures': (entity_data['procedures_found'].str.len() > 0).sum()
}

fig.add_trace(
    go.Bar(x=list(coverage.keys()), y=list(coverage.values()),
           name='Coverage', marker_color='lightgreen'),
    row=2, col=1
)

# Total entities per record
total_entities = entity_data['total_entities']
fig.add_trace(
    go.Histogram(x=total_entities, name='Entity Distribution', 
                marker_color='gold', nbinsx=20),
    row=2, col=2
)

fig.update_layout(height=600, title_text="Medical Entity Recognition Analysis", 
                 showlegend=False)
fig.show()

## 5. Text Classification Models

Train and evaluate classification models for:
- Medical document type classification
- Patient feedback sentiment analysis

In [None]:
# Train classification models
print("🤖 Training classification models...")

classification_results = pipeline.train_classification_models(processed_data)

print("✅ Model training completed!")
print("\n📊 Classification Results:")

for task, results in classification_results.items():
    print(f"\n🔹 {task.upper()}:")
    for model_name, metrics in results.items():
        print(f"  • {model_name}:")
        print(f"    - Test Accuracy: {metrics['test_accuracy']:.4f}")
        print(f"    - CV Mean ± Std: {metrics['cv_mean']:.4f} ± {metrics['cv_std']:.4f}")

In [None]:
# Visualize classification performance
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('Classification Model Performance', fontsize=16, fontweight='bold')

# Note type classification results
if 'note_classification' in classification_results:
    note_results = classification_results['note_classification']
    models = list(note_results.keys())
    accuracies = [note_results[model]['test_accuracy'] for model in models]
    cv_means = [note_results[model]['cv_mean'] for model in models]
    cv_stds = [note_results[model]['cv_std'] for model in models]
    
    x = np.arange(len(models))
    width = 0.35
    
    axes[0].bar(x - width/2, accuracies, width, label='Test Accuracy', alpha=0.8, color='steelblue')
    axes[0].errorbar(x + width/2, cv_means, yerr=cv_stds, fmt='o', 
                    label='CV Mean ± Std', color='darkred', capsize=5)
    
    axes[0].set_xlabel('Models')
    axes[0].set_ylabel('Accuracy')
    axes[0].set_title('Note Type Classification')
    axes[0].set_xticks(x)
    axes[0].set_xticklabels(models, rotation=45)
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

# Sentiment analysis results
if 'sentiment_analysis' in classification_results:
    sentiment_results = classification_results['sentiment_analysis']
    models = list(sentiment_results.keys())
    accuracies = [sentiment_results[model]['test_accuracy'] for model in models]
    cv_means = [sentiment_results[model]['cv_mean'] for model in models]
    cv_stds = [sentiment_results[model]['cv_std'] for model in models]
    
    x = np.arange(len(models))
    
    axes[1].bar(x - width/2, accuracies, width, label='Test Accuracy', alpha=0.8, color='forestgreen')
    axes[1].errorbar(x + width/2, cv_means, yerr=cv_stds, fmt='o', 
                    label='CV Mean ± Std', color='darkred', capsize=5)
    
    axes[1].set_xlabel('Models')
    axes[1].set_ylabel('Accuracy')
    axes[1].set_title('Sentiment Analysis')
    axes[1].set_xticks(x)
    axes[1].set_xticklabels(models, rotation=45)
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Single Text Analysis Demo

Demonstrate real-time analysis of individual medical texts showing the complete pipeline in action.

In [None]:
# Demo single text analysis
demo_texts = [
    "Patient presents with acute chest pain radiating to left arm. BP 160/95 mmHg, HR 92 bpm. EKG shows ST elevation. Started on aspirin 325mg, nitroglycerin sublingual. Cardiology consulted for urgent catheterization.",
    
    "I am extremely satisfied with the care I received during my hospital stay. The nursing staff was incredibly professional and caring. Dr. Smith took the time to explain my condition thoroughly and answered all my questions patiently. The facility was clean and modern. I would definitely recommend this hospital to others.",
    
    "DISCHARGE SUMMARY: 68-year-old female admitted with acute exacerbation of COPD. Treated with bronchodilators, steroids, and antibiotics. Oxygen saturation improved from 88% to 94% on room air. Patient stable for discharge home with home oxygen therapy and pulmonary rehabilitation referral."
]

print("🔍 Single Text Analysis Demonstrations\n")

for i, text in enumerate(demo_texts, 1):
    print(f"{'='*80}")
    print(f"📄 DEMO TEXT {i}")
    print(f"{'='*80}")
    print(f"Original Text: {text[:100]}...")
    print()
    
    # Analyze the text
    result = pipeline.analyze_single_text(text)
    
    # Display preprocessing results
    print("🔄 PREPROCESSING:")
    print(f"  Cleaned: {result['cleaned_text'][:80]}...")
    print(f"  Tokens: {len(result['tokens'])} tokens")
    print()
    
    # Display entities
    print(f"🏥 ENTITIES FOUND ({len(result['entities'])}):") 
    entity_groups = {}
    for entity in result['entities']:
        category = entity['category']
        if category not in entity_groups:
            entity_groups[category] = []
        entity_groups[category].append(entity)
    
    for category, entities in entity_groups.items():
        print(f"  📋 {category.title()}:")
        for entity in entities[:3]:  # Show top 3 per category
            conf_emoji = "🟢" if entity['confidence'] > 0.8 else "🟡" if entity['confidence'] > 0.5 else "🟠"
            print(f"    {conf_emoji} {entity['text']} (confidence: {entity['confidence']:.2f})")
        if len(entities) > 3:
            print(f"    ... and {len(entities) - 3} more")
    print()
    
    # Display predictions
    if result['predictions']:
        print("🤖 PREDICTIONS:")
        for pred_type, pred_data in result['predictions'].items():
            conf_emoji = "🟢" if pred_data['confidence'] > 0.8 else "🟡" if pred_data['confidence'] > 0.5 else "🟠"
            print(f"  {conf_emoji} {pred_type.title()}: {pred_data['prediction']} (confidence: {pred_data['confidence']:.3f})")
            
            if 'all_probabilities' in pred_data:
                print(f"    All probabilities: {', '.join([f'{k}: {v:.3f}' for k, v in pred_data['all_probabilities'].items()])}")
    
    print("\n")

## 7. Comprehensive Analysis Summary

Generate a complete analysis report with insights and visualizations.

In [None]:
# Run complete analysis
print("📊 Running comprehensive analysis...")

complete_results = pipeline.analyze_complete_dataset(medical_data)

# Generate and display report
report = pipeline.generate_report()
print("\n" + "="*80)
print("📋 COMPREHENSIVE ANALYSIS REPORT")
print("="*80)
print(report)

In [None]:
# Create final dashboard visualization
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=('Note Type Distribution', 'Sentiment Distribution',
                   'Entity Coverage by Note Type', 'Text Length Analysis',
                   'Model Performance Comparison', 'Processing Pipeline Flow'),
    specs=[[{'type': 'pie'}, {'type': 'pie'}],
           [{'type': 'bar'}, {'type': 'histogram'}],
           [{'type': 'bar'}, {'type': 'bar'}]]
)

# Note type distribution
note_dist = medical_data['note_type'].value_counts()
fig.add_trace(
    go.Pie(labels=note_dist.index, values=note_dist.values, name="Note Types"),
    row=1, col=1
)

# Sentiment distribution (for feedback only)
feedback_data = medical_data[medical_data['note_type'] == 'patient_feedback']
if 'sentiment' in feedback_data.columns and len(feedback_data) > 0:
    sentiment_dist = feedback_data['sentiment'].value_counts()
    fig.add_trace(
        go.Pie(labels=sentiment_dist.index, values=sentiment_dist.values, name="Sentiment"),
        row=1, col=2
    )

# Entity coverage by note type
if 'note_type' in processed_data.columns:
    entity_coverage = processed_data.groupby('note_type').agg({
        'medications_extracted': lambda x: (x.str.len() > 0).sum(),
        'conditions_extracted': lambda x: (x.str.len() > 0).sum(),
        'procedures_extracted': lambda x: (x.str.len() > 0).sum()
    })
    
    for i, col in enumerate(['medications_extracted', 'conditions_extracted', 'procedures_extracted']):
        fig.add_trace(
            go.Bar(x=entity_coverage.index, y=entity_coverage[col], 
                  name=col.replace('_extracted', '').title(),
                  visible=True if i == 0 else 'legendonly'),
            row=2, col=1
        )

# Text length analysis
fig.add_trace(
    go.Histogram(x=medical_data['text'].str.len(), name='Text Length', nbinsx=30),
    row=2, col=2
)

# Model performance comparison
if 'note_classification' in classification_results:
    models = list(classification_results['note_classification'].keys())
    accuracies = [classification_results['note_classification'][m]['test_accuracy'] for m in models]
    
    fig.add_trace(
        go.Bar(x=models, y=accuracies, name='Test Accuracy'),
        row=3, col=1
    )

# Pipeline steps
pipeline_steps = ['Raw Text', 'Preprocessing', 'Entity Recognition', 'Classification', 'Results']
step_counts = [len(medical_data), len(processed_data), len(entity_data), 
              len(processed_data), len(complete_results['processed_data'])]

fig.add_trace(
    go.Bar(x=pipeline_steps, y=step_counts, name='Records Processed'),
    row=3, col=2
)

fig.update_layout(height=900, title_text="Medical NLP Pipeline - Complete Analysis Dashboard", 
                 showlegend=True)
fig.show()

## 8. Azure Integration Demo

Demonstrate how to integrate with Azure Text Analytics for Health (requires Azure credentials).

In [None]:
# Azure integration demo
print("☁️ Azure Text Analytics Integration Demo")
print("="*50)

# Check for Azure credentials
import os
azure_endpoint = os.getenv('AZURE_TEXT_ANALYTICS_ENDPOINT')
azure_key = os.getenv('AZURE_TEXT_ANALYTICS_KEY')

if azure_endpoint and azure_key:
    print("✅ Azure credentials found!")
    print(f"🔗 Endpoint: {azure_endpoint[:50]}...")
    
    # Enable Azure in pipeline
    pipeline.config['use_azure'] = True
    pipeline.config['azure_endpoint'] = azure_endpoint
    pipeline.config['azure_key'] = azure_key
    
    # Test with a sample text
    test_text = "Patient prescribed lisinopril 10mg daily for hypertension. Blood pressure 140/90. Follow up in 2 weeks."
    
    print(f"\n🔍 Analyzing with Azure: {test_text}")
    
    try:
        azure_result = pipeline.analyze_single_text(test_text)
        
        print("\n☁️ Azure Analysis Results:")
        for entity in azure_result['entities']:
            print(f"  • {entity['text']} ({entity['category']}) [confidence: {entity['confidence']:.2f}]")
            if entity.get('subcategory'):
                print(f"    Subcategory: {entity['subcategory']}")
            if entity.get('normalized_text'):
                print(f"    Normalized: {entity['normalized_text']}")
    
    except Exception as e:
        print(f"❌ Azure analysis failed: {str(e)}")
        print("💡 This might be due to network issues or Azure service limits.")
        
else:
    print("⚠️ Azure credentials not found in environment variables.")
    print("\n💡 To enable Azure integration:")
    print("   1. Set AZURE_TEXT_ANALYTICS_ENDPOINT environment variable")
    print("   2. Set AZURE_TEXT_ANALYTICS_KEY environment variable")
    print("   3. Restart this notebook")
    print("\n🔧 Using local models instead...")
    
    # Demonstrate local analysis
    test_text = "Patient prescribed lisinopril 10mg daily for hypertension. Blood pressure 140/90. Follow up in 2 weeks."
    local_result = pipeline.analyze_single_text(test_text)
    
    print(f"\n🏠 Local Analysis Results:")
    for entity in local_result['entities']:
        print(f"  • {entity['text']} ({entity['category']}) [confidence: {entity['confidence']:.2f}]")

## 9. Production Readiness & Next Steps

Summary of capabilities and recommendations for production deployment.

In [None]:
# Save models and results
print("💾 Saving models and results...")

try:
    # Save trained models
    pipeline.save_models('models')
    print("✅ Models saved to 'models/' directory")
    
    # Save analysis results
    pipeline.save_results('outputs')
    print("✅ Results saved to 'outputs/' directory")
    
    print("\n📁 Generated Files:")
    import os
    
    if os.path.exists('models'):
        model_files = os.listdir('models')
        print("  🤖 Models:")
        for file in model_files:
            print(f"    • {file}")
    
    if os.path.exists('outputs'):
        output_files = os.listdir('outputs')
        print("  📊 Outputs:")
        for file in output_files:
            print(f"    • {file}")
            
except Exception as e:
    print(f"⚠️ Error saving files: {str(e)}")

In [None]:
# Production readiness summary
print("🚀 PRODUCTION READINESS SUMMARY")
print("="*50)

capabilities = [
    "✅ Medical text preprocessing with domain-specific normalization",
    "✅ Multi-class document classification (clinical notes, discharge summaries, etc.)",
    "✅ Medical entity recognition (medications, conditions, procedures, vitals)",
    "✅ Sentiment analysis for patient feedback",
    "✅ Azure Text Analytics for Health integration with local fallbacks",
    "✅ Configurable pipeline with YAML/JSON configuration",
    "✅ Model persistence and loading capabilities",
    "✅ Comprehensive analysis reporting",
    "✅ HIPAA-compliant synthetic data generation",
    "✅ Interactive visualization and analysis"
]

print("\n🎯 Key Capabilities:")
for capability in capabilities:
    print(f"  {capability}")

print("\n🏥 Healthcare Use Cases:")
use_cases = [
    "📋 Clinical documentation analysis and quality assurance",
    "💬 Patient feedback sentiment monitoring and analysis",
    "🏷️ Automated medical record classification and routing",
    "🔍 Medical entity extraction for research and analytics",
    "📊 Healthcare quality improvement initiatives",
    "⚡ Real-time clinical decision support systems",
    "📈 Population health analytics and insights",
    "🛡️ Regulatory compliance and audit support"
]

for use_case in use_cases:
    print(f"  {use_case}")

print("\n🔧 Deployment Recommendations:")
recommendations = [
    "🌐 Deploy as REST API using FastAPI or Flask",
    "🐳 Containerize with Docker for scalable deployment",
    "☁️ Use Azure Container Instances or Kubernetes for cloud deployment",
    "🔒 Implement proper authentication and authorization",
    "📊 Add monitoring and logging for production operations",
    "🔄 Set up CI/CD pipeline for model updates",
    "🧪 Implement A/B testing for model improvements",
    "📚 Create comprehensive API documentation"
]

for rec in recommendations:
    print(f"  {rec}")

print("\n" + "="*50)
print("🎉 Medical NLP Pipeline Analysis Complete!")
print("="*50)