# Sentiment Analysis of Discharge Instructions

This notebook performs sentiment analysis on hospital discharge instructions to identify potential differences in tone and sentiment across racial groups.

## Analysis Overview:
1. Load discharge instruction data
2. Apply sentiment analysis using pre-trained models
3. Compare sentiment distributions across demographic groups
4. Assess statistical significance of differences

## Model Used:
- **DistilBERT** fine-tuned on SST-2 (Stanford Sentiment Treebank)
- Binary classification: Positive/Negative sentiment
- Confidence scores for each prediction

## Key Question:
Do discharge instructions show systematic differences in sentiment/tone across racial groups?

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Sentiment analysis
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Custom data loader
import sys
sys.path.insert(0, '..')
from src.data_loader import load_for_analysis

# Set display options
pd.set_option('display.max_columns', None)
tqdm.pandas()

print("✓ Imports complete")

## 2. Load Data

We'll load discharge instructions using our standardized data loader.

In [None]:
# Load discharge instructions
# Note: Sentiment analysis is computationally expensive
# Start with a sample for testing
df = load_for_analysis(
    filepath='../data/merged_file_sample=100k_section=dischargeinstructions.csv',
    sample_size=1000,  # Start small for testing
    random_state=42
)

print(f"Loaded {len(df)} discharge instructions")
print(f"\nRace distribution:")
print(df['race_simplified'].value_counts())
df.head()

## 3. Load Sentiment Analysis Model

We use **DistilBERT** fine-tuned on SST-2 (Stanford Sentiment Treebank):
- Faster than full BERT (40% smaller, 60% faster)
- Maintains 97% of BERT's language understanding
- Pre-trained on sentiment classification

In [None]:
# Load pre-trained sentiment analysis model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

print(f"Loading model: {model_name}")
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model=model_name,
    tokenizer=model_name,
    device=-1  # Use CPU (-1) or GPU (0)
)

print("✓ Model loaded successfully")

# Test the model
test_text = "You are recovering well and ready to go home."
result = sentiment_pipeline(test_text)[0]
print(f"\nTest: '{test_text}'")
print(f"Sentiment: {result['label']}, Score: {result['score']:.3f}")

## 4. Run Sentiment Analysis

**Note:** This is computationally intensive. Processing time depends on:
- Number of texts
- Text length
- Hardware (CPU vs GPU)

Approximate timing:
- 1,000 texts on CPU: ~10 minutes
- 10,000 texts on CPU: ~100 minutes
- With GPU: 5-10x faster

In [None]:
def analyze_sentiment(text, max_length=512):
    """
    Analyze sentiment of a text.
    
    Args:
        text: Input text
        max_length: Maximum tokens (BERT limit is 512)
    
    Returns:
        dict with label and score
    """
    if not isinstance(text, str) or len(text.strip()) == 0:
        return {'label': 'NEUTRAL', 'score': 0.5}
    
    # Truncate if too long
    if len(text) > max_length * 4:  # Rough estimate (chars -> tokens)
        text = text[:max_length * 4]
    
    try:
        result = sentiment_pipeline(text)[0]
        return result
    except Exception as e:
        print(f"Error analyzing text: {e}")
        return {'label': 'ERROR', 'score': 0.0}


# Run sentiment analysis
print("Running sentiment analysis...")
print("This may take several minutes...\n")

# Apply to all texts
sentiments = []
for text in tqdm(df['text'], desc="Analyzing sentiment"):
    result = analyze_sentiment(text)
    sentiments.append(result)

# Add results to dataframe
df['sentiment_label'] = [s['label'] for s in sentiments]
df['sentiment_score'] = [s['score'] for s in sentiments]

print(f"\n✓ Analyzed {len(df)} texts")
print(f"\nSentiment distribution:")
print(df['sentiment_label'].value_counts())

## 5. Compare Sentiment Across Racial Groups

Now we examine whether sentiment differs systematically by race.

In [None]:
# Calculate sentiment statistics by race
sentiment_by_race = df.groupby('race_simplified').agg({
    'sentiment_label': lambda x: (x == 'POSITIVE').sum() / len(x),
    'sentiment_score': ['mean', 'std', 'count']
})

sentiment_by_race.columns = ['Positive_Pct', 'Mean_Score', 'Std_Score', 'Count']
sentiment_by_race = sentiment_by_race.sort_values('Positive_Pct', ascending=False)

print("Sentiment Analysis by Race:")
print(sentiment_by_race)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Positive percentage
sentiment_by_race['Positive_Pct'].plot(
    kind='bar',
    ax=axes[0],
    color='steelblue'
)
axes[0].set_title('Percentage of Positive Sentiment by Race')
axes[0].set_xlabel('Race')
axes[0].set_ylabel('% Positive')
axes[0].set_ylim(0, 1)
axes[0].grid(axis='y', alpha=0.3)

# Plot 2: Mean sentiment score
sentiment_by_race['Mean_Score'].plot(
    kind='bar',
    ax=axes[1],
    color='coral'
)
axes[1].set_title('Mean Sentiment Score by Race')
axes[1].set_xlabel('Race')
axes[1].set_ylabel('Mean Confidence Score')
axes[1].set_ylim(0, 1)
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('../results/sentiment_by_race.png', dpi=300, bbox_inches='tight')
plt.show()

## 6. Statistical Significance Testing

We use chi-square test to assess whether sentiment distributions differ across racial groups.

In [None]:
from scipy.stats import chi2_contingency, mannwhitneyu

# Create contingency table
contingency = pd.crosstab(
    df['race_simplified'],
    df['sentiment_label']
)

print("Contingency Table:")
print(contingency)
print()

# Chi-square test
chi2, p_value, dof, expected = chi2_contingency(contingency)

print(f"Chi-Square Test Results:")
print(f"  χ² = {chi2:.4f}")
print(f"  p-value = {p_value:.6f}")
print(f"  degrees of freedom = {dof}")

if p_value < 0.05:
    print(f"\n✓ Significant difference in sentiment across racial groups (p < 0.05)")
else:
    print(f"\n✗ No significant difference in sentiment across racial groups (p ≥ 0.05)")

# Pairwise comparisons (example: WHITE vs BLACK)
print("\n" + "="*70)
print("Pairwise Comparison: WHITE vs BLACK")
print("="*70)

white_scores = df[df['race_simplified'] == 'WHITE']['sentiment_score']
black_scores = df[df['race_simplified'] == 'BLACK']['sentiment_score']

if len(white_scores) > 0 and len(black_scores) > 0:
    stat, p = mannwhitneyu(white_scores, black_scores, alternative='two-sided')
    
    print(f"WHITE: Mean = {white_scores.mean():.3f}, Median = {white_scores.median():.3f}")
    print(f"BLACK: Mean = {black_scores.mean():.3f}, Median = {black_scores.median():.3f}")
    print(f"\nMann-Whitney U test: U = {stat:.1f}, p = {p:.6f}")
    
    if p < 0.05:
        print("✓ Significant difference (p < 0.05)")
    else:
        print("✗ No significant difference (p ≥ 0.05)")
else:
    print("Insufficient data for comparison")

## 7. Save Results

In [None]:
import os
os.makedirs('../results/sentiment_analysis', exist_ok=True)

# Save full results
output_file = '../results/sentiment_analysis/sentiment_results.csv'
df.to_csv(output_file, index=False)
print(f"Saved results to {output_file}")

# Save summary statistics
summary_file = '../results/sentiment_analysis/sentiment_summary_by_race.csv'
sentiment_by_race.to_csv(summary_file)
print(f"Saved summary to {summary_file}")

## 8. Interpretation and Limitations

### What We're Measuring:
- **Sentiment polarity:** Whether discharge instructions use positive vs. negative language
- **Sentiment intensity:** Confidence of the sentiment classification

### Important Caveats:

1. **Medical Context:**
   - Sentiment models are trained on general text (movie reviews, social media)
   - Medical language has different conventions
   - "Negative" sentiment may reflect medical necessity, not bias

2. **Confounding Factors:**
   - Disease severity differs across groups
   - Socioeconomic factors affect health outcomes
   - Cannot separate bias from legitimate medical differences

3. **Model Limitations:**
   - DistilBERT has 512 token limit (long texts are truncated)
   - Binary classification (positive/negative) is simplistic
   - May miss subtle tonal differences

4. **Statistical Power:**
   - Small sample sizes for some racial groups
   - Effect sizes may be small
   - Need large samples to detect differences

### What This Adds to the Analysis:

- **Complements Fighting Words:** While Fighting Words identifies *which* words differ, sentiment analysis assesses overall *tone*
- **Hypothesis Generation:** Significant differences warrant deeper investigation
- **Policy Relevance:** If systematic differences exist, suggests need for communication training

### Next Steps:

1. Run on full dataset (not just sample)
2. Try domain-specific sentiment models (if available)
3. Examine specific examples of high/low sentiment texts
4. Control for disease severity and other clinical factors
5. Qualitative analysis of flagged texts