# Fighting Words Analysis: Racial Disparities in Discharge Instructions

This notebook performs Fighting Words analysis to identify statistically significant differences in word usage across racial groups in hospital discharge instructions.

## Analysis Steps:
1. Load and preprocess discharge instruction data
2. Apply Fighting Words algorithm (Monroe et al., 2008)
3. **Apply FDR correction** for multiple comparisons
4. Visualize and interpret results

## Key Innovation:
We add Benjamini-Hochberg False Discovery Rate (FDR) correction to account for testing thousands of words simultaneously.

## 1. Setup and Imports

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
from tqdm import tqdm
import convokit
from convokit import Corpus, FightingWords

# Import our custom modules
import sys
sys.path.insert(0, '..')
from src.data_loader import load_for_analysis
from statistical_analysis import fighting_words_with_correction, report_statistics

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

tqdm.pandas()

## 2. Load Data

We use the custom data loader which:
- Loads from `data/` directory
- Standardizes race categories
- Provides reproducible sampling

In [None]:
# Load discharge instructions
# For full analysis, remove sample_size parameter
df = load_for_analysis(
    filepath='../data/merged_file_sample=100k_section=dischargeinstructions.csv',
    sample_size=None,  # Use all 100k records
    random_state=42
)

print(f"Loaded {len(df)} discharge instructions")
df.head()

## 3. Text Preprocessing Functions

In [None]:
def clean_text(text, stemm=False):
    """
    Tokenize and clean text for Fighting Words analysis.
    
    Steps:
    1. Tokenize into words
    2. Lowercase
    3. Remove punctuation
    4. Remove stopwords
    5. Remove tokens with numbers
    6. Optionally apply stemming
    """
    tokens = nltk.word_tokenize(text)
    tokens = [token.lower() for token in tokens]
    tokens = [token.translate(str.maketrans('', '', string.punctuation)) for token in tokens]
    
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [token for token in tokens if not any(char.isdigit() for char in token)]
    
    if stemm:
        stemmer = PorterStemmer()
        tokens = [stemmer.stem(token) for token in tokens]
    
    tokens = [token for token in tokens if token]
    return tokens


def sep_instruct(text):
    """
    Extract discharge instructions section from full clinical note.
    """
    pattern = r"(?smi)^\s*Discharge Instructions(?::)?\n(.*?)^\s*Followup Instructions"
    match = re.search(pattern, text, re.DOTALL)
    if match:
        return " ".join(clean_text(match.group(1).strip()))
    return ""


def get_corpus(sample_df):
    """
    Convert DataFrame to ConvoKit corpus for Fighting Words.
    """
    df = sample_df.copy()
    
    df['id'] = df['note_id'].astype(str)
    df['speaker'] = df['subject_id'].astype(str)
    df['conversation_id'] = df.index.astype(str)
    df['reply_to'] = None
    df['timestamp'] = pd.to_datetime(df['charttime'])
    df['meta.race'] = df['race']
    df['meta.gender'] = df['gender']
    
    utterances_df = df[['id', 'timestamp', 'text', 'speaker', 'reply_to', 
                         'conversation_id', 'meta.race', 'meta.gender']]
    
    corpus = Corpus.from_pandas(utterances_df=utterances_df)
    return corpus

## 4. Preprocess Text Data

In [None]:
# Extract discharge instructions and clean
print("Preprocessing text...")
df_clean = df.dropna(subset=['text']).copy()
df_clean['text'] = df_clean['text'].progress_apply(sep_instruct)
df_clean = df_clean[df_clean['text'] != ''].reset_index(drop=True)

print(f"After cleaning: {len(df_clean)} records")
print(f"Race distribution:")
print(df_clean['race'].value_counts())

## 5. Fighting Words Analysis

Compare word usage between two racial groups using log-odds with Dirichlet prior.

In [None]:
# Select racial groups to compare
race1 = "WHITE"
race2 = "BLACK"

print(f"Comparing {race1} vs {race2}")

# Create corpus
corpus = get_corpus(df_clean)

# Run Fighting Words
fw = FightingWords(ngram_range=(1,1))
fw.fit(
    corpus,
    class1_func=lambda utt: race1 in utt.meta['race'],
    class2_func=lambda utt: race2 in utt.meta['race']
)

# Get results
results = fw.summarize(corpus, plot=True, class1_name=race1, class2_name=race2)
print(f"\nTotal words tested: {len(results)}")

## 6. Apply FDR Correction (CRITICAL)

**This is the most important statistical fix.**

When testing thousands of words, we expect many false positives:
- With p < 0.05 and 2,557 words tested → expect ~128 false positives!
- We use Benjamini-Hochberg FDR correction to control for this

This addresses the main methodological critique from the code review.

In [None]:
# Apply FDR correction
print("Applying Benjamini-Hochberg FDR correction...")
results_corrected = fighting_words_with_correction(
    results,
    z_col='z-score',
    alpha=0.05,
    method='fdr_bh'
)

# Generate statistical report
stats = report_statistics(
    results_corrected,
    comparison_name=f"{race1} vs {race2}",
    class1_name=race1,
    class2_name=race2
)

print("\n" + "="*70)
print("RESULTS SUMMARY")
print("="*70)
print(stats)

## 7. Examine Significant Results

In [None]:
# Top words after FDR correction
significant = results_corrected[results_corrected['significant_fdr']].copy()

print(f"Significant words after FDR correction: {len(significant)}")
print(f"\nTop 10 words associated with {race1}:")
print(significant[significant['z-score'] > 0].sort_values('z-score', ascending=False).head(10))

print(f"\nTop 10 words associated with {race2}:")
print(significant[significant['z-score'] < 0].sort_values('z-score').head(10))

## 8. Save Results

In [None]:
# Save corrected results
import os
os.makedirs('../results/fighting_words', exist_ok=True)

output_file = f'../results/fighting_words/{race1}_vs_{race2}_FDR_corrected.csv'
results_corrected.to_csv(output_file, index=False)
print(f"Results saved to {output_file}")

# Save only significant words
significant_file = f'../results/fighting_words/{race1}_vs_{race2}_significant_only.csv'
significant.to_csv(significant_file, index=False)
print(f"Significant words saved to {significant_file}")

## 9. Interpretation

### Key Findings:
- **Before FDR correction:** Many words appeared "significant" (p < 0.05)
- **After FDR correction:** Only truly robust differences remain
- **Effect sizes:** Check the `effect_magnitude` column for practical significance

### Caveats:
1. **Correlation ≠ Causation:** Word differences may reflect:
   - Actual bias in clinical communication
   - Differences in disease prevalence across racial groups
   - Socioeconomic factors
   
2. **Single hospital:** Results from Beth Israel Deaconess may not generalize

3. **Binary comparisons:** Intersectional effects (race × gender) not captured

