# PCA Visualization of Word Embeddings by Race

This notebook trains Word2Vec embeddings on discharge instructions from different racial groups and visualizes the top 100 most common words using PCA.

## Analysis Steps:
1. Load discharge instruction data
2. Train Word2Vec models for each racial group
3. Extract embeddings for top 100 words
4. Reduce dimensionality with PCA
5. Visualize word distributions

## 1. Setup and Imports

In [None]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
import string
from gensim.models import Word2Vec
from collections import Counter
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Import custom data loader
import sys
sys.path.insert(0, '..')
from src.data_loader import load_for_analysis

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)

## 2. Load Data

In [None]:
# Load discharge instructions
df = load_for_analysis(
    filepath='../data/merged_file_sample=100k_section=dischargeinstructions.csv',
    sample_size=None,
    random_state=42
)

print(f"Loaded {len(df)} records")
df.head()

## 3. Text Preprocessing (Fixed)

In [None]:
def clean_text(text):
    """
    Tokenize and clean text.
    
    FIXED: Previous version had lambda bug that overwrote tokens variable.
    """
    # Tokenize
    tokens = nltk.word_tokenize(text)
    
    # Convert to lowercase and remove \x95 character
    tokens = [re.sub(r'\x95', "", token.lower()) for token in tokens if token != '\x95']
    
    # Remove punctuation
    tokens = [token.translate(str.maketrans('', '', string.punctuation)) for token in tokens]
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Remove tokens with numbers
    tokens = [token for token in tokens if not any(char.isdigit() for char in token)]
    
    # Remove empty tokens
    tokens = [token for token in tokens if token]
    
    return tokens


# Test the function
sample_text = df['text'].iloc[0]
print("Sample tokens:", clean_text(sample_text)[:30])

## 4. Train Word2Vec Models by Race

We train separate Word2Vec models for each racial group to compare word usage patterns.

In [None]:
import os
os.makedirs('../results/PCA', exist_ok=True)

for race in df['race_simplified'].unique():
    print(f"\n{'='*70}")
    print(f"Processing {race}")
    print(f"{'='*70}")
    
    # Filter by race and clean text
    race_df = df[df['race_simplified'] == race].copy()
    sentences = race_df['text'].apply(clean_text).tolist()
    
    print(f"Number of documents: {len(sentences)}")
    print(f"Sample tokens: {sentences[0][:30]}")
    
    # Train Word2Vec model
    print("Training Word2Vec model...")
    model = Word2Vec(
        sentences,
        vector_size=100,
        window=5,
        min_count=5,
        sg=1,  # Skip-gram
        workers=4
    )
    
    # Save model
    model.save(f'../results/PCA/{race}.wordvectors')
    
    # Get top 100 most common words
    word_counts = Counter(word for sentence in sentences for word in sentence)
    top_100_words = [word for word, _ in word_counts.most_common(100)]
    
    # Extract embeddings
    embeddings = [model.wv[word] for word in top_100_words]
    
    # Reduce dimensionality with PCA
    pca = PCA(n_components=2)
    reduced_embeddings = pca.fit_transform(embeddings)
    
    # Plot
    plt.figure(figsize=(10, 10))
    plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], alpha=0.6)
    
    for i, word in enumerate(top_100_words):
        plt.annotate(word, xy=(reduced_embeddings[i, 0], reduced_embeddings[i, 1]), 
                    fontsize=8, alpha=0.7)
    
    plt.title(f'Top 100 Word2Vec Embeddings - {race} Discharge Instructions')
    plt.xlabel('PCA Component 1')
    plt.ylabel('PCA Component 2')
    plt.grid(alpha=0.3)
    plt.tight_layout()
    
    # Save plot
    plt.savefig(f'../results/PCA/{race}_embeddings.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"Saved visualization to results/PCA/{race}_embeddings.png")

## 5. Interpretation

### What do these visualizations show?
- Words closer together in the PCA space have similar contextual usage
- Clusters reveal common medical topics and instruction patterns
- Differences across racial groups may indicate:
  - Variations in medical conditions
  - Different communication styles
  - Potential biases in care delivery

### Limitations:
- PCA reduction loses information (only 2D visualization of 100D space)
- Word2Vec trained separately per group (not directly comparable)
- Need Fighting Words analysis for statistical significance