# Single Cognitive Pattern UMAP + HDBSCAN Analysis

This notebook performs UMAP dimensionality reduction and HDBSCAN clustering analysis on a single cognitive pattern.
The key improvements over the previous analysis:
- **Clustering within states only**: Separate clustering for positive, negative, and transition states
- **Single cognitive pattern focus**: Analyze one pattern at a time (e.g., 'Executive Fatigue & Avolition')
- **Token sampling options**: All tokens vs last token only
- **Sample size options**: All samples vs single sample

## Analysis Variants:
1. All token positions + All samples
2. Last token only + All samples  
3. All token positions + Single sample
4. Last token only + Single sample

In [10]:
import torch
import numpy as np
import json
import umap
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
from pathlib import Path
import webbrowser
import os
import hdbscan
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

# Configure Plotly for browser display
import plotly.io as pio
pio.renderers.default = "browser"

In [11]:
print("🔍 Single Cognitive Pattern UMAP + HDBSCAN Analysis")
print("=" * 60)
print("Loading activation data and metadata...")

# Set device and paths (same as original notebook)
device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
base_path = Path("/Users/ivanculo/Desktop/Projects/turn_point")
activations_dir = base_path / "activations"

# Load data (same as original notebook)
negative_activations = torch.load(activations_dir / "activations_8ff00d963316212d.pt", map_location=device)
positive_activations = torch.load(activations_dir / "activations_e5ad16e9b3c33c9b.pt", map_location=device)
transition_activations = torch.load(activations_dir / "activations_332f24de2a3f82ff.pt", map_location=device)

with open(base_path / "data" / "final" / "enriched_metadata.json", 'r') as f:
    metadata = json.load(f)

print(f"Loaded data for {len(metadata)} examples")
print(f"Device: {device}")
print(f"Negative activations keys: {list(negative_activations.keys())[:3]}...")
print(f"Positive activations keys: {list(positive_activations.keys())[:3]}...")
print(f"Transition activations keys: {list(transition_activations.keys())[:3]}...")

🔍 Single Cognitive Pattern UMAP + HDBSCAN Analysis
Loading activation data and metadata...
Loaded data for 520 examples
Device: mps
Negative activations keys: ['negative_layer_17', 'negative_layer_21', 'enriched_metadata']...
Positive activations keys: ['positive_layer_17', 'positive_layer_21', 'enriched_metadata']...
Transition activations keys: ['transition_layer_17', 'transition_layer_21', 'enriched_metadata']...


In [18]:
# Create pattern indices mapping
pattern_indices = {}
for i, entry in enumerate(metadata):
    pattern_name = entry['bad_good_narratives_match']['cognitive_pattern_name_from_bad_good']
    if pattern_name not in pattern_indices:
        pattern_indices[pattern_name] = []
    pattern_indices[pattern_name].append(i)

# Set layer to analyze (fixed to layer 17, same as original notebook)
layer = 17

print("\n📊 Available Cognitive Patterns:")
for pattern, indices in pattern_indices.items():
    print(f"  {pattern}: {len(indices)} examples")

# Select pattern to analyze
selected_pattern = 'Conflict-Focused Self-Reflection'  # Change this to analyze different patterns
print(f"\n🎯 Selected pattern: {selected_pattern}")
print(f"   Number of examples: {len(pattern_indices[selected_pattern])}")
print(f"   Analyzing layer: {layer}")


📊 Available Cognitive Patterns:
  Executive Fatigue & Avolition: 40 examples
  Persistent Suicidal Ideation Focus: 40 examples
  Self-Critical Rumination: 40 examples
  Conflict-Focused Self-Reflection: 40 examples
  Disorganized Thought & Derealization: 40 examples
  Somatic–Emotional Self-Monitoring: 40 examples
  Identity-Focused Life Narrative: 40 examples
  Overwhelmed Narrative Processing: 40 examples
  Overload with Entrapment Themes: 40 examples
  Existential Overload & Worthlessness: 40 examples
  Hopelessness-Driven Cognitive Exhaustion: 40 examples
  Suicidal Planning & Rationalization: 40 examples
  Fragmented Overwhelm & Exhaustion: 40 examples

🎯 Selected pattern: Conflict-Focused Self-Reflection
   Number of examples: 40
   Analyzing layer: 17


In [19]:
def perform_single_pattern_analysis(
    pattern_name,
    all_tokens=True,
    all_samples=True,
    single_sample_idx=0,
    min_cluster_size=20
):
    """
    Perform UMAP + HDBSCAN analysis on a single cognitive pattern
    
    Args:
        pattern_name: Name of cognitive pattern to analyze
        all_tokens: If True, use all token positions. If False, use last token only
        all_samples: If True, use all samples. If False, use single sample
        single_sample_idx: Which sample to use if all_samples=False
        min_cluster_size: Minimum cluster size for HDBSCAN
    """
    
    if pattern_name not in pattern_indices:
        print(f"❌ Pattern '{pattern_name}' not found")
        return None
    
    # Configuration string for titles
    tokens_str = "AllTokens" if all_tokens else "LastToken"
    samples_str = "AllSamples" if all_samples else f"Sample{single_sample_idx}"
    config_str = f"{tokens_str}_{samples_str}"
    
    print(f"\n🚀 Analysis Configuration: {config_str}")
    print(f"   Pattern: {pattern_name}")
    print(f"   Layer: {layer} (fixed)")
    print(f"   Token sampling: {'All tokens' if all_tokens else 'Last token only'}")
    print(f"   Sample selection: {'All samples' if all_samples else f'Single sample {single_sample_idx}'}")
    print(f"   Min cluster size: {min_cluster_size}")
    
    # Get indices for this pattern
    indices = pattern_indices[pattern_name]
    if not all_samples:
        if single_sample_idx >= len(indices):
            print(f"❌ Sample index {single_sample_idx} out of range (max: {len(indices)-1})")
            return None
        indices = [indices[single_sample_idx]]
    
    print(f"   Using {len(indices)} sample(s)")
    
    # Load activations for this pattern (layer 17 only)
    neg_data = negative_activations[f'negative_layer_{layer}'][indices]
    pos_data = positive_activations[f'positive_layer_{layer}'][indices] 
    trans_data = transition_activations[f'transition_layer_{layer}'][indices]
    
    print(f"   Data shapes - Neg: {neg_data.shape}, Pos: {pos_data.shape}, Trans: {trans_data.shape}")
    
    # Prepare data based on token sampling strategy
    def prepare_state_data(data, state_name):
        if all_tokens:
            # Use all tokens from all samples
            flat_data = data.reshape(-1, data.shape[-1])
            print(f"     {state_name}: {data.shape} -> {flat_data.shape} (all tokens)")
        else:
            # Use only last token from each sample
            last_tokens = data[:, -1, :]
            flat_data = last_tokens.reshape(-1, last_tokens.shape[-1])
            print(f"     {state_name}: {data.shape} -> {flat_data.shape} (last tokens only)")
        return flat_data.cpu().numpy()  # Move to CPU before converting to numpy
    
    # Prepare data for each state
    print(f"\n📊 Preparing data for clustering...")
    neg_flat = prepare_state_data(neg_data, "Negative")
    pos_flat = prepare_state_data(pos_data, "Positive")
    trans_flat = prepare_state_data(trans_data, "Transition")
    
    return perform_clustering_and_visualization(
        neg_flat, pos_flat, trans_flat, 
        pattern_name, config_str, min_cluster_size
    )

In [20]:
def perform_clustering_and_visualization(
    neg_flat, pos_flat, trans_flat, 
    pattern_name, config_str, min_cluster_size
):
    """
    Perform HDBSCAN clustering within each state separately and create UMAP visualization
    """
    
    print(f"\n🔬 HDBSCAN Clustering Analysis (within states only)")
    print("=" * 60)
    
    states_data = {
        'Negative': neg_flat,
        'Positive': pos_flat,
        'Transition': trans_flat
    }
    
    # Standardize data
    scaler = StandardScaler()
    
    # Perform HDBSCAN clustering on each state separately
    clustering_results = {}
    all_embeddings = []
    all_labels = []
    all_colors = []
    all_detailed_labels = []
    
    # Color schemes for each state
    color_schemes = {
        'Negative': ['#8B0000', '#DC143C', '#B22222', '#FF6B6B', '#FF8E8E'],
        'Positive': ['#006400', '#228B22', '#32CD32', '#7CFC00', '#ADFF2F'], 
        'Transition': ['#4B0082', '#8A2BE2', '#9370DB', '#BA55D3', '#DDA0DD']
    }
    
    for state_name, data in states_data.items():
        print(f"\n🎯 Clustering {state_name} state ({data.shape[0]} samples)")
        
        if data.shape[0] < min_cluster_size:
            print(f"   ⚠️  Skipping {state_name} - insufficient samples ({data.shape[0]} < {min_cluster_size})")
            continue
        
        # Standardize data
        data_scaled = scaler.fit_transform(data)
        
        # HDBSCAN clustering
        clusterer = hdbscan.HDBSCAN(
            min_cluster_size=min_cluster_size,
            min_samples=max(5, min_cluster_size // 4),
            cluster_selection_epsilon=0.0,
            metric='euclidean'
        )
        
        cluster_labels = clusterer.fit_predict(data_scaled)
        
        # Analyze clustering results
        unique_labels, counts = np.unique(cluster_labels, return_counts=True)
        n_clusters = len(unique_labels) - (1 if -1 in unique_labels else 0)
        n_noise = np.sum(cluster_labels == -1)
        
        print(f"   Found {n_clusters} clusters, {n_noise} noise points ({n_noise/len(cluster_labels)*100:.1f}%)")
        
        for label, count in zip(unique_labels, counts):
            pct = count / len(cluster_labels) * 100
            if label == -1:
                print(f"     Noise: {count:4d} samples ({pct:5.1f}%)")
            else:
                print(f"     Cluster {label}: {count:4d} samples ({pct:5.1f}%)")
        
        # Store clustering results
        clustering_results[state_name] = {
            'data': data,
            'data_scaled': data_scaled,
            'cluster_labels': cluster_labels,
            'clusterer': clusterer,
            'n_clusters': n_clusters,
            'n_noise': n_noise
        }
        
        # UMAP embedding for this state
        print(f"   Computing UMAP embedding...")
        n_neighbors = min(15, max(2, len(data) // 10))
        
        umap_reducer = umap.UMAP(
            n_components=2,
            n_neighbors=n_neighbors,
            min_dist=0.1,
            random_state=42,
            n_jobs=1
        )
        
        embedding = umap_reducer.fit_transform(data_scaled)
        all_embeddings.append(embedding)
        
        # Create labels and colors
        state_colors = color_schemes[state_name]
        for i, label in enumerate(cluster_labels):
            if label == -1:
                all_labels.append(f"{state_name}_Noise")
                all_colors.append('#808080')  # Gray for noise
                all_detailed_labels.append(f"{state_name} Noise")
            else:
                all_labels.append(f"{state_name}_Cluster_{label}")
                all_colors.append(state_colors[label % len(state_colors)])
                all_detailed_labels.append(f"{state_name} Cluster {label}")
    
    # Combine all embeddings
    if not all_embeddings:
        print("❌ No valid clusterings found")
        return None
    
    combined_embedding = np.vstack(all_embeddings)
    
    print(f"\n🗺️  Creating visualization with {len(combined_embedding)} total points...")
    
    # Create visualization
    results = {
        'embedding': combined_embedding,
        'labels': all_labels,
        'colors': all_colors,
        'detailed_labels': all_detailed_labels,
        'clustering_results': clustering_results,
        'pattern_name': pattern_name,
        'config_str': config_str
    }
    
    create_visualization(results)
    
    return results

In [21]:
def create_visualization(results):
    """
    Create interactive UMAP visualization with HDBSCAN clustering results
    """
    
    embedding = results['embedding']
    labels = results['labels']
    colors = results['colors']
    detailed_labels = results['detailed_labels']
    pattern_name = results['pattern_name']
    config_str = results['config_str']
    
    # Create figure
    fig = go.Figure()
    
    # Group points by label for legend organization
    unique_labels = list(set(labels))
    unique_labels.sort()
    
    for label in unique_labels:
        mask = [l == label for l in labels]
        if not any(mask):
            continue
            
        indices = [i for i, m in enumerate(mask) if m]
        
        # Extract state and cluster info for display
        parts = label.split('_')
        state = parts[0]
        
        if 'Noise' in label:
            display_name = f"{state} (Noise)"
            marker_symbol = 'x'
            marker_size = 3
        else:
            cluster_id = parts[-1]
            display_name = f"{state} C{cluster_id}"
            marker_symbol = 'circle'
            marker_size = 4
        
        fig.add_trace(go.Scatter(
            x=embedding[indices, 0],
            y=embedding[indices, 1],
            mode='markers',
            marker=dict(
                color=colors[indices[0]],
                size=marker_size,
                opacity=0.7,
                symbol=marker_symbol,
                line=dict(width=0.5, color='white')
            ),
            name=display_name,
            hovertemplate=f'<b>{display_name}</b><br>UMAP 1: %{{x:.2f}}<br>UMAP 2: %{{y:.2f}}<extra></extra>'
        ))
    
    # Update layout
    title = f'{pattern_name} - {config_str} UMAP + HDBSCAN'
    fig.update_layout(
        title=dict(
            text=title,
            x=0.5,
            font=dict(size=16)
        ),
        xaxis_title='UMAP Dimension 1',
        yaxis_title='UMAP Dimension 2',
        width=1000,
        height=800,
        showlegend=True,
        legend=dict(
            yanchor="top",
            y=0.99,
            xanchor="left",
            x=1.01
        ),
        margin=dict(r=150)
    )
    
    # Save and display
    safe_pattern = pattern_name.replace(' ', '_').replace('&', 'and')
    filename = f"single_pattern_{safe_pattern}_{config_str}_{hash(str(embedding.tolist())) % 10000}.html"
    
    fig.write_html(filename, auto_open=False)
    
    print(f"\n📊 Visualization saved: {filename}")
    print(f"   Opening in browser...")
    
    try:
        webbrowser.open(f'file://{os.path.abspath(filename)}', new=2)
    except Exception as e:
        print(f"   Could not open browser: {e}")
    
    fig.show()
    
    return fig

In [22]:
def print_analysis_summary(results):
    """
    Print summary of clustering analysis results
    """
    if not results:
        return
    
    print(f"\n📋 Analysis Summary: {results['pattern_name']} - {results['config_str']}")
    print("=" * 60)
    
    clustering_results = results['clustering_results']
    total_points = len(results['labels'])
    
    print(f"Total data points analyzed: {total_points}")
    
    for state_name, cluster_info in clustering_results.items():
        n_clusters = cluster_info['n_clusters']
        n_noise = cluster_info['n_noise']
        n_total = len(cluster_info['cluster_labels'])
        
        print(f"\n{state_name} State:")
        print(f"  • {n_total} total points")
        print(f"  • {n_clusters} clusters found")
        print(f"  • {n_noise} noise points ({n_noise/n_total*100:.1f}%)")
        print(f"  • {n_total - n_noise} points in clusters ({(n_total-n_noise)/n_total*100:.1f}%)")
    
    print(f"\n💡 Key Insights:")
    print(f"  • Clustering performed WITHIN each cognitive state separately")
    print(f"  • HDBSCAN automatically determines optimal number of clusters")
    print(f"  • Noise points represent outliers that don't fit clear patterns")
    print(f"  • Cluster separation indicates distinct neural activation patterns")

## Analysis 1: All Token Positions + All Samples

This analysis uses all token positions from all samples of the selected cognitive pattern.

In [23]:
# Analysis 1: All tokens, all samples
print("\n" + "=" * 80)
print("🔍 ANALYSIS 1: All Token Positions + All Samples")
print("=" * 80)

results_all_tokens_all_samples = perform_single_pattern_analysis(
    pattern_name=selected_pattern,
    all_tokens=True,
    all_samples=True,
    min_cluster_size=30
)

print_analysis_summary(results_all_tokens_all_samples)


🔍 ANALYSIS 1: All Token Positions + All Samples

🚀 Analysis Configuration: AllTokens_AllSamples
   Pattern: Conflict-Focused Self-Reflection
   Layer: 17 (fixed)
   Token sampling: All tokens
   Sample selection: All samples
   Min cluster size: 30
   Using 40 sample(s)
   Data shapes - Neg: torch.Size([40, 208, 2304]), Pos: torch.Size([40, 261, 2304]), Trans: torch.Size([40, 311, 2304])

📊 Preparing data for clustering...
     Negative: torch.Size([40, 208, 2304]) -> torch.Size([8320, 2304]) (all tokens)
     Positive: torch.Size([40, 261, 2304]) -> torch.Size([10440, 2304]) (all tokens)
     Transition: torch.Size([40, 311, 2304]) -> torch.Size([12440, 2304]) (all tokens)

🔬 HDBSCAN Clustering Analysis (within states only)

🎯 Clustering Negative state (8320 samples)
   Found 6 clusters, 2161 noise points (26.0%)
     Noise: 2161 samples ( 26.0%)
     Cluster 0:   37 samples (  0.4%)
     Cluster 1:   42 samples (  0.5%)
     Cluster 2:   58 samples (  0.7%)
     Cluster 3:   40 samp

In [None]:
# 💾 Save and Analyze Analysis 1 Results
print("\n" + "=" * 80)
print("💾 SAVING & ANALYZING ANALYSIS 1 RESULTS")  
print("=" * 80)

# Import the cluster analysis utilities
from cluster_analysis_utils import analyze_clustering_from_notebook_results

# Set up paths for the cluster analyzer
base_path = Path("/Users/ivanculo/Desktop/Projects/turn_point")
activations_dir = base_path / "activations"
metadata_path = base_path / "data" / "final" / "enriched_metadata.json"

# Analyze the clustering results from Analysis 1
print("🔬 Running cluster analysis on Analysis 1 results...")
cluster_analysis = analyze_clustering_from_notebook_results(
    results_all_tokens_all_samples,
    activations_dir=activations_dir,
    metadata_path=metadata_path,
    top_k=3  # Get top 3 highest activating clusters per state
)

print("\n🎯 TOP CLUSTERS BY ACTIVATION MAGNITUDE:")
for state, clusters in cluster_analysis['top_clusters'].items():
    print(f"\n{state} state:")
    for i, cluster in enumerate(clusters):
        print(f"  #{i+1}: Cluster {cluster['cluster_id']} - "
              f"Magnitude: {cluster['activation_magnitude']:.2f}, "
              f"Size: {cluster['size']} points")

print(f"\n📋 ACTIVATION FILTERS CREATED:")
for state, filter_info in cluster_analysis['activation_filter'].items():
    print(f"  {state}: {filter_info['activation_count']} activations from "
          f"{len(filter_info['sample_ids'])} unique samples")

# Save the analysis results for later use
import pickle
analysis_filename = f"cluster_analysis_{selected_pattern.replace(' ', '_').replace('&', 'and')}_Analysis1.pkl"
with open(analysis_filename, 'wb') as f:
    pickle.dump(cluster_analysis, f)
    
print(f"\n💾 Cluster analysis saved to: {analysis_filename}")
print("✅ Analysis 1 results processed and saved!")


💾 SAVING & ANALYZING ANALYSIS 1 RESULTS
🔬 Running cluster analysis on Analysis 1 results...
Loading activation data and metadata...
Loaded data for 520 examples
Found 13 cognitive patterns
Using device: mps

🔬 COMPLETE CLUSTER ANALYSIS PIPELINE

🔍 Analyzing cluster activations for Conflict-Focused Self-Reflection
   Layer: 17

📊 Analyzing Negative state clusters:
   Cluster 0: 37 points, magnitude: 268.09
   Cluster 1: 42 points, magnitude: 246.57
   Cluster 2: 58 points, magnitude: 254.22
   Cluster 3: 40 points, magnitude: 569.35
   Cluster 4: 40 points, magnitude: 3502.02
   Cluster 5: 5942 points, magnitude: 0.51

📊 Analyzing Positive state clusters:
   Cluster 0: 30 points, magnitude: 248.16
   Cluster 1: 34 points, magnitude: 342.39
   Cluster 2: 33 points, magnitude: 265.45
   Cluster 3: 72 points, magnitude: 286.24
   Cluster 4: 45 points, magnitude: 278.84
   Cluster 5: 50 points, magnitude: 268.29
   Cluster 6: 80 points, magnitude: 239.17
   Cluster 7: 31 points, magnitude:

## Analysis 2: Last Token Only + All Samples

This analysis uses only the last token from each sample (final processing state).

In [21]:
# Analysis 2: Last token only, all samples
print("\n" + "=" * 80)
print("🔍 ANALYSIS 2: Last Token Only + All Samples")
print("=" * 80)

results_last_token_all_samples = perform_single_pattern_analysis(
    pattern_name=selected_pattern,
    all_tokens=False,
    all_samples=True,
    min_cluster_size=10  # Smaller cluster size since we have fewer points
)

print_analysis_summary(results_last_token_all_samples)


🔍 ANALYSIS 2: Last Token Only + All Samples

🚀 Analysis Configuration: LastToken_AllSamples
   Pattern: Executive Fatigue & Avolition
   Layer: 17 (fixed)
   Token sampling: Last token only
   Sample selection: All samples
   Min cluster size: 10
   Using 40 sample(s)
   Data shapes - Neg: torch.Size([40, 208, 2304]), Pos: torch.Size([40, 261, 2304]), Trans: torch.Size([40, 311, 2304])

📊 Preparing data for clustering...
     Negative: torch.Size([40, 208, 2304]) -> torch.Size([40, 2304]) (last tokens only)
     Positive: torch.Size([40, 261, 2304]) -> torch.Size([40, 2304]) (last tokens only)
     Transition: torch.Size([40, 311, 2304]) -> torch.Size([40, 2304]) (last tokens only)

🔬 HDBSCAN Clustering Analysis (within states only)

🎯 Clustering Negative state (40 samples)
   Found 0 clusters, 40 noise points (100.0%)
     Noise:   40 samples (100.0%)
   Computing UMAP embedding...

🎯 Clustering Positive state (40 samples)
   Found 0 clusters, 40 noise points (100.0%)
     Noise:   4

## Analysis 3: All Token Positions + Single Sample

This analysis focuses on just one example to see token-level progression within a single narrative.

In [22]:
def perform_clustering_and_visualization_umap_first(
    neg_flat, pos_flat, trans_flat, 
    pattern_name, config_str, min_cluster_size
):
    """
    Alternative approach: Perform 3D UMAP first, then HDBSCAN clustering on the 3D embeddings
    """
    
    print(f"\n🔬 UMAP-First Analysis: 3D UMAP → HDBSCAN Clustering")
    print("=" * 60)
    print("⚡ NEW APPROACH: Dimensionality reduction first, then clustering in 3D space")
    
    states_data = {
        'Negative': neg_flat,
        'Positive': pos_flat,
        'Transition': trans_flat
    }
    
    # Standardize data
    scaler = StandardScaler()
    
    # Perform 3D UMAP first, then HDBSCAN clustering on 3D embeddings
    clustering_results = {}
    all_embeddings_2d = []  # For final 2D visualization
    all_embeddings_3d = []  # Store 3D embeddings
    all_labels = []
    all_colors = []
    all_detailed_labels = []
    
    # Color schemes for each state
    color_schemes = {
        'Negative': ['#8B0000', '#DC143C', '#B22222', '#FF6B6B', '#FF8E8E'],
        'Positive': ['#006400', '#228B22', '#32CD32', '#7CFC00', '#ADFF2F'], 
        'Transition': ['#4B0082', '#8A2BE2', '#9370DB', '#BA55D3', '#DDA0DD']
    }
    
    for state_name, data in states_data.items():
        print(f"\n🎯 Processing {state_name} state ({data.shape[0]} samples)")
        
        if data.shape[0] < min_cluster_size:
            print(f"   ⚠️  Skipping {state_name} - insufficient samples ({data.shape[0]} < {min_cluster_size})")
            continue
        
        # Step 1: Standardize data
        data_scaled = scaler.fit_transform(data)
        
        # Step 2: UMAP to 3D FIRST
        print(f"   🗺️  Computing 3D UMAP embedding...")
        n_neighbors = min(15, max(2, len(data) // 10))
        
        umap_3d = umap.UMAP(
            n_components=3,  # 3D embedding
            n_neighbors=n_neighbors,
            min_dist=0.1,
            random_state=42,
            n_jobs=1
        )
        
        embedding_3d = umap_3d.fit_transform(data_scaled)
        print(f"   ✅ 3D UMAP: {data_scaled.shape} → {embedding_3d.shape}")
        
        # Step 3: HDBSCAN clustering on 3D embeddings
        print(f"   🔍 HDBSCAN clustering on 3D embedding...")
        clusterer = hdbscan.HDBSCAN(
            min_cluster_size=min_cluster_size,
            min_samples=max(5, min_cluster_size // 4),
            cluster_selection_epsilon=0.0,
            metric='euclidean'
        )
        
        cluster_labels = clusterer.fit_predict(embedding_3d)  # Cluster on 3D UMAP!
        
        # Analyze clustering results
        unique_labels, counts = np.unique(cluster_labels, return_counts=True)
        n_clusters = len(unique_labels) - (1 if -1 in unique_labels else 0)
        n_noise = np.sum(cluster_labels == -1)
        
        print(f"   Found {n_clusters} clusters, {n_noise} noise points ({n_noise/len(cluster_labels)*100:.1f}%)")
        
        for label, count in zip(unique_labels, counts):
            pct = count / len(cluster_labels) * 100
            if label == -1:
                print(f"     Noise: {count:4d} samples ({pct:5.1f}%)")
            else:
                print(f"     Cluster {label}: {count:4d} samples ({pct:5.1f}%)")
        
        # Step 4: Create 2D UMAP for visualization (using same random state for consistency)
        print(f"   📊 Creating 2D UMAP for visualization...")
        umap_2d = umap.UMAP(
            n_components=2,
            n_neighbors=n_neighbors,
            min_dist=0.1,
            random_state=42,  # Same seed for consistency
            n_jobs=1
        )
        
        embedding_2d = umap_2d.fit_transform(data_scaled)
        all_embeddings_2d.append(embedding_2d)
        all_embeddings_3d.append(embedding_3d)
        
        # Store clustering results
        clustering_results[state_name] = {
            'data': data,
            'data_scaled': data_scaled,
            'embedding_3d': embedding_3d,
            'embedding_2d': embedding_2d,
            'cluster_labels': cluster_labels,
            'clusterer': clusterer,
            'n_clusters': n_clusters,
            'n_noise': n_noise,
            'umap_3d': umap_3d,
            'umap_2d': umap_2d
        }
        
        # Create labels and colors based on 3D clustering
        state_colors = color_schemes[state_name]
        for i, label in enumerate(cluster_labels):
            if label == -1:
                all_labels.append(f"{state_name}_Noise")
                all_colors.append('#808080')  # Gray for noise
                all_detailed_labels.append(f"{state_name} Noise")
            else:
                all_labels.append(f"{state_name}_Cluster_{label}")
                all_colors.append(state_colors[label % len(state_colors)])
                all_detailed_labels.append(f"{state_name} Cluster {label}")
    
    # Combine all embeddings
    if not all_embeddings_2d:
        print("❌ No valid clusterings found")
        return None
    
    combined_embedding_2d = np.vstack(all_embeddings_2d)
    combined_embedding_3d = np.vstack(all_embeddings_3d)
    
    print(f"\n🗺️  Creating visualizations with {len(combined_embedding_2d)} total points...")
    
    # Create results with both 2D and 3D embeddings
    results = {
        'embedding_2d': combined_embedding_2d,
        'embedding_3d': combined_embedding_3d,
        'labels': all_labels,
        'colors': all_colors,
        'detailed_labels': all_detailed_labels,
        'clustering_results': clustering_results,
        'pattern_name': pattern_name,
        'config_str': config_str + "_UMAPFirst"
    }
    
    # Create both 2D and 3D visualizations
    create_umap_first_visualizations(results)
    
    return results


In [23]:
def create_umap_first_visualizations(results):
    """
    Create both 2D and 3D interactive visualizations for UMAP-first approach
    """
    
    embedding_2d = results['embedding_2d']
    embedding_3d = results['embedding_3d']
    labels = results['labels']
    colors = results['colors']
    detailed_labels = results['detailed_labels']
    pattern_name = results['pattern_name']
    config_str = results['config_str']
    
    # Create subplots: 2D and 3D side by side
    from plotly.subplots import make_subplots
    import plotly.graph_objects as go
    
    fig = make_subplots(
        rows=1, cols=2,
        column_widths=[0.5, 0.5],
        specs=[[{"type": "scatter"}, {"type": "scatter3d"}]],
        subplot_titles=('2D UMAP Visualization', '3D UMAP (Clustering Space)')
    )
    
    # Group points by label for legend organization
    unique_labels = list(set(labels))
    unique_labels.sort()
    
    for i, label in enumerate(unique_labels):
        mask = [l == label for l in labels]
        if not any(mask):
            continue
            
        indices = [idx for idx, m in enumerate(mask) if m]
        
        # Extract state and cluster info for display
        parts = label.split('_')
        state = parts[0]
        
        if 'Noise' in label:
            display_name = f"{state} (Noise)"
            marker_symbol = 'x'
            marker_size_2d = 3
            marker_size_3d = 2
        else:
            cluster_id = parts[-1]
            display_name = f"{state} C{cluster_id}"
            marker_symbol = 'circle'
            marker_size_2d = 4
            marker_size_3d = 3
        
        # Add 2D scatter plot
        fig.add_trace(
            go.Scatter(
                x=embedding_2d[indices, 0],
                y=embedding_2d[indices, 1],
                mode='markers',
                marker=dict(
                    color=colors[indices[0]],
                    size=marker_size_2d,
                    opacity=0.7,
                    symbol=marker_symbol,
                    line=dict(width=0.5, color='white')
                ),
                name=display_name,
                legendgroup=display_name,
                hovertemplate=f'<b>{display_name}</b><br>UMAP 1: %{{x:.2f}}<br>UMAP 2: %{{y:.2f}}<extra></extra>'
            ),
            row=1, col=1
        )
        
        # Add 3D scatter plot  
        fig.add_trace(
            go.Scatter3d(
                x=embedding_3d[indices, 0],
                y=embedding_3d[indices, 1], 
                z=embedding_3d[indices, 2],
                mode='markers',
                marker=dict(
                    color=colors[indices[0]],
                    size=marker_size_3d,
                    opacity=0.7,
                    symbol=marker_symbol,
                    line=dict(width=0.5, color='white')
                ),
                name=display_name,
                legendgroup=display_name,
                showlegend=False,  # Don't duplicate legend
                hovertemplate=f'<b>{display_name}</b><br>UMAP 1: %{{x:.2f}}<br>UMAP 2: %{{y:.2f}}<br>UMAP 3: %{{z:.2f}}<extra></extra>'
            ),
            row=1, col=2
        )
    
    # Update layout
    title = f'{pattern_name} - {config_str}<br><sub>Left: 2D Visualization | Right: 3D Clustering Space</sub>'
    
    fig.update_layout(
        title=dict(
            text=title,
            x=0.5,
            font=dict(size=14)
        ),
        width=1600,
        height=700,
        showlegend=True,
        legend=dict(
            yanchor="top",
            y=0.99,
            xanchor="left", 
            x=1.02
        ),
        margin=dict(r=200)
    )
    
    # Update 2D subplot axes
    fig.update_xaxes(title_text="UMAP Dimension 1", row=1, col=1)
    fig.update_yaxes(title_text="UMAP Dimension 2", row=1, col=1)
    
    # Update 3D subplot axes
    fig.update_scenes(
        xaxis_title="UMAP Dimension 1",
        yaxis_title="UMAP Dimension 2", 
        zaxis_title="UMAP Dimension 3",
        row=1, col=2
    )
    
    # Save and display
    safe_pattern = pattern_name.replace(' ', '_').replace('&', 'and')
    filename = f"umap_first_{safe_pattern}_{config_str}_{hash(str(embedding_2d.tolist())) % 10000}.html"
    
    fig.write_html(filename, auto_open=False)
    
    print(f"\n📊 UMAP-First Visualization saved: {filename}")
    print(f"   🎯 Key difference: Clustering was performed on 3D UMAP embeddings, not original high-dim data")
    print(f"   📈 Left plot: 2D visualization | Right plot: 3D clustering space")
    print(f"   Opening in browser...")
    
    try:
        import webbrowser
        import os
        webbrowser.open(f'file://{os.path.abspath(filename)}', new=2)
    except Exception as e:
        print(f"   Could not open browser: {e}")
    
    fig.show()
    
    return fig


In [24]:
def perform_single_pattern_analysis_umap_first(
    pattern_name,
    all_tokens=True,
    all_samples=True,
    single_sample_idx=0,
    min_cluster_size=20
):
    """
    Alternative analysis: UMAP first (to 3D), then HDBSCAN clustering on 3D embeddings
    
    Args:
        pattern_name: Name of cognitive pattern to analyze
        all_tokens: If True, use all token positions. If False, use last token only
        all_samples: If True, use all samples. If False, use single sample
        single_sample_idx: Which sample to use if all_samples=False
        min_cluster_size: Minimum cluster size for HDBSCAN
    """
    
    if pattern_name not in pattern_indices:
        print(f"❌ Pattern '{pattern_name}' not found")
        return None
    
    # Configuration string for titles
    tokens_str = "AllTokens" if all_tokens else "LastToken"
    samples_str = "AllSamples" if all_samples else f"Sample{single_sample_idx}"
    config_str = f"{tokens_str}_{samples_str}"
    
    print(f"\n🚀 UMAP-First Analysis Configuration: {config_str}")
    print(f"   🔄 NEW APPROACH: 3D UMAP → HDBSCAN clustering (vs original: HDBSCAN → 2D UMAP)")
    print(f"   Pattern: {pattern_name}")
    print(f"   Layer: {layer} (fixed)")
    print(f"   Token sampling: {'All tokens' if all_tokens else 'Last token only'}")
    print(f"   Sample selection: {'All samples' if all_samples else f'Single sample {single_sample_idx}'}")
    print(f"   Min cluster size: {min_cluster_size}")
    
    # Get indices for this pattern
    indices = pattern_indices[pattern_name]
    if not all_samples:
        if single_sample_idx >= len(indices):
            print(f"❌ Sample index {single_sample_idx} out of range (max: {len(indices)-1})")
            return None
        indices = [indices[single_sample_idx]]
    
    print(f"   Using {len(indices)} sample(s)")
    
    # Load activations for this pattern (layer 17 only)
    neg_data = negative_activations[f'negative_layer_{layer}'][indices]
    pos_data = positive_activations[f'positive_layer_{layer}'][indices] 
    trans_data = transition_activations[f'transition_layer_{layer}'][indices]
    
    print(f"   Data shapes - Neg: {neg_data.shape}, Pos: {pos_data.shape}, Trans: {trans_data.shape}")
    
    # Prepare data based on token sampling strategy
    def prepare_state_data(data, state_name):
        if all_tokens:
            # Use all tokens from all samples
            flat_data = data.reshape(-1, data.shape[-1])
            print(f"     {state_name}: {data.shape} -> {flat_data.shape} (all tokens)")
        else:
            # Use only last token from each sample
            last_tokens = data[:, -1, :]
            flat_data = last_tokens.reshape(-1, last_tokens.shape[-1])
            print(f"     {state_name}: {data.shape} -> {flat_data.shape} (last tokens only)")
        return flat_data.cpu().numpy()  # Move to CPU before converting to numpy
    
    # Prepare data for each state
    print(f"\n📊 Preparing data for UMAP-first analysis...")
    neg_flat = prepare_state_data(neg_data, "Negative")
    pos_flat = prepare_state_data(pos_data, "Positive")
    trans_flat = prepare_state_data(trans_data, "Transition")
    
    return perform_clustering_and_visualization_umap_first(
        neg_flat, pos_flat, trans_flat, 
        pattern_name, config_str, min_cluster_size
    )


## 🔄 Alternative Approach: UMAP-First Analysis

This section tests a different approach: **3D UMAP first, then HDBSCAN clustering on the 3D embeddings**.

### Key Differences:
- **Original approach**: HDBSCAN clustering in 2304-dimensional space → 2D UMAP for visualization
- **New approach**: 3D UMAP dimensionality reduction → HDBSCAN clustering in 3D space → 2D UMAP for comparison

### Why This Might Be Better:
1. **UMAP preserves local structure** better than raw high-dimensional clustering
2. **3D clustering space** is more manageable than 2304D but richer than 2D
3. **May reveal different cluster patterns** that are lost in high-dimensional noise
4. **More interpretable clustering** in the UMAP-transformed space


In [25]:
# Test UMAP-First Approach: All tokens, all samples
print("\\n" + "=" * 80)
print("🔄 UMAP-FIRST ANALYSIS: All Token Positions + All Samples")
print("=" * 80)

results_umap_first = perform_single_pattern_analysis_umap_first(
    pattern_name=selected_pattern,
    all_tokens=True,
    all_samples=True,
    min_cluster_size=30
)

print_analysis_summary(results_umap_first)


🔄 UMAP-FIRST ANALYSIS: All Token Positions + All Samples

🚀 UMAP-First Analysis Configuration: AllTokens_AllSamples
   🔄 NEW APPROACH: 3D UMAP → HDBSCAN clustering (vs original: HDBSCAN → 2D UMAP)
   Pattern: Executive Fatigue & Avolition
   Layer: 17 (fixed)
   Token sampling: All tokens
   Sample selection: All samples
   Min cluster size: 30
   Using 40 sample(s)
   Data shapes - Neg: torch.Size([40, 208, 2304]), Pos: torch.Size([40, 261, 2304]), Trans: torch.Size([40, 311, 2304])

📊 Preparing data for UMAP-first analysis...
     Negative: torch.Size([40, 208, 2304]) -> torch.Size([8320, 2304]) (all tokens)
     Positive: torch.Size([40, 261, 2304]) -> torch.Size([10440, 2304]) (all tokens)
     Transition: torch.Size([40, 311, 2304]) -> torch.Size([12440, 2304]) (all tokens)

🔬 UMAP-First Analysis: 3D UMAP → HDBSCAN Clustering
⚡ NEW APPROACH: Dimensionality reduction first, then clustering in 3D space

🎯 Processing Negative state (8320 samples)
   🗺️  Computing 3D UMAP embedding...

In [26]:
def compare_clustering_approaches(original_results, umap_first_results):
    """
    Compare the clustering results between original and UMAP-first approaches
    """
    
    if not original_results or not umap_first_results:
        print("❌ Cannot compare - missing results")
        return
    
    print("\\n" + "=" * 80)
    print("📊 CLUSTERING APPROACH COMPARISON")
    print("=" * 80)
    
    print(f"Pattern: {original_results['pattern_name']}")
    print(f"Configuration: All Tokens + All Samples\\n")
    
    # Extract clustering info for both approaches
    orig_clustering = original_results['clustering_results']
    umap_clustering = umap_first_results['clustering_results']
    
    print(f"{'State':<12} {'Approach':<15} {'Clusters':<9} {'Noise %':<8} {'Largest Cluster %':<18}")
    print("-" * 70)
    
    for state in ['Negative', 'Positive', 'Transition']:
        if state in orig_clustering:
            orig_info = orig_clustering[state]
            orig_noise_pct = orig_info['n_noise'] / len(orig_info['cluster_labels']) * 100
            
            # Find largest cluster
            labels, counts = np.unique(orig_info['cluster_labels'], return_counts=True)
            non_noise_counts = counts[labels != -1] if -1 in labels else counts
            largest_orig_pct = max(non_noise_counts) / len(orig_info['cluster_labels']) * 100 if len(non_noise_counts) > 0 else 0
            
            print(f"{state:<12} {'Original':<15} {orig_info['n_clusters']:<9} {orig_noise_pct:<8.1f} {largest_orig_pct:<18.1f}")
        
        if state in umap_clustering:
            umap_info = umap_clustering[state]
            umap_noise_pct = umap_info['n_noise'] / len(umap_info['cluster_labels']) * 100
            
            # Find largest cluster
            labels, counts = np.unique(umap_info['cluster_labels'], return_counts=True)
            non_noise_counts = counts[labels != -1] if -1 in labels else counts
            largest_umap_pct = max(non_noise_counts) / len(umap_info['cluster_labels']) * 100 if len(non_noise_counts) > 0 else 0
            
            print(f"{'':<12} {'UMAP-First':<15} {umap_info['n_clusters']:<9} {umap_noise_pct:<8.1f} {largest_umap_pct:<18.1f}")
        
        print()
    
    print("🔍 Key Insights:")
    print("  • Original: Clusters in full 2304D space, then visualizes with UMAP")
    print("  • UMAP-First: Reduces to 3D with UMAP, then clusters in 3D space")
    print("  • Lower noise % suggests cleaner clustering")
    print("  • More balanced cluster sizes suggest better separation")
    print("  • Different approaches may reveal different organizational patterns")

# Compare the approaches if both results exist
if 'results_all_tokens_all_samples' in locals() and 'results_umap_first' in locals():
    compare_clustering_approaches(results_all_tokens_all_samples, results_umap_first)


In [1]:
# Analysis 3: All tokens, single sample
print("\n" + "=" * 80)
print("🔍 ANALYSIS 3: All Token Positions + Single Sample")
print("=" * 80)

results_all_tokens_single_sample = perform_single_pattern_analysis(
    pattern_name=selected_pattern,
    all_tokens=True,
    all_samples=False,
    single_sample_idx=0,
    min_cluster_size=5  # Much smaller cluster size for single sample
)

print_analysis_summary(results_all_tokens_single_sample)


🔍 ANALYSIS 3: All Token Positions + Single Sample


NameError: name 'perform_single_pattern_analysis' is not defined

## Analysis 4: Last Token Only + Single Sample

This analysis looks at the final processing state of each cognitive state within a single example.

In [None]:
# Analysis 4: Last token only, single sample
print("\n" + "=" * 80)
print("🔍 ANALYSIS 4: Last Token Only + Single Sample")
print("=" * 80)

results_last_token_single_sample = perform_single_pattern_analysis(
    pattern_name=selected_pattern,
    all_tokens=False,
    all_samples=False,
    single_sample_idx=0,
    min_cluster_size=2  # Very small cluster size for single sample last tokens only
)

print_analysis_summary(results_last_token_single_sample)

## Comparative Analysis

Compare the results across different analysis approaches.

In [None]:
def compare_analyses(analysis_results):
    """
    Compare the results from different analysis approaches
    """
    print("\n" + "=" * 80)
    print("📊 COMPARATIVE ANALYSIS SUMMARY")
    print("=" * 80)
    
    comparison_data = []
    
    for name, results in analysis_results.items():
        if results is None:
            continue
            
        clustering_results = results['clustering_results']
        total_points = len(results['labels'])
        
        # Aggregate statistics
        total_clusters = sum(info['n_clusters'] for info in clustering_results.values())
        total_noise = sum(info['n_noise'] for info in clustering_results.values())
        total_clustered = total_points - total_noise
        
        comparison_data.append({
            'analysis': name,
            'total_points': total_points,
            'total_clusters': total_clusters,
            'clustered_points': total_clustered,
            'noise_points': total_noise,
            'cluster_rate': total_clustered / total_points * 100 if total_points > 0 else 0
        })
    
    # Print comparison table
    print(f"{'Analysis':<25} {'Points':<8} {'Clusters':<9} {'Clustered':<10} {'Noise':<8} {'Cluster %':<10}")
    print("-" * 80)
    
    for data in comparison_data:
        print(f"{data['analysis']:<25} {data['total_points']:<8} {data['total_clusters']:<9} "
              f"{data['clustered_points']:<10} {data['noise_points']:<8} {data['cluster_rate']:<10.1f}%")
    
    print(f"\n💡 Interpretation Guidelines:")
    print(f"  • All Tokens analyses show sequential processing patterns")
    print(f"  • Last Token analyses focus on final cognitive states")
    print(f"  • All Samples analyses reveal population-level patterns")
    print(f"  • Single Sample analyses show individual example dynamics")
    print(f"  • Higher cluster rates suggest more structured activation patterns")
    print(f"  • Noise points indicate unique or transitional activation states")

# Collect all analysis results for comparison
all_analyses = {
    'All Tokens + All Samples': results_all_tokens_all_samples,
    'Last Token + All Samples': results_last_token_all_samples,  
    'All Tokens + Single Sample': results_all_tokens_single_sample,
    'Last Token + Single Sample': results_last_token_single_sample
}

compare_analyses(all_analyses)

## Conclusion

This notebook provides a comprehensive analysis framework for single cognitive patterns using UMAP + HDBSCAN.

### Key Improvements:
1. **Proper clustering scope**: Clustering performed within each cognitive state (positive/negative/transition) separately
2. **Single pattern focus**: Analysis limited to one cognitive pattern at a time for cleaner interpretation
3. **Flexible token sampling**: Options for all tokens vs. last token analysis
4. **Sample size control**: Analysis of all samples vs. single sample for different research questions

### Usage Notes:
- Change `selected_pattern` variable to analyze different cognitive patterns
- Adjust `min_cluster_size` parameters based on your data size and desired granularity
- The generated HTML files can be opened in any web browser for interactive exploration
- Each analysis answers different research questions about neural activation patterns