# Semantic Gravity Mapping (SGM)

This notebook implements a system to map LLM semantic structure by generating word association graphs.

**What it does:**
1. **Phase 1 - Seed & Crawl**: Generate association graph via BFS from 100 seed concepts
2. **Phase 2 - Logprob Scoring**: Weight edges using logprob extraction
3. **Phase 3 - Topology Analysis**: Analyze hubs, convergence, islands, and asymmetry

**Expected Runtime:** 45-85 minutes on Colab T4 GPU

---

## Recommended Setup: VS Code/Cursor + Colab Extension

**This notebook works best with the Google Colab VS Code extension** (launched Nov 2025):
- Keep your notebook **file local** (easy Git workflow)
- Run code on **remote Colab GPU/TPU** (free T4 or Pro A100)
- Use your **local IDE** (extensions, debugging, linting)

**Quick Setup:**
1. Install "Google Colab" extension in VS Code/Cursor
2. Open this notebook in VS Code/Cursor
3. Click "Select Kernel" ‚Üí "Colab" ‚Üí "New Colab Server"
4. Sign in with your Google account
5. Run the notebook - it will auto-setup the remote environment

**Alternative:** You can also run this in traditional Colab web UI or locally with your own GPU.

---

## Cell 1: Environment Setup

Detects environment (VS Code + Colab, traditional Colab, or local), checks GPU, installs dependencies, and syncs local code to remote runtime.

In [1]:
import os
import sys
from pathlib import Path

# Detect if running in Colab runtime (includes VS Code + Colab extension)
IN_COLAB = 'google.colab' in sys.modules

print(f"Running in: {'Colab Runtime (GPU/TPU)' if IN_COLAB else 'Local Environment (CPU/GPU)'}")

if IN_COLAB:
    print("\n=== Setting up Colab Runtime ===")
    print("(Works with both traditional Colab and VS Code/Cursor + Colab extension)")
    
    # Check GPU
    print("\n1. Checking GPU...")
    !nvidia-smi --query-gpu=name,memory.total --format=csv
    
    # Install dependencies
    print("\n2. Installing dependencies (5-10 minutes)...")
    !pip install -q vllm networkx tqdm matplotlib seaborn
    
    # NOTE: Google Drive mounting (drive.mount) is NOT supported in VS Code + Colab extension
    # See: https://github.com/googlecolab/colab-vscode/issues/256
    # Workaround: Use /content/ directory which persists during the session
    
    print("\n3. Setting up storage...")
    print("   ‚ö†Ô∏è  Google Drive mounting not supported in VS Code + Colab extension")
    print("   ‚úì  Using /content/ directory (persists during session)")
    print("   üí° Download results at end of session")
    
    # Create checkpoint and output directories in /content/
    !mkdir -p /content/sgm_checkpoints
    !mkdir -p /content/sgm_outputs
    
    # Clone/sync repo to Colab runtime
    # NOTE: When using VS Code + Colab extension, your LOCAL files are NOT automatically
    # available to the REMOTE Colab runtime. We need to clone the repo.
    repo_url = "https://github.com/ChuloIva/align_prompts"  # UPDATE THIS
    
    if not Path('/content/align_prompts').exists():
        print(f"\n4. Cloning repository to Colab runtime...")
        print(f"   Repo: {repo_url}")
        !git clone {repo_url} /content/align_prompts
    else:
        print(f"\n4. Repository already exists. Pulling latest changes...")
        !cd /content/align_prompts && git pull
    
    # Add repo to Python path
    sys.path.insert(0, '/content/align_prompts')
    print(f"   Added to sys.path: /content/align_prompts")
    
    print("\n‚úÖ Colab runtime ready!")
    print("\nüí° Important Notes:")
    print("   ‚Ä¢ Your notebook file is local, but code runs on Colab GPU")
    print("   ‚Ä¢ Files saved to /content/ persist during your session")
    print("   ‚Ä¢ Download results before disconnecting (session timeout ~12hrs)")
    
else:
    print("\n=== Local Environment ===")
    print("Make sure you have:")
    print("  - vllm installed: pip install vllm")
    print("  - networkx installed: pip install networkx")
    print("  - tqdm, matplotlib, seaborn: pip install tqdm matplotlib seaborn")
    print("  - vLLM server running on localhost:8000")
    print("\nOr install Colab extension for free GPU: https://marketplace.visualstudio.com/items?itemName=Google.colab")

Running in: Colab Runtime (GPU/TPU)

=== Setting up Colab Runtime ===
(Works with both traditional Colab and VS Code/Cursor + Colab extension)

1. Checking GPU...
name, memory.total [MiB]
NVIDIA A100-SXM4-40GB, 40960 MiB

2. Installing dependencies (5-10 minutes)...

3. Setting up storage...
   ‚ö†Ô∏è  Google Drive mounting not supported in VS Code + Colab extension
   ‚úì  Using /content/ directory (persists during session)
   üí° Download results at end of session

4. Repository already exists. Pulling latest changes...
Already up to date.
   Added to sys.path: /content/align_prompts

‚úÖ Colab runtime ready!

üí° Important Notes:
   ‚Ä¢ Your notebook file is local, but code runs on Colab GPU
   ‚Ä¢ Files saved to /content/ persist during your session
   ‚Ä¢ Download results before disconnecting (session timeout ~12hrs)


## Cell 2: Configuration

Set model, paths, and hyperparameters.

In [None]:
# Configuration
CONFIG = {
    # Model settings
    'model': 'google/gemma-3-4b-it',  # Model to use
    'vllm_base_url': 'http://localhost:8000/v1',  # vLLM server URL
    
    # Graph generation settings
    'max_hops': 3,  # BFS depth (3 = ~15k edges)
    'associations_per_word': 5,  # Associations per word
    
    # Checkpoint settings (using /content/ instead of Google Drive)
    'checkpoint_dir': '/content/sgm_checkpoints' if IN_COLAB else './data/sgm/checkpoints',
    'output_dir': '/content/sgm_outputs' if IN_COLAB else './data/sgm/graphs',
    
    # Optimization settings
    'temperature_associations': 0.7,  # Temperature for Phase 1
    'temperature_scoring': 0.0,  # Temperature for Phase 2 (deterministic)
    'batch_size': 32,  # Concurrent requests
    
    # Resume settings
    'resume': True  # Resume from checkpoint if available
}

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## Cell 2.5: Hugging Face Authentication

Gemma models are gated - you need to authenticate with Hugging Face to download them.

In [None]:
if IN_COLAB:
    print("üîê Hugging Face Authentication Required")
    print("=" * 50)
    print(f"\nModel '{CONFIG['model']}' is a gated model.")
    print("\nüìã Steps to get access:")
    print("   1. Go to: https://huggingface.co/google/gemma-3-4b-it")
    print("   2. Click 'Agree and access repository'")
    print("   3. Get your token: https://huggingface.co/settings/tokens")
    print("   4. Create a token with 'read' permissions")
    print("\n")
    
    # Try to use huggingface-cli login
    try:
        from huggingface_hub import login
        import getpass
        
        # Check if already logged in
        try:
            from huggingface_hub import HfFolder
            token = HfFolder.get_token()
            if token:
                print("‚úÖ Already logged in to Hugging Face!")
            else:
                raise Exception("Not logged in")
        except:
            print("Please paste your Hugging Face token below:")
            print("(Token will be hidden while typing)")
            hf_token = getpass.getpass("HF Token: ")
            
            # Login with token
            login(token=hf_token, add_to_git_credential=False)
            print("\n‚úÖ Successfully authenticated with Hugging Face!")
    
    except ImportError:
        print("‚ö†Ô∏è  huggingface_hub not installed. Installing...")
        !pip install -q huggingface_hub
        print("‚úÖ Installed. Please re-run this cell to authenticate.")
        
else:
    print("Local environment - make sure you're logged in to Hugging Face:")
    print("  huggingface-cli login")
    print("\nOr set HF_TOKEN environment variable:")

if IN_COLAB:
    print("Starting vLLM server in background...")
    print(f"Model: {CONFIG['model']}")
    print("\n‚è±Ô∏è  Note: First run may take 5-10 minutes to download the model (~8GB)")
    print("   Subsequent runs will be much faster (model is cached)\n")
    
    # Kill any existing vLLM servers first
    print("1. Cleaning up any existing vLLM processes...")
    !pkill -f "vllm.entrypoints.openai.api_server" 2>/dev/null || true
    !sleep 2
    
    # Start vLLM server in background with more verbose logging
    print("2. Starting vLLM server...")
    vllm_cmd = f"""
    nohup python -m vllm.entrypoints.openai.api_server \
        --model {CONFIG['model']} \
        --gpu-memory-utilization 0.9 \
        --max-model-len 2048 \
        --port 8000 \
        --trust-remote-code \
        > /tmp/vllm_server.log 2>&1 &
    """
    
    !{vllm_cmd}
    
    # Wait for server to be ready with better monitoring
    import time
    import requests
    
    print("3. Waiting for vLLM server to initialize...")
    print("   (Checking server health every 5 seconds)\n")
    
    max_wait_time = 600  # 10 minutes
    check_interval = 5   # 5 seconds
    max_iterations = max_wait_time // check_interval
    
    for i in range(max_iterations):
        try:
            response = requests.get('http://localhost:8000/health', timeout=2)
            if response.status_code == 200:
                elapsed = i * check_interval
                print(f"\n‚úÖ vLLM server is ready! (took {elapsed}s)")
                
                # Test the server with a simple request
                print("\n4. Testing server with sample request...")
                try:
                    test_response = requests.post(
                        'http://localhost:8000/v1/completions',
                        json={'model': CONFIG['model'], 'prompt': 'Hello', 'max_tokens': 5},
                        timeout=10
                    )
                    if test_response.status_code == 200:
                        print("‚úÖ Server test successful!")
                    else:
                        print(f"‚ö†Ô∏è  Server responded but with status {test_response.status_code}")
                except Exception as e:
                    print(f"‚ö†Ô∏è  Server test failed: {e}")
                break
        except requests.exceptions.RequestException:
            pass
        
        # Show progress every 20 seconds
        if i > 0 and i % 4 == 0:
            elapsed = i * check_interval
            print(f"   Still initializing... ({elapsed}s elapsed)")
            
            # Show last few lines of log for progress
            print("   Latest log output:")
            !tail -n 3 /tmp/vllm_server.log 2>/dev/null | sed 's/^/     /'
            print()
        
        time.sleep(check_interval)
    else:
        elapsed = max_iterations * check_interval
        print(f"\n‚ö†Ô∏è  Server didn't respond after {elapsed}s")
        print("\nüìã Full server log:")
        !cat /tmp/vllm_server.log
        print("\nüí° Troubleshooting:")
        print("   ‚Ä¢ Check if vLLM process is running: !ps aux | grep vllm")
        print("   ‚Ä¢ Check GPU memory: !nvidia-smi")
        print("   ‚Ä¢ Make sure you're authenticated with Hugging Face (run previous cell)")
        print("   ‚Ä¢ Try restarting the Colab runtime")
        print("   ‚Ä¢ The model might be too large for the GPU")
        
else:
    print("Local environment - assuming vLLM server is already running.")
    print(f"Make sure vLLM is serving {CONFIG['model']} on {CONFIG['vllm_base_url']}")
    print("\nTo start vLLM locally, run:")
    print(f"  python -m vllm.entrypoints.openai.api_server --model {CONFIG['model']} --port 8000")

In [None]:
if IN_COLAB:
    print("Starting vLLM server in background...")
    print(f"Model: {CONFIG['model']}")
    print("\n‚è±Ô∏è  Note: First run may take 5-10 minutes to download the model (~8GB)")
    print("   Subsequent runs will be much faster (model is cached)\n")
    
    # Kill any existing vLLM servers first
    print("1. Cleaning up any existing vLLM processes...")
    !pkill -f "vllm.entrypoints.openai.api_server" 2>/dev/null || true
    !sleep 2
    
    # Start vLLM server in background with more verbose logging
    print("2. Starting vLLM server...")
    vllm_cmd = f"""
    nohup python -m vllm.entrypoints.openai.api_server \
        --model {CONFIG['model']} \
        --gpu-memory-utilization 0.9 \
        --max-model-len 2048 \
        --port 8000 \
        --trust-remote-code \
        > /tmp/vllm_server.log 2>&1 &
    """
    
    !{vllm_cmd}
    
    # Wait for server to be ready with better monitoring
    import time
    import requests
    
    print("3. Waiting for vLLM server to initialize...")
    print("   (Checking server health every 5 seconds)\n")
    
    max_wait_time = 600  # 10 minutes
    check_interval = 5   # 5 seconds
    max_iterations = max_wait_time // check_interval
    
    for i in range(max_iterations):
        try:
            response = requests.get('http://localhost:8000/health', timeout=2)
            if response.status_code == 200:
                elapsed = i * check_interval
                print(f"\n‚úÖ vLLM server is ready! (took {elapsed}s)")
                
                # Test the server with a simple request
                print("\n4. Testing server with sample request...")
                try:
                    test_response = requests.post(
                        'http://localhost:8000/v1/completions',
                        json={'model': CONFIG['model'], 'prompt': 'Hello', 'max_tokens': 5},
                        timeout=10
                    )
                    if test_response.status_code == 200:
                        print("‚úÖ Server test successful!")
                    else:
                        print(f"‚ö†Ô∏è  Server responded but with status {test_response.status_code}")
                except Exception as e:
                    print(f"‚ö†Ô∏è  Server test failed: {e}")
                break
        except requests.exceptions.RequestException:
            pass
        
        # Show progress every 20 seconds
        if i > 0 and i % 4 == 0:
            elapsed = i * check_interval
            print(f"   Still initializing... ({elapsed}s elapsed)")
            
            # Show last few lines of log for progress
            print("   Latest log output:")
            !tail -n 3 /tmp/vllm_server.log 2>/dev/null | sed 's/^/     /'
            print()
        
        time.sleep(check_interval)
    else:
        elapsed = max_iterations * check_interval
        print(f"\n‚ö†Ô∏è  Server didn't respond after {elapsed}s")
        print("\nüìã Full server log:")
        !cat /tmp/vllm_server.log
        print("\nüí° Troubleshooting:")
        print("   ‚Ä¢ Check if vLLM process is running: !ps aux | grep vllm")
        print("   ‚Ä¢ Check GPU memory: !nvidia-smi")
        print("   ‚Ä¢ Make sure you're authenticated with Hugging Face (run previous cell)")
        print("   ‚Ä¢ Try restarting the Colab runtime")
        print("   ‚Ä¢ The model might be too large for the GPU")
        
else:
    print("Local environment - assuming vLLM server is already running.")
    print(f"Make sure vLLM is serving {CONFIG['model']} on {CONFIG['vllm_base_url']}")
    print("\nTo start vLLM locally, run:")
    print(f"  python -m vllm.entrypoints.openai.api_server --model {CONFIG['model']} --port 8000")

## Helper: Check vLLM Server Status

Run this cell if the server isn't starting or you want to see what's happening.

In [None]:
# Helper cell to check vLLM server status
if IN_COLAB:
    print("üîç vLLM Server Diagnostics")
    print("=" * 50)
    
    # Check if process is running
    print("\n1. Checking if vLLM process is running:")
    !ps aux | grep -E "[v]llm.entrypoints" || echo "   ‚ùå No vLLM process found"
    
    # Check port
    print("\n2. Checking if port 8000 is in use:")
    !lsof -i :8000 || echo "   ‚ùå Port 8000 not in use"
    
    # Check server health endpoint
    print("\n3. Testing health endpoint:")
    import requests
    try:
        response = requests.get('http://localhost:8000/health', timeout=2)
        print(f"   ‚úÖ Server responding! Status: {response.status_code}")
    except Exception as e:
        print(f"   ‚ùå Server not responding: {e}")
    
    # Show recent logs
    print("\n4. Recent server logs (last 20 lines):")
    print("-" * 50)
    !tail -n 20 /tmp/vllm_server.log 2>/dev/null || echo "   ‚ùå No log file found"
    print("-" * 50)
    
    # GPU status
    print("\n5. GPU Memory Status:")
    !nvidia-smi --query-gpu=memory.used,memory.total --format=csv
    
    print("\nüí° To see full logs, run: !cat /tmp/vllm_server.log")
    print("üí° To restart server, re-run the 'Start vLLM Server' cell above")
    
else:
    print("Local environment - check your vLLM server manually")
    print("  ps aux | grep vllm")
    print("  curl http://localhost:8000/health")

In [None]:
if IN_COLAB:
    print("Starting vLLM server in background...")
    print(f"Model: {CONFIG['model']}")
    print("\nThis may take 2-3 minutes to download and load the model.")
    
    # Start vLLM server in background
    vllm_cmd = f"""
    nohup python -m vllm.entrypoints.openai.api_server \
        --model {CONFIG['model']} \
        --gpu-memory-utilization 0.9 \
        --max-model-len 2048 \
        --port 8000 \
        > /tmp/vllm_server.log 2>&1 &
    """
    
    !{vllm_cmd}
    
    # Wait for server to be ready
    import time
    import requests
    
    print("\nWaiting for vLLM server to start...")
    for i in range(60):  # Wait up to 60 seconds
        try:
            response = requests.get('http://localhost:8000/health')
            if response.status_code == 200:
                print("\n‚úÖ vLLM server is ready!")
                break
        except:
            pass
        time.sleep(2)
        if i % 10 == 0:
            print(f"  Still waiting... ({i*2}s)")
    else:
        print("\n‚ö†Ô∏è Server didn't respond in time. Check logs: !tail /tmp/vllm_server.log")
        
else:
    print("Local environment - assuming vLLM server is already running.")
    print(f"Make sure vLLM is serving {CONFIG['model']} on {CONFIG['vllm_base_url']}")

## Cell 4: Initialize Components

Create engine, checkpoint manager, and check for existing checkpoints.

In [None]:
from align_test.core.vllm_client import VLLMClient
from align_test.sgm.inference.batch_inference import SGMInferenceEngine
from align_test.sgm.storage.checkpoint_manager import CheckpointManager
from align_test.sgm.models.seed_domains import get_all_seeds, get_domain_names

# Initialize vLLM client
print("Initializing components...\n")

vllm_client = VLLMClient(
    base_url=CONFIG['vllm_base_url'],
    model=CONFIG['model']
)

print(f"‚úì VLLMClient: {vllm_client}")

# Initialize inference engine
inference_engine = SGMInferenceEngine(
    client=vllm_client,
    temperature=CONFIG['temperature_associations'],
    batch_size=CONFIG['batch_size']
)

print(f"‚úì SGMInferenceEngine: {inference_engine}")

# Initialize checkpoint manager
checkpoint_manager = CheckpointManager(
    checkpoint_dir=CONFIG['checkpoint_dir'],
    config=CONFIG
)

print(f"‚úì CheckpointManager: {checkpoint_manager.checkpoint_dir}")

# Check for existing checkpoints
resume_info = checkpoint_manager.get_resume_info()

if resume_info:
    print(f"\nüìÅ Found checkpoint: Phase {resume_info['phase']}, Iteration {resume_info['iteration']}")
    print(f"   Timestamp: {resume_info['timestamp']}")
    print(f"   Can resume from: {resume_info['checkpoint_file']}")
else:
    print("\nüìÅ No existing checkpoints found - starting fresh")

# Display seed information
seeds = get_all_seeds()
domains = get_domain_names()

print(f"\nüå± Seeds: {len(seeds)} concepts across {len(domains)} domains")
print(f"   Domains: {', '.join(domains)}")
print(f"   Sample seeds: {', '.join(seeds[:5])}...")

## Cell 5: Phase 1 - Seed & Crawl

Generate association graph via BFS expansion. This will take 30-60 minutes on T4 GPU.

In [None]:
from align_test.sgm.core.graph_builder import GraphBuilder

print("=" * 70)
print("PHASE 1: SEED & CRAWL - BFS Graph Generation")
print("=" * 70)

# Initialize graph builder
graph_builder = GraphBuilder(
    engine=inference_engine,
    checkpoint_manager=checkpoint_manager,
    checkpoint_interval=500
)

# Build graph
import time
start_time = time.time()

raw_graph = graph_builder.build_graph(
    max_hops=CONFIG['max_hops'],
    associations_per_word=CONFIG['associations_per_word'],
    resume=CONFIG['resume']
)

elapsed_time = time.time() - start_time

print(f"\n‚è±Ô∏è  Phase 1 completed in {elapsed_time/60:.1f} minutes")
print(f"\nüìä Final Graph:")
print(f"   Nodes: {raw_graph.number_of_nodes():,}")
print(f"   Edges: {raw_graph.number_of_edges():,}")

# Get statistics
stats = graph_builder.get_statistics()
print(f"\nüìà Statistics:")
print(f"   Visited nodes: {stats['num_visited']:,}")
print(f"   Avg out-degree: {stats['avg_out_degree']:.2f}")
print(f"   Max out-degree: {stats['max_out_degree']}")

## Cell 6: Phase 1 Results Preview

Visualize sample associations and graph structure.

In [None]:
import matplotlib.pyplot as plt
import networkx as nx

print("=" * 70)
print("PHASE 1: RESULTS PREVIEW")
print("=" * 70)

# Show sample paths
print("\nüîç Sample Association Paths:")
graph_builder.preview_sample_paths(n=5)

# Analyze hop distribution
hop_counts = {}
for _, _, data in raw_graph.edges(data=True):
    hop = data.get('hop', 0)
    hop_counts[hop] = hop_counts.get(hop, 0) + 1

print("\nüìä Edge Distribution by Hop:")
for hop in sorted(hop_counts.keys()):
    count = hop_counts[hop]
    print(f"   Hop {hop}: {count:,} edges ({count/raw_graph.number_of_edges()*100:.1f}%)")

# Visualize hop distribution
plt.figure(figsize=(10, 5))
plt.bar(hop_counts.keys(), hop_counts.values())
plt.xlabel('Hop')
plt.ylabel('Number of Edges')
plt.title('Edge Distribution by BFS Hop')
plt.grid(True, alpha=0.3)
plt.show()

# Show top nodes by degree
degree_dict = dict(raw_graph.in_degree())
top_nodes = sorted(degree_dict.items(), key=lambda x: x[1], reverse=True)[:10]

print("\nüéØ Top 10 Nodes by In-Degree (most associated with):")
for i, (node, degree) in enumerate(top_nodes, 1):
    print(f"   {i:2d}. {node:20s} (degree: {degree})")

## Cell 7: Phase 2 - Logprob Scoring

Score edge weights using logprob extraction. This will take 10-20 minutes.

In [None]:
from align_test.sgm.core.logprob_scorer import LogprobScorer

print("=" * 70)
print("PHASE 2: LOGPROB SCORING - Edge Weight Assignment")
print("=" * 70)

# Update engine temperature for deterministic scoring
inference_engine.temperature = CONFIG['temperature_scoring']

# Initialize scorer
logprob_scorer = LogprobScorer(
    engine=inference_engine,
    checkpoint_manager=checkpoint_manager,
    checkpoint_interval=2000
)

# Score all edges
import time
start_time = time.time()

weighted_graph = logprob_scorer.score_all_edges(
    graph=raw_graph,
    resume=CONFIG['resume'],
    show_progress=True
)

elapsed_time = time.time() - start_time

print(f"\n‚è±Ô∏è  Phase 2 completed in {elapsed_time/60:.1f} minutes")

# Get weight statistics
weight_stats = logprob_scorer.get_weight_statistics(weighted_graph)

print(f"\nüìä Weight Statistics:")
print(f"   Mean: {weight_stats['mean_weight']:.4f}")
print(f"   Median: {weight_stats['median_weight']:.4f}")
print(f"   Min: {weight_stats['min_weight']:.4f}")
print(f"   Max: {weight_stats['max_weight']:.4f}")
print(f"   Std: {weight_stats['std_weight']:.4f}")
print(f"   Scored edges: {weight_stats['num_scored']:,}")

## Cell 8: Phase 2 Results Preview

Analyze strongest and weakest associations.

In [None]:
print("=" * 70)
print("PHASE 2: RESULTS PREVIEW")
print("=" * 70)

# Show top edges (strongest associations)
top_edges = logprob_scorer.get_top_edges(weighted_graph, n=10, sort_by='weight')

print("\nüí™ Top 10 Strongest Associations (by weight):")
for i, (u, v, w) in enumerate(top_edges, 1):
    print(f"   {i:2d}. {u:15s} ‚Üí {v:15s} (weight: {w:.4f})")

# Show bottom edges (weakest associations)
bottom_edges = logprob_scorer.get_bottom_edges(weighted_graph, n=10, sort_by='weight')

print("\nüîª Top 10 Weakest Associations (by weight):")
for i, (u, v, w) in enumerate(bottom_edges, 1):
    print(f"   {i:2d}. {u:15s} ‚Üí {v:15s} (weight: {w:.4f})")

# Visualize weight distribution
print("\nüìä Weight Distribution:")
logprob_scorer.visualize_weight_distribution(weighted_graph)

## Cell 9: Phase 3 - Topology Analysis

Analyze graph topology to find hubs, convergence patterns, islands, and asymmetries.

In [None]:
from align_test.sgm.core.topology_analyzer import TopologyAnalyzer

print("=" * 70)
print("PHASE 3: TOPOLOGY ANALYSIS")
print("=" * 70)

# Initialize analyzer
analyzer = TopologyAnalyzer(weighted_graph)

# Run all analyses
import time
start_time = time.time()

results = analyzer.analyze_all()

elapsed_time = time.time() - start_time

print(f"\n‚è±Ô∏è  Phase 3 completed in {elapsed_time:.1f} seconds")

# Print summary
analyzer.print_summary(results)

# Export results to JSON
output_path = Path(CONFIG['output_dir']) / 'topology_metrics.json'
analyzer.export_results(results, str(output_path))

# Save final graph
print("\nüíæ Saving final graph...")
checkpoint_manager.save_graph(
    graph=weighted_graph,
    filename='semantic_graph_final',
    include_metadata=True
)
print(f"   Graph saved to: {CONFIG['output_dir']}/semantic_graph_final.gpickle")
print(f"   Edge list (CSV): {CONFIG['output_dir']}/semantic_graph_final.csv")

## Cell 10: Results Summary & Export

Final summary and file access instructions (results are saved to Google Drive).

In [None]:
print("=" * 70)
print("üéâ SEMANTIC GRAVITY MAPPING - COMPLETE!")
print("=" * 70)

# Summary of findings
print("\nüìù Key Findings:")
print("\n1. Semantic Attractors (Hubs):")
top_hubs = results['hubs'][:5]
for hub in top_hubs:
    print(f"   ‚Ä¢ {hub['word']} (PageRank: {hub['pagerank']:.4f})")

print("\n2. Convergence Analysis:")
conv = results['convergence']
print(f"   ‚Ä¢ Overall avg hops to hubs: {conv['overall_avg_hops']:.2f}")
fastest_domain = min(conv['by_domain'].items(), key=lambda x: x[1]['avg_hops'])
print(f"   ‚Ä¢ Fastest converging domain: {fastest_domain[0]} ({fastest_domain[1]['avg_hops']:.2f} hops)")

print("\n3. Isolated Domains (Islands):")
if results['islands']:
    for island in results['islands'][:3]:
        print(f"   ‚Ä¢ Size {island['size']}: {', '.join(island['words'][:3])}...")
else:
    print("   ‚Ä¢ No isolated clusters found")

print("\n4. Asymmetric Associations (Narrative Bias):")
for pair in results['asymmetry'][:3]:
    print(f"   ‚Ä¢ {pair['source']} ‚Üí {pair['target']}: {pair['asymmetry']:.3f}")

# Output files
print("\n\nüì¶ Output Files:")
print(f"   ‚Ä¢ Graph (pickle): {CONFIG['output_dir']}/semantic_graph_final.gpickle")
print(f"   ‚Ä¢ Edge list (CSV): {CONFIG['output_dir']}/semantic_graph_final.csv")
print(f"   ‚Ä¢ Metrics (JSON): {CONFIG['output_dir']}/topology_metrics.json")
print(f"   ‚Ä¢ Checkpoints: {CONFIG['checkpoint_dir']}/")

if IN_COLAB:
    print("\nüí° How to Download Your Results:")
    print("   Files are saved to /content/sgm_outputs/ on the Colab runtime")
    print("\n   Option 1 - Direct Download (Traditional Colab):")
    try:
        from google.colab import files
        download = input("\n   üì• Download results now? (y/n): ")
        if download.lower() == 'y':
            print("   Downloading...")
            files.download(f"{CONFIG['output_dir']}/semantic_graph_final.csv")
            files.download(f"{CONFIG['output_dir']}/topology_metrics.json")
            files.download(f"{CONFIG['output_dir']}/semantic_graph_final.gpickle")
            print("   ‚úÖ Download complete!")
    except:
        print("   ‚ö†Ô∏è  Direct download not available in VS Code + Colab extension")
        print("\n   Option 2 - Upload to GitHub:")
        print("   Run these commands to push results to your repo:")
        print(f"   !cd /content/align_prompts && mkdir -p data/results")
        print(f"   !cp {CONFIG['output_dir']}/* /content/align_prompts/data/results/")
        print(f"   !cd /content/align_prompts && git add data/results && git commit -m 'Add SGM results' && git push")
        print("\n   Option 3 - Manual Copy (VS Code):")
        print("   1. Use VS Code file browser to navigate to /content/sgm_outputs/")
        print("   2. Right-click files ‚Üí Download")
        print("   3. Or upload to a cloud storage service")
    
    print("\n   ‚ö†Ô∏è  Remember: /content/ is temporary - download before session ends!")
    
    print("\n   Next Steps:")
    print("   1. Download the CSV and JSON files")
    print("   2. Import CSV to Gephi for network visualization")
    print("   3. Analyze topology_metrics.json for detailed insights")

print("\n" + "=" * 70)
print("Thank you for using Semantic Gravity Mapping!")
print("=" * 70)

## Helper: Download Results (VS Code + Colab)

If you're using VS Code + Colab extension, run this cell to package and download your results.

In [None]:
# Optional: Create a zip file of all results for easy download
if IN_COLAB:
    import shutil
    
    print("üì¶ Packaging results for download...")
    
    # Create zip archive
    output_zip = '/content/sgm_results'
    shutil.make_archive(output_zip, 'zip', CONFIG['output_dir'])
    
    print(f"‚úÖ Created: {output_zip}.zip")
    print(f"   Size: {Path(f'{output_zip}.zip').stat().st_size / 1024 / 1024:.2f} MB")
    
    # Try to download (works in traditional Colab, not VS Code)
    try:
        from google.colab import files
        print("\nüì• Downloading zip file...")
        files.download(f'{output_zip}.zip')
        print("‚úÖ Download started!")
    except:
        print("\n‚ö†Ô∏è  Automatic download not available in VS Code + Colab")
        print(f"\nüí° Alternative: Copy {output_zip}.zip to your GitHub repo:")
        print(f"   !cp {output_zip}.zip /content/align_prompts/sgm_results.zip")
        print(f"   !cd /content/align_prompts && git add sgm_results.zip && git commit -m 'Add results' && git push")
        print("\nThen download from GitHub, or right-click the file in VS Code file browser.")
else:
    print("Local environment - results already saved locally to:")
    print(f"  {CONFIG['output_dir']}")