# ATLAS Math Reasoning Demo

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Arc-Intelligence/RCL/blob/main/examples/math_reasoning_demo.ipynb)

## 🎯 Goal: See 15.7% Accuracy Improvement in < 5 Minutes

This notebook demonstrates ATLAS's two-pass inference protocol:
1. **Diagnostic Probing** (≤50 tokens): Teacher assesses student capability  
2. **Adaptive Teaching**: Conditional guidance based on diagnosed strength

**Models Used:**
- **Student**: Qwen/Qwen3-4B-Instruct-2507
- **Teacher**: ATLAS-8B-Thinking
- **Dataset**: Arc-Intelligence/Arc-ATLAS-Teach-v0 or Big-Math-RL subset

**Hardware Requirements:** Single GPU (T4/A100), 12-16GB RAM

## 🚀 Setup and Environment

In [None]:
# Install dependencies (Colab)
import sys
if 'google.colab' in sys.modules:
    !pip install -q transformers datasets torch accelerate
    !pip install -q matplotlib seaborn pandas numpy
    
    # Clone repository for utils
    !git clone -q https://github.com/Arc-Intelligence/RCL.git
    sys.path.append('/content/RCL/examples')
else:
    # Local development
    sys.path.append('.')

print("✅ Environment setup complete")

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')

# Import ATLAS utilities
from utils.atlas_inference import ATLASInference, load_atlas_models
from utils.evaluation import calculate_metrics
from utils.visualization import plot_comparison, display_results_table, show_example_comparisons
from data.load_datasets import load_atlas_teach_dataset, load_bigmath_dataset

print("📚 Libraries imported successfully")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🎮 CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory // 1024**3} GB")

## 📊 Load Dataset

In [None]:
# Load math problems from HuggingFace datasets
print("Loading math problems from HuggingFace...\n")

try:
    # Try Arc-ATLAS-Teach dataset first
    problems = load_atlas_teach_dataset(num_samples=20)
    print(f"✅ Loaded {len(problems)} problems from Arc-ATLAS-Teach dataset")
except Exception as e:
    print(f"⚠️  Arc-ATLAS-Teach dataset failed: {e}")
    try:
        # Fallback to Big-Math dataset
        problems = load_bigmath_dataset(num_samples=20)
        print(f"✅ Loaded {len(problems)} problems from Big-Math dataset")
    except Exception as e2:
        print(f"⚠️  Big-Math dataset also failed: {e2}")
        print("Using fallback sample problems...")
        from data.load_datasets import get_sample_math_problems
        problems = get_sample_math_problems()

print(f"\n📝 Dataset Summary:")
print(f"   Total problems: {len(problems)}")
print(f"   Source: {problems[0].get('source', 'unknown')}")

# Show sample problem
sample = problems[0]
print(f"\n🔍 Sample Problem:")
print(f"Problem: {sample['problem'][:200]}{'...' if len(sample['problem']) > 200 else ''}")
print(f"Expected: {sample.get('answer', sample.get('solution', 'N/A'))}")

## 🤖 Load Models

Loading the student model (Qwen3-4B-Instruct-2507) and ATLAS teacher (ATLAS-8B-Thinking).

In [None]:
print("🔄 Loading ATLAS models...\n")

# Load models with memory optimization
device_map = "auto" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

try:
    reasoning_atlas, _ = load_atlas_models(
        student_model_name="Qwen/Qwen3-4B-Instruct-2507",
        teacher_thinking_name="Arc-Intelligence/ATLAS-8B-Thinking", 
        device_map=device_map,
        torch_dtype=torch_dtype
    )
    
    print("\n✅ Models loaded successfully!")
    
    # Memory check
    if torch.cuda.is_available():
        memory_used = torch.cuda.memory_allocated() / 1024**3
        memory_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
        print(f"📊 GPU Memory: {memory_used:.1f}GB / {memory_total:.1f}GB ({memory_used/memory_total*100:.1f}%)")
        
except Exception as e:
    print(f"❌ Error loading models: {e}")
    print("This demo requires GPU access. Please ensure you have:")
    print("- GPU runtime enabled in Colab")
    print("- Sufficient GPU memory (12GB+ recommended)")
    raise e

## 🧠 ATLAS Two-Pass Protocol in Action

Let's see how ATLAS works step by step with a single example:

In [None]:
# Demo with single problem
demo_problem = problems[0]
print(f"🎯 Demo Problem: {demo_problem['problem']}\n")

print("🔍 Phase 1: Diagnostic Probing")
diagnostic_result = reasoning_atlas.diagnostic_probe(demo_problem['problem'])
print(f"Capability Score: {diagnostic_result['capability_score']}/5")
print(f"Teaching Strategy: {diagnostic_result['teaching_strategy']}")
print(f"Assessment: {diagnostic_result['probe_response'][:200]}...\n")

print("🎓 Phase 2: Adaptive Teaching")
teaching_result = reasoning_atlas.adaptive_teaching(demo_problem['problem'], diagnostic_result)
print(f"Teaching Guidance: {teaching_result['teaching_guidance'][:300]}...\n")

print("👤 Student Baseline Response:")
baseline_response = reasoning_atlas.generate_student_response(demo_problem['problem'])
print(f"{baseline_response[:300]}...\n")

print("✨ Student + Teacher Response:")
guided_response = reasoning_atlas.generate_student_response(
    demo_problem['problem'], 
    teaching_result['teaching_guidance']
)
print(f"{guided_response[:300]}...")

## 📊 Full Evaluation: Baseline vs ATLAS

Now let's run the complete evaluation across all problems:

In [None]:
print(f"🚀 Running full evaluation on {len(problems)} problems...\n")

# Run ATLAS protocol on all problems
results = []
for i, problem in enumerate(problems):
    print(f"Processing problem {i+1}/{len(problems)}...", end="\r")
    
    try:
        result = reasoning_atlas.run_full_protocol(problem['problem'])
        results.append(result)
    except Exception as e:
        print(f"\n⚠️  Error on problem {i+1}: {e}")
        continue

print(f"\n✅ Evaluation complete! Processed {len(results)} problems successfully.")

# Calculate comprehensive metrics
print("\n📈 Calculating performance metrics...")
metrics = calculate_metrics(problems[:len(results)], results, task_type="math")

print(f"\n🎯 Key Results:")
print(f"   Baseline Accuracy: {metrics['baseline_accuracy']:.1%}")
print(f"   ATLAS Accuracy: {metrics['guided_accuracy']:.1%}")
print(f"   Improvement: +{metrics['improvement_percentage']:.1f}%")
print(f"   Problems Improved: {metrics['improvements']}")
print(f"   Problems Degraded: {metrics['degradations']}")
print(f"   Non-Degradation Rate: {metrics['non_degradation_rate']:.1%}")

## 📈 Performance Analysis & Visualization

In [None]:
# Create comprehensive performance plots
plot_comparison(metrics, task_type="math")

In [None]:
# Display detailed results table
display_results_table(metrics, task_type="math")

In [None]:
# Show example comparisons
show_example_comparisons(problems[:len(results)], results, num_examples=3)

## 🔍 Diagnostic Analysis

Understanding how ATLAS adapts its teaching strategy:

In [None]:
from utils.visualization import create_diagnostic_analysis

# Analyze diagnostic probing results
create_diagnostic_analysis(results)

## 🎯 Key Takeaways

Based on this evaluation:

In [None]:
# Generate summary insights
baseline_acc = metrics['baseline_accuracy'] * 100
atlas_acc = metrics['guided_accuracy'] * 100
improvement = metrics['improvement_percentage']
ndr = metrics['non_degradation_rate'] * 100

summary_html = f"""
<div style='background-color: #f0f8ff; padding: 20px; border-radius: 10px; border-left: 5px solid #4CAF50;'>
    <h3 style='color: #2E8B57; margin-top: 0;'>🎯 ATLAS Performance Summary</h3>
    
    <div style='display: grid; grid-template-columns: 1fr 1fr; gap: 20px; margin: 15px 0;'>
        <div>
            <h4 style='margin: 0; color: #4169E1;'>📊 Accuracy Results</h4>
            <p style='margin: 5px 0;'><strong>Baseline (Student Only):</strong> {baseline_acc:.1f}%</p>
            <p style='margin: 5px 0;'><strong>ATLAS (Student+Teacher):</strong> {atlas_acc:.1f}%</p>
            <p style='margin: 5px 0; color: #228B22;'><strong>Improvement:</strong> +{improvement:.1f}%</p>
        </div>
        <div>
            <h4 style='margin: 0; color: #4169E1;'>🛡️ Reliability</h4>
            <p style='margin: 5px 0;'><strong>Problems Improved:</strong> {metrics['improvements']}</p>
            <p style='margin: 5px 0;'><strong>Problems Degraded:</strong> {metrics['degradations']}</p>
            <p style='margin: 5px 0; color: #228B22;'><strong>Non-Degradation Rate:</strong> {ndr:.1f}%</p>
        </div>
    </div>
    
    <div style='margin-top: 15px; padding: 10px; background-color: white; border-radius: 5px;'>
        <h4 style='margin: 0 0 10px 0; color: #FF6347;'>🚀 Why This Matters</h4>
        <ul style='margin: 0; padding-left: 20px;'>
            <li><strong>Model-Agnostic:</strong> Works with any student model (Qwen, Llama, etc.)</li>
            <li><strong>Adaptive:</strong> Teaching strategy adjusts based on problem difficulty</li>
            <li><strong>Reliable:</strong> {ndr:.1f}% of responses maintained or improved quality</li>
            <li><strong>Efficient:</strong> Minimal teacher guidance achieves significant gains</li>
        </ul>
    </div>
</div>
"""

display(HTML(summary_html))

## 🔧 Extending This Demo

### Use Your Own Student Model

In [None]:
# Example: How to use a different student model
print("💡 To use your own student model:")
print("\n1. Replace the model name:")
print('   student_model_name="your/model-name"')
print("\n2. Keep the same ATLAS teacher:")
print('   teacher_thinking_name="Arc-Intelligence/ATLAS-8B-Thinking"')
print("\n3. Run the same evaluation pipeline!")

print("\n🎯 Supported Student Models:")
supported_models = [
    "Qwen/Qwen3-4B-Instruct-2507",
    "meta-llama/Llama-3.1-8B-Instruct", 
    "microsoft/DialoGPT-medium",
    "mistralai/Mistral-7B-Instruct-v0.1",
    "Any HuggingFace compatible model!"
]

for model in supported_models:
    print(f"   ✅ {model}")

### Custom Problem Sets

In [None]:
# Example: Add your own math problems
custom_problems = [
    {
        "problem": "A company's profit increased from $50,000 to $65,000. What was the percentage increase?",
        "answer": 30,
        "solution": "Increase = $65,000 - $50,000 = $15,000. Percentage = ($15,000 / $50,000) × 100% = 30%"
    },
    {
        "problem": "If 3^x = 27, what is the value of x?", 
        "answer": 3,
        "solution": "3^x = 27 = 3^3, therefore x = 3"
    }
]

print("🔧 Custom Problems Example:")
for i, prob in enumerate(custom_problems):
    print(f"\n{i+1}. {prob['problem']}")
    print(f"   Expected: {prob['answer']}")

print("\n💡 To use custom problems:")
print("   problems = custom_problems")
print("   # Then run the same evaluation pipeline!")

## 🚀 Next Steps

1. **Try the Code Generation Demo**: [`code_generation_demo.ipynb`](code_generation_demo.ipynb)
2. **Train Your Own Teacher**: See the main [ATLAS repository](https://github.com/Arc-Intelligence/RCL)
3. **Production Deployment**: Check our [documentation](../README.md)
4. **Join the Community**: [GitHub Issues](https://github.com/Arc-Intelligence/RCL/issues)

---

**Citation:**
```bibtex
@article{atlas2025,
  title     = {ATLAS: Adaptive Training Methodology for RL},
  author    = {Arc Intelligence},
  journal   = {arXiv preprint},
  year      = {2025}
}
```