# ðŸ”¬ Evaluation: Zero-Shot Reasoning Assessment

This notebook evaluates our trained SLM using **zero-shot analytical prompts**. No examples, no formatting instructions, no chain-of-thought forcing.

**Evaluation Philosophy:**
- Open-ended prompts about system behavior
- Focus on reasoning quality, not benchmark scores
- Both successes and failures are meaningful data

---
## 1. Setup & Model Loading

In [None]:
import torch
import torch.nn.functional as F
from tokenizers import Tokenizer
from pathlib import Path
import json

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

torch.manual_seed(42)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(42)

In [None]:
TOKENIZER_PATH = Path("../tokenizer/tokenizer.json")
CHECKPOINT_DIR = Path("../checkpoints")

tokenizer = Tokenizer.from_file(str(TOKENIZER_PATH))
print(f"âœ“ Tokenizer loaded: {tokenizer.get_vocab_size()} tokens")

# List available checkpoints
if CHECKPOINT_DIR.exists():
    checkpoints = list(CHECKPOINT_DIR.glob("*.pt"))
    print(f"\nAvailable checkpoints:")
    for cp in sorted(checkpoints):
        print(f"  - {cp.name}")
else:
    print("No checkpoints found - run training first!")

In [None]:
# Load model architecture (copy from notebook 03 or import from saved module)
# For evaluation, we need the same model definition used during training

def load_model_from_checkpoint(checkpoint_path, model_class, config):
    """Load a trained model from checkpoint."""
    model = model_class(config)
    checkpoint = torch.load(checkpoint_path, map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    model.to(device)
    model.eval()
    print(f"âœ“ Loaded checkpoint: {checkpoint_path.name}")
    print(f"  Step: {checkpoint.get('step', 'unknown')}")
    print(f"  Loss: {checkpoint.get('loss', 'unknown'):.4f}")
    return model

# NOTE: Uncomment and modify when you have a trained model
# model = load_model_from_checkpoint(
#     CHECKPOINT_DIR / "phase_c_final_step5000.pt",
#     SLM,
#     ModelConfig(use_block_local=True, max_seq_len=4096)
# )

print("Model loading function ready")

---
## 2. Text Generation Utilities

Simple generation with temperature control. No beam search, no fancy sampling - just the basics.

In [None]:
def generate_text(model, prompt, max_tokens=200, temperature=0.8):
    """
    Generate text from a prompt.
    
    Lower temperature = more focused/deterministic
    Higher temperature = more creative/diverse
    """
    model.eval()
    
    # Tokenize prompt
    encoding = tokenizer.encode(prompt)
    tokens = torch.tensor([encoding.ids], dtype=torch.long, device=device)
    
    # Get special token IDs
    eos_id = tokenizer.token_to_id("<EOS>")
    
    with torch.no_grad():
        for _ in range(max_tokens):
            # Get logits for last position
            logits = model(tokens)[:, -1, :]
            
            # Apply temperature
            logits = logits / temperature
            
            # Sample next token
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            
            # Stop at EOS
            if next_token.item() == eos_id:
                break
            
            tokens = torch.cat([tokens, next_token], dim=1)
    
    # Decode output
    output_ids = tokens[0].tolist()
    output_text = tokenizer.decode(output_ids)
    
    return output_text

print("Generation function ready")

---
## 3. Evaluation Prompts

Our prompts are designed to test **reasoning about system behavior** without asking for specific conclusions. The model should:
- Identify causal dependencies
- State assumptions explicitly
- Handle uncertainty appropriately

In [None]:
# Evaluation prompts - designed to test reasoning, not recall

EVAL_PROMPTS = {
    "feedback_loop": """
Consider a system where the output of a process influences its own input through 
a delay mechanism. As the delay increases, describe what happens to the system's 
stability and explain the underlying principles.
""".strip(),
    
    "thermal_coupling": """
A reactor vessel contains a fluid whose temperature affects its own heat transfer 
properties. As temperature rises, the fluid becomes less effective at removing heat. 
Analyze the implications for system behavior.
""".strip(),
    
    "control_failure": """
An automated control system is designed to maintain a variable at a setpoint. 
The sensor measuring this variable has a measurement lag. Under what conditions 
might this configuration lead to oscillations or instability?
""".strip(),
    
    "redundancy_analysis": """
A safety system consists of three independent channels, any one of which can 
trigger a protective action. Discuss the factors that would determine whether 
this redundancy provides adequate protection.
""".strip(),
    
    "transient_behavior": """
A physical system is operating at steady state when a step change is applied 
to one of its inputs. The system response initially moves in one direction, 
then reverses. What does this inverse response indicate about the system's 
internal structure?
""".strip(),
}

print(f"Prepared {len(EVAL_PROMPTS)} evaluation prompts:")
for name in EVAL_PROMPTS:
    print(f"  - {name}")

---
## 4. Running Evaluation

Generate responses and save for analysis. We'll compare outputs at different temperatures.

In [None]:
def run_evaluation(model, prompts, temperatures=[0.7, 1.0]):
    """
    Run all evaluation prompts at different temperatures.
    
    Returns a dict of results for analysis.
    """
    results = {}
    
    for name, prompt in prompts.items():
        print(f"\n{'='*60}")
        print(f"Prompt: {name}")
        print(f"{'='*60}")
        print(prompt)
        
        results[name] = {"prompt": prompt, "responses": {}}
        
        for temp in temperatures:
            print(f"\n--- Temperature {temp} ---")
            response = generate_text(model, prompt, max_tokens=300, temperature=temp)
            results[name]["responses"][temp] = response
            print(response)
    
    return results

# NOTE: Uncomment when you have a loaded model
# results = run_evaluation(model, EVAL_PROMPTS)

print("Evaluation function ready")

---
## 5. Qualitative Assessment Criteria

We evaluate responses on these dimensions (manually, not automatically):

| Criterion | Description |
|-----------|-------------|
| **Causal Depth** | Does the response identify cause-effect relationships? |
| **Conditional Reasoning** | Does it use "if-then" logic appropriately? |
| **Explicit Uncertainty** | Does it acknowledge what it doesn't know? |
| **Consistency** | Are claims internally consistent? |
| **Assumption Clarity** | Are underlying assumptions stated? |

In [None]:
def assess_response(response):
    """
    Manual assessment template - prints prompts for human evaluation.
    """
    print("Response to assess:")
    print("-" * 40)
    print(response)
    print("-" * 40)
    print("\nAssessment criteria (rate 1-5):")
    print("  1. Causal Depth: Does it identify cause-effect relationships?")
    print("  2. Conditional Reasoning: Does it use if-then logic?")
    print("  3. Explicit Uncertainty: Does it acknowledge unknowns?")
    print("  4. Consistency: Are claims internally consistent?")
    print("  5. Assumption Clarity: Are assumptions stated?")
    print("\nNotes:")
    print("  - What reasoning patterns are present?")
    print("  - What's missing or incorrect?")
    print("  - Would this be useful to a human analyst?")

# Example usage (with dummy response)
assess_response("[Model response would appear here]")

---
## 6. Long-Context Evaluation

Test whether the model maintains coherence over extended contexts. We provide a long system description and check if conclusions remain consistent.

In [None]:
LONG_CONTEXT_PROMPT = """
The following describes a complex industrial control system:

A chemical reactor operates with three primary control loops. The first loop 
regulates temperature by adjusting coolant flow rate. The temperature sensor 
has a 30-second measurement delay. The second loop controls pressure by 
modulating a relief valve. Pressure and temperature are coupled - rising 
temperature increases pressure according to the ideal gas law. The third 
loop maintains reactant concentration by controlling feed rate.

The system has the following characteristics:
- Temperature setpoint: 350Â°C
- Pressure operating range: 10-15 bar
- Coolant flow capacity: 0-100 L/min
- Feed rate range: 0-50 kg/hr

An operator initiates a production rate increase by raising the feed rate 
setpoint from 30 to 45 kg/hr. The reaction is exothermic.

Analyze the sequence of events that would follow this change, considering 
the interactions between all three control loops. Identify any potential 
stability concerns.
""".strip()

print("Long-context prompt ready")
print(f"Prompt length: {len(tokenizer.encode(LONG_CONTEXT_PROMPT).ids)} tokens")

In [None]:
def evaluate_long_context(model, prompt, n_runs=3):
    """
    Run the same long-context prompt multiple times.
    
    Check for:
    - Consistency of conclusions across runs
    - Reference to earlier parts of the prompt
    - Absence of contradictions
    """
    responses = []
    
    for i in range(n_runs):
        print(f"\nRun {i+1}/{n_runs}")
        print("-" * 40)
        response = generate_text(model, prompt, max_tokens=400, temperature=0.7)
        responses.append(response)
        print(response[:500] + "..." if len(response) > 500 else response)
    
    return responses

# NOTE: Uncomment when model is loaded
# long_context_results = evaluate_long_context(model, LONG_CONTEXT_PROMPT)

print("Long-context evaluation ready")

---
## 7. Comparative Analysis: 2k vs 5k Context

Compare model behavior at different context lengths to verify the context extension worked.

In [None]:
def compare_context_lengths(model_2k, model_5k, prompt):
    """
    Compare outputs from models trained at different context lengths.
    
    We expect the 5k model to:
    - Reference more of the prompt context
    - Maintain coherence over longer outputs
    - Show similar reasoning patterns on short prompts
    """
    print("2k Context Model:")
    print("-" * 40)
    response_2k = generate_text(model_2k, prompt, max_tokens=300)
    print(response_2k)
    
    print("\n5k Context Model:")
    print("-" * 40)
    response_5k = generate_text(model_5k, prompt, max_tokens=300)
    print(response_5k)
    
    return response_2k, response_5k

# This requires loading both Phase A and Phase C checkpoints
print("Context comparison function ready")

---
## 8. Perplexity Calculation

While we prioritize qualitative assessment, perplexity gives us a rough quantitative measure of model fit.

In [None]:
def calculate_perplexity(model, text, max_length=2048):
    """
    Calculate perplexity on a text sample.
    
    Lower perplexity = model assigns higher probability to the text.
    """
    model.eval()
    
    encoding = tokenizer.encode(text)
    tokens = torch.tensor([encoding.ids[:max_length]], dtype=torch.long, device=device)
    
    with torch.no_grad():
        logits = model(tokens)
        
        # Shift for next-token prediction
        shift_logits = logits[:, :-1, :].contiguous()
        shift_labels = tokens[:, 1:].contiguous()
        
        # Cross-entropy loss
        loss = F.cross_entropy(
            shift_logits.view(-1, shift_logits.size(-1)),
            shift_labels.view(-1)
        )
        
        perplexity = torch.exp(loss).item()
    
    return perplexity

print("Perplexity calculation ready")

In [None]:
# Perplexity on different text types
# This helps us understand what the model "knows" well

TEST_TEXTS = {
    "technical": "The transfer function of a closed-loop control system determines its stability characteristics.",
    "narrative": "The engineer walked into the control room and noticed the temperature gauge rising.",
    "mathematical": "Given that f(x) = x^2 + 2x + 1, the derivative f'(x) = 2x + 2.",
}

# NOTE: Uncomment when model is loaded
# print("Perplexity by text type:")
# for name, text in TEST_TEXTS.items():
#     ppl = calculate_perplexity(model, text)
#     print(f"  {name}: {ppl:.2f}")

print("Test texts ready for perplexity evaluation")

---
## 9. Saving Evaluation Results

In [None]:
import datetime

def save_evaluation_results(results, output_dir="../eval_results"):
    """Save evaluation results to JSON for later analysis."""
    output_path = Path(output_dir)
    output_path.mkdir(exist_ok=True)
    
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = output_path / f"eval_{timestamp}.json"
    
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"âœ“ Results saved to: {filename}")
    return filename

print("Results saving function ready")

---
## Summary

This notebook provides the evaluation framework for our SLM:

1. **Zero-shot prompts** - Testing reasoning without examples
2. **Qualitative criteria** - Causal depth, consistency, uncertainty handling
3. **Long-context tests** - Verify extended context capability
4. **Perplexity analysis** - Quantitative baseline measure

**Key Principle:** Both successes and failures are valuable. The goal is insight into reasoning emergence, not a leaderboard score.

**To use this notebook:**
1. Run training (notebook 04) to generate checkpoints
2. Load the Phase C checkpoint
3. Run evaluation prompts
4. Manually assess responses using the criteria