# Test Evaluation Framework

This notebook tests the Golden Output Manager functionality for managing LLM-generated risk summaries cache.

**Purpose**: Examine golden output caching without requiring Groq API tokens.

## Setup

In [2]:
# Auto-reload modules when they change
%load_ext autoreload
%autoreload 2

# Configure logging
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(levelname)s - %(name)s - %(message)s'
)

In [3]:
# Add parent directory to Python path
import sys
from pathlib import Path

# Add src directory to path
notebook_dir = Path.cwd()
src_dir = notebook_dir.parent / "src"
sys.path.insert(0, str(src_dir))

print(f"Notebook directory: {notebook_dir}")
print(f"Source directory: {src_dir}")
print(f"Python path includes: {src_dir in [Path(p) for p in sys.path]}")

Notebook directory: c:\Users\H244746\Documents\reit-risk-summarizer\notebooks
Source directory: c:\Users\H244746\Documents\reit-risk-summarizer\src
Python path includes: True


## Import Required Libraries

In [5]:
import sys
sys.path.insert(0, str(Path.cwd().parent))

from evaluation.golden_output_manager import GoldenOutputManager
from reit_risk_summarizer.services.llm.summarizer import RiskSummary
import pandas as pd
import json

print("✅ Imports successful")

✅ Imports successful


## Initialize Golden Output Manager

In [6]:
# Initialize with default cache directory
manager = GoldenOutputManager()

print(f"Cache directory: {manager.cache_dir}")
print(f"Cache exists: {manager.cache_dir.exists()}")
print(f"Number of cached files: {len(list(manager.cache_dir.glob('*.json')))}")

INFO - evaluation.golden_output_manager - Golden output cache directory: c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs


Cache directory: c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs
Cache exists: True
Number of cached files: 0


## List All Cached Tickers

In [7]:
# Get all cached tickers
cached_tickers = manager.list_cached_tickers()

print(f"Total cached tickers: {len(cached_tickers)}")
print(f"Tickers: {cached_tickers}")

Total cached tickers: 0
Tickers: []


## Create and Save Test Output

Let's create a fake golden output to test the save functionality.

In [9]:
# Create a fake RiskSummary for testing
test_summary = RiskSummary(
    ticker="TEST",
    company_name="Test REIT Company",
    risks=[
        "Interest rate risk from variable-rate debt exposure",
        "Market risk from economic downturn affecting occupancy rates",
        "Regulatory risk from changes in REIT tax requirements",
        "Concentration risk from geographic market exposure",
        "Liquidity risk from refinancing obligations"
    ],
    model="llama-3.3-70b-versatile",
    prompt_version="1.0"
)

# Save to cache
manager.save_output(
    summary=test_summary,
    input_text_length=5000,
    cache_hit=False
)

print(f"✅ Saved test output for {test_summary.ticker}")
print(f"Cache file exists: {manager.has_cached_output('TEST')}")

INFO - evaluation.golden_output_manager - Saved golden output for TEST to c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs\TEST.json


✅ Saved test output for TEST
Cache file exists: True


## Load and Inspect Cached Output

In [18]:
# Load the cached output
loaded_summary = manager.load_cached_output("TEST")

if loaded_summary:
    print(f"Ticker: {loaded_summary.ticker}")
    print(f"Company: {loaded_summary.company_name}")
    print(f"Model: {loaded_summary.model}")
    print(f"Prompt Version: {loaded_summary.prompt_version}")
    print(f"\nRisks ({len(loaded_summary.risks)}):")
    for i, risk in enumerate(loaded_summary.risks, 1):
        print(f"  {i}. {risk}")
else:
    print("❌ Failed to load cached output")

INFO - evaluation.golden_output_manager - Loaded cached output for TEST (generated at 2026-01-08T23:56:47.197142Z)


Ticker: TEST
Company: Test REIT Company
Model: llama-3.3-70b-versatile
Prompt Version: 1.0

Risks (5):
  1. Interest rate risk from variable-rate debt exposure
  2. Market risk from economic downturn affecting occupancy rates
  3. Regulatory risk from changes in REIT tax requirements
  4. Concentration risk from geographic market exposure
  5. Liquidity risk from refinancing obligations


## Inspect Raw JSON Cache File

In [19]:
# Read the raw JSON file
cache_path = manager.get_cache_path("TEST")
with open(cache_path, 'r') as f:
    raw_data = json.load(f)

print("Raw cache file content:")
print(json.dumps(raw_data, indent=2))

Raw cache file content:
{
  "ticker": "TEST",
  "company_name": "Test REIT Company",
  "risks": [
    "Interest rate risk from variable-rate debt exposure",
    "Market risk from economic downturn affecting occupancy rates",
    "Regulatory risk from changes in REIT tax requirements",
    "Concentration risk from geographic market exposure",
    "Liquidity risk from refinancing obligations"
  ],
  "model": "llama-3.3-70b-versatile",
  "prompt_version": "1.0",
  "generated_at": "2026-01-08T23:56:47.197142Z",
  "input_text_length": 5000,
  "cache_hit": false
}


## Test Multiple Tickers

In [13]:
# Create multiple fake outputs
test_tickers = [
    ("AMT", "American Tower Corporation", [
        "5G infrastructure risk from capital expenditure requirements",
        "Regulatory tower siting risk from zoning restrictions",
        "Tenant concentration risk from wireless carrier dependence",
        "International exposure risk from foreign operations",
        "Technology obsolescence risk from network evolution"
    ]),
    ("PLD", "Prologis Inc", [
        "E-commerce shift risk affecting warehouse demand",
        "Supply chain disruption risk from global uncertainties",
        "Development risk from construction cost inflation",
        "Lease rollover risk from tenant turnover",
        "Market saturation risk in key logistics hubs"
    ]),
    ("EQIX", "Equinix Inc", [
        "Data center energy cost risk from power consumption",
        "Cloud competition risk from hyperscale providers",
        "Customer concentration risk from major tech tenants",
        "Cybersecurity risk from data breach exposure",
        "Capacity expansion risk from deployment timelines"
    ])
]

for ticker, company, risks in test_tickers:
    summary = RiskSummary(
        ticker=ticker,
        company_name=company,
        risks=risks,
        model="llama-3.3-70b-versatile",
        prompt_version="1.0"
    )
    manager.save_output(summary, input_text_length=4500, cache_hit=False)
    print(f"✅ Saved {ticker}")

# List all cached tickers
print(f"\nAll cached tickers: {manager.list_cached_tickers()}")

INFO - evaluation.golden_output_manager - Saved golden output for AMT to c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs\AMT.json
INFO - evaluation.golden_output_manager - Saved golden output for PLD to c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs\PLD.json
INFO - evaluation.golden_output_manager - Saved golden output for EQIX to c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs\EQIX.json


✅ Saved AMT
✅ Saved PLD
✅ Saved EQIX

All cached tickers: ['AMT', 'EQIX', 'PLD', 'TEST']


## Load Golden Dataset

Test loading the golden dataset CSV with expert-labeled risks.

In [17]:
# Load golden dataset
dataset_path = Path.cwd().parent / "evaluation" / "golden_dataset.csv"

if dataset_path.exists():
    df = pd.read_csv(dataset_path)
    print(f"Golden dataset loaded: {len(df)} records")
    print(f"\nColumns: {df.columns.tolist()}")
    print(f"\nFirst few rows:")
    print(df.head())
    
    # Show unique tickers
    print(f"\nUnique tickers in dataset: {df['ticker'].nunique()}")
    print(f"Tickers: {sorted(df['ticker'].unique())}")
else:
    print(f"❌ Dataset not found at {dataset_path}")

Golden dataset loaded: 50 records

Columns: ['ticker', 'company_name', 'sector', 'filing_year', 'risk_rank', 'risk_category', 'risk_title', 'risk_description', 'why_material', 'unique_to_sector']

First few rows:
  ticker company_name                sector  filing_year  risk_rank  \
0    PLD     Prologis  Industrial/Logistics         2023          1   
1    PLD     Prologis  Industrial/Logistics         2023          2   
2    PLD     Prologis  Industrial/Logistics         2023          3   
3    PLD     Prologis  Industrial/Logistics         2023          4   
4    PLD     Prologis  Industrial/Logistics         2023          5   

              risk_category                                   risk_title  \
0  Geographic Concentration                   California Market Exposure   
1    Customer Concentration                    Major Customer Dependency   
2          Foreign Currency  Foreign Currency & International Operations   
3               Development                   Developmen

## Delete Cached Output

In [23]:
# Delete a specific cached output
deleted = manager.delete_output("TEST")
print(f"Deleted TEST: {deleted}")
print(f"TEST still cached: {manager.has_cached_output('TEST')}")

# List remaining cached tickers
print(f"Remaining cached tickers: {manager.list_cached_tickers()}")

INFO - evaluation.golden_output_manager - Deleted cached output for TEST


Deleted TEST: True
TEST still cached: False
Remaining cached tickers: ['AMT', 'EQIX', 'PLD']


## Verify Cache Operations

Test edge cases and verification.

In [24]:
# Test loading non-existent ticker
non_existent = manager.load_cached_output("NONEXIST")
print(f"Loading non-existent ticker: {non_existent}")

# Test has_cached_output for existing and non-existing
print(f"\nAMT cached: {manager.has_cached_output('AMT')}")
print(f"FAKE cached: {manager.has_cached_output('FAKE')}")

# Get cache paths
print(f"\nCache path for AMT: {manager.get_cache_path('AMT')}")
print(f"Path exists: {manager.get_cache_path('AMT').exists()}")

Loading non-existent ticker: None

AMT cached: True
FAKE cached: False

Cache path for AMT: c:\Users\H244746\Documents\reit-risk-summarizer\evaluation\golden_outputs\AMT.json
Path exists: True


## Clear All Cache (Optional)

**Warning**: This will delete all cached outputs. Uncomment to run.

In [25]:
# Uncomment to clear all cached outputs
count = manager.clear_all()
print(f"Deleted {count} cached outputs")
print(f"Remaining: {manager.list_cached_tickers()}")

print("Clear all cache command is commented out. Uncomment to run.")

INFO - evaluation.golden_output_manager - Cleared 3 cached outputs


Deleted 3 cached outputs
Remaining: []
Clear all cache command is commented out. Uncomment to run.


## Summary

This notebook tested the Golden Output Manager functionality:
- ✅ Initialize manager with default cache directory
- ✅ Save fake RiskSummary objects to cache
- ✅ Load cached outputs and verify data integrity
- ✅ Inspect raw JSON cache files
- ✅ List all cached tickers
- ✅ Delete specific cached outputs
- ✅ Load golden dataset CSV
- ✅ Test edge cases (non-existent tickers, path checks)

**Next Steps**:
1. Run real Groq API calls to generate actual golden outputs (when tokens available)
2. Build Phase 2: metrics.py with semantic similarity and NDCG scoring
3. Create evaluation reports and sector analysis