## 02 – Evaluation & Testing

🟩 **GOOD:** This notebook demonstrates production-grade evaluation and testing workflows for CodeCraft AI, aligned with Clean Architecture and AWS-native best practices.

### Purpose
- Evaluate model and API responses for accuracy, reliability, and business alignment
- Provide reproducible, testable evaluation experiments for stakeholders and technical reviewers
- Serve as a reference for regression testing, prompt validation, and operational monitoring

### Prerequisites
- The backend API service must be running and accessible at the endpoint specified by `CODECRAFT_API_URL`.
- All secrets and config must be injected via environment variables (never hardcoded).
- This notebook is a client only; all business logic and data processing are handled by the backend.

> 🟦 **NOTE:** For local dev, start your FastAPI server with:
> `poetry run uvicorn src.adapters.api.main:app --reload`

### Environment-Aware Configuration

🟦 **NOTE:** All configuration and secrets are injected via environment variables for security and portability. No values are hardcoded.

In [11]:
import os
import requests

API_URL = os.getenv("CODECRAFT_API_URL", "http://localhost:8000/query")
API_AUTH_TOKEN = os.getenv("CODECRAFT_API_TOKEN", "")
API_AUTH_HEADER = os.getenv("CODECRAFT_API_AUTH_HEADER", "Authorization")
API_AUTH_PREFIX = os.getenv("CODECRAFT_API_AUTH_PREFIX", "Bearer")

def ask_ai(query: str, top_k: int = 3, extra_payload: dict = None):
    payload = {"query": query, "top_k": top_k}
    if extra_payload:
        payload.update(extra_payload)
    headers = {}
    if API_AUTH_TOKEN:
        if API_AUTH_PREFIX:
            headers[API_AUTH_HEADER] = f"{API_AUTH_PREFIX} {API_AUTH_TOKEN}"
        else:
            headers[API_AUTH_HEADER] = API_AUTH_TOKEN
    try:
        response = requests.post(API_URL, json=payload, headers=headers, timeout=30)
        response.raise_for_status()
        return response.json()
    except Exception as e:
        print(f"🟥 CRITICAL: API error: {e}")
        return None

### Automated Evaluation Suite

🟩 **GOOD:** Use this section to run regression tests and evaluate model responses against expected outputs.

In [12]:
from pprint import pprint

test_cases = [
    {
        "prompt": "Summarize the main features of CodeCraft AI.",
        "expected_keywords": ["AWS", "vector", "scalable", "secure"]
    },
    {
        "prompt": "How does CodeCraft AI handle data privacy?",
        "expected_keywords": ["encryption", "compliance", "IAM"]
    },
]

def extract_text_from_response(response):
    """
    Extracts concatenated text from all results for robust keyword evaluation.
    """
    if not response or "results" not in response or not response["results"]:
        return ""
    # 🟦 NOTE: Assumes each result has a 'content' field.
    return " ".join(str(item.get("content", "")) for item in response["results"])

def evaluate_response(response, expected_keywords):
    """
    Evaluates if all expected keywords are present in the concatenated response text.
    """
    text = extract_text_from_response(response)
    if not text:
        print("🟦 Diagnostics: No content returned in response.")
        return False
    return all(keyword.lower() in text.lower() for keyword in expected_keywords)

for i, case in enumerate(test_cases, 1):
    result = ask_ai(case["prompt"])
    passed = evaluate_response(result, case["expected_keywords"])
    print(f"Test {i}: {'PASS' if passed else 'FAIL'}")
    if not passed:
        print("🟦 Diagnostics: Response text was:")
        print(extract_text_from_response(result))
        print("🟦 Expected keywords:", case["expected_keywords"])
    pprint(result)


🟦 Diagnostics: No content returned in response.
Test 1: FAIL
🟦 Diagnostics: Response text was:

🟦 Expected keywords: ['AWS', 'vector', 'scalable', 'secure']
{'results': []}
🟦 Diagnostics: No content returned in response.
Test 2: FAIL
🟦 Diagnostics: Response text was:

🟦 Expected keywords: ['encryption', 'compliance', 'IAM']
{'results': []}


### Manual Evaluation

🟦 **NOTE:** Use this cell to manually test new prompts or edge cases.

In [13]:
my_prompt = "Describe how CodeCraft AI supports cost optimization in AWS."
my_result = ask_ai(my_prompt)
pprint(my_result)

{'results': []}
