# Evaluation Prototyping

This notebook prototypes and tests the `evaluate_answer` function with various sample student answers.

## Purpose

We test the evaluation system with:
- **Good answers**: Comprehensive, correct responses
- **Mediocre answers**: Partially correct with some gaps
- **Bad answers**: Incorrect or missing key concepts

This helps us understand how well the LLM evaluator performs and identify potential improvements.


In [1]:
# Load environment variables from .env file (if it exists)
# This allows notebooks to use the same configuration as the main app
try:
    from dotenv import load_dotenv
    load_dotenv()  # Loads variables from .env file in project root
    print("Environment variables loaded from .env file")
except ImportError:
    print("python-dotenv not installed. Using system environment variables only.")
except Exception as e:
    print(f"Note: Could not load .env file: {e}")
    print("Using system environment variables only.")


Environment variables loaded from .env file


In [2]:
import sys
from pathlib import Path
import os

# Find project root by looking for src/ directory
current = Path.cwd()
project_root = None

# Check if we're in notebooks/ directory
if current.name == 'notebooks':
    project_root = current.parent
else:
    # Walk up the directory tree looking for src/ folder
    for parent in [current] + list(current.parents):
        if (parent / 'src').exists() and (parent / 'src' / '__init__.py').exists():
            project_root = parent
            break
    
    # Fallback: assume current directory is project root if src/ exists here
    if project_root is None and (current / 'src').exists():
        project_root = current

# If still not found, use current directory's parent
if project_root is None:
    project_root = current.parent if current.name == 'notebooks' else current

# Change to project root directory so relative paths work correctly
os.chdir(project_root)

# Add project root to path to import src modules
sys.path.insert(0, str(project_root))

from src.data_loader import load_qa_dataset, get_random_question
from src.evaluator import evaluate_answer
from src import config
import pandas as pd

print("Modules imported successfully")
print(f"Current working directory: {os.getcwd()}")


Modules imported successfully
Current working directory: c:\Users\Levin\OneDrive\Desktop\DAI Assignment Part 2


In [3]:
# Load dataset and pick a sample question
# Use absolute path to ensure it works regardless of working directory
data_file = project_root / "data" / "Q&A_db_practice.json"
df = load_qa_dataset(path=data_file)
sample_q = get_random_question(df)

print("Sample Question:")
print(f"ID: {sample_q['id']}")
print(f"Question: {sample_q['question']}")
print(f"\nReference Answer:\n{sample_q['answer']}")

INFO:src.data_loader:Successfully loaded 150 question-answer pairs from c:\Users\Levin\OneDrive\Desktop\DAI Assignment Part 2\data\Q&A_db_practice.json


Sample Question:
ID: 44
Question: Epoch

Reference Answer:
An epoch is a training iteration that constitutes a complete forward and backward pass through the entire labeled dataset, updating model parameters once per batch, and is repeated until convergence criteria are met


## Test Case 1: Good Answer

A comprehensive, correct answer that covers all key points.


In [None]:
# Good answer (comprehensive and correct)
good_answer = sample_q['answer']  # Using reference as a "good" answer for testing

print("Student Answer (Good):")
print(good_answer)
print("\n" + "="*80)

# Evaluate
result_good = evaluate_answer(
    question_id=sample_q['id'],
    question=sample_q['question'],
    reference_answer=sample_q['answer'],
    student_answer=good_answer,
    language="English"
)

print(f"\nLLM Score: {result_good.llm_score}/100")
print(f"ROUGE-1: {result_good.rouge_1:.3f}")
print(f"ROUGE-L: {result_good.rouge_l:.3f}")
print(f"\nExplanation:\n{result_good.llm_explanation}")


INFO:src.evaluator:Evaluating answer for question ID: 44


Student Answer (Good):
An epoch is a training iteration that constitutes a complete forward and backward pass through the entire labeled dataset, updating model parameters once per batch, and is repeated until convergence criteria are met



INFO:src.llm_interface:Initialized OpenAI client
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:src.evaluator:Parsed LLM score: 100/100
INFO:src.evaluator:Loading ROUGE metric...
INFO:src.evaluator:ROUGE metric loaded
INFO:absl:Using default tokenizer.
INFO:src.evaluator:ROUGE-1: 1.000, ROUGE-L: 1.000



LLM Score: 100/100
ROUGE-1: 1.000
ROUGE-L: 1.000

Explanation:
**Step 1 - Content Analysis:**
The reference answer includes the following key concepts:
- Definition of an epoch as a training iteration
- Description of a complete forward and backward pass through the entire labeled dataset
- Mention of updating model parameters once per batch
- Indication that this process is repeated until convergence criteria are met

The student's answer contains all these concepts exactly as presented in the reference answer.

**Step 2 - Correctness Assessment:**
- Correct Elements: 
  - The definition of an epoch as a training iteration
  - The description of a complete forward and backward pass through the entire labeled dataset
  - The updating of model parameters once per batch
  - The repetition of this process until convergence criteria are met

- Missing Elements: None; the student included all relevant concepts.

- Errors/Misconceptions: None; the student's answer is accurate.

**Step 3 - S

## Test Case 2: Mediocre Answer

A partially correct answer with some gaps or minor inaccuracies.


In [None]:
# Mediocre answer (partially correct)
mediocre_answer = "This is a concept in machine learning. It's used for training models and helps with optimization."

print("Student Answer (Mediocre):")
print(mediocre_answer)
print("\n" + "="*80)

# Evaluate
result_mediocre = evaluate_answer(
    question_id=sample_q['id'],
    question=sample_q['question'],
    reference_answer=sample_q['answer'],
    student_answer=mediocre_answer,
    language="English"
)

print(f"\nLLM Score: {result_mediocre.llm_score}/100")
print(f"ROUGE-1: {result_mediocre.rouge_1:.3f}")
print(f"ROUGE-L: {result_mediocre.rouge_l:.3f}")
print(f"\nExplanation:\n{result_mediocre.llm_explanation}")


INFO:src.evaluator:Evaluating answer for question ID: 44


Student Answer (Mediocre):
This is a concept in machine learning. It's used for training models and helps with optimization.



INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:src.evaluator:Parsed LLM score: 40/100
INFO:absl:Using default tokenizer.
INFO:src.evaluator:ROUGE-1: 0.160, ROUGE-L: 0.160



LLM Score: 40/100
ROUGE-1: 0.160
ROUGE-L: 0.160

Explanation:
**Step 1 - Content Analysis:**
The reference answer defines an epoch as a complete training iteration that includes both a forward and backward pass through the entire dataset, emphasizing the updating of model parameters per batch and the repetition until convergence. The student's answer acknowledges that an epoch is a concept in machine learning related to training models and optimization, but it lacks the specific details about what constitutes an epoch.

**Step 2 - Correctness Assessment:**
- Correct Elements: The student correctly identifies that an epoch is related to training models in machine learning and mentions its role in optimization.
- Missing Elements: The student does not mention the complete forward and backward pass through the dataset, the updating of model parameters, or the concept of convergence criteria.
- Errors/Misconceptions: There are no outright errors in the student's answer, but it is overly v

## Test Case 3: Bad Answer

An incorrect answer or one that misses key concepts.


In [None]:
# Bad answer (incorrect or missing key points)
bad_answer = "I don't really know much about this topic. Maybe it's related to data science?"

print("Student Answer (Bad):")
print(bad_answer)
print("\n" + "="*80)

# Evaluate
result_bad = evaluate_answer(
    question_id=sample_q['id'],
    question=sample_q['question'],
    reference_answer=sample_q['answer'],
    student_answer=bad_answer,
    language="English"
)

print(f"\nLLM Score: {result_bad.llm_score}/100")
print(f"ROUGE-1: {result_bad.rouge_1:.3f}")
print(f"ROUGE-L: {result_bad.rouge_l:.3f}")
print(f"\nExplanation:\n{result_bad.llm_explanation}")


INFO:src.evaluator:Evaluating answer for question ID: 44


Student Answer (Bad):
I don't really know much about this topic. Maybe it's related to data science?



INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:src.evaluator:Parsed LLM score: 10/100
INFO:absl:Using default tokenizer.
INFO:src.evaluator:ROUGE-1: 0.000, ROUGE-L: 0.000



LLM Score: 10/100
ROUGE-1: 0.000
ROUGE-L: 0.000

Explanation:
**Step 1 - Content Analysis:**
The reference answer defines an epoch in the context of machine learning as a complete training iteration that includes a forward and backward pass through the entire labeled dataset, updating model parameters once per batch, and repeating until convergence criteria are met. The student's answer does not contain any of these key concepts and instead expresses uncertainty about the topic, suggesting a lack of understanding.

**Step 2 - Correctness Assessment:**
- Correct Elements: None identified; the student did not provide any accurate definitions or explanations related to the concept of an epoch.
- Missing Elements: The entire definition of an epoch, including the concepts of forward and backward passes, updating model parameters, and convergence criteria.
- Errors/Misconceptions: The statement "I don't really know much about this topic" indicates a lack of understanding rather than a misco

## Analysis and Observations

### Strengths of the Evaluation Approach

1. **LLM Explanation**: Provides detailed, contextual feedback that goes beyond simple metrics
2. **ROUGE Metrics**: Offers objective, quantitative measures of overlap
3. **Combined Approach**: LLM + metrics provides both qualitative and quantitative assessment

### Potential Improvements

1. **Score Parsing**: The score parsing could be more robust (consider structured output formats)
2. **Consistency**: LLM scores may vary slightly between runs (temperature=0.2 helps)
3. **Fine-tuning**: A fine-tuned model specifically for evaluation could improve consistency
4. **Judge LLM**: Using a separate, specialized "judge" model could improve evaluation quality
5. **Multi-aspect Scoring**: Break down scores by different aspects (accuracy, completeness, clarity)

### Alternative Approaches

- **Pure Metric-based**: Use only ROUGE/BLEU (fast but less nuanced)
- **Fine-tuned Evaluator**: Train a model specifically for answer evaluation
- **RLHF**: Use reinforcement learning from human feedback to improve the evaluator
- **Ensemble**: Combine multiple LLMs or evaluation methods for robustness
