In [1]:
import json
import random
import sys
import os

# Add the project root directory to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

random.seed(2)

In [2]:
from src.agents.context_qa_correction_agent import ContextQACorrectionAgent

In [3]:
with open(f"../dataset/raw_model_responses/train/train_rajpurkar_squad.json", "r") as f:
    json_data = json.load(f)

In [4]:
json_data

[{'task_info': {'type': 'Contextual QA', 'dataset': 'rajpurkar_squad'},
  'additional_info': {'model': 'meta-llama/Llama-3.2-1B-Instruct',
   'question': 'What percentage of Egyptians polled support death penalty for those leaving Islam?',
   'context': 'The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.',
   'answer': ['84%'],
   'title': 'Egypt'},
  'input': "<|begin_of_te

In [5]:
index = 1
responses = json_data[index]["response"]

question = json_data[index]["additional_info"]["question"]
context = json_data[index]["additional_info"]["context"]
answer = json_data[index]["additional_info"]["answer"]

In [6]:
responses

['The answer is books',
 'According to the context, Ann Arbor ranks 1st among booksellers.',
 'The answer is books',
 'Based on the context, Ann Arbor ranks 1st among books sold.']

In [7]:
# response[0] = 'According to the text, 84% of Chinese polled supported the death penalty for those leaving Japan.'
# response[1] = 'According to the text, 84% of Egyptians polled support the death penalty for those leaving Japan.'
# # response[2] = 'The answer is 25%.'
# response[3] = 'The text does not specify the percentage, thus there is no answer. However, it does mention that 84% of Egyptians polled supported the death penalty for those who leave Islam.'

In [8]:
agent = ContextQACorrectionAgent(estimate_only=False)

state = agent.model.invoke({
    "question": question,
    "context": context,
    "answer": answer,
    "responses": responses,
})

You are a meticulous AI quality assurance expert and diagnostician. Your sole function is to evaluate a given response and provide a structured diagnosis for every flaw you find. You MUST NOT use any external knowledge and base your analysis *only* on the provided context and correct answer.

# Inputs
**Question**: Ann Arbor ranks 1st among what goods sold?
**Context**: '''The Ann Arbor Hands-On Museum is located in a renovated and expanded historic downtown fire station. Multiple art galleries exist in the city, notably in the downtown area and around the University of Michigan campus. Aside from a large restaurant scene in the Main Street, South State Street, and South University Avenue areas, Ann Arbor ranks first among U.S. cities in the number of booksellers and books sold per capita. The Ann Arbor District Library maintains four branch outlets in addition to its main downtown building. The city is also home to the Gerald R. Ford Presidential Library.'''
**Correct Answer**: ['book

[32m2025-07-22 20:30:52.664[0m | [1mINFO    [0m | [36msrc.agents.context_qa_correction_agent[0m:[36mcheck_for_errors[0m:[36m87[0m - [1mErrors: [[], [Error(error_description='The response states that Ann Arbor ranks 1st among booksellers, but the context states that Ann Arbor ranks first among U.S. cities in the number of booksellers and books sold per capita. The correct answer is books.', error_location='booksellers', error_correction='Delete "booksellers" and add "books".')], [], [Error(error_description='The response is not concise and contains unnecessary words.', error_location='Based on the context, Ann Arbor ranks 1st among', error_correction="Delete 'Based on the context, '"), Error(error_description="The answer should be 'books' instead of 'books sold'.", error_location='books sold', error_correction="Delete 'sold'")]][0m
[32m2025-07-22 20:30:52.665[0m | [1mINFO    [0m | [36msrc.agents.context_qa_correction_agent[0m:[36mshould_correct[0m:[36m101[0m - [1m

You are a Master Editor. Your task is to perform a final, definitive correction on a flawed response using a special token-based editing language. You will act as the final authority, using a junior editor's analysis as guidance but not as a command.

# Core Task
You will be given the `Question`, `Context`, `Correct Answer`, `Incorrect Response`, and `Errors in response`. Your mission is to apply precise edits to the `Incorrect Response` to make it grammatically perfect and semantically identical to the `Correct Answer`.

# Token-Based Editing Language
You have two operations at your disposal: deleting text with special tokens and adding new text. You may use one or both.

1.  **<DEL_W> (Delete Word)**: Deletes the single word (not a token!) immediately preceding it. A word is a string of characters separated by spaces. Punctuation is part of the word (e.g., 'end.').
    *   **Pro Tip**: Deleting a single word can create a grammatical flaw (e.g., a dangling 'and'). Always check the con

[32m2025-07-22 20:30:53.556[0m | [1mINFO    [0m | [36msrc.agents.context_qa_correction_agent[0m:[36mcorrect_response[0m:[36m155[0m - [1mCorrected responses: ['Based on the context,<DEL_W> Ann Arbor ranks 1st among books<DEL_W> sold.', 'According to the context, Ann Arbor ranks 1st among booksellers<DEL_W> books.'][0m


You are a meticulous AI Verification Engine. Your sole purpose is to determine if a `Generated Answer` is correct by strictly comparing it against a `Correct Answer` and its supporting `Context`.

# Task Inputs
**Question**: Ann Arbor ranks 1st among what goods sold?
**Context**: '''The Ann Arbor Hands-On Museum is located in a renovated and expanded historic downtown fire station. Multiple art galleries exist in the city, notably in the downtown area and around the University of Michigan campus. Aside from a large restaurant scene in the Main Street, South State Street, and South University Avenue areas, Ann Arbor ranks first among U.S. cities in the number of booksellers and books sold per capita. The Ann Arbor District Library maintains four branch outlets in addition to its main downtown building. The city is also home to the Gerald R. Ford Presidential Library.'''
**Correct Answer**: ['books']
**Generated Answer**: Based on the Ann Arbor ranks 1st among sold.

# Evaluation Criteri

[32m2025-07-22 20:30:54.224[0m | [1mINFO    [0m | [36msrc.agents.context_qa_correction_agent[0m:[36mverify_corrected_response[0m:[36m188[0m - [1mVerified responses: ['False', 'True'][0m


In [9]:
state

{'question': 'Ann Arbor ranks 1st among what goods sold?',
 'context': 'The Ann Arbor Hands-On Museum is located in a renovated and expanded historic downtown fire station. Multiple art galleries exist in the city, notably in the downtown area and around the University of Michigan campus. Aside from a large restaurant scene in the Main Street, South State Street, and South University Avenue areas, Ann Arbor ranks first among U.S. cities in the number of booksellers and books sold per capita. The Ann Arbor District Library maintains four branch outlets in addition to its main downtown building. The city is also home to the Gerald R. Ford Presidential Library.',
 'answer': ['books'],
 'responses': ('Based on the context, Ann Arbor ranks 1st among books sold.',
  'According to the context, Ann Arbor ranks 1st among booksellers.'),
 'errors': [[Error(error_description='The response states that Ann Arbor ranks 1st among booksellers, but the context states that Ann Arbor ranks first among U.