In [1]:
import json
import random
import sys
import os

# Add the project root directory to the Python path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

random.seed(2)

In [2]:
from src.agents.context_qa_correction_agent import ContextQACorrectionAgent
from src.agents.math_qa_correction_agent import MathQACorrectionAgent
from src.agents.summary_correction_agent import SummarizationCorrectionAgent
from src.agents.qa_correction_agent import QACorrectionAgent

In [3]:
with open(f"../dataset/raw_model_responses/train/train_HaluEval_2_0.json", "r") as f:
    json_data = json.load(f)

In [4]:
json_data = json_data[:10]
json_data

[{'task_info': {'type': 'QA', 'dataset': 'HaluEval_2_0'},
  'additional_info': {'model': 'meta-llama/Llama-3.2-1B-Instruct',
   'question': 'how is a metallic glass different from your window glass ?',
   'domain': 'Education'},
  'input': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nProvide a correct, clear, and short answer to the question.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhow is a metallic glass different from your window glass ?<|eot_id|><|start_header_id|>assistant<|end_header_id|>',
  'response': ['A metallic glass, also known as glass-metal or glass-ceramic, is a type of glass that is made by heating a mixture of silicon dioxide (SiO2) and other metal oxides, such as cobalt, nickel, or zinc, to extremely high temperatures, typically above 1400°C. This process, called sintering, creates a unique material with properties that distinguish it from regular glass.\n\nHere are the key differences:\n\n1. **Appearance**: Metallic glass has a metalli

In [5]:
question = json_data[-1]["additional_info"]["question"]
response = json_data[-1]["response"]

batch_size = len(response)

In [6]:
agent = QACorrectionAgent(estimate_only=True)

state = agent.model.invoke({
    "question": question,
    "response": response,
})

You are a highly intelligent query routing assistant. Your primary function is to determine whether you have enough internal knowledge to answer a user's question or if you need to perform a web search to gather more information.
Your task is to analyze the user's question and decide on one of two paths:
1.  **If you are confident you know the answer**:
   *   You will provide the necessary information directly in the `context` field.
   *   The `search_query` field MUST be left empty.
   *   The context should be comprehensive enough to fully answer the user's question.
2.  **If you are NOT confident or do not know the answer**:
   *   You will formulate a single, concise, and effective search query to find the answer.
   *   The `search_query` field will contain this query.
   *   The `context` field MUST be left empty.

**Crucial Instructions**:
   *   You must critically assess your own knowledge. Do not attempt to provide context if you are uncertain. It is better to search than t

[32m2025-07-22 08:02:50.259[0m | [1mINFO    [0m | [36msrc.agents.qa_correction_agent[0m:[36mroute_question[0m:[36m118[0m - [1mRouting decision: context='Lambert of Spoleto was Holy Roman Emperor and King of Italy. He married Ageltrude, who was the daughter of Adelchis I of Spoleto.' search_query=''[0m
[32m2025-07-22 08:02:50.260[0m | [1mINFO    [0m | [36msrc.agents.qa_correction_agent[0m:[36mset_internal_context[0m:[36m157[0m - [1mUsing internal context: 'Lambert of Spoleto was Holy Roman Emperor and King of Italy. He married Ageltrude, who was the daughter of Adelchis I of Spoleto.'[0m


You are a meticulous AI quality assurance expert. Your sole function is to evaluate a given response based *only* on the provided context and correct answer. You MUST NOT use any external knowledge.

# Inputs
**Question**: Which King and holy roman emperor had a wife from Spoleto?
**Context**: '''Lambert of Spoleto was Holy Roman Emperor and King of Italy. He married Ageltrude, who was the daughter of Adelchis I of Spoleto.'''
**Response to Evaluate**: The correct answer is Pope Gregory VII.

# Task
Your mission is to meticulously analyze the `Response to Evaluate` and identify all its flaws. Compare it against the `Context` to find any discrepancy or poor response quality. Your goal is to create a list of all errors in the response.

Think step-by-step:
1.  Does the response correctly answer the question?
2.  Is all information in the response supported by the `Context`?
3.  Does the response introduce any unhelpful or evasive language? For example, phrases like 'The text does not spe

KeyboardInterrupt: 

In [None]:
state

{'question': 'Do you agree with the claim that "Side effects associated with antidepressants do not represent a mortality risk to postmenopausal women."? Provide factual statements about the claim.',
 'response': ['The claim that "Side effects associated with antidepressants do not represent a mortality risk to postmenopausal women" is not entirely accurate. Here are some factual statements that support the opposite:\n\n1. **Increased risk of mortality**: Studies have consistently shown that antidepressant use in postmenopausal women can be associated with a significant increase in mortality, particularly cardiovascular mortality. A meta-analysis of 17 studies found that antidepressant use was associated with a 20-40% increased risk of all-cause mortality in postmenopausal women. (Source: "Antidepressant use and mortality in postmenopausal women: A systematic review and meta-analysis of 17 studies" (2014))\n\n2. **Increased risk of cardiovascular mortality**: The risk of cardiovascular

In [None]:
1/0

ZeroDivisionError: division by zero

In [None]:
from langchain_tavily import TavilySearch

tool = TavilySearch(
    max_results=2,
    topic="general",
    include_answer=False,
    include_raw_content=True,
    # include_images=False,
    # include_image_descriptions=False,
    # search_depth="basic",
    # time_range="day",
    # include_domains=None,
    # exclude_domains=None
)

In [None]:
tool.invoke({"query": "Third type of intelligence"})

{'query': 'Third type of intelligence',
 'follow_up_questions': None,
 'answer': 'The third type of intelligence is practical intelligence, which involves applying skills to everyday situations. It focuses on problem-solving and adaptability. This type is crucial for real-world success.',
 'images': [],
 'results': [{'url': 'https://www.thoughtco.com/triarchic-theory-of-intelligence-4172497',
   'title': 'Understanding the Triarchic Theory of Intelligence - ThoughtCo',
   'content': 'The triarchic theory of intelligence contends that there are three types of intelligence: practical, creative, and analytical.',
   'score': 0.8215175,
   'raw_content': "# Understanding the Triarchic Theory of Intelligence\n\n![Computer artwork of human brain in profile](https://www.thoughtco.com/thmb/8wxU6kJ1vmDryGH26avCQdPrzn8=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/computer-artwork-of-human-brain--profile-85757401-5b50a85846e0fb005b73aecf.jpg)\n![Computer artwork of human brain in pr

In [None]:
1/0

ZeroDivisionError: division by zero

In [None]:
from langchain_core.tools import Tool
from langchain_google_community import GoogleSearchAPIWrapper

search = GoogleSearchAPIWrapper()

tool = Tool(
    name="google_search",
    description="Search Google for recent results.",
    func=search.run,
)

In [None]:
with open(f"../dataset/raw_model_responses/train/train_leaderboard_summaries.json", "r") as f:
    json_data = json.load(f)

In [None]:
json_data = json_data[:4]
json_data

[{'task_info': {'type': 'Summarization', 'dataset': 'leaderboard_summaries'},
  'additional_info': {'model': 'meta-llama/Llama-3.2-1B-Instruct',
   'source': 'The man died in Inverness on 27 October this year. The Police Investigations and Review Commissioner (Pirc), Kate Frame, has been asked to scrutinise the initial police response to the man\'s call. Police Scotland said it was "fully engaging" with the investigation and awaited its findings. A spokesman for Pirc said: "The Crown Office and Procurator Fiscal Service (COPFS) has instructed the Police Investigations and Review Commissioner to undertake an investigation into the initial police response to a call from a 72-year-old man who was later found dead at a sheltered housing complex in Inverness. "A report on the commissioner\'s findings will be submitted to the COPFS in due course."'},
  'input': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nGenerate a concise and neutral summary of the following text. Your s

In [None]:
summary = json_data[2]["response"]
batch_size = len(summary)

source_text = json_data[2]["additional_info"]["source"]

In [None]:
agent = SummarizationCorrectionAgent()

state = agent.model.invoke({
    "source_text": source_text,
    "summary": summary,
})

I0000 00:00:1752674891.103013 2776211 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache
I0000 00:00:1752674891.108828 2776211 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1752674891.111320 2776211 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


You are an expert summarization analyst. Your task is to rigorously evaluate a summary against its source text. You must be meticulous and base your entire evaluation *only* on the provided source text.

# Inputs
**Source Text**: '''A woman who suffered seven miscarriages has spoken of her distress after a family friend stole the money set aside for her babies' grave stone. The bereaved mother, named Dawn, was appearing on The Jeremy Kyle show and accused long-term chum, convicted burglar Jamie, of making off with the cash. When one of Mr Kyle's famous lie detector tests revealed that Jamie, who lives with Dawn and her partner who is also called Jamie, had done the deed, she burst into tears and stormed off. Scroll down for video . Devastated: Dawn, who has had seven miscarriages, was furious to discover her friend had stolen from her . The drama began when Dawn, who has previously appeared on the ITV talk show, discovered that £207 set aside for a £500 grave stone had disappeared. Sus

[32m2025-07-16 17:08:13.859[0m | [1mINFO    [0m | [36msrc.agents.summary_correction_agent[0m:[36mcheck_for_errors[0m:[36m71[0m - [1mErrors: [["Missing Key Information: The summary fails to mention the core conflict: a family friend stole money intended for a gravestone for the woman's deceased baby.", 'Missing Key Information: The summary omits the fact that the woman appeared on The Jeremy Kyle Show to confront the friend.', "Faithfulness Error: The summary states 'baby girl' which is not specified in the source text. The source text refers to 'Daniel James'.", 'Lack of Context: The summary is too short and lacks context, making it difficult to understand the situation.'], ["The summary is incomplete and lacks context. It's a fragment that doesn't convey the main points of the article, such as the theft of money for a gravestone, the lie detector test, or the confession."], ['Missing Key Information: The summary fails to mention the theft of the money set aside for the grav

You are an expert summary editor. Your task is to correct an incorrect summary using special deletion tokens to make the absolute minimum changes required.

# Task:
Your goal is to fix the `Incorrect Summary` so that it accurately and faithfully reflects the `Correct Summary`. You must use the special deletion tokens to perform the correction with the fewest edits possible.

# Special Deletion Tokens:
1.  `<DEL_W>`: Deletes the single word immediately preceding the token. A word is a string of characters separated by spaces. Punctuation is part of the word (e.g., 'end.').
2.  `<DEL_S>`: Deletes the entire sentence immediately preceding the token. A sentence is a group of words ending in a period (.), question mark (?), or exclamation mark (!). The token must be placed right after the punctuation.
3.  `<DEL_A>`: Deletes the entire summary that comes before the token. Use this for a complete rewrite when the summary is fundamentally wrong.

# Instructions:
1.  **Analyze the Errors**: Fir

[32m2025-07-16 17:08:16.064[0m | [1mINFO    [0m | [36msrc.agents.summary_correction_agent[0m:[36mcorrect_response[0m:[36m135[0m - [1mCorrected summary: ["for months and I trusted him completely,' Dawn said.<DEL_S> The couple had been dating since 2018 and had even taken out a loan together to buy a house.<DEL_S> The news of the theft has left Dawn and her partner devastated, and she has expressed her anger and disappointment at the alleged perpetrator.<DEL_S> The couple has been together for five years and had planned to have a family with two children.<DEL_A> Dawn, a woman who suffered seven miscarriages, was distressed after a family friend named Jamie stole £207 that was set aside for her baby Daniel James' gravestone, which cost £500. Jamie confessed to the crime after a lie detector test on The Jeremy Kyle show but blamed Carol in a Facebook message.", "<DEL_A>Dawn, a woman who had seven miscarriages, appeared on The Jeremy Kyle show after her friend Jamie, a convicted 

In [None]:
state

{'source_text': "A woman who suffered seven miscarriages has spoken of her distress after a family friend stole the money set aside for her babies' grave stone. The bereaved mother, named Dawn, was appearing on The Jeremy Kyle show and accused long-term chum, convicted burglar Jamie, of making off with the cash. When one of Mr Kyle's famous lie detector tests revealed that Jamie, who lives with Dawn and her partner who is also called Jamie, had done the deed, she burst into tears and stormed off. Scroll down for video . Devastated: Dawn, who has had seven miscarriages, was furious to discover her friend had stolen from her . The drama began when Dawn, who has previously appeared on the ITV talk show, discovered that £207 set aside for a £500 grave stone had disappeared. Suspecting house-mate Jamie and his girlfriend Carol, she confronted them but was initially met with a flat denial. 'Only him and Carol knew where it was,' she said during an angry confrontation on the ITV show. 'Me and

In [None]:
1/0

ZeroDivisionError: division by zero

In [None]:
with open(f"../dataset/raw_model_responses/train/train_UMWP.json", "r") as f:
    json_data = json.load(f)

In [None]:
json_data = json_data[:4]
json_data

[{'task_info': {'type': 'QA', 'dataset': 'UMWP'},
  'additional_info': {'model': 'meta-llama/Llama-3.2-1B-Instruct',
   'domain': 'Math',
   'question': 'Because of the decision Sofia asked the students to suggest specific types of food. If 479 students suggested adding mashed potatoes while 489 suggested adding bacon to the menu.How many more students suggested bacon than those that suggested mashed potatoes?',
   'answer': [10.0],
   'answerable': True,
   'source': 'SVAMP'},
  'input': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nProvide a correct and concise answer to the following math question. If the question is unanswerable (e.g., it is illogical or missing information), you must clearly state that it cannot be answered and briefly explain why.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nBecause of the decision Sofia asked the students to suggest specific types of food. If 479 students suggested adding mashed potatoes while 489 suggested adding bacon

In [None]:
response = json_data[-1]["response"]
batch_size = len(response)

question = json_data[-1]["additional_info"]["question"]
is_answerable = json_data[-1]["additional_info"]["answerable"]
answer = json_data[-1]["additional_info"]["answer"]

In [None]:
agent = MathQACorrectionAgent()

state = agent.model.invoke({
    "question": question,
    "answer": answer,
    "is_answerable": is_answerable,
    "response": response,
})

I0000 00:00:1752649486.911765 2198097 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache
I0000 00:00:1752649486.917897 2198097 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1752649486.919848 2198097 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


You are a meticulous AI quality assurance expert. Your sole function is to evaluate a given response based *only* on the provided context and correct answer. You MUST NOT use any external knowledge.

# Inputs
**Question**: Ellen is baking bread. It takes several hours to rise 1 ball of dough, and then another 2 hours to bake it. If she makes 4 balls of dough one after another and then bakes them one after another when they're done rising, how many hours will it take?
**Correct Answer**: This question cannot be answered.
**Response to Evaluate**: To determine the total time, we need to add the time it takes to rise the dough, the time it takes to bake the dough, and the time it takes to rise the dough again.

1. The time to rise 1 ball of dough is 1 hour.
2. The time to bake 1 ball of dough is 2 hours.
3. Ellen makes 4 balls of dough, so the time to rise all 4 balls is 4 x 1 hour = 4 hours.
4. The time to bake 4 balls of dough is 4 x 2 hours = 8 hours.

Total time = 4 hours + 8 hours = 

[32m2025-07-16 10:04:48.126[0m | [1mINFO    [0m | [36msrc.agents.math_qa_correction_agent[0m:[36mcheck_for_errors[0m:[36m76[0m - [1mErrors: [['The response incorrectly assumes the time to rise 1 ball of dough is 1 hour, which is not specified in the context.', 'The response provides a numerical answer when the correct answer is that the question cannot be answered.'], ["The response incorrectly assumes the time it takes for one ball of dough to rise is 1 hour. The question only states that it takes 'several hours'.", 'The response provides a numerical answer when the correct answer is that the question cannot be answered.'], ["The response incorrectly assumes that it takes 1 hour to rise 1 ball of dough. The problem states that it takes 'several hours', which is not a specific amount of time.", 'The response provides a numerical answer when the correct answer is that the question cannot be answered.'], []][0m
[32m2025-07-16 10:04:48.127[0m | [1mINFO    [0m | [36msrc.ag

You are an expert editor, a master of precision and conciseness. Your task is to correct an incorrect response using special deletion tokens to make the absolute minimum changes required.

# Task:
Your goal is to fix the `Incorrect Response` so that it matches the `Correct Answer`. You must use the special deletion tokens to perform the correction with the fewest edits possible.
# Special Deletion Tokens:
1.  <DEL_W>: Deletes the single word immediately preceding the token. A word is a string of characters separated by spaces. Punctuation is part of the word (e.g., 'end.').
2.  <DEL_S>: Deletes the entire sentence immediately preceding the token. A sentence is a group of words ending in a period (.), question mark (?), or exclamation mark (!). The token must be placed right after the punctuation.
3.  <DEL_A>: Deletes the entire response that comes before the token. Use this for a complete rewrite when the response is fundamentally wrong.

# Instructions:
1.  **Analyze the Error**: Firs

[32m2025-07-16 10:04:49.509[0m | [1mINFO    [0m | [36msrc.agents.math_qa_correction_agent[0m:[36mcorrect_response[0m:[36m145[0m - [1mCorrected response: ['To determine the total time, we need to add the time it takes to rise the dough, the time it takes to bake the dough, and the time it takes to rise the dough again.<DEL_A>This question cannot be answered.', '## Step 1: Determine the time it takes for one ball of dough to rise and bake.<DEL_A>This question cannot be answered.', 'The problem states that it takes 1 hour<DEL_W> several hours to rise 1 ball of dough, and 2 hours to bake 1 ball of dough. So, the total time to rise and bake 4 balls of dough is:\n\n4 x 1 hour<DEL_W> + 4 x 2 hours = 4 + 8 = 12 hours\n\nTherefore, it will take 12 hours to bake 4 balls of dough.<DEL_S> This question cannot be answered.'][0m


In [None]:
state

{'question': "Ellen is baking bread. It takes several hours to rise 1 ball of dough, and then another 2 hours to bake it. If she makes 4 balls of dough one after another and then bakes them one after another when they're done rising, how many hours will it take?",
 'answer': [0.0],
 'is_answerable': False,
 'response': ('To determine the total time, we need to add the time it takes to rise the dough, the time it takes to bake the dough, and the time it takes to rise the dough again.\n\n1. The time to rise 1 ball of dough is 1 hour.\n2. The time to bake 1 ball of dough is 2 hours.\n3. Ellen makes 4 balls of dough, so the time to rise all 4 balls is 4 x 1 hour = 4 hours.\n4. The time to bake 4 balls of dough is 4 x 2 hours = 8 hours.\n\nTotal time = 4 hours + 8 hours = 12 hours.\n\nThe final answer is 12 hours.',
  '## Step 1: Determine the time it takes for one ball of dough to rise and bake.\nIt takes 1 hour to rise and 2 hours to bake one ball of dough.\n\n## Step 2: Calculate the tot

In [None]:
1/0

ZeroDivisionError: division by zero

In [None]:
with open(f"../dataset/raw_model_responses/train/train_rajpurkar_squad.json", "r") as f:
    json_data = json.load(f)

In [None]:
json_data = json_data[:2]

In [None]:
json_data

[{'task_info': {'type': 'Contextual QA', 'dataset': 'rajpurkar_squad'},
  'additional_info': {'model': 'meta-llama/Llama-3.2-1B-Instruct',
   'question': 'What percentage of Egyptians polled support death penalty for those leaving Islam?',
   'context': 'The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.',
   'answer': ['84%'],
   'title': 'Egypt'},
  'input': "<|begin_of_te

In [None]:
from src.models import ErrorList
from src.prompts.gemini_agent_prompts import ContextQAErrorCheckPrompt, ContextQAErrorCorrectionPrompt

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.output_parsers import PydanticOutputParser

In [None]:
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.5)

I0000 00:00:1752647722.293223 2155010 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache
I0000 00:00:1752647722.298318 2155010 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


In [None]:
response = json_data[0]["response"]
batch_size = len(response)

question = json_data[0]["additional_info"]["question"]
context = json_data[0]["additional_info"]["context"]
answer = json_data[0]["additional_info"]["answer"]

In [None]:
response[0] = 'According to the text, 84% of Chinese polled supported the death penalty for those leaving Japan.'
response[1] = 'According to the text, 84% of Egyptians polled support the death penalty for those leaving Japan.'
response[2] = 'The answer is 25%.'
response[3] = 'The text does not specify the percentage, thus there is no answer. However, it does mention that 84% of Egyptians polled supported the death penalty for those who leave Islam.'

In [None]:
from src.agents.context_qa_correction_agent import ContextQACorrectionAgent

agent = ContextQACorrectionAgent()

state = agent.model.invoke({
    "question": question,
    "context": context,
    "answer": answer,
    "response": response,
})

I0000 00:00:1752647722.335815 2155010 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1752647722.336950 2155010 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


You are a meticulous AI quality assurance expert. Your sole function is to evaluate a given response based *only* on the provided context and correct answer. You MUST NOT use any external knowledge.

# Inputs
**Question**: What percentage of Egyptians polled support death penalty for those leaving Islam?
**Context**: '''The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.'''
*

[32m2025-07-16 09:35:23.356[0m | [1mINFO    [0m | [36msrc.agents.context_qa_correction_agent[0m:[36mcheck_for_errors[0m:[36m78[0m - [1mErrors: [['The response incorrectly states that the poll was of Chinese people, when the context specifies Egyptians.', 'The response incorrectly states that the poll was about leaving Japan, when the context specifies leaving Islam.'], ['The response incorrectly states that the death penalty is for those leaving Japan, instead of Islam.'], ['The response provides an incorrect answer. The correct answer, based on the context, is 84%, but the response states 25%.'], ["The response contains the phrase 'The text does not specify the percentage, thus there is no answer.', which is an unnecessary and evasive statement. The response should directly provide the answer.", "The response contradicts itself by initially stating that the text doesn't specify the percentage, then providing the correct percentage in the same response."]][0m
[32m2025-07-1

In [None]:
state

{'question': 'What percentage of Egyptians polled support death penalty for those leaving Islam?',
 'context': 'The Pew Forum on Religion & Public Life ranks Egypt as the fifth worst country in the world for religious freedom. The United States Commission on International Religious Freedom, a bipartisan independent agency of the US government, has placed Egypt on its watch list of countries that require close monitoring due to the nature and extent of violations of religious freedom engaged in or tolerated by the government. According to a 2010 Pew Global Attitudes survey, 84% of Egyptians polled supported the death penalty for those who leave Islam; 77% supported whippings and cutting off of hands for theft and robbery; and 82% support stoning a person who commits adultery.',
 'answer': ['84%'],
 'response': ['According to the text, 84% of Chinese polled supported the death penalty for those leaving Japan.',
  'According to the text, 84% of Egyptians polled support the death penalty f