## Exteval Corrector Research

### Step 1: Extract human annotated incorrect summaries from the total 1600 summaries (`data.json`)

To archieve this the following script will filter the `data.json` and create a `data_only_incorrect.json` with all summaries that contain errors that were detected by humans.

In [1]:
import json

input_file = "data/data.json"
output_file = "data/data_only_incorrect.json"

fields_to_check = [
    "incorrect_coref",
    "incomplete_coref",
    "incorrect_discourse",
    "incomplete_discourse",
    "misleading1",
    "misleading2",
]

# Load the input JSON file
with open(input_file, "r", encoding="utf-8") as file:
    data = json.load(file)

# Filter the data based on the specified fields
filtered_data = {
    key: value
    for key, value in data.items()
    if any(value.get(field, "no") == "yes" for field in fields_to_check)
}

# Save the filtered data to the output file
with open(output_file, "w", encoding="utf-8") as file:
    json.dump(filtered_data, file, indent=4, ensure_ascii=False)

# Count the total number of entries in the filtered JSON
total_entries = len(filtered_data)

# Output the total count and the saved file message
print(f"Filtered data was saved in '{output_file}'.")
print(f"Total number of entries in the filtered JSON: {total_entries}")


Filtered data was saved in 'data/data_only_incorrect.json'.
Total number of entries in the filtered JSON: 535


### Step 2: Run ExtEval on the summaries to see the exact scores it predicts for `data_only_incorrect.json`

These scores will be saved in `data_only_incorrect_exteval.json` and are needed for the prompt to `GPT-4o` and later for the `Evaluation`.

In [2]:
import subprocess

# Define the command and its arguments
command = [
    "python3",
    "exteval/exteval.py",
    "--data_file", "data/data_only_incorrect.json",
    "--output_file", "data/scores/data_only_incorrect_exteval.json"
]

# Execute the command
result = subprocess.run(command, capture_output=True, text=True)

# Print the output and errors
if result.returncode == 0:
    print("Command executed successfully.")
    print(result.stdout)
else:
    print("Error occurred while executing the command.")
    print(result.stderr)


Command executed successfully.
Downloading https://storage.googleapis.com/allennlp-public-mo… ━━━ 100% 0:… 1.3…
                                                                            GB  
927.3197152614594



### Step 2b: Run ExtEval modified on the summaries to see the exact scores it predicts for `data_only_incorrect.json` witd extended information for the found errors

These scores will be saved in `data_only_incorrect_exteval_modified.json` and are needed for the prompt to `GPT-4o` and later for the `Evaluation`.
The new exteval modified will provide a lot more information for `GPT-4o` and will ensure a better correction later.

In [1]:
import subprocess

# Define the command and its arguments
command = [
    "python3",
    "exteval/extevalModified.py",
    "--data_file", "data/data_only_incorrect.json",
    "--output_file", "data/scores/data_only_incorrect_exteval_modified.json"
]

# Execute the command
result = subprocess.run(command, capture_output=True, text=True)

# Print the output and errors
if result.returncode == 0:
    print("Command executed successfully.")
    print(result.stdout)
else:
    print("Error occurred while executing the command.")
    print(result.stderr)


Command executed successfully.
1391.9829301834106



### Step 3: Combine incorrect data (`data_only_incorrect.json`) with their exteval scores (`data_only_incorrect_exteval.json`) -> Prepartion for easy access in the prompt

The new merged data will be saved in `data_incorrect_merged.json` and will be used for the prompt to `GPT-4o` in the next step.

In [3]:
import json

raw_data = "data/data_only_incorrect.json"
exteval_data = "data/scores/data_only_incorrect_exteval.json"
output_path = "data/merged/data_only_incorrect_merged.json"

# read data
with open(raw_data, 'r', encoding='utf-8') as f1:
    data1 = json.load(f1)

with open(exteval_data, 'r', encoding='utf-8') as f2:
    data2 = json.load(f2)

# result dictionary for the new json
merged_data = {}

# Merging of the data
for key, value2 in data2.items():
    if key in data1:  # check if the key is in both files
        value1 = data1[key]
        
        # create new structure
        merged_entry = {
            **value2,  # gets all fields from the second json
            "summary": value1.get("summary"),
            "document": value1.get("document"),
            "summary_for_annotation": value1.get("summary_for_annotation"),
            "document_for_annotation": value1.get("document_for_annotation")
        }
        
        # save inside new json
        merged_data[key] = merged_entry

# save completed new json
with open(output_path, 'w', encoding='utf-8') as output_file:
    json.dump(merged_data, output_file, ensure_ascii=False, indent=4)

print(f"Merged JSON saved to {output_path}")


Merged JSON saved to data/merged/data_only_incorrect_merged.json


### Step 3b: Combine incorrect data (`data_only_incorrect.json`) with their exteval scores (`data_only_incorrect_exteval_modified.json`) -> Prepartion for easy access in the prompt

The new merged data will be saved in `data_incorrect_merged_modified.json` and will be used for the prompt to `GPT-4o` in the next step.

In [2]:
import json

raw_data = "data/data_only_incorrect.json"
exteval_modified_data = "data/scores/data_only_incorrect_exteval_modified.json"
output_path = "data/merged/data_only_incorrect_merged_modified.json"

# read_data
with open(raw_data, 'r', encoding='utf-8') as f1:
    data1 = json.load(f1)

with open(exteval_modified_data, 'r', encoding='utf-8') as f2:
    data2 = json.load(f2)

# result dictionary for the new json
merged_data = {}

# Merging of the data
for key, value2 in data2.items():
    if key in data1:  # check if the key is in both files
        value1 = data1[key]
        
        # create new structure
        merged_entry = {
            **value2, # gets all fields from the second json
            "summary": value1.get("summary"),
            "document": value1.get("document"),
            "summary_for_annotation": value1.get("summary_for_annotation"),
            "document_for_annotation": value1.get("document_for_annotation")
        }
        
        # save inside new json
        merged_data[key] = merged_entry

# save completed new json
with open(output_path, 'w', encoding='utf-8') as output_file:
    json.dump(merged_data, output_file, ensure_ascii=False, indent=4)

print(f"Merged JSON saved to {output_path}")


Merged JSON saved to data/merged/data_only_incorrect_merged_modified.json


### Step 4: Call `GPT-4o` for every entry inside the `data_only_incorrect_merged.json` file with the prompt mask

For every entry (535 entries) a call to `GPT-4o` will be made to get a new extractive summary which will be saved in a new file called `data_new.json`.

In [9]:
from openai import OpenAI

client = OpenAI(api_key="your_api_key_here")

# Create the prompt
prompt = """
You are an expert AI assistant specializing in extractive summarization and its evaluation using EXTEVAL. 
EXTEVAL is a system designed to assess extractive summaries' faithfulness by identifying issues in four error categories:

1. **Incorrect Coreference (INCORCOREFEVAL):** Incorrect mapping of pronouns or noun phrases to their antecedents.
2. **Incomplete Coreference (INCOMCOREFEVAL):** Missing or unclear antecedents for references in the text.
3. **Incorrect Discourse (INCOMDISCOEVAL):** Faulty or misleading discourse relationships, such as inappropriate use of conjunctions or connectors.
4. **Sentiment Bias (SENTIBIAS):** Misalignment of sentiments between the original text and the summary.

Each category has a corresponding sub-metric (INCORCOREFEVAL, INCOMCOREFEVAL, INCOMDISCOEVAL, SENTIBIAS) that provides a score between 0 and 1. 
Scores closer to 1 indicate the presence of significant errors. The overall EXTEVAL score is a weighted average reflecting these metrics.

### Task
I will provide:
- The **original document**
- The **extractive summary**
- EXTEVAL scores.

Your job:
1. Identify and revise problematic areas in the extractive summary to address the specific EXTEVAL metrics.
2. Ensure the corrected extractive summary is faithful, coherent, and free from bias, fully resolving the identified errors.

### Response Format
Respond in this JSON format:
```json
{{
    "corrected_extractive_summary": "<your revised summary here>",
    "justifications": {{
        "IncorCorefEval": "<details of corrections made>",
        "IncomCorefEval": "<details of corrections made>",
        "IncomDiscoEval": "<details of corrections made>",
        "SentiBias": "<details of corrections made>"
    }}
}}
```

### Inputs
Here is the original document: {original_text}
Here is the extractive summary: {extractive_summary}
Here are the EXTEVAL scores: {exteval_scores}

Use the provided inputs to produce a corrected extractive summary and detailed justifications for each correction. """

# Example data
original_text = "Insert the full original document here..." 
extractive_summary = "Insert the problematic extractive summary here..." 
exteval_scores = { "IncorCorefEval": 0.2, "IncomCorefEval": 0.3, "IncomDiscoEval": 0.1, "SentiBias": 0.15, "OverallScore": 0.2 }

# Create the request
completion = client.chat.completions.create( 
    model="gpt-4", messages=[ 
        {"role": "system", "content": "You are a highly knowledgeable assistant trained in extracti've summarization and evaluation."}, 
        {
            "role": "user", 
            "content": prompt.format( 
                original_text=original_text, 
                extractive_summary=extractive_summary, 
                exteval_scores=exteval_scores)
        }
    ] 
)

# Print the answer
print(completion.choices[0].message["content"])


### Key Points from the prompt:
# 1. **Explicit Definitions**: Incorporated EXTEVAL’s detailed error metrics (coreference, discourse, and sentiment bias) based on the paper.
# 2. **Actionable Guidelines**: Clear instructions for correcting errors aligned with the EXTEVAL framework.
# 3. **Justification Requirement**: Added a section to justify how each specific metric’s issues were resolved, making the output more transparent and evaluative.
# 4. **Standardized JSON Output**: Provides a machine-readable format for integrating corrections into larger pipelines.
# This refined prompt ensures precise alignment with EXTEVAL's framework while maintaining clear and actionable guidance for improving summaries. 



AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: your_api*****here. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

### Step 4b: Call `GPT-4o` for every entry inside the `data_only_incorrect_merged_modified.json` file with the prompt mask with more information from ExtEval-Modified

For every entry (535 entries) a call to `GPT-4o` will be made to get a new extractive summary which will be saved in a new file called `data_new.json`.

In [3]:
from openai import OpenAI

client = OpenAI(api_key="your_api_key_here")

# Create the prompt
prompt = """
You are an expert AI assistant specializing in extractive summarization and evaluation using an advanced EXTEVAL framework. EXTEVAL assesses extractive summaries' faithfulness using the following error categories and metrics:

### EXTEVAL Metrics
1. **Incorrect Coreference (INCORCOREFEVAL):**
   - **Definition:** Refers to incorrect mapping of pronouns or noun phrases to their antecedents.
   - **Details Provided:** Count and list of specific instances, including:
     - Problematic sentence(s)
     - Specific mention(s)
     - Error type (e.g., "ambiguous pronoun").

2. **Incomplete Coreference (INCOMCOREFEVAL):**
   - **Definition:** Indicates missing or unclear antecedents for references in the text.
   - **Details Provided:** Count and list of specific instances, including:
     - Problematic sentence(s)
     - Specific mention(s)
     - Error type (e.g., "missing antecedent").

3. **Incorrect Discourse (INCOMDISCOEVAL):**
   - **Definition:** Highlights faulty or misleading discourse relationships, such as inappropriate use of conjunctions or connectors.
   - **Details Provided:** Count and list of specific instances, including:
     - Problematic sentence(s)
     - Specific discourse marker(s).

4. **Sentiment Bias (SENTIBIAS):**
   - **Definition:** Measures misalignment of sentiments between the source document and the summary.
   - **Details Provided:**
     - Absolute difference between document and summary sentiment.
     - Average sentiment of the source document and summary.
     - List of significant deviations with:
       - Problematic sentence(s)
       - Document sentiment
       - Summary sentiment.

5. **Overall EXTEVAL Score (EXTEVAL):**
   - Represents a weighted aggregation of all sub-metrics, with higher values indicating greater issues.

### Task
I will provide:
- The **original document**.
- The **extractive summary**.
- Detailed EXTEVAL scores with specific error details.

Your responsibilities:
1. **Analyze** the provided EXTEVAL scores and error details to identify the problematic areas in the summary.
2. **Revise** the summary to address the identified issues, ensuring it is faithful, coherent, and sentimentally aligned with the source document.

### Response Format
Respond in this JSON format:
```json
{{
    "corrected_extractive_summary": "<your revised summary>",
    "justifications": {{
        "IncorCorefEval": "<summary of changes made>",
        "IncomCorefEval": "<summary of changes made>",
        "IncomDiscoEval": "<summary of changes made>",
        "SentiBias": "<summary of changes made>"
    }}
}}
```

### Inputs
Here is the original document: {original_text}
Here is the extractive summary: {extractive_summary}
Here are the EXTEVAL scores: {exteval_scores}

### Notes
For Incorrect Coreference and Incomplete Coreference, revise pronouns or noun phrases to ensure accurate and clear references.
For Incorrect Discourse, restructure sentences or replace discourse markers to create logical and coherent relationships.
For Sentiment Bias, adjust phrasing or tone to align the summary's sentiment with the source document's sentiment distribution. """

# Example data
original_text = "Insert the full original document here..." 
extractive_summary = "Insert the problematic extractive summary here..." 
exteval_scores = { "0_neusumm": 
                  { "IncorCorefEval": 
                   { "count": 0, "details": [] }, 
                   "IncomCorefEval": { "count": 0, "details": [] }, 
                   "IncomDiscoEval": { "count": 0, "details": [] }, 
                   "SentiBias": { 
                       "absolute_difference": 0.07061501783900892, 
                       "doc_avg_sentiment": 0.6347842197341379, 
                       "summary_avg_sentiment": 0.7053992375731468, 
                       "significant_deviations": 
                       [ 
                           { "sentence_index": 0, "summary_sentiment": 0.9926256537437439, "document_avg_sentiment": 0.6347842197341379 }, 
                           { "sentence_index": 1, "summary_sentiment": 0.9239406585693359, "document_avg_sentiment": 0.6347842197341379 } 
                       ] }, 
                   "ExtEval": 0.07061501783900892 }
                 }


# Create the request
completion = client.chat.completions.create( 
    model="gpt-4", 
    messages=[ 
        {"role": "system", "content": "You are a highly knowledgeable assistant trained in extractive summarization and evaluation."}, 
        {
            "role": "user", 
            "content": prompt.format( 
                original_text=original_text, 
                extractive_summary=extractive_summary, 
                exteval_scores=exteval_scores )
        } 
    ] 
)

# Print the answer
print(completion.choices[0].message["content"])

### Key Changes
# 1. **Enhanced EXTEVAL Details**: Incorporated the extended EXTEVAL scores with **count, details, sentiment deviations**, and metrics for better analysis.
# 2. **Actionable Justifications**: Added more context in the "justifications" section for each metric.
# 3. **Clarity in Sentiment Bias**: Highlighted specific sentences with sentiment deviations, providing sentence-level sentiment values to aid targeted corrections.
# 4. **Structured Error Insights**: Explicitly included `summary_sentence`, `mention`, `type`, and `discourse_marker` in the task instructions, helping the AI pinpoint corrections.
# This refined prompt ensures detailed insights for correction while leveraging all the nuances in the new EXTEVAL data structure.


SyntaxError: invalid decimal literal (389155630.py, line 45)

### Step 5: Now the new and corrected summaries are evaluated by ExtEval to get new scores 

For this the new summaries need to be processed (`preprocess.py`) and after that evaluated (`exteval.py`). The new scores for the corrected summaries will be saved to `data_new_exteval.json`.

### Step 5b: Now the new and corrected summaries are evaluated by ExtEval Modified to get new scores (for the summaries that were corrected with the extra information from ExtEval Modified)

For this the new summaries need to be processed (`preprocess.py`) and after that evaluated (`exteval_modified.py`). The new scores for the corrected summaries will be saved to `data_new_exteval_modified.json`.

### Step 6: The new ExtEval scores are now compared to the old ones to see the improvement 

This is the last step and the results will be visualised for the paper.

### Step 6: The new ExtEval Modified scores are now compared to the old ones to see the improvement and to the ones from the normal ExtEval correction without the extra informations

This is the last step and the results will be visualised for the paper.