## Exteval Modified Corrector Research

This notebooks contains my research regarding the EXTEVAL-Modified-Metric.
The metric was modified to output more information of the detected errors.

### Step 1: Run ExtEval-Modified on the summaries to see the exact scores it predicts for `data_only_incorrect.json` witd extended information for the found errors

These scores will be saved in `data_only_incorrect_exteval_modified.json` and are needed for the prompt to `GPT-4o` and later for the `Evaluation`.
The new exteval modified will provide a lot more information for `GPT-4o mini` and will ensure a better correction later.

(The `data_only_incorrect.json` was already created by `Step 1` of the `exteval-corrector-research`)

In [1]:
import subprocess

# Define the command and its arguments
command = [
    "python3",
    "exteval/extevalModified.py",
    "--data_file", "data/data_only_incorrect.json",
    "--output_file", "data/scores/data_only_incorrect_exteval_modified.json"
]

# Execute the command
result = subprocess.run(command, capture_output=True, text=True)

# Print the output and errors
if result.returncode == 0:
    print("Command executed successfully.")
    print(result.stdout)
else:
    print("Error occurred while executing the command.")
    print(result.stderr)


Command executed successfully.
837.0383448600769



### Step 2: Combine incorrect data (`data_only_incorrect.json`) with their exteval scores (`data_only_incorrect_exteval_modified.json`) -> Prepartion for easy access in the prompt

The new merged data will be saved in `data_incorrect_merged_modified.json` and will be used for the prompt to `GPT-4o` in the next step.

In [2]:
import json

raw_data = "data/data_only_incorrect.json"
exteval_modified_data = "data/scores/data_only_incorrect_exteval_modified.json"
output_path = "data/merged/data_only_incorrect_merged_modified.json"

# read_data
with open(raw_data, 'r', encoding='utf-8') as f1:
    data1 = json.load(f1)

with open(exteval_modified_data, 'r', encoding='utf-8') as f2:
    data2 = json.load(f2)

# result dictionary for the new json
merged_data = {}

# Merging of the data
for key, value2 in data2.items():
    if key in data1:  # check if the key is in both files
        value1 = data1[key]
        
        # create new structure
        merged_entry = {
            **value2, # gets all fields from the second json
            "summary": value1.get("summary"),
            "document": value1.get("document"),
            "summary_for_annotation": value1.get("summary_for_annotation"),
            "document_for_annotation": value1.get("document_for_annotation")
        }
        
        # save inside new json
        merged_data[key] = merged_entry

# save completed new json
with open(output_path, 'w', encoding='utf-8') as output_file:
    json.dump(merged_data, output_file, ensure_ascii=False, indent=4)

print(f"Merged JSON saved to {output_path}")


Merged JSON saved to data/merged/data_only_incorrect_merged_modified.json


### Step 3: Call `GPT-4o` for every entry inside the `data_only_incorrect_merged_modified.json` file with the prompt mask with more information from ExtEval-Modified

For every entry (484 entries) a call to `GPT-4o` will be made to get a new extractive summary which will be saved in a new file called `data_new.json`.

In [7]:
from openai import OpenAI
import json
import re

client = OpenAI(api_key="<your_api_key>")

# Create the prompt
prompt = """
You are an expert AI assistant specializing in extractive summarization and evaluation using an advanced EXTEVAL framework. EXTEVAL assesses extractive summaries' faithfulness using the following error categories and metrics:

### EXTEVAL Metrics
1. **Incorrect Coreference (INCORCOREFEVAL):**
   - **Definition:** Refers to incorrect mapping of pronouns or noun phrases to their antecedents.
   - **Details Provided:** Count and list of specific instances, including:
     - Problematic sentence(s)
     - Specific mention(s)
     - Error type (e.g., "ambiguous pronoun").

2. **Incomplete Coreference (INCOMCOREFEVAL):**
   - **Definition:** Indicates missing or unclear antecedents for references in the text.
   - **Details Provided:** Count and list of specific instances, including:
     - Problematic sentence(s)
     - Specific mention(s)
     - Error type (e.g., "missing antecedent").

3. **Incorrect Discourse (INCOMDISCOEVAL):**
   - **Definition:** Highlights faulty or misleading discourse relationships, such as inappropriate use of conjunctions or connectors.
   - **Details Provided:** Count and list of specific instances, including:
     - Problematic sentence(s)
     - Specific discourse marker(s).

4. **Sentiment Bias (SENTIBIAS):**
   - **Definition:** Measures misalignment of sentiments between the source document and the summary.
   - **Details Provided:**
     - Absolute difference between document and summary sentiment.
     - Average sentiment of the source document and summary.
     - List of significant deviations with:
       - Problematic sentence(s)
       - Document sentiment
       - Summary sentiment.

5. **Overall EXTEVAL Score (EXTEVAL):**
   - Represents a weighted aggregation of all sub-metrics, with higher values indicating greater issues.

### Task
I will provide:
- The **original document**.
- The **extractive summary**.
- Detailed EXTEVAL scores with specific error details.

Your responsibilities:
1. **Analyze** the provided EXTEVAL scores and error details to identify the problematic areas in the summary.
2. **Revise** the summary to address the identified issues, ensuring it is faithful, coherent, and sentimentally aligned with the source document.

### Response Format
Respond in this JSON format:
```json
{{
    "corrected_extractive_summary": "<your revised summary>",
    "justifications": {{
        "IncorCorefEval": "<summary of changes made>",
        "IncomCorefEval": "<summary of changes made>",
        "IncomDiscoEval": "<summary of changes made>",
        "SentiBias": "<summary of changes made>"
    }}
}}
```

### Inputs
Here is the original document: {original_text}
Here is the extractive summary: {extractive_summary}
Here are the EXTEVAL scores: {exteval_scores}

### Notes
For Incorrect Coreference and Incomplete Coreference, revise pronouns or noun phrases to ensure accurate and clear references.
For Incorrect Discourse, restructure sentences or replace discourse markers to create logical and coherent relationships.
For Sentiment Bias, adjust phrasing or tone to align the summary's sentiment with the source document's sentiment distribution. """

def load_json(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return json.load(file)

def save_json(data, file_path):
    with open(file_path, 'w', encoding='utf-8') as file:
        json.dump(data, file, ensure_ascii=False, indent=4)


# function for the prompts
def generate_prompts_and_save(input_file, output_file):
    data = load_json(input_file)
    output_data = {}

    for key, entry in data.items():
        document = entry.get("document", "")
        summary = entry.get("summary", "")
        exteval_scores = {
            "IncorCorefEval": entry.get("IncorCorefEval", ""),
            "IncomCorefEval": entry.get("IncomCorefEval", ""),
            "IncomDiscoEval": entry.get("IncomDiscoEval", ""),
            "SentiBias": entry.get("SentiBias", ""),
            "ExtEval": entry.get("ExtEval", ""),
        }
        
        try:
            # api call
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": "You are a highly knowledgeable assistant trained in extractive summarization and evaluation."},
                    {"role": "user", "content": prompt.format(
                        original_text=document,
                        extractive_summary=summary,
                        exteval_scores=exteval_scores
                    )}
                ]
            )

            # access the answer
            response_text = response.choices[0].message.content

            #extract only the json part from the answer
            json_match = re.search(r"{.*}", response_text, re.DOTALL)
            if json_match:
                json_text = json_match.group(0)
                try:
                    response_json = json.loads(json_text) 
                    corrected_extractive_summary = response_json.get("corrected_extractive_summary", None)
                    justifications = response_json.get("justifications", None)
                except json.JSONDecodeError as e:
                    print(f"JSONDecodeError for {key}: {e}")
                    print(f"Response Text: {response_text}")
                    corrected_extractive_summary = None
                    justifications = None
            else:
                print(f"No valid JSON found in response for {key}. Response Text: {response_text}")
                corrected_extractive_summary = None
                justifications = None

            # save the results
            output_data[key] = {
                "document": document,
                "corrected_extractive_summary": corrected_extractive_summary,
                "justifications": justifications
            }
        except Exception as e:
            print(f"Error while trying to get a response from GPT-4 {key}: {e}")
            output_data[key] = {
                "document": document,
                "response": f"Error: {e}"
            }

    # save all results to a json
    save_json(output_data, output_file)
    print(f"Results were saved to: {output_file}.")


# Call
input_file_path = "data/merged/data_only_incorrect_merged_modified.json"
output_file_path = "data/corrected/corrected_data_modified.json"
generate_prompts_and_save(input_file_path, output_file_path)

JSONDecodeError for 207_textrank_st: Expecting ',' delimiter: line 6 column 138 (char 849)
Response Text: ```json
{
    "corrected_extractive_summary": "<t> The family respectfully requested that people do not vote for Hillary Clinton in 2016, as stated in Larry Upright's obituary. </t> <t> Marina Shear of Dallas wrote in the online guestbook, 'You have my solemn promise I will not waste a vote on Hillary Clinton.'</t>",
    "justifications": {
        "IncorCorefEval": "No instances of incorrect coreference needed addressing.",
        "IncomCorefEval": "Clarified 'the family' in the first summary sentence by specifying 'The family of Larry Upright' to provide context. Clarified 'the obituary's' in the second summary sentence by explicitly mentioning 'Larry Upright's obituary' to address incomplete antecedents.",
        "IncomDiscoEval": "Removed the discourse marker 'also' and rewrote the sentence to eliminate the misleading progression implied."),
        "SentiBias": "Adjusted the

### Key Points from the prompt:

1. **Enhanced EXTEVAL Details**: Incorporated the extended EXTEVAL scores with **count, details, sentiment deviations**, and metrics for better analysis.
2. **Actionable Justifications**: Added more context in the "justifications" section for each metric.
3. **Clarity in Sentiment Bias**: Highlighted specific sentences with sentiment deviations, providing sentence-level sentiment values to aid targeted corrections.
4. **Structured Error Insights**: Explicitly included `summary_sentence`, `mention`, `type`, and `discourse_marker` in the task instructions, helping the AI pinpoint corrections.

This refined prompt ensures detailed insights for correction while leveraging all the nuances in the new EXTEVAL data structure.

### Error Handling

The prompting to `GPT-4o` throwed one errors. For this one errors the prompt was manually created and given to `GPT-4o` to correct. The response was inserted into the `corrected_data_modified.json`.

### Step 4: Now the new and corrected summaries are evaluated by ExtEval Modified to get new scores (for the summaries that were corrected with the extra information from ExtEval Modified)

For this the new summaries need to be processed (`preprocess.py`) and after that evaluated (`exteval_modified.py`). The new scores for the corrected summaries will be saved to `data_new_exteval_modified.json`.

In [None]:
import subprocess

# Define the input and output files
data_file = "data/corrected/corrected_data_modified.json"
output_file = "data/corrected/preprocessed/corrected_data_preprocessed_modified.json"

# Build and execute the command
command = ["python", "exteval/preprocessForCorrected.py", "--data_file", data_file, "--output_file", output_file]

try:
    result = subprocess.run(command, check=True, text=True, capture_output=True)
    print("Script executed successfully:")
    print(result.stdout)
except subprocess.CalledProcessError as e:
    print("Error while executing the script:")
    print(e.stderr)


In [None]:
import subprocess

# Define the input and output files
data_file = "data/corrected/preprocessed/corrected_data_preprocessed_modified.json"
output_file = "data/corrected/scores/corrected_data_exteval_modified.json"

# Build and execute the command
command = ["python", "exteval/extevalModifiedForCorrected.py", "--data_file", data_file, "--output_file", output_file]

try:
    result = subprocess.run(command, check=True, text=True, capture_output=True)
    print("Script executed successfully:")
    print(result.stdout)
except subprocess.CalledProcessError as e:
    print("Error while executing the script:")
    print(e.stderr)


### Step 6: The new ExtEval-Modified scores are now compared to the old ones to see the improvement 