# Task 3: Summarization

### Welcome to task 3!

LLMs can be effective at assessing the quality of summarised information when compared to the original text. This involves leveraging their ability to understand, compare and evaluate textual content.

This approach can be used to assess AI-generated summaries for refinement, compare human-written summaries for quality assurance and provide feedback to improve summarisation tools.

By using structured prompts, examples, and specific criteria, LLMs can serve as powerful tools for evaluating the quality of summaries in a consistent and scalable way.

**In this task you will build an LLM Judge to analyse the quality of summaries provided by an LLM summarizer.**

### Environment and Task Set Up 

Run the following cell. 
If there are no issues, you will get the message 'Root directory set up correctly!'

In [None]:
# Install required packages
!pip install -qq -r ../requirements.txt

REL_PATH_TO_ROOT = "../"

import sys
import os
import json
import pandas as pd
import tqdm

sys.path.insert(0,REL_PATH_TO_ROOT)

from src.utils import get_root_dir, test_root_dir
from local_variables import ROOT_DIR

test_root_dir(REL_PATH_TO_ROOT)

from prompt_manager.manager import PromptManager
from prompt_manager.fetcher import fetch_prompt
from src.api import generate_outputs_openai
from src.image_display import display_image

### Task Background

Below is the initial ask from the Journalist-Mini team as well as an explanation of the dataset they have provided you.

#### The Ask

In [None]:
display_image(f"{get_root_dir()}/task_images/task_3_desc.png")

#### The Data

In [None]:
display_image(f"{get_root_dir()}/task_images/task_3_data.png",max_size=700)

### Load Dataset

The dataset contains 30 CNN news articles along with summaries of the articles.

For each summary, we have provided ground truth labels for 3 measures of summary quality:

- Fluency
- Brevity
- Coverage

Definitions for these can be found in task_notebooks/summarization_metrics.txt 

In [None]:
! cat ./summarization_metrics.txt

In [None]:
input_path = os.path.join(REL_PATH_TO_ROOT, "data/summarisation.csv")
df = pd.read_csv(input_path)

In [None]:
# Dataset shape
df.shape

In [None]:
# First few rows
df.head()

### Task: Build LLM as a Judge

For each metric, craft a prompt that aims to capture the evaluation rubric

The **inputs** to your LLM Judge should be the article and/or the article summary.

The **output** from your LLM Judge is a scoring system of your choice, however you should produce a final score.

The **ground truths** are binary and should serve as a guide for benchmarking but unlike previous exercises, you cannot calculate direct agreement unless you choose your metric scale to be binary as well (this is where your own human review could help you!)

#### Fluency

##### Load the prompt

In [None]:
SEQUENCE = ["task_3","fluency"]
prompt_template = fetch_prompt(SEQUENCE,use_latest_version=True)
print(f"Current LLM Judge Prompt:\n------------------------\n{prompt_template}\n------------------------")

##### Apply the prompt to the test dataset

In [None]:
evaluator_responses = []

for _, row in tqdm.tqdm(df.iterrows()):

    # Get inputs and place into dictionary format
    summary = row["journalist_mini_summary"]
    row_inputs = {"SUMMARY" : summary}

    # Initialise prompt to validate and format inputs
    prompt = PromptManager(template=prompt_template,inputs=row_inputs)
    prompt.validate_inputs()
    prompt.format_inputs()

    # Send prompt and collect response
    response = generate_outputs_openai(prompt.prompt)
    evaluator_responses.append(response)

df["evaluator_fluency"] = evaluator_responses
display(df.head(5))

In [None]:
df["journalist_mini_summary"].iloc[1]

#### Brevity

##### Load the prompt

In [None]:
SEQUENCE = ["task_3","brevity"]
prompt_template = fetch_prompt(SEQUENCE,use_latest_version=True)
print(f"Current LLM Judge Prompt:\n------------------------\n{prompt_template}\n------------------------")

##### Apply the prompt to dataset

In [None]:
evaluator_responses = []

for _, row in tqdm.tqdm(df.iterrows()):

    # Get inputs and place into dictionary format
    summary = row["journalist_mini_summary"]
    row_inputs = {"SUMMARY" : summary}

    # Initialise prompt to validate and format inputs
    prompt = PromptManager(template=prompt_template,inputs=row_inputs)
    prompt.validate_inputs()
    prompt.format_inputs()

    # Send prompt and collect response
    response = generate_outputs_openai(prompt.prompt)
    evaluator_responses.append(response)

df["evaluator_brevity"] = evaluator_responses
display(df.head(5))

#### Coverage

##### Load the prompt

In [None]:
SEQUENCE = ["task_3","coverage"]
prompt_template = fetch_prompt(SEQUENCE,use_latest_version=True)
print(f"Current LLM Judge Prompt:\n------------------------\n{prompt_template}\n------------------------")

##### Apply the prompt to the dataset

In [None]:
evaluator_responses = []

for _, row in tqdm.tqdm(df.iterrows()):

    # Get inputs and place into dictionary format
    summary = row["journalist_mini_summary"]
    article = row["original_article"]
    row_inputs = {"SUMMARY" : summary, "ARTICLE" : article}

    # Initialise prompt to validate and format inputs
    prompt = PromptManager(template=prompt_template,inputs=row_inputs)
    prompt.validate_inputs()
    prompt.format_inputs()

    # Send prompt and collect response
    response = generate_outputs_openai(prompt.prompt)
    evaluator_responses.append(response)

df["evaluator_coverage"] = evaluator_responses
display(df.head(5))

If you have time, consider adding other metrics to your metric suite for summary quality.

For example, you could consider metrics such as:

- Hallucination (can you re-use your previous prompt?)
- Formatting including dates, special characters etc.

### End of Task 3