# Task 3: Summarization

#### Welcome to task 3!

In this task you will build an LLM Judge to analyse the quality of summaries provided by an LLM summarizer.

### Environment and Task Set Up 

Run the following cell. 
If there are no issues, you will get the message 'Root directory set up correctly!'

In [None]:
# Install required packages
!pip install -qq -r ../requirements.txt

REL_PATH_TO_ROOT = "../"

import sys
import os
import json
import pandas as pd
import tqdm

sys.path.insert(0,REL_PATH_TO_ROOT)

from src.utils import get_root_dir, test_root_dir
from local_variables import ROOT_DIR

test_root_dir(REL_PATH_TO_ROOT)

from prompt_manager.manager import PromptManager
from prompt_manager.fetcher import fetch_prompt
from src.api import generate_outputs_openai
from src.image_display import display_image

### Task Background

Below is the initial ask from the Journalist-Mini team as well as an explanation of the dataset they have provided you

#### The Ask

In [None]:
display_image(f"{get_root_dir()}/task_images/task_3_desc.png")

#### The Data

In [None]:
display_image(f"{get_root_dir()}/task_images/task_3_data.png",max_size=700)

### Load Dataset

The dataset contains 30 CNN news articles along with summaries produced by an LLM.

For each article summary, we have provided ground truth labels for 3 measures of summary quality:

- Fluency
- Brevity
- Coverage

If you have time you may also want to consider metrics such as:

- Hallucination
- Formatting (dates, amounts of money etc.)

Definitions for these can be found in task_notebooks/summarization_metrics.txt 

In [None]:
! cat ./summarization_metrics.txt

In [None]:
input_path = os.path.join(REL_PATH_TO_ROOT, "data/summarisation_multi_ground_truth.csv")
df = pd.read_csv(input_path)

In [None]:
# Dataset shape
df.shape

In [None]:
# First few rows
df.head()

### Task: Build LLM as a Judge

For each metric, craft a prompt that aims to capture the evaluation rubric

The **inputs** to your LLM Judge should be the article and/or the article summary.

The **output** from your LLM Judge is a scoring system of your choice

#### Hallucination

In [None]:
# Get prompt
SEQUENCE = ["task_3","hallucination_detector"]
prompt_template = fetch_prompt(SEQUENCE,use_latest_version=True)
print(f"Current LLM Judge Prompt:\n------------------------\n{prompt_template}\n------------------------")

In [None]:
# Apply prompt to dataset
evaluator_responses = []

for _, row in tqdm.tqdm(df.iterrows()):

    # Get inputs and place into dictionary format
    context = row["article"]
    response = row["altered_summaries"]
    row_inputs = {"CONTEXT" : context, "RESPONSE" : response}

    # Initialise prompt to validate and format inputs
    prompt = PromptManager(template=prompt_template,inputs=row_inputs)
    prompt.validate_inputs()
    prompt.format_inputs()

    # Send prompt and collect response
    response = generate_outputs_openai(prompt.prompt)
    evaluator_responses.append(response)

df["evaluator_hallucination"] = evaluator_responses
display(df.head(5))

In [None]:
# Get agreement
agreement_counts = [1 if str(row['hallucination']) == str(row['evaluator_response']) else 0 for _, row in df.iterrows()]
percentage_agreement = sum(agreement_counts)/len(agreement_counts)
print(f"\n Your LLM Judge achieved {round(100 * percentage_agreement, 1)}% agreement!")

### End of Exercise