<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Summarization Classification Evals</h1>

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted approach to evaluating summarization quality,
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
#####################
## N_EVAL_SAMPLE_SIZE
#####################
# Eval sample size determines the run time
# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec
# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)
# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min
N_EVAL_SAMPLE_SIZE = 100

In [None]:
!pip install -qq arize-phoenix ipython matplotlib openai pycm scikit-learn

In [None]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import openai
import pandas as pd
import phoenix.experimental.evals.templates.default_templates as templates
from phoenix.experimental.evals import (
    OpenAIModel,
    download_benchmark_dataset,
    llm_eval_binary,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

pd.set_option("display.max_colwidth", None)

## Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. We will be using the CNN Daily News Mail dataset. This dataset is commonly used for text summarization models as a benchmark.

In [None]:
df = download_benchmark_dataset(
    task="summarization-classification", dataset_name="summarization-test"
)
df.head()

## Display Binary Summarization Classification Template

View the default template used to classify summarizations. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(templates.SUMMARIZATION_PROMPT_TEMPLATE_STR)

Eval template variables:

- **document** : The document text to summarize
- **summary** : The summary of the document

## Configure the LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## Benchmark Dataset Sample
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

In [None]:
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)


## LLM Evals: Summarization Evals Classifications GPT-4
Run summarization classifications against a subset of the data.

Instantiate the LLM and set parameters.

In [None]:
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

In [None]:
model("Hello world, this is a test if you are working?")

In [None]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())
summarization_classifications = llm_eval_binary(
    dataframe=df_sample,
    template=templates.SUMMARIZATION_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)


Evaluate the predictions against human-labeled ground-truth summarization labels.

In [None]:
true_labels = df_sample["user_feedback"].map(templates.SUMMARIZATION_PROMPT_RAILS_MAP).tolist()
summarization_classifications = (
    pd.Series(summarization_classifications)
    .map(lambda x: "unparseable" if x is None else x)
    .tolist()
)
print(classification_report(true_labels, summarization_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=summarization_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
);


## LLM Evals: Summarization Evals Classifications GPT-3.5
Run summarization classifications against a subset of the data.

In [None]:
model = OpenAIModel(model_name="gpt-3.5-turbo", temperature=0.0, request_timeout=20)

In [None]:
rails = list(templates.SUMMARIZATION_PROMPT_RAILS_MAP.values())
summarization_classifications = llm_eval_binary(
    dataframe=df_sample,
    template=templates.SUMMARIZATION_PROMPT_TEMPLATE_STR,
    model=model,
    rails=rails,
)

In [None]:
true_labels = df_sample["user_feedback"].map(templates.SUMMARIZATION_PROMPT_RAILS_MAP).tolist()
summarization_classifications = (
    pd.Series(summarization_classifications)
    .map(lambda x: "unparseable" if x is None else x)
    .tolist()
)

print(classification_report(true_labels, summarization_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=summarization_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
);