<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Text Summarization for LLMs</h1>

Imagine you're responsible for a media company's summarization model, MediaGPT, which condenses daily news into concise summaries. Lately, the model's performance has declined, leading to negative feedback from readers around the globe.

This tutorial show how Phoenix can swiftly identify and troubleshoot the cause of this performance degradation by analyzing prompt-response pairs linked to the documents being summarized. You'll see how examining embedding drift can reveal data issues before they impact performance.

In this tutorial, you will:

- Download curated LLM data for this walkthrough
- Calculate generative text performance metrics
- Launch Phoenix
- Pinpoint a cluster of articles your LLM is struggling to summarize

Let's get started!

Install dependencies.

In [None]:
!pip install -q "arize-phoenix" 'arize[AutoEmbeddings, LLM_Evaluation]'

Import libraries.

In [None]:
from datetime import datetime, timedelta
import uuid
import pandas as pd

from arize.pandas.embeddings import EmbeddingGenerator, UseCases
from arize.pandas.generative.llm_evaluation import sacre_bleu, rouge
import phoenix as px

Download the CNN/ Daily Mail dataset, a benchmark dataset for text summarization. View a few examples of the data.

In [None]:
train_df = pd.read_parquet("http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/summarization/llm_summarization_train.parquet?ignoreCache=1")
prod_df = pd.read_parquet("http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/summarization/llm_summarization_prod.parquet?ignoreCache=1")
train_df.head()

The columns of the DataFrame are:

- **document:** the news article to be summarized
- **summary:** the LLM-generated summary 
- **reference_summary:** the reference summary written by a human
- **document_vector:** the embedding vector for the news article to be summarized
- **summary_vector:** the embedding vector for the summary
- **rouge1_score:** a score that compares the LLM-generated summary with the human-written reference summary (high ROUGE scores indicate that the LLM-summary is similar to the reference summary)

Todo: Remove or explain remaining columns.

Todo: Include prompt template.

Run the cell below if you have a GPU and want to compute embeddings and ROUGE scores from scratch; otherwise, skip this step to use the pre-computed embeddings downloaded with the rest of your data.

TODO: Implement these steps to compute 

In [None]:
# df["sacreBLEU_score"] = sacre_bleu(
#     response_col=df["summary"], references_col=df["reference_summary"]
# )
# rouge_scores = rouge(
#     response_col=df["summary"],
#     references_col=df["reference_summary"],
#     rouge_types=["rouge1", "rouge2", "rougeL", "rougeLsum"],
# )
# for rouge_type, scores in rouge_scores.items():
#     df[f"{rouge_type}_score"] = scores

In [None]:
# generator = EmbeddingGenerator.from_use_case(
#     use_case=UseCases.NLP.SUMMARIZATION,
#     model_name="distilbert-base-uncased",
# )

In [None]:
# df["document_vector"] = generator.generate_embeddings(text_col=df["document"])
# df["summary_vector"] = generator.generate_embeddings(text_col=df["summary"])

Create a Phoenix schema to describe the columns of your DataFrames.

In [None]:
schema = px.Schema(
    timestamp_column_name="prediction_ts",
    tag_column_names=[
        "sacreBLEU_score",
        "rouge1_score",
        "rouge2_score",
        "rougeL_score",
        "rougeLsum_score",
        "reference_summary",
        "language",
    ],
    prompt_column_names=px.EmbeddingColumnNames(
        vector_column_name="document_vector", raw_data_column_name="document"
    ),
    response_column_names=px.EmbeddingColumnNames(
        vector_column_name="summary_vector", raw_data_column_name="summary"
    ),
)

Create your Phoenix datasets.

In [None]:
train_ds = px.Dataset(train_df, schema)
prod_ds = px.Dataset(prod_df, schema)

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI.

In [None]:
session = px.launch_app(prod_ds, train_ds)

Use Phoenix to find a cluster of your data where your LLM is doing a poor job of summarizing the input text.

1. Click on "document_vector" to go to the embeddings view for your prompts (the input news articles).
1. Select a period of high drift.
1. Color your data by the `rouge1_score` dimension. The problematic clusters have low ROUGE score in blue, the well-performing clusters have high ROUGE score in green.
1. Compare the data in the two clusters. Notice that the LLM is doing a good job summarizing the English articles in the green clusters, but is struggling to summarize Dutch articles in the blue cluster.

Congrats! You've discovered that the LLM is struggling to summarize Dutch news articles. You should check out your prompt template to see if you can improve performance for Dutch articles.

Close the app when you're done.

In [None]:
px.close_app()