<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Troubleshooting an LLM Summarization Task</h1>

Imagine you're responsible for your media company's summarization model that condenses daily news into concise summaries. Your model's performance has recently declined, leading to negative feedback from readers around the globe.

Phoenix helps you find the root-cause of LLM performance issues by analyzing prompt-response pairs.

In this tutorial, you will:

- Download curated LLM data for this walkthrough
- Compute embeddings for each prompt (article) and response (summary)
- Calculate ROUGE-L scores to evaluate the quality of your LLM-generated summaries against human-written reference summaries
- Use Phoenix to find articles that your LLM is struggling to summarize

⚠️ This tutorial runs faster with a GPU.

Let's get started!

## Install Dependencies and Import Libraries

Install Phoenix and the Arize SDK, which provides convenience methods for extracting embeddings and computing LLM evaluation metrics.

In [None]:
!pip install -q "arize-phoenix" "arize[AutoEmbeddings, LLM_Evaluation]"

Import libraries.

In [None]:
from arize.pandas.embeddings import EmbeddingGenerator, UseCases
from arize.pandas.generative.llm_evaluation import sacre_bleu, rouge
import pandas as pd
import phoenix as px

## Download the Data

Download your production data. Split it into two dataframes, one containing older baseline data and the other containing the most recent data. Inspect a few rows of your baseline data.

In [None]:
df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/summarization/llm_summarization.parquet"
)
baseline_df = df[:300]
recent_df = df[300:]
baseline_df = baseline_df.reset_index(
    drop=True
)  # recommended when using EmbeddingGenerator.generate_embeddings
recent_df = recent_df.reset_index(
    drop=True
)  # recommended when using EmbeddingGenerator.generate_embeddings
baseline_df.head()

The columns of the dataframe are:

- **prediction_timestamp:**
- **article:** the news article to be summarized
- **summary:** the LLM-generated summary created using the prompt template: "Please summarize the following document in English: {article}"
- **reference_summary:** the reference summary written by a human and used to compute ROUGE score

## Compute LLM Evaluation Metrics

Compute ROUGE-L scores to compare the LLM-generated summary with the human-written reference summary. A high ROUGE-L score mean that the LLM's summary closely matches the human reference summary.

In [None]:
def compute_rougeL_scores(df: pd.DataFrame) -> pd.Series:
    return rouge(
        response_col=df["summary"],
        references_col=df["reference_summary"],
        rouge_types=["rougeL"],
    )["rougeL"]


baseline_df["rougeL_score"] = compute_rougeL_scores(baseline_df)
recent_df["rougeL_score"] = compute_rougeL_scores(recent_df)

## Compute Embeddings for Prompts and Responses

Compute embeddings for articles and summaries.

In [None]:
generator = EmbeddingGenerator.from_use_case(
    use_case=UseCases.NLP.SUMMARIZATION,
    model_name="distilbert-base-uncased",
)
baseline_df["article_vector"] = generator.generate_embeddings(text_col=baseline_df["article"])
baseline_df["summary_vector"] = generator.generate_embeddings(text_col=baseline_df["summary"])
recent_df["article_vector"] = generator.generate_embeddings(text_col=recent_df["article"])
recent_df["summary_vector"] = generator.generate_embeddings(text_col=recent_df["summary"])

## Launch Phoenix

Define a schema to tell Phoenix what the columns of your dataframe represent (tags, prompts, responses, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [None]:
schema = px.Schema(
    timestamp_column_name="prediction_timestamp",
    tag_column_names=[
        "rougeL_score",
        "reference_summary",
    ],
    prompt_column_names=px.EmbeddingColumnNames(
        vector_column_name="article_vector", raw_data_column_name="article"
    ),
    response_column_names=px.EmbeddingColumnNames(
        vector_column_name="summary_vector", raw_data_column_name="summary"
    ),
)

Create Phoenix datasets that wrap your dataframes with schemas that describe them.

In [None]:
baseline_ds = px.Dataset(dataframe=baseline_df, schema=schema, name="baseline")
recent_ds = px.Dataset(dataframe=recent_df, schema=schema, name="recent")

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI in the notebook or in a new browser tab.

In [None]:
session = px.launch_app(primary=recent_ds, reference=baseline_ds)

## Find the Root-Cause of Your Model Performance Issue

Use Phoenix to find the root-cause of your LLM's performance issue.

Click on "article_vector" to go to the embeddings view for your prompts (the input news articles).

![click on article vector](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llm_summarization/click_on_article_vector.png)

Select a period of high drift.

![select period of high drift](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llm_summarization/select_period_of_high_drift.png)

Color your data by the "rougeL_score" dimension. The problematic clusters have low ROUGE-L score in blue, the well-performing clusters have high ROUGE-L score in green.

![color by rouge score](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llm_summarization/color_by_rouge_score.png)

Use the lasso to select part of your data and inspect the prompt-response pairs.

![select points with lasso](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llm_summarization/select_points_with_lasso.png)

Select each clusters in the left panel and look at the prompt-response pairs. Notice that the LLM is doing a good job summarizing the English articles in the green cluster (high ROUGE-L score), but is struggling to summarize Dutch articles in the blue cluster (low ROUGE-L score).

![select clusters](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llm_summarization/select_clusters.png)

Congrats! You've discovered that your LLM is struggling to summarize Dutch news articles. You should modify your prompt template to see if you can improve your ROUGE-L score for Dutch articles.