<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# <center>Getting Started with the Arize Platform</center>
## <center>Investigating Text Summarization (LLM Observability)</center>

**In this walkthrough, we are going to ingest data from a Large Language Model (LLM) performing text summarization.** 

In this scenario, you are in charge of maintaining a summarization model of a media company. Your model, MediaGPT, is tasked to summarize daily news into a concise summary. However, once the model is released into production, you notice that the performance and behavior of the model changed over a period of time and your readers around the world started to provide some negative feedback. 


This notebook will show you how Arize can automatically surface and troubleshoot the reason for this performance degradation by analyzing _prompt-response pairs_ associated with the document to be summarized so that you can take the right action to fine-tune your models. In this example, there are documents in a different language and documents also contain news about different specific topics.

It is worth noting that, according to our research, inspecting embedding drift can surface problems with your data before they cause performance degradation.

In this tutorial, we will start from scratch. We will:
* Download the LLM Data we have curated for this tutorial
* Compute generative text performance metrics with the Arize SDK
* Automatically generate embeddings using the Arize SDK
* Log the inferences into Arize
* Visually explore embeddings in the Arize Platform

Let's get started! 

# Step 0. Install Dependencies, Import Libraries, Use GPU 📚

To have automatic embedding generation functionality and LLM evaluation from the Arize SDK, we need
to specify the extra `[AutoEmbeddings]` and `[NLP_Metrics]`. Note that LLM evaluation is used to
compute evaluation metrics. If you already have metrics or embeddings computed within your dataset,
you do not need the extra packages. 

⚠️ Use a GPU to save time generating embeddings. Click on 'Runtime', select 'Change Runtime Type' and
select 'GPU'. 

In [None]:
!pip install -q 'arize[AutoEmbeddings, NLP_Metrics]' 

from datetime import datetime, timedelta
import uuid
import pandas as pd

from arize.pandas.logger import Client
from arize.utils.types import ModelTypes, Environments, Schema, Metrics
from arize.utils.types import Environments, ModelTypes, EmbeddingColumnNames, Schema

# Step 1. Download the data

For this tutorial, we will be using the CNN Daily News Mail dataset. This dataset is commonly used for text summarization models as a benchmark. Let's download and display the data we have available. Inside the dataset, we have: 

*   **document:** news article to be summarized
*   **summary:** AI-generated summary 
*   **reference_summary:** reference summary written by domain experts
*   **user_feedback:** thumbs down (0) or thumbs up (1) for summary feedback 
*   **prompt_template:** Template used to perform summarization task given a document

In [None]:
# Download tutorial dataset
df = pd.read_json("https://storage.googleapis.com/arize-assets/fixtures/Embeddings/arize-demo-models-data/GENERATIVE/summarization/generative_llm_demo_summarization.json")
df.head()

# Step 2. Compute Generative Text Performance Metrics with Arize

Compute SacreBLEU Score and ROUGE Score for each generated summary using the reference summary. Additional information on text-based generative AI metrics can be found [here.](https://arize.com/blog-course/generative-ai-metrics-bleu-score/)

In [None]:
from arize.pandas.generative.nlp_metrics import sacre_bleu, rouge

In [None]:
df['sacreBLEU_score'] = sacre_bleu( 
    response_col=df["summary"], 
    references_col=df["reference_summary"]
)
rouge_scores = rouge(
    response_col=df["summary"],
    references_col=df["reference_summary"],
    rouge_types=["rougeL"]
)
for rouge_type, scores in rouge_scores.items():
    df[f"{rouge_type}_score"] = scores

In [None]:
df.head()

# Step 3. Generate embedding vectors using Arize

Arize offers the ability of generating embeddings seemlessly using large pre-trained models. In this scenario, we will use the pre-trained BERT language model. `distilbert-base-uncased`.

**NOTE: We recommend utilizing GPUs to optimize embedding generation. In Google Colaboratory, navigate to the 'Runtime' menu and select 'Change runtime type'. If you are interested in accessing even more powerful GPUs, upgrade to Colab Pro for enhanced speed and performance.** 

The language models that Arize's embedding generators use have already been trained in such a huge amount of data that the embeddings can capture relevant structure in your data without being fine-tuned.

First step is to import `EmbeddingGenerator` and `UseCases`.

In [None]:
from arize.pandas.embeddings import EmbeddingGenerator, UseCases

Next, we define our generator, choosing the model `distilbert-base-uncased`.

You can also set the `batch_size`. This allows you to process the data in smaller batches if you are running out of resources. The default `batch_size` is 100.

Arize then downloads the models and tokenizers from the 🤗 HuggingFace Hub.

In [None]:
generator = EmbeddingGenerator.from_use_case(
    use_case=UseCases.NLP.SUMMARIZATION,
    model_name="distilbert-base-uncased",
    tokenizer_max_length=512,
    batch_size=100
)

To generate the embeddings, we must pass the dataframe and the name of the column in the dataframe that contains the path to the images

In [None]:
df["document_vector"] = generator.generate_embeddings(text_col=df["document"])
df["summary_vector"] = generator.generate_embeddings(text_col=df["summary"])

# Step 4. Prepare your data to be sent to Arize

## Add prediction ids

The Arize platform uses prediction IDs to link a prediction to an actual. Visit the [Arize documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id) for more details.

You can generate prediction IDs as follows:

In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]
df['prediction_id'] = add_prediction_id(df)

## Create timestamps
In order to measure drift and performance over time, it is good to have timestamps for each prompt-response pair. We generate a synthetic timestamp for each record within the last 15 days. 

In [None]:
now_dt = datetime.now()
start_dt = now_dt - timedelta(days=15)

In [None]:
df["prediction_ts"] = pd.date_range(
    start=start_dt,
    end=now_dt,
    periods=len(df),
)

# Step 5. Sending Data into Arize 💫


Now that we have our data configured, we are ready to log our dataset into Arize. There, the data will be easily visualized and investigated.

For our model, we are going to log:

*   prompt text and embeddings
*   generated summary and embeddings
*   reference text for the summary
*   Rouge and SacreBleu scores we computed
*   Prompt templates

## Import and Setup Arize Client

The first step is to setup our Arize client. After that we will log the data.

First, use your Arize account credentials to log in. Thereafter, retrieve the Arize `API_KEY` and `SPACE_KEY` from your Space Settings page shown below! Copy those over to the set-up section. We will also be setting up some metadata to use across all logging.




<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
SPACE_KEY = "YOUR_SPACE_KEY"  
API_KEY =  "YOUR_API_KEY"
arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
if SPACE_KEY == "YOUR_SPACE_KEY" or API_KEY == "YOUR_API_KEY":
    raise ValueError("❌ CHANGE SPACE AND API KEYS")
else:
    print("✅ Arize client setup done! Now you can start using Arize!")

## Define the Schema 

A Schema instance specifies the column names for corresponding data in the dataframe. 

To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our dataframe. Prompt and response pairs, however, are a little bit different since embedding vectors need to be logged into the platform.

Arize allows you to ingest prompt and response pairs directly by providing `prompt_column_names` and `response_column_names` as fields of the Schema. You ingest not only the embedding vector but the raw data associated with that embedding. Therefore, up to 2 columns can be associated with the prompt or response objects:
* Embedding `vector` (required)
* Embedding `data` (optional,but recommended): raw text associated with the embedding vector

Learn more [here](https://docs.arize.com/arize/sending-data/model-schema-reference#8.-embedding-features-unstructured).


In [None]:
# Declare prompt and response columns
prompt_columns=EmbeddingColumnNames(
    vector_column_name="document_vector",
    data_column_name="document"
)

response_columns=EmbeddingColumnNames(
    vector_column_name="summary_vector",
    data_column_name="summary"
)

In [None]:
schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    tag_column_names=["sacreBLEU_score", "rougeL_score","prompt_template", "user_feedback"],
    prompt_column_names=prompt_columns,
    response_column_names=response_columns,
)

## Log LLM Data into Arize

In [None]:
response = arize_client.log(
    dataframe=df,
    schema=schema,
    model_id="demo-generative-ai-text-summarization-tutorial",
    model_version="1.0",
    model_type=ModelTypes.GENERATIVE_LLM,
    environment=Environments.PRODUCTION
)
if response.status_code == 200:
    print(f"✅ Successfully logged data to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )

# Step 6. Confirm Data in Arize ✅
Note that the Arize platform takes about 15 minutes to index embedding data. While the model should appear immediately, the data will not show up until the indexing is complete. Feel free to head over to the **Data Ingestion** tab for your model to watch Arize work its magic!🔮

You will be able to see the predictions and actuals that have been sent in the last 30 minutes, last day or last week.

An example view of the Data Ingestion tab from a model, when data is sent continuously over 30 minutes, is shown in the image below.

<img src="https://storage.googleapis.com/arize-assets/fixtures/data-ingestion-tab.png" width="700">

## Check the Embedding Data in Arize
Now, you can see how Arize surfaces the different prompt response pairs and their embeddings and troubleshoots the degradation in performance to save you the time and effort. 

Click on the Embeddings Tab to see how your embedding data is drifting over time. In the picture below we represent the global euclidean distance between your production set (at different points in time) and the baseline (which we set to be 2 weeks of production data, delayed by 3 days). We can see the distance is remarkably higher towards the end of the graph. This shows us that recent prompts and responses that were logged into Arize are qualitatively different in terms of content compared to 2 weeks ago Let's try to find out why the new prompt and response pairs are different using Arize!
 
<img src="https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/High Drift Summarization.png" width="900">

In addition to the drift tracking plot above, below you can find the UMAP visualization of your data, according to the point in time selected. Notice that the production data and our baseline data are superimposed, which is indicative that the model is seeing data in production similar to the data it saw 2 weeks ago. In this UMAP, notice that there are different clusters and the cluster that is kind of separated from the main big cluster of points appears only in the production dataset and has a lower rouge score. (Dataset is currently colored by rouge score, shown on the left panel). 

<img src="https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/UMAP 1.png" width="900">

For further inspection, you may select a 3D UMAP view and clicked _Explore UMAP_ to expand the view. With this view we can interact in 3D with our dataset. We can zoom, rotate, and drag so we can see the areas of our dataset that are most interesting to us. Let's try to analyze why there is a new cluster of data that has a lower summarization score:

<img src="https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/UMAP 2.png" width="900">

As you can see, this new cluster has a different language, Dutch, which does not appear in our baseline dataset, which is the data from 2 weeks ago. Now, we actually found the cause of the big jump in embedding drift on our LLM model and we can export this problematic cluster by clicking on _Download Cluster_ on the right hand side and fine-tune our model with Dutch news data. 

In the display above, Arize offers many coloring options:
1. **By Dataset:** You can see that the coloring has been made to distinguish production data vs baseline data. This is specifically useful to detect drift. In this example, we can see that there is some recent production data far away from any older data, giving an indication of severe dataset drift. We can identify exactly what datapoints our baseline is missing so that we can re-train effectively.
2. **By Performance Metric:** This coloring option gives an insight on how is our model generating summaries. By clicking on Tags and selecting one of the performance metrics, you can see which clusters are performing relatively lower compared to other clusters. 
6. **By Feature:** You can identify areas of the space where your model might be underperforming and, by coloring the points by feature, identify patterns at feature level. In other words, you can identify a slice of your data sharing a common feature (or features) that are causing a problem.

More coloring options will be added to help you understand and debug your model and dataset.

# Wrap Up 🎁
Congratulations, you've now sent your first Large Language Model data to the Arize platform!!

Additionally, if you want to remove this example model from your account, just click **Models** -> **demo-generative-ai-text-summarization-tutorial** -> **config** -> **delete**