<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# <center>Getting Started with the Arize Platform</center>

## <center>Prompt Engineering and Retrieval Workflows (LLM Observability) With Token Generation</center>
This guide demonstrates how to use Arize for monitoring and debugging your LLM with retrieval augmented generation workflows and prompt engineering, as well as how to generate token count information that is important for managing your LLM usage. We're going to use data from a chatbot built on top of Arize docs (https://docs.arize.com/arize/), with example query and retrieved text. Let's figure out how to understand how well our RAG system is working.

# Step 0. Install Dependencies, Import Libraries 📚

In [None]:
!pip -q install arize

In [None]:
import uuid
import json
import pandas as pd
from arize.pandas.logger import Client
from arize.utils.types import (
    Environments,
    ModelTypes,
    EmbeddingColumnNames,
    Schema,
    PromptTemplateColumnNames,
    LLMConfigColumnNames,
    LLMRunMetadataColumnNames,
    CorpusSchema,
)
from datetime import datetime

# Step 1. Download the data
The data contains queries, retrieved context (from a corpus) used to augment generations, LLM responses and metdata. We're going to inspect this data further in the Arize platform, to understand the relationship between responses and corpus documents along with metadata. 

In [None]:
data_url = (
    "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/"
    "arize-demo-models-data/GENERATIVE/prompt-response/"
)
prod_df = pd.read_parquet(
    data_url + "df_queries_for_token_count_generation.parquet"
)

# Step 2. Prepare Your Data

## Add prediction ids

The Arize platform uses prediction IDs to link a prediction to an actual. Visit the [Arize documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id) for more details.

You can generate prediction IDs as follows:


In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]


prod_df["prediction_id"] = add_prediction_id(prod_df)

## Update the timestamps

The data that you are working with was constructed in August of 2023. Hence, we will update the timestamps so they are current at the time that you're sending data to Arize.



In [None]:
last_ts = max(prod_df["prediction_ts"])
now_ts = datetime.timestamp(datetime.now())
delta_ts = now_ts - last_ts

prod_df["prediction_ts"] = (prod_df["prediction_ts"] + delta_ts).astype(float)

# Step 3. Generate Token Counts
Arize supports tracking fields that designate LLM token usage as part of the Arize schema. Token counts are important for understand how efficiently you are using LLMs and the total cost you're incurring from your usage. Defining these fields in Arize will allow you to create a default LLM dashboard from the Dashboard view. 

To generate token counts, we will be using [tiktoken](https://github.com/openai/tiktoken), a byte-pair encoding tokeniser library.

In [None]:
!pip install tiktoken

In [None]:
import tiktoken


def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


encoding_name = "cl100k_base"  # or any another encoding you want to use

prod_df["prompt_token_count"] = prod_df["prompt_text"].apply(
    lambda x: num_tokens_from_string(x, encoding_name)
)
prod_df["response_token_count"] = prod_df["response_text"].apply(
    lambda x: num_tokens_from_string(x, encoding_name)
)
prod_df["total_token_count"] = (
    prod_df["response_token_count"] + prod_df["prompt_token_count"]
)

In [None]:
prod_df.head()

# Step 4. Sending Data into Arize 💫

## Set up Arize Client

In [None]:
SPACE_ID = "SPACE_ID"
API_KEY = "API_KEY"

arize_client = Client(space_id=SPACE_ID, api_key=API_KEY)
model_id = "search-and-retrieval-prompt-template-debug-with-token-counts-demo"
model_version = "1.0"
model_type = ModelTypes.GENERATIVE_LLM

if SPACE_ID == "SPACE_ID" or API_KEY == "API_KEY":
    raise ValueError("❌ CHANGE SPACE_ID AND/OR API_KEY")
else:
    print("✅ Arize client setup done! Now you can start using Arize!")

## Define the Schema 

A Schema instance specifies the column names for corresponding data in the dataframe. Arize is built with flexibility in mind - the LLM Schema fields below are optional. The more data provided, the more targeted your debugging flows can be. Learn more about defining Schemas for LLM data [here](https://docs.arize.com/arize/model-types/large-language-models-llm).

For Prompt and response pairs, Arize allows you to ingest optional prompt and response values directly by providing `prompt_column_names` and `response_column_names` as fields of the Schema. Both prompt and response can be passed in as the following:
- a single string column representing the raw text data column
- (optional) as an embedding containing both an embedding and raw text associated with the embedding vector. Learn more about unstructured features [here](https://docs.arize.com/arize/sending-data/model-schema-reference#8.-embedding-features-unstructured).

In addition, in this tutorial you will be sending information about your prompt templates, the LLM used and the hyper parameters used to configure it. Arize allows you to send this information by providing `prompt_template_column_names` and `llm_config_column_names`. We make use of the following classes:
* `PromptTemplateColumnNames` (optional):  Groups together the prompt templates with their version
    * `template_column_name`: Name of the column containing the promtp template in string format. The variables are represented by using the double key braces: `{{variable_name}}`.
    * `template_version_column_name`: Name of column containing the version of the template used. This will allow you to filter by this field in the Arize platform.
* `LLMConfigColumnNames` (optional): Groups together the LLM used and the hyper parameters passed to it.
    * `model_column_name`: Name of the column containing the names of the LLMs used to produce responses to the prompts. Typical examples are "gpt-3.5turbo" or `gpt-4".
    * `params_column_name`: Name of column containing the hyperparameters used to configure the LLM used. The contents of the column must be well formatted JSON string. For example: `{'max_tokens': 500, 'presence_penalty': 0.66, 'temperature': 0.28}`

Learn more about Arize's prompt engineering workflows [here](https://docs.arize.com/arize/llm-large-language-models/prompt-engineering).

In [None]:
# Declare prompt and response columns
prompt_columns = EmbeddingColumnNames(
    vector_column_name="prompt_vector", data_column_name="prompt_text"
)

response_columns = "response_text"

In [None]:
# Declare the columns for the prompt template playground
prompt_template_columns = PromptTemplateColumnNames(
    template_column_name="prompt_template",
    template_version_column_name="prompt_template_name",
)
llm_config_columns = LLMConfigColumnNames(
    model_column_name="llm_config_model_name",
    params_column_name="llm_params",
)

Now we will use the token counts we generated above in the Schema, along with the LLM response latency data in our data, and we will send that data into Arize.

* `LLMRunMetadataColumnNames` (optional): Groups together LLM run metadata
    * `prompt_token_count_column_name`: Name of column containing the number of tokens used in the prompt that is sent to the LLM.
    * `response_token_count_column_name`: Name of column containing he number of tokens used in the LLM's response. 
    * `total_token_count_column_name`: Name of column containing the number of tokens used between both the prompt and response tokens.
    * `response_latency_ms_column_name`: Name of column containing the time it took for the LLM to respond (latency) in milliseconds. 

In [None]:
llm_run_metadata_columns = LLMRunMetadataColumnNames(
    total_token_count_column_name="total_token_count",
    prompt_token_count_column_name="prompt_token_count",
    response_token_count_column_name="response_token_count",
    response_latency_ms_column_name="response_latency_ms",
)

Now, let's finalize the Schema. Learn more about the other Schema fields [here](https://docs.arize.com/arize/sending-data-guides/model-schema-reference).

In [None]:
tag_columns = [
    "cost_per_call",
    "euclidean_distance_0",
    "euclidean_distance_1",
    "instruction",
    "openai_precision_1",
    "openai_precision_2",
    "openai_relevance_0",
    "openai_relevance_1",
    "prompt_template",
    "prompt_template_name",
    "retrieval_text_0",
    "retrieval_text_1",
    "text_similarity_0",
    "text_similarity_1",
    "user_query",
    "is_hallucination",
    "llm_config_model_name",
]

prod_schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="user_feedback",
    tag_column_names=tag_columns,
    prompt_column_names=prompt_columns,
    response_column_names=response_columns,
    prompt_template_column_names=prompt_template_columns,
    llm_config_column_names=llm_config_columns,
    llm_run_metadata_column_names=llm_run_metadata_columns,
)

## Send Production Data
Using the production dataset dataframe we prepared and the Schema we just defined, send the data into Arize.

In [None]:
# Parquet files do not support maps with list and non-list values, so they are instead stored as valid json strings.
# Convert these llm params that are stored as valid json strings into python dictionaries.
prod_df["llm_params"] = prod_df["llm_params"].apply(lambda x: json.loads(x))

response = arize_client.log(
    dataframe=prod_df,
    schema=prod_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
)

if response.status_code == 200:
    print(f"✅ Successfully logged data for model {model_id} to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )

## Send Corpus Data
The "corpus" dataset contains references to retrieved "documents" which were used to augment our LLM's response. First, let's download the corpus dataset and take a quick peek at the data.

In [None]:
corpus_df = pd.read_parquet(data_url + "df_corpus_docs.parquet")
corpus_df.head()

Next, define the Schema required for sending in corpus data. The following fields will be needed:
*   `document_id_column_name` - This maps to the column in the corpus dataframe containing the IDs that will be referenced by the `retrieved_document_ids` from the production dataframe.
*   `document_text_embedding_column_names` - The embedding column names for the Corpus document
*   `document_version_column_name` - The column name for the document version

In [None]:
corpus_schema = CorpusSchema(
    document_id_column_name="document_id",
    document_text_embedding_column_names=EmbeddingColumnNames(
        vector_column_name="text_vector",
        data_column_name="text",
    ),
    document_version_column_name="document_version",
)

Now, let's send in the corpus data into Arize.

In [None]:
response = arize_client.log(
    dataframe=corpus_df,
    schema=corpus_schema,
    model_id=model_id,
    model_type=model_type,
    model_version=model_version,
    environment=Environments.CORPUS,
)

if response.status_code == 200:
    print(f"✅ Successfully logged data for model {model_id} to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )