<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="200"/>

# <center>Getting Started with the Arize Platform</center>
## <center>Optimized Prompt Engineering Workflows (LLM Observability)</center>

# Step 0. Install Dependencies, Import Libraries 📚

In [None]:
!pip install 'arize==7.5.0rc1'

import uuid
import pandas as pd
from arize.pandas.logger import Client
from arize.utils.types import (
    Environments,
    ModelTypes,
    EmbeddingColumnNames,
    Schema,
    PromptTemplateColumnNames,
    LLMConfigColumnNames
)
from datetime import datetime

# Step 1. Download the data

In [None]:
data_url = (
    "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/"
    "arize-demo-models-data/GENERATIVE/prompt-response/"
)
prod_df = pd.read_parquet(data_url+"df_queries_08_25.parquet")
val_df = pd.read_parquet(data_url+"df_documents_08_25.parquet")

In [None]:
prod_df.head()

# Step 2. Prepare Your Data

## Add prediction ids

The Arize platform uses prediction IDs to link a prediction to an actual. Visit the [Arize documentation](https://docs.arize.com/arize/data-ingestion/model-schema/5.-prediction-id?q=prediction_id) for more details.

You can generate prediction IDs as follows:


In [None]:
def add_prediction_id(df):
    return [str(uuid.uuid4()) for _ in range(df.shape[0])]

prod_df['prediction_id'] = add_prediction_id(prod_df)
val_df['prediction_id'] = add_prediction_id(val_df)

## Update the timestamps

The data that you are working with was constructed in August of 2023. Hence, we will update the timestamps so they are current at the time that you're sending data to Arize.



In [None]:
last_ts = max(prod_df['prediction_ts'])
now_ts = datetime.timestamp(datetime.now())
delta_ts = now_ts - last_ts    

prod_df['prediction_ts'] = (prod_df['prediction_ts'] + delta_ts).astype(float)

# Step 5. Sending Data into Arize 💫

## Set up Arize Client

In [None]:
SPACE_KEY = "YOUR_SPACE_KEY"
API_KEY = "YOUR_API_KEY"

arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)
model_id = "search-and-retrieval-prompt-template-debug-demo"
model_version = "1.0"
model_type = ModelTypes.GENERATIVE_LLM

if SPACE_KEY == "YOUR_SPACE_KEY" or API_KEY == "YOUR_API_KEY":
    raise ValueError("❌ CHANGE SPACE AND API KEYS")
else:
    print("✅ Arize client setup done! Now you can start using Arize!")

## Define the Schema 

A Schema instance specifies the column names for corresponding data in the dataframe. 

To ingest non-embedding features, it suffices to provide a list of column names that contain the features in our dataframe. Prompt and response pairs, however, are a little bit different since embedding vectors need to be logged into the platform.

Arize allows you to ingest prompt and response pairs directly by providing `prompt_column_names` and `response_column_names` as fields of the Schema. You ingest not only the embedding vector but the raw data associated with that embedding. Therefore, up to 2 columns can be associated with the prompt or response objects:
* Embedding `vector` (required)
* Embedding `data` (optional,but recommended): raw text associated with the embedding vector

Learn more about unstructured features [here](https://docs.arize.com/arize/sending-data/model-schema-reference#8.-embedding-features-unstructured).

In addition, in this tutorial you will be sending information about your prompt templates, the LLM used and the hyper parameters used to configure it. Arize allows you to send this information by providing `prompt_template_column_names` and `llm_config_column_names`. We make use of the following classes:
* `PromptTemplateColumnNames`: Groups together the prompt templates with their version
    * `template_column_name`: Name of the column containing the promtp template in string format. The variables are represented by using the double key braces: `{{variable_name}}`.
    * `template_version_column_name`: Name of column containing the version of the template used. This will allow you to filter by this field in the Arize platform.
* `LLMConfigColumnNames`: Groups together the LLM used and the hyper parameters passed to it.
    * `model_column_name`: Name of the column containing the names of the LLMs used to produce responses to the prompts. Typical examples are "gpt-3.5turbo" or `gpt-4".
    * `params_column_name`: Name of column containing the hyperparameters used to configure the LLM used. The contents of the column must be well formatted JSON string. For example: `{'max_tokens': 500, 'presence_penalty': 0.66, 'temperature': 0.28}`
    
Learn more about Arize's prompt engineering workflows [here](https://docs.arize.com/arize/llm-large-language-models/prompt-engineering).

In [None]:
tag_columns = [
    "cost_per_call",
    "euclidean_distance_0",
    "euclidean_distance_1",
    "instruction",
    "openai_precision_1",
    "openai_precision_2",
    "openai_relevance_0",
    "openai_relevance_1",
    "prompt_template",
    "prompt_template_name",
    "retrieval_text_0",
    "retrieval_text_1",
    "text_similarity_0",
    "text_similarity_1",
    "tokens_used",
    "user_feedback",
    "user_query",
]

In [None]:
# Declare prompt and response columns
prompt_columns=EmbeddingColumnNames(
    vector_column_name="prompt_vector",
    data_column_name="prompt_text"
)

response_columns=EmbeddingColumnNames(
    vector_column_name="response_vector",
    data_column_name="response_text"
)

In [None]:
# Declare the columns for the prompt template playground
prompt_template_columns = PromptTemplateColumnNames(
        template_column_name="prompt_template",
        template_version_column_name="prompt_template_name"
)
llm_config_columns = LLMConfigColumnNames(
        model_column_name="llm_config_model_name",
        params_column_name="llm_params",
)

In [None]:
prod_schema = Schema(
    prediction_id_column_name="prediction_id",
    timestamp_column_name="prediction_ts",
    prediction_label_column_name="pred_label",
    actual_label_column_name="user_feedback",
    tag_column_names=tag_columns,
    prompt_column_names=prompt_columns,
    response_column_names=response_columns,
    prompt_template_column_names=prompt_template_columns,
    llm_config_column_names=llm_config_columns
)

## Send Production Data

In [None]:
response = arize_client.log(
    dataframe=prod_df,
    schema=prod_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    environment=Environments.PRODUCTION,
)
if response.status_code == 200:
    print(f"✅ Successfully logged data for model {model_id} to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )

## Send Validation Data

In [None]:
# Declare prompt and response columns
prompt_columns=EmbeddingColumnNames(
    vector_column_name="text_vector",
    data_column_name="text"
)

response_columns=EmbeddingColumnNames(
    vector_column_name="text_vector",
)

In [None]:
val_schema = Schema(
    prediction_label_column_name="actual_label",
    actual_label_column_name="actual_label",
    prompt_column_names=prompt_columns,
    response_column_names=response_columns,
)

In [None]:
response = arize_client.log(
    dataframe=val_df,
    schema=val_schema,
    model_id=model_id,
    model_version=model_version,
    model_type=model_type,
    batch_id="validation-1",
    environment=Environments.VALIDATION,
#     sync=True
)
if response.status_code == 200:
    print(f"✅ Successfully logged data for model {model_id} to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )