# Embeddings Generation and Persistence for LLMs

This notebook can be used to generate embeddings for a given data.

#### Contents

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Input Data](#Input-Data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM watsonx.governance Services and their configuration
4. [Generate Embeddings](#generate-embeddings)
5. [Optional: Configure Drift v2](#optional-configure-drift-v2)
6. [Optional: Store Runtime Records with Embeddings](#optional-store-runtime-records-with-embeddings)
7. [Optional: Evaluate Drift v2 monitor](#optional-evaluate-drift-v2-monitor)

## Setting up the environment

**Installing required packages**

In [None]:
%pip install --upgrade "ibm-metrics-plugin[notebook]~=3.0.0" "ibm-watson-openscale~=3.0.36" "ibm-watsonx-ai~=1.1.6" | tail -n 1

In [1]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.0"

#Version History
#1.0: Initial release

In [2]:
from datetime import datetime

import pandas as pd
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson_openscale import APIClient
from ibm_watson_openscale.utils.embeddings_generation_utility import \
    EmbeddingsGenerationUtility
from ibm_watson_openscale.utils.drift_v2_utility import DriftV2Utility


## Input Data

The notebook supports two modes:
1. Fetch dataset records from WatsonX.Governance.
1. Read the input scored data as a pandas dataframe. Although the sample here reads a CSV file into a dataframe, this could be a table, etc.
1. The input scored data should contain the following columns:
    - The feature _aka_ prompt variable columns
    - The model output/prediction _aka_ generated text column
    - Optional: The meta columns
    - Optional: The input token count column
    - Optional: The output token count column
    - Optional: The prediction probability column

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

In [3]:
df = None

# Uncomment these lines if you want to compute and store embeddings for local data
# df = pd.read_csv("TO BE EDITED")
# print(df.info())

## User Inputs Section

##### _1. Provide watsonx.governance parameters_:

Provide the watsonx.governance parameters - the api key, and the subscription id.

##### _2. Provide an embedding function_

The embedding function should adhere to the following guidelines.

- The input of the embedding function should accept a `list`.
- The output of the embedding function should return a `list` comprising of the embeddings for all the inputs.

A few samples of the embedding function have been provided [here](https://github.com/IBM/watson-openscale-samples/wiki/Embedding-Function-Templates-for-unstructured-text-data)

In [4]:
CLOUD_API_KEY = "TO BE EDITED"

subscription_id = "TO BE EDITED"

def embeddings_fn(inputs):
    from ibm_watsonx_ai import Credentials, APIClient
    from ibm_watsonx_ai.foundation_models import Embeddings
    from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames 
    
    # from time import time
    # start_time = time()

    API_KEY = "TO BE EDITED"
    WX_URL = "https://us-south.ml.cloud.ibm.com"
    PROJECT_ID = "TO BE EDITED"

    credentials = Credentials(
        url = WX_URL,
        api_key = API_KEY
    )

    client = APIClient(credentials, project_id=PROJECT_ID)
    # client.foundation_models.EmbeddingModels.show()
    embedding = Embeddings(
        model_id=client.foundation_models.EmbeddingModels.ALL_MINILM_L12_V2,
        api_client=client,
        params={
            EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 128
        }
    )
    result = embedding.embed_documents(texts=inputs)
    # print(f"Got embeddings of {len(inputs)} inputs in {time() - start_time}s.")
    return result

In [5]:
# Initialize the client

authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url="https://iam.cloud.ibm.com")
wos_client = APIClient(authenticator=authenticator, service_url="https://aiopenscale.cloud.ibm.com")

## Generate and persist embeddings

Generate the embeddings and persist them in watsonx.governance. Use `embeddings_chunk_size` to control, how many records are sent to the `embeddings_fn` at a given time.

The `compute_and_store_embeddings` method takes the following arguments:
1. `embeddings_fn` : The embeddings function to generate embeddings
2. `embeddings_chunk_size`: The maximum number of records with which to call the embeddings function.
3. `scored_data`: The pandas dataframe containing the scored data with at least the prompt variables and the generated text. This is to be given, when a dataframe is to be uploaded along with embeddings
4. `start`, `end`: The time interval which is used to read the payload records. 
5. `force`: If `force` is set to `True`, all the records between the above timestamps are read. If `force` is set to `False`, only the payload records, which do not contain embeddings are read.
6. `limit`: The `limit` controls how many records will be read for generating embeddings in total.

In [6]:
embedding_util = EmbeddingsGenerationUtility(client=wos_client, subscription_id=subscription_id)

baseline_df = embedding_util.compute_and_store_embeddings(start=datetime(2024, 7, 23, 18, 1),
                                            end=datetime(2024, 7, 25, 18, 2),
                                            embeddings_fn=embeddings_fn,
                                            embeddings_chunk_size=500,
                                            limit=1000,
                                            force=True)

# Use this snippet, if the local data has been read in dataframe
# baseline_df = embedding_util.compute_and_store_embeddings(scored_data=df,
#                                             embeddings_fn=embeddings_fn,
#                                             embeddings_chunk_size=100)


Reading payload_logging records... :   0%|          | 0/1000 [00:00<?, ?records/s]

Computing embeddings... :   0%|          | 0/7000 [00:00<?, ?values/s]

Storing embeddings... :   0%|          | 0/1000 [00:00<?, ?records/s]

In [7]:
baseline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 19 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   question                                   1000 non-null   object 
 1   scoring_id                                 1000 non-null   object 
 2   scoring_timestamp                          1000 non-null   object 
 3   context_column_2                           1000 non-null   object 
 4   prediction_probability                     1000 non-null   float64
 5   context_column_1                           1000 non-null   object 
 6   generated_text                             1000 non-null   object 
 7   context_column_3                           1000 non-null   object 
 8   input_token_count                          1000 non-null   int64  
 9   type                                       1000 non-null   object 
 10  generated_token_count    

## Optional: Configure Drift v2

In the below cell, user can configure Drift v2 monitor by using the dataframe generated above with embeddings

In [8]:
drift_v2_utility = DriftV2Utility(client=wos_client, subscription_id=subscription_id)
drift_v2_utility.configure(scored_data=baseline_df, embeddings_fn=embeddings_fn)

The subscription '64846065-a105-44f2-85ca-0510ca868056' has Drift v2 monitor configured with id 'f4d5eb97-c51c-4b3e-bf2e-2b5c4126f0b7'
The utility will re-configure Drift v2.
Generating Drift v2 Archive...
Baseline archive created at path:  /Users/prempiyush/work/code/notebooks/WatsonX.Governance/Cloud/GenAI/samples/baseline__c3496af4-171b-4b42-8d45-e8922c2cdfde.tar.gz
Generated Drift v2 Archive in 0:00:04.688310...
Uploading Drift v2 Archive...
Uploaded Drift v2 Archive in 0:00:26.598470...
Updating Drift v2 monitor...
Updating Drift v2 monitor. state: active. Time elapsed: 0:00:01.888446...


## Optional: Store Runtime Records with 

Read another scored data csv, to be persisted as runtime data.

In [9]:
runtime_df = pd.read_csv("TO BE EDITED")
print(runtime_df.info())

embedding_util = EmbeddingsGenerationUtility(
    client=wos_client, subscription_id=subscription_id)
runtime_df = embedding_util.compute_and_store_embeddings(scored_data=runtime_df,
                                                         embeddings_fn=embeddings_fn,
                                                         embeddings_chunk_size=100)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   context_column_1        50 non-null     object 
 1   context_column_2        50 non-null     object 
 2   context_column_3        50 non-null     object 
 3   question                50 non-null     object 
 4   generated_text          50 non-null     object 
 5   prediction_probability  50 non-null     float64
 6   input_token_count       50 non-null     int64  
 7   generated_token_count   50 non-null     int64  
 8   answer                  50 non-null     object 
 9   type                    50 non-null     object 
 10  level                   50 non-null     object 
dtypes: float64(1), int64(2), object(8)
memory usage: 4.4+ KB
None


Storing records... :   0%|          | 0/50 [00:00<?, ?records/s]

Computing embeddings... :   0%|          | 0/350 [00:00<?, ?values/s]

Storing embeddings... :   0%|          | 0/50 [00:00<?, ?records/s]

## Optional: Evaluate Drift v2 monitor

In [10]:
drift_v2_utility = DriftV2Utility(client=wos_client, subscription_id=subscription_id)
drift_v2_utility.evaluate()

Running Drift v2 monitor...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:01.681804...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:12.594594...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:23.847397...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:34.770098...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:45.655135...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:56.578163...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:07.645539...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:18.518526...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:29.429359...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:40.339988...
Running Drift v2 monitor. state: finished. Time elapsed: 0:01:51.278365...


#### Authors
Developed by [Prem Piyush Goyal](mailto:prempiyush@in.ibm.com)