# Embeddings Generation and Persistence for LLMs

This notebook can be used to generate embeddings for a given data.

#### Contents

**Contents:**
1. [Setting up the environment](#setting-up-the-environment) - Pre-requisites: Install Libraries and required dependencies
2. [Input Data](#Input-Data) - Read the training data as a pandas DataFrame
3. [User Inputs Section](#user-inputs-section) - Provide Model Details, IBM watsonx.governance Services and their configuration
4. [Generate Embeddings](#generate-embeddings)
5. [Optional: Configure Drift v2](#optional-configure-drift-v2)
6. [Optional: Store Runtime Records with Embeddings](#optional-store-runtime-records-with-embeddings)
7. [Optional: Evaluate Drift v2 monitor](#optional-evaluate-drift-v2-monitor)

## Setting up the environment

**Installing required packages**

In [None]:
%pip install --upgrade "ibm-metrics-plugin[notebook]~=5.0.3" "ibm-watson-openscale~=3.0.36" "sentence-transformers" | tail -n 1

In [89]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.0"

#Version History
#1.0: Initial release

In [90]:
from datetime import datetime

import pandas as pd
from ibm_cloud_sdk_core.authenticators import CloudPakForDataAuthenticator
from ibm_watson_openscale import APIClient
from ibm_watson_openscale.utils.embeddings_generation_utility import \
    EmbeddingsGenerationUtility
from ibm_watson_openscale.utils.drift_v2_utility import DriftV2Utility
from sentence_transformers import SentenceTransformer


## Input Data

The notebook supports two modes:
1. Fetch dataset records from WatsonX.Governance.
1. Read the input scored data as a pandas dataframe. Although the sample here reads a CSV file into a dataframe, this could be a table, etc.
1. The input scored data should contain the following columns:
    - The feature _aka_ prompt variable columns
    - The model output/prediction _aka_ generated text column
    - Optional: The meta columns
    - Optional: The input token count column
    - Optional: The output token count column
    - Optional: The prediction probability column

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

In [122]:
df = None

# Uncomment these lines if you want to compute and store embeddings for local data
# df = pd.read_csv("<EDIT THIS>")
# print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 9 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   original_text                           110 non-null    object 
 1   reference_summary                       110 non-null    object 
 2   generated_text                          110 non-null    object 
 3   input_token_count                       110 non-null    int64  
 4   generated_token_count                   110 non-null    int64  
 5   prediction_probability                  110 non-null    float64
 6   wos_feature_original_text_embeddings__  110 non-null    object 
 7   wos_input_embeddings__                  110 non-null    object 
 8   wos_output_embeddings__                 110 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 7.9+ KB
None


## User Inputs Section

##### _1. Provide watsonx.governance parameters_:

Provide the watsonx.governance parameters - the api key, and the subscription id.

##### _2. Provide an embedding function_

The embedding function should adhere to the following guidelines.

- The input of the embedding function should accept a `list`.
- The output of the embedding function should return a `list` comprising of the embeddings for all the inputs.

A few samples of the embedding function have been provided [here](https://github.com/IBM/watson-openscale-samples/wiki/Embedding-Function-Templates-for-unstructured-text-data)

In [123]:
CPD_URL = "<EDIT THIS>"
CPD_USERNAME = "<EDIT THIS>"
CPD_PASSWORD = "<EDIT THIS>"
WOS_SERVICE_INSTANCE_ID = "00000000-0000-0000-0000-000000000000" # If None, default instance would be used
subscription_id = "<EDIT THIS>"

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L12-v2")

# 2. Calculate embeddings by calling model.encode()
embeddings_fn = model.encode

In [124]:
# Initialize the client

authenticator = CloudPakForDataAuthenticator(
    url=CPD_URL,
    username=CPD_USERNAME,
    password=CPD_PASSWORD,
    disable_ssl_verification=True
)
wos_client = APIClient(
    service_url=CPD_URL,
    authenticator=authenticator,
    service_instance_id=WOS_SERVICE_INSTANCE_ID
)

## Generate and persist embeddings

Generate the embeddings and persist them in watsonx.governance. Use `embeddings_chunk_size` to control, how many records are sent to the `embeddings_fn` at a given time.

The `compute_and_store_embeddings` method takes the following arguments:
1. `embeddings_fn` : The embeddings function to generate embeddings
2. `embeddings_chunk_size`: The maximum number of records with which to call the embeddings function.
3. `scored_data`: The pandas dataframe containing the scored data with at least the prompt variables and the generated text. This is to be given, when a dataframe is to be uploaded along with embeddings
4. `start`, `end`: The time interval which is used to read the payload records. 
5. `force`: If `force` is set to `True`, all the records between the above timestamps are read. If `force` is set to `False`, only the payload records, which do not contain embeddings are read.
6. `limit`: The `limit` controls how many records will be read for generating embeddings in total.

In [125]:
embedding_util = EmbeddingsGenerationUtility(client=wos_client, subscription_id=subscription_id)

# baseline_df = embedding_util.compute_and_store_embeddings(start=datetime(2024, 7, 23, 18, 1),
#                                             end=datetime(2024, 7, 25, 18, 2),
#                                             embeddings_fn=embeddings_fn,
#                                             embeddings_chunk_size=500,
#                                             limit=1000,
#                                             force=True)

# Use this snippet, if the local data has been read in dataframe
baseline_df = embedding_util.compute_and_store_embeddings(scored_data=df,
                                            embeddings_fn=embeddings_fn,
                                            embeddings_chunk_size=100)


Storing records... :   0%|          | 0/110 [00:00<?, ?records/s]

Computing embeddings... :   0%|          | 0/330 [00:00<?, ?values/s]

Storing embeddings... :   0%|          | 0/110 [00:00<?, ?records/s]

In [126]:
baseline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 10 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   original_text                           110 non-null    object 
 1   reference_summary                       110 non-null    object 
 2   generated_text                          110 non-null    object 
 3   input_token_count                       110 non-null    int64  
 4   generated_token_count                   110 non-null    int64  
 5   prediction_probability                  110 non-null    float64
 6   wos_feature_original_text_embeddings__  110 non-null    object 
 7   wos_input_embeddings__                  110 non-null    object 
 8   wos_output_embeddings__                 110 non-null    object 
 9   scoring_id                              110 non-null    object 
dtypes: float64(1), int64(2), object(7)
memory usage: 8.7+ KB


## Optional: Configure Drift v2

In the below cell, user can configure Drift v2 monitor by using the dataframe generated above with embeddings

In [127]:
drift_v2_utility = DriftV2Utility(client=wos_client, subscription_id=subscription_id)
drift_v2_utility.configure(scored_data=baseline_df, embeddings_fn=embeddings_fn)

The subscription '25f1dab8-1adf-4a5b-9e73-f4c991554e68' has Drift v2 monitor configured with id 'cf5dd636-a26d-4b02-8b01-36a34458e4f0'
The utility will re-configure Drift v2.
Generating Drift v2 Archive...
Baseline archive created at path:  /Users/soumyajyotibiswas/Desktop/Sample Notebooks/notebooks/WatsonX.Governance/OnPrem/GenAI/2.0/samples/baseline__8d4c9e91-3383-4aae-8702-67073215a7c7.tar.gz
Generated Drift v2 Archive in 0:00:01.417305...
Uploading Drift v2 Archive...
Uploaded Drift v2 Archive in 0:00:09.443958...
Updating Drift v2 monitor...
Updating Drift v2 monitor. state: preparing. Time elapsed: 0:00:00.922608...
Updating Drift v2 monitor. state: active. Time elapsed: 0:00:03.273687...


## Optional: Store Runtime Records with 

Read another scored data csv, to be persisted as runtime data.

In [128]:
runtime_df = pd.read_csv("<EDIT THIS>")
print(runtime_df.info())

embedding_util = EmbeddingsGenerationUtility(
    client=wos_client, subscription_id=subscription_id)
runtime_df = embedding_util.compute_and_store_embeddings(scored_data=runtime_df,
                                                         embeddings_fn=embeddings_fn,
                                                         embeddings_chunk_size=100)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   original_text           110 non-null    object 
 1   reference_summary       110 non-null    object 
 2   generated_text          110 non-null    object 
 3   input_token_count       110 non-null    int64  
 4   generated_token_count   110 non-null    int64  
 5   prediction_probability  110 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 5.3+ KB
None


Storing records... :   0%|          | 0/110 [00:00<?, ?records/s]

Computing embeddings... :   0%|          | 0/330 [00:00<?, ?values/s]

Storing embeddings... :   0%|          | 0/110 [00:00<?, ?records/s]

## Optional: Evaluate Drift v2 monitor

In [129]:
drift_v2_utility = DriftV2Utility(client=wos_client, subscription_id=subscription_id)
drift_v2_utility.evaluate()

Running Drift v2 monitor...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:00.649101...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:11.100097...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:21.600146...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:31.974034...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:42.397013...
Running Drift v2 monitor. state: running. Time elapsed: 0:00:52.742539...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:03.087831...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:13.437764...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:23.788788...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:34.212243...
Running Drift v2 monitor. state: running. Time elapsed: 0:01:44.552081...
Running Drift v2 monitor. state: finished. Time elapsed: 0:01:54.905307...


#### Authors
Developed by [Prem Piyush Goyal](mailto:prempiyush@in.ibm.com)