# Embeddings Generation for LLMs

This notebook can be used to generate embeddings for a given data.

#### Contents

**Contents:**
1. [Setting up the environment](#setting-up-the-environment)
2. [Input Data](#Input-Data)
3. [User Inputs Section](#user-inputs-section)
4. [Generate Embeddings](#generate-embeddings)
5. [Definitions](#definitions)

## Setting up the environment

**Installing required packages**

In [36]:
%pip install --upgrade "ibm-metrics-plugin[notebook]~=5.0.3" "sentence-transformers" | tail -n 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


In [1]:
# ----------------------------------------------------------------------------------------------------
# IBM Confidential
# OCO Source Materials
# 5900-A3Q, 5737-H76
# Copyright IBM Corp. 2024
# The source code for this Notebook is not published or other-wise divested of its trade 
# secrets, irrespective of what has been deposited with the U.S.Copyright Office.
# ----------------------------------------------------------------------------------------------------

VERSION = "1.0"

#Version History
#1.0: Initial release

In [37]:
import pandas as pd

from ibm_metrics_plugin.common.utils.embeddings_utils import compute_embeddings
from sentence_transformers import SentenceTransformer


## Input Data

Read the input data as a pandas dataframe. Although the sample here reads a CSV file into a dataframe, this could be a table, etc.

*Note: Pandas' read\_csv method converts the columns to its data types. If you want the column type to not be interpreted, specify the dtype param to read_csv method in this cell. More on this method [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)*

In [38]:
df = pd.read_csv("<EDIT THIS>")
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   original_text           110 non-null    object 
 1   reference_summary       110 non-null    object 
 2   generated_text          110 non-null    object 
 3   input_token_count       110 non-null    int64  
 4   generated_token_count   110 non-null    int64  
 5   prediction_probability  110 non-null    float64
dtypes: float64(1), int64(2), object(3)
memory usage: 5.3+ KB
None


## User Inputs Section

##### _1. Provide common parameters_:

Provide the common parameters like the basic problem type, asset type, prompt variable columns, etc. Read more about these [here](#definitions). 

##### _2. Provide an embedding function_

The embedding function should adhere to the following guidelines.

- The input of the embedding function should accept a `list`.
- The output of the embedding function should return a `list` comprising of the embeddings for all the inputs.

In [39]:
#See 'Definitions' section to know more.

problem_type = "retrieval_augmented_generation" 
# Supported problem types are classification, extraction, generation,
#question_answering, summarization and retrieval_augmented_generation.
asset_type = "prompt"
input_data_type = "unstructured_text"
feature_columns = ["TO BE EDITED", "TO BE EDITED", "TO BE EDITED"] #Mandatory parameter.
context_columns = ["TO BE EDITED", "TO BE EDITED"]
question_column = "TO BE EDITED"
prediction_column = "generated_text"

configuration = {
    "configuration": {
        "asset_type": asset_type,
        "problem_type": problem_type,
        "input_data_type": input_data_type,
        "feature_columns": feature_columns,
        "prediction_column": prediction_column,
        "context_columns": context_columns,
        "question_column": question_column,
        "drift_v2": {
            "metrics_configuration": {
                "advanced_controls": {
                    "enable_embedding_drift": True
                }
            }
        }
    }
}

In [40]:
# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L12-v2")

 # 2. Calculate embeddings by calling model.encode()
embeddings_fn = model.encode



## Generate embeddings

Generate the embeddings and save the result as a CSV. Use `embeddings_chunk_size` to control, how many records are sent to the `embeddings_fn` at a given time.

In [41]:
embeddings_df = compute_embeddings(configuration=configuration,
                                   data=df,
                                   embeddings_fn=embeddings_fn,
                                   embeddings_chunk_size=100)

Computing embeddings... :   0%|          | 0/330 [00:00<?, ?values/s]

In [42]:
embeddings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 9 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   original_text                           110 non-null    object 
 1   reference_summary                       110 non-null    object 
 2   generated_text                          110 non-null    object 
 3   input_token_count                       110 non-null    int64  
 4   generated_token_count                   110 non-null    int64  
 5   prediction_probability                  110 non-null    float64
 6   wos_feature_original_text_embeddings__  110 non-null    object 
 7   wos_input_embeddings__                  110 non-null    object 
 8   wos_output_embeddings__                 110 non-null    object 
dtypes: float64(1), int64(2), object(6)
memory usage: 7.9+ KB


In [43]:
embeddings_df.to_csv("Data with embeddings.csv", index=None)

#### Saving the CSV in Watson Studio

In [None]:
from ibm_watson_studio_lib import access_project_or_space
wslib = access_project_or_space()
wslib.upload_file("Data with embeddings.csv")

## Definitions

### Common Parameters

| Parameter | Description | Default Value | Possible Value(s) |
| :- | :- | :- | :- |
| problem_type | One of the problem types. |  | classification, extraction, generation, question_answering, summarization, retrieval_augmented_generation|
| asset_type | The asset type | prompt | prompt |
| input_data_type | The type of input from the dataframe | unstructured_text | unstructured_text |
| feature_columns | The names of all prompt variable columns | | |
| context_columns | List of all the context columns. Mandatory if `problem_type` is `retrieval_augmented_generation` | | |
| question_column | Optional parameter. The name of the question column|  | |
| prediction_column | Optional parameter. | generated_text | |


Example:
```html
problem_type = "classification"
asset_type = "prompt"
input_data_type = "unstructured_text"
prompt_variable_columns = ["text"]
prediction_column = "prediction"
```

#### Authors
Developed by [Prem Piyush Goyal](mailto:prempiyush@in.ibm.com)