# Custom generative AI evaluators using python function wrappers on watsonx.governance fine tuned models.

This notebook illustrates how to create custom generative AI evaluators by deploying python functions in the Cloud using a custom runtime environment built on runtime-24.1. These python functions invoke the fine tuned RAG models using watson nlp to perform specified metric evaluations.

## Learning goals

- Custom evaluators creation using fine tuned models for computing RAG metrics

## Prerequisites

- IBM Cloud credentials
- Space id in which you want to create python function deployments

## Contents

- [Step 1 - Setup](#setup)
- [Step 2 - Generative AI Evaluators creation](#evaluator)

## Step 1 - Setup <a id="setup"></a>

### Install the necessary libraries

In [None]:
!pip install ibm-watsonx-ai==1.1.11
!pip install ibm-watson-openscale>=3.0.43

Looking in indexes: https://test.pypi.org/simple/


**Note**: you may need to restart the kernel to use updated libraries.

### Configure your credentials

In [2]:
# Cloud credentials
IAM_URL = "https://iam.cloud.ibm.com"
CLOUD_API_KEY = "<EDIT THIS>"  # YOUR_CLOUD_API_KEY
SERVICE_URL = "https://aiopenscale.cloud.ibm.com"

WX_AI_CREDENTIALS = {
    "url": "https://us-south.ml.cloud.ibm.com",
    "apikey": CLOUD_API_KEY,
}

### Verify client version

In [3]:
import json

from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

from ibm_watson_openscale import *
from ibm_watson_openscale.supporting_classes.enums import *
from ibm_watson_openscale.supporting_classes import *

service_instance_id = None  # Update this to refer to a particular service instance
authenticator = IAMAuthenticator(apikey=CLOUD_API_KEY, url=IAM_URL)
wos_client = APIClient(
    authenticator=authenticator,
    service_url=SERVICE_URL,
    service_instance_id=service_instance_id,
)
print(wos_client.version)

3.0.41.12


## Step 2 - Generative AI Evaluators creation <a id="evaluator"></a>

The custom evaluators are created as a python function which is deployed in a custom runtime environment built on `RT24.1`.

#### Evaluator parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| `metric_type` | The name of the metric for which evaluator is created. |  | `retrieval_quality`, `faithfulness`, `answer_relevance` |
| `wx_ai_credentials` | Watsonx AI credentials. |  |  |
| `space_id` | ID of the space in which you want to create python function deployment. |  |  |
| `context_columns` [Optional]| The list of context column names in the input data frame. |  |  |
| `question_column` | The name of the question column in the input data frame. |  |  |
| `hardware_spec` [Optional]| Hardware specifications for deploying the python function. The metrics computation performance can be improved by providing larger hardware specification. | `M` | `M`, `L`, `XL` |
| `metric_parameters` [Optional]| Additional parameter specific to each metric.  |  |  |
| `func_name` [Optional]| The name of python function. | `<metric_type>`_with_nlp |  |
| `create_integrated_system` [Optional]| Flag for restricting creation of associated integrated system. | `True` | `True` or `False` |

#### Faithfulness parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| attributions_count [Optional]| Source attributions are computed for each sentence in the generated answer. Source attribution for a sentence is the set of sentences in the context which contributed to the LLM generating that sentence in the answer.  The attributions_count parameter specifies the number of sentences in the context which need to be identified for attributions.  E.g., if the value is set to 2, then we will find the top 2 sentences from the context as source attributions. | `3` |  |
| ngrams [Optional]| The number of sentences to be grouped from the context when computing faithfulness score. These grouped sentences will be shown in the attributions. Having a very high value of ngrams might lead to having lower faithfulness scores due to dispersion of data and inclusion of unrelated sentences in the attributions. Having a very low value might lead to increase in metric computation time and attributions not capturing the all the aspects of the answer. | `2` |  |

#### Context relevance parameters
| Parameter | Description | Default Value | Possible Value(s) |
|:-|:-|:-|:-|
| ngrams [Optional]| The number of sentences to be grouped from the context when computing context relevance score. Having a very high value of ngrams might lead to having lower context relevance scores due to dispersion of data and inclusion of unrelated sentences. | `5` |  |

In [4]:
space_id = "<EDIT_THIS>"  # space in which python function should be deployed

In [5]:
# Edit below values based on the input data
context_columns = [] # EDIT_THIS
question_column = "<EDIT_THIS>"

### Evaluator creation for retrieval quality

In [None]:
rq_evaluator_dtls = wos_client.llm_metrics.evaluators.add(
    metric_type="retrieval_quality",
    wx_ai_credentials=WX_AI_CREDENTIALS,
    space_id=space_id,
    question_column=question_column,
    context_columns=context_columns,
    hardware_spec="M",
    metric_parameters={
        # The metrics computed for retrieval quality are context_relevance, retrieval_precision, average_precision, reciprocal_rank, hit_rate, normalized_discounted_cumulative_gain
        # "context_relevance": {
        #     "ngrams": 5
        # },
    },
    func_name="retrieval_quality_with_nlp",
    create_integrated_system=True,
)
print("Retrieval quality evaluator ID: " + rq_evaluator_dtls["evaluator_id"])



######################################################################################

Synchronous deployment creation for id: '4ee6b7ec-41cb-49ab-87cd-1125c57f9868' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
...........
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='32b3d59e-5a4d-49f0-83f0-2e35ed89fbb1'
-----------------------------------------------------------------------------------------------


Retrieval quality evaluator ID:2bcf7f9a-c31c-4126-9455-574d74898a88


### Evaluator creation for faithfulness

In [None]:
faith_evaluator_dtls = wos_client.llm_metrics.evaluators.add(
    metric_type="faithfulness",
    wx_ai_credentials=WX_AI_CREDENTIALS,
    space_id=space_id,
    question_column=question_column,
    context_columns=context_columns,
    hardware_spec="XL",
    metric_parameters={
        # "attributions_count": 3,
        # "ngrams": 2,
    },
    func_name="faithfulness_with_nlp",
    create_integrated_system=True,
)
print("Faithfulness evaluator ID: " + faith_evaluator_dtls["evaluator_id"])



######################################################################################

Synchronous deployment creation for id: '859e8355-d5fa-4265-9b9a-170ab0b0f135' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
..............
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='9f209cd0-2db6-4879-a790-5d32d3e6976f'
-----------------------------------------------------------------------------------------------


Faithfulness evaluator ID:f8cd71f1-1a0c-4cda-b538-d395287d13e2


### Evaluator creation for answer relevance

In [None]:
ar_evaluator_dtls = wos_client.llm_metrics.evaluators.add(
    metric_type="answer_relevance",
    wx_ai_credentials=WX_AI_CREDENTIALS,
    space_id=space_id,
    question_column=question_column,
    hardware_spec="M",
    metric_parameters={
    },
    func_name="answer_relevance_with_nlp",
    create_integrated_system=True,
)
print("Answer relevance evaluator ID: " + ar_evaluator_dtls["evaluator_id"])



######################################################################################

Synchronous deployment creation for id: '008a8a18-74d4-43f1-be64-a5bd0e4b6147' started

######################################################################################


initializing
Note: online_url and serving_urls are deprecated and will be removed in a future release. Use inference instead.
............
ready


-----------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_id='1411609f-d6fd-4a03-b5ea-8bfedd21dfe8'
-----------------------------------------------------------------------------------------------


Answer relevance evaluator ID:a193964b-fce9-4441-8cf5-5ed7c27871aa


## Next steps
Use the above generative ai evaluator ids created for each metric and configure them during prompt setup creation.