# Exercise 4a - Introduction to LLM Evaluations With TruLens
In this exercise you'll learn how to assess the performance of a RAG application using the Trulens framework. We'll introduce the key concepts involved in testing an LLM-powered application, run a basic answer relevance test and view the results in a test dashboard.

In [None]:
%load_ext dotenv
%dotenv ../../.env

First, the RAG application is imported and configured. In this case, it is a [langchain](https://www.langchain.com/) application which wraps an instance of [Azure Information Assistant](https://github.com/microsoft/PubSec-Info-Assistant).

In [None]:
import os
import json
from pathlib import Path

config_file_name: str = "config-4a.json"
current_dir_path: Path = Path(".")
full_path = current_dir_path / config_file_name

with full_path.open() as f:
    config = json.load(f)


In [None]:
import sys
sys.path.append(str(current_dir_path.resolve().parent.parent))
from llm_application.azure_information_assistant_accelerator.wrapper import RAG_from_scratch

rag_chain = RAG_from_scratch(config_data=config)

### App
Now that the target application has been configured, we'll introduce the code that will be used to evaluate the application's performance.

Our first step is to set up an App object, this will manage the tests that we conduct and track the results in a local database. 

In [None]:
from kjr_llm.app import App

app = App(app_name="exercise4a", reset_database=True)

### Targets
Next we define the target of our tests, this is an abstraction which enables the framework to communicate with the application being tested. Preconfigured options exist for Langchain and LlamaIndex applications but in this case we opt to use a more flexible Custom target.

In [None]:
from kjr_llm.targets import CustomTarget

target = CustomTarget(rag_chain)

### Defining Tests
In TruLens, a test consists of two elements, prompts and feedback functions.

##### Prompts

In the context of evaluating LLM-powered applications, prompts refer to the input queries or instructions given to the model to generate responses or predictions. Prompts are critical for guiding the behavior of the model, as they frame the task or question the model is expected to address. By providing a range of carefully designed prompts, evaluators can test the model’s ability to handle different types of input, ensuring it responds accurately and appropriately.

Effective prompt design is essential for thorough evaluation, as it reveals the model's strengths, limitations, and behavior across different scenarios.

##### Prompt Sets
Our evaluation framework provides the `Prompt` and `PromptSet` classes to faciliate loading and interacting with prompts. 

A `Prompt` is a single input provided to the target application. A `Prompt` consists of several fields, an input, an optional expected response from the application and an optional context. 

A `PromptSet` is a set of one or more related `Prompts`. A PromptSet is defined in JSON format. You can see an example of the JSON format in `./relevance-4a.json`. 

In the code snippet below, we import the PromptSet class and load our prompts from the file they are defined in, then iterate over the prompt set and print the individual prompts.

In [None]:
from kjr_llm.prompts import PromptSet

prompts_path = current_dir_path / "relevance-4a.json"
prompts = PromptSet.from_json_file(prompts_path)

for prompt in prompts:
    print(prompt)

### Metrics
Metrics evaluate a target model's performance when responding to one or more prompts. A metric assesses the models performance in regard to a specific category, such as the presence of hate speech, or the groundedness of the response in the provided context. 

Most metrics are backed by an LLM, known as the provider model. This can (but doesn't have to be) the same model used by the application being evaluated. The metric is essentially a prompt which asks the provider model to score the target applications response against a rubric.

In some cases, for example when attempting to detect the presence of personal information in a response, it is necessary to use a provider model than can be run locally, such as llama3, to ensure that personal information is not exposed to a proprietary model where it could be retained for training purposes.

In the code below, we import and instantiate the Answer Relevance metric which evaluates the relevance of the LLM response to the input prompt. Note that the openai property is selected which denotes the provider model to be used by the metric.

In [None]:
from kjr_llm.metrics import (
    AnswerRelevance
)
from trulens.core.schema import Select

# Using custom TestSet
# comment and uncomment the feedback you wish to evaluate
metrics = [
    AnswerRelevance().openai
]

Now that we have some prompts to feed into the target application and a metric to assess its performance, they can be combined to produce a test set. The `TestSet` object provides an `evaluate` method which will execute the contained tests against our target application.

A default provider can be set when creating the test set which will be used by any metrics where a provider was not explicitly specified.

In [None]:
from kjr_llm.tests import TestSet
from kjr_llm.provider import OpenAIProvider

gpt_35_turbo_provider = OpenAIProvider(model_name="gpt-3.5-turbo")

# Define our test set
custom_test = TestSet(prompts, metrics, name="Exercise4-openai", 
                      default_provider=gpt_35_turbo_provider)

# Evaluate our test set
result = custom_test.evaluate(target, "Exercise4a")

### Dashboard
Once the tests have been executed, the `App` object can be used to run a local test dashboard and peruse the results. The dashboard uses the `streamlit` library and will attempt to open automatically in your default browser.

In [None]:
app.run_dashboard()