# Example 1

In this notebook we will present you a simple case of using contextcheck to validate llm responses.

We will talk about:
- Configuration
- Test Scenario
- Test Steps
- Running the Test Scenario

In [1]:
# TODO: Add optional jinja2 templating section or a remark with a link

## Installation

In [2]:
# %pip install contextcheck

## Imports

In [3]:
from contextcheck import TestScenario
from contextcheck.executors.executor import Executor # NOTE RB: Maybe Executor should be at the most outer layer for import
import rich

### Send default request

Let's initially create a simple yaml that we will use to send a dummy request to OpenAI.

*When config is empty then OpenAI's gpt-4o-mini is used.

In [4]:
%%writefile test_scenario_ex1_progress.yaml
config:

steps:
   - What is the capital of Poland?

Overwriting test_scenario_ex1_progress.yaml


In [5]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [6]:
# visualize test scenario
rich.print(test_scenario)

In [7]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [8]:
# run all test steps
executor.run_all()

[32m2024-10-21 13:55:07.781[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-10-21 13:55:07.783[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='What is the capital of Poland?' request=RequestBase(message='What is the capital of Poland?') response=None asserts=[] result=None[0m
[32m2024-10-21 13:55:07.784[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='What is the capital of Poland?'[0m
[32m2024-10-21 13:55:08.435[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='The capital of Poland is Warsaw.' stats=ResponseStats(tokens_request=14, tokens_response=7, tokens_total=21, conn_start_time=16623.944679749, conn_end_time=16624.593853512, conn_duration=0.6491737629985437) id='chatcmpl-AKlI44qlKg6c3Zh5oCKHvYDzxObKi' choi

True

In [9]:
# Once more visualize the test scenario to see the changes
rich.print(test_scenario)

In [10]:
# Response from llm
test_scenario.steps[0].response.message

'The capital of Poland is Warsaw.'

### Config update

We initially left the config empty, but we can easily populate it with configuration that best fits our needs.

For defining the connection to the llm or rag system we use `endpoint_under_test`. For demo purposes we will use one of OpenAI's models which are already implemented by default. For more information please visit [TODO - Link to config]

In [11]:
%%writefile test_scenario_ex1_progress.yaml
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o   

steps:
   - What is the capital of Poland?


Overwriting test_scenario_ex1_progress.yaml


In [12]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [13]:
# visualize test scenario
# Note the change in config from gpt-4o-mini to gpt-4o
rich.print(test_scenario)

In [14]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [15]:
executor.run_all()

[32m2024-10-21 13:55:09.195[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-10-21 13:55:09.196[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='What is the capital of Poland?' request=RequestBase(message='What is the capital of Poland?') response=None asserts=[] result=None[0m
[32m2024-10-21 13:55:09.197[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='What is the capital of Poland?'[0m
[32m2024-10-21 13:55:09.712[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='The capital of Poland is Warsaw.' stats=ResponseStats(tokens_request=14, tokens_response=7, tokens_total=21, conn_start_time=16625.356580305, conn_end_time=16625.870958239, conn_duration=0.5143779339996399) id='chatcmpl-AKlI5ReiPtlzSw9nsLnnfKOAFOKVd' choi

True

In [16]:
# Response from llm
test_scenario.steps[0].response.message

'The capital of Poland is Warsaw.'

##### Model's Parameters update

In config we can also update the model parameters like temperature, max_tokens etc.

In [17]:
# TODO: Check this after rebase with contextcheck changes
# TODO: I'd add a possibility to transfer parameters through step/request

In [18]:
%%writefile test_scenario_ex1_progress.yaml
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o-mini
      temperature: 2.0
      max_tokens: 64

steps:
   - Write a poem about LLMs


Overwriting test_scenario_ex1_progress.yaml


In [19]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [20]:
# visualize test scenario
rich.print(test_scenario)

In [21]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [22]:
executor.run_all()

[32m2024-10-21 13:55:10.623[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-10-21 13:55:10.624[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Write a poem about LLMs' request=RequestBase(message='Write a poem about LLMs') response=None asserts=[] result=None[0m
[32m2024-10-21 13:55:10.625[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Write a poem about LLMs'[0m
[32m2024-10-21 13:55:14.473[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage="In the realm where circuits hum and glow,  \nA web of thoughts begins to flow,  \nLines of code and dreams entwined,  \nWhispers of knowledge, redefined.  \n\nLanguage born from data's dance,  \nA tapestry of words and chance,  \nFrom whispers soft to thunderous prose,  \nThe

True

In [23]:
# Response from llm
test_scenario.steps[0].response.message

"In the realm where circuits hum and glow,  \nA web of thoughts begins to flow,  \nLines of code and dreams entwined,  \nWhispers of knowledge, redefined.  \n\nLanguage born from data's dance,  \nA tapestry of words and chance,  \nFrom whispers soft to thunderous prose,  \nThe heart of a machine, it gently grows.  \n\nNot just tools in a sterile space,  \nBut echo chambers of the human grace,  \nImitating voices, past and near,  \nIn pixels bright, they bring us near.  \n\nThey weave the stories, craft the rhyme,  \nIn endless realms, they stretch through time,  \nFrom ancient tomes to futuristic tales,  \nThey navigate the vast, where wonder prevails.  \n\nYet in this poise, a shadow looms,  \nThe weight of thought in quiet rooms,  \nEchoes of the minds they seek to share,  \nWith wisdom's thread, they must beware.  \n\nFor in the quest with lines they trace,  \nThe essence of truth, they must embrace,  \nIn every question, every plea,  \nA mirror held to humanity.  \n\nSo let us guid

### Simple scenario

Lets create a simple test scenario which will help you understand the working of contextcheck.
We will use simple asserts which are based on python's `eval` build-in functionality.


We believe it's also a good place to introduce the nomenclature for test steps.

Each step can by defined by its `name` (optional), `request` and `asserts` (optional):
- `name` is a name of the test step
- `request` is a message to an llm
- `asserts` is a list of assertions done on llm response

NOTE: By default each assert is treated as an `eval` assertion

In [24]:
%%writefile test_scenario_ex1_progress.yaml

config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o

steps:
   - name: Write sucess
     request: 'Please write only "success" as a response'
     asserts:
        - '"success" == response.message'
        - 'response.stats.conn_duration < 10'


Overwriting test_scenario_ex1_progress.yaml


In [25]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [26]:
# visualize test scenario
rich.print(test_scenario)

In [27]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [28]:
executor.run_all()

[32m2024-10-21 13:55:14.576[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-10-21 13:55:14.578[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Write sucess' request=RequestBase(message='Please write only "success" as a response') response=None asserts=[AssertionEval(result=None, eval='"success" == response.message'), AssertionEval(result=None, eval='response.stats.conn_duration < 10')] result=None[0m
[32m2024-10-21 13:55:14.579[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Please write only "success" as a response'[0m
[32m2024-10-21 13:55:15.230[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Success' stats=ResponseStats(tokens_request=16, tokens_response=1, tokens_total=17, conn_start_time=16630.741020135, 

False

In [29]:
rich.print(test_scenario)

In [30]:
# Show the result
test_scenario.show_test_step_results()

------------
Name: Write sucess; Result: False

Assertion: ""success" == response.message", Result: False
Assertion: "response.stats.conn_duration < 10", Result: True
------------


### Scenario extension

Having introduction under our belt we will extend the already built scenario by new types of assertions and explain more in depth the needed topics.

#### Explain config

To extend our scenario we need to introduce new config features that are needed for some of the asertions.

In short, config defines llm (or Rag system) connection. We provide several popular llm providers implementations which lets you be productive from the start. For more info about them please go to [Link here].

There are three components used in config:
1. `endpoint_under_test` - defines the tested endpoint
2. `default_request` - defines the defaults for both the `endpoint_under_test` and `eval_endpoint` (TODO: Please someone confirm that)
3. `eval_endpoint` - defines the endpoint which is used for evaluating the responses from `endpoint_under_test`

For more infromation about configuration please go to [TODO - INSERT LINK HERE]

TODO: What's the purpose of `default_request` when the same configuration can be given to `endpoint_under_test` or `eval_endpoint`?

In [31]:
# Lets use our new knowledge and define a scenario with llm evaluation - full explanation later
# In short `llm_metric` uses another llm to evaluate the response and `model-grading-qa` particularly uses
# another llm to check whether the response is about the topic X defined by user.
# TODO: We cannot have multiple assertions under the same llm metric

In [32]:
%%writefile test_scenario_ex1_progress.yaml
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o-mini
      temperature: 0.2
   eval_endpoint: # Needed for llm_metric assertions
      kind: openai
      model: gpt-4o
      temperature: 0.0

steps:
  - name: Test model grading QA evaluator
    request:
      message: "Please write a 5 line poem about AI."
    asserts:
      - llm_metric: model-grading-qa
        assertion: Text should be a poem about AI.
      - llm_metric: model-grading-qa
        assertion: Text should be a report on taxes. # Misleading assertion for demo purposes

Overwriting test_scenario_ex1_progress.yaml


In [33]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [34]:
# visualize test scenario
rich.print(test_scenario)

In [35]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [36]:
executor.run_all()

[32m2024-10-21 13:55:15.364[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-10-21 13:55:15.366[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Test model grading QA evaluator' request=RequestBase(message='Please write a 5 line poem about AI.') response=None asserts=[AssertionLLM(result=None, llm_metric='model-grading-qa', reference='', assertion='Text should be a poem about AI.'), AssertionLLM(result=None, llm_metric='model-grading-qa', reference='', assertion='Text should be a report on taxes.')] result=None[0m
[32m2024-10-21 13:55:15.367[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Please write a 5 line poem about AI.'[0m
[32m2024-10-21 13:55:16.228[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='In silen

False

In [37]:
rich.print(test_scenario)

In [38]:
# Show the result of each step
test_scenario.show_test_step_results()

------------
Name: Test model grading QA evaluator; Result: False

Assertion: "Text should be a poem about AI.", Result: True
Assertion: "Text should be a report on taxes.", Result: False
------------


#### Extra: Adding custom endpoint

In [39]:
# Logic or a link for creating and using custom endpoint should be added somewhere here

#### Explain assertions

There are three families of assertions (two of which we already know and used):
1. `eval` assertion - converts a string to python code using (you guessed it) eval
2. `llm_metric` assertion - uses another llm defined in `eval_endpoint` to assess the `endpoint_under_test` performance
3. `deterministic` assertion - does string assessments like contains, contains-any etc.

In [40]:
# NOTE RB: Metrics should be easilly extended i.e. if someone wants to add a metric we should provide a simple way
# to do that, which should not break any functionalities like result summarization or time statistics etc.
# NOTE: How detailed should be the explanations? And should each sub metric like llm_metric-hallucination be mentioned, or should we link the docs instead? 

##### Explain eval assertions

`eval` assertion uses python's build in eval function which changes any string to python executable code. User has Response model for disposition which include in a base form should include the response from the `endpoint_under_test` and the time statistics (see `ConnectorStats` model).

##### Explain llm assertions

`llm_metric` uses another llm to assess the response of the `endpoint_under_test`. For this `eval_endpoint` should be added in config section to define evaluation endpoint. It can be one of the available endpoints (link here) or one created by the user (link here).

There are 5 specific sub metrics associated with it:
- `hallucination` (available only for RAG systems): This metric assesses whether the LLM's answer includes information not present in the provided reference data
- `qa-reference` - (available only for RAG systems): This metric assesses whether the LLM's response accurately answers the user query based on the provided reference data.
- `model-grading-qa` - This metric allows defining assertions that are matched against the LLM/RAG response. Think of it as "regular expressions defined using natural language".
- `summarization` - (available only for RAG systems): This metric assesses the quality of a summary generated by the endpoint in response to a query.
- `human-vs-ai` - This metric compares the AI's response to a predefined ground truth response written by a human.

For more in depth explanations and examples please go to [TODO - Insert link here]

##### Explain deterministic assertions

`deterministic` assertion provide a way to assert the content of the response through string comparisons like `contains` or `contains-any`.
To use `deterministic` assertion use keyword `kind` with assertion type (see final example).

For more information please go to [Link here]

## Final scenario

In [41]:
# When the test scenario is finally ready we can load it
test_scenario_file_path = "../tests/scenario_example1.yaml"
test_scenario = TestScenario.from_yaml(file_path=test_scenario_file_path)

In [42]:
# Inspect the structure of test_scenario
rich.print(test_scenario)

In [43]:
# Initiate executor which runs test scenario
executor = Executor(test_scenario=test_scenario)

In [44]:
# Run test scenario
executor.run_all()

[32m2024-10-21 13:55:20.316[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-10-21 13:55:20.318[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Write sucess' request=RequestBase(message='Please write only "success" as a response') response=None asserts=[AssertionEval(result=None, eval='"success" == response.message'), AssertionEval(result=None, eval='response.stats.conn_duration < 10')] result=None[0m
[32m2024-10-21 13:55:20.319[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Please write only "success" as a response'[0m
[32m2024-10-21 13:55:20.862[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='success' stats=ResponseStats(tokens_request=16, tokens_response=1, tokens_total=17, conn_start_time=16636.481612781, 

False

In [45]:
# NOTE RB: Maybe executor should copy the test scenario
# Inspect updated test_scenario
rich.print(test_scenario)

In [46]:
test_scenario.show_test_step_results()

------------
Name: Write sucess; Result: True

Assertion: ""success" == response.message", Result: True
Assertion: "response.stats.conn_duration < 10", Result: True
------------
Name: Capital of Poland; Result: True

Assertion: ""Warsaw" in response.message", Result: True
------------
Name: Test model grading QA evaluator; Result: False

Assertion: "Text should be a poem about AI.", Result: True
Assertion: "Text should be a report on taxes.", Result: False
------------
Name: Deterministic assertion test; Result: True

Assertion: "Paris", Result: True
------------


### Execute scenario using ccheck command

In [47]:
# We can also run contextcheck in a command line
!ccheck --output-type console --filename ../tests/scenario_example1.yaml

[32m2024-10-21 13:55:25.772[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[1;35mTestStep[0m[1m([0m
    [33mname[0m=[32m'Write sucess'[0m,
    [33mrequest[0m=[1;35mRequestBase[0m[1m([0m[33mmessage[0m=[32m'Please write only "success" as a response'[0m[1m)[0m,
    [33mresponse[0m=[3;35mNone[0m,
    [33masserts[0m=[1m[[0m
        [1;35mAssertionEval[0m[1m([0m[33mresult[0m=[3;35mNone[0m, [33meval[0m=[32m'"success" == response.message'[0m[1m)[0m,
        [1;35mAssertionEval[0m[1m([0m[33mresult[0m=[3;35mNone[0m, [33meval[0m=[32m'response.stats.conn_duration < 10'[0m[1m)[0m
    [1m][0m,
    [33mresult[0m=[3;35mNone[0m
[1m)[0m
╭──────────────────────────────────────────────────────────────────────────────╮
│ [1;31m🎈 Request:[0m                                                                  │
│ [1;35mRequestBase[0m[1m([0m[33mmessage[0m=[32m'Pleas