# Example 1

In this notebook we will present you a simple case of using contextcheck to validate llm responses.

We will talk about:
- Configuration
- Test Scenario
- Test Steps
- Running the Test Scenario

In [1]:
# TODO: Add optional jinja2 templating section or a remark with a link

## Installation

In [2]:
# %pip install contextcheck

## Imports

In [3]:
from contextcheck import TestScenario
from contextcheck.executors.executor import Executor # NOTE RB: Maybe Executor should be at the most outer layer for import
import yaml
import rich

### Helper functions

Mostly used to showcase the results of ran tests.

In [4]:
def show_test_step_results(test_scenario: TestScenario):
    print("-"*12)
    for step in test_scenario.steps:
        print(f"Name: {step.name}; Result: {step.result}\n")
        for assertion in step.asserts:
            assertion_dumped = assertion.model_dump()
            assertion_ = assertion.eval if "eval" in assertion_dumped else assertion.assertion
            print(f'Assertion: "{assertion_}", Result: {assertion.result}')
        print("-"*12)

### Send default request

Let's initially create a simple yaml that we will use to send a dummy request to OpenAI.

*When config is empty then OpenAI's gpt-4o-mini is used.

In [5]:
# Define configuration in yaml - for demonstration purposes it's done in notebook
yaml_string = """
config:

steps:
   - What is the capital of Poland?
"""

yaml_from_string = yaml.safe_load(yaml_string)

In [6]:
# Create a test scenario
test_scenario = TestScenario.model_validate(yaml_from_string)

In [7]:
# visualize test scenario
rich.print(test_scenario)

In [8]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [9]:
# run all test steps
executor.run_all()

[32m2024-09-26 15:20:23.774[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-09-26 15:20:23.775[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='What is the capital of Poland?' request=RequestBase(message='What is the capital of Poland?') response=None asserts=[] result=None[0m
[32m2024-09-26 15:20:23.776[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='What is the capital of Poland?'[0m


[32m2024-09-26 15:20:24.413[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='The capital of Poland is Warsaw.' stats=ResponseStats(tokens_request=14, tokens_response=7, tokens_total=21, conn_start_time=25028.39479685, conn_end_time=25029.029330006, conn_duration=0.6345331560005434) id='chatcmpl-ABihsRTVsu13ak7iVreYxvEilxmGE' choices=[{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': 'The capital of Poland is Warsaw.', 'role': 'assistant', 'refusal': None}}] created=1727356824 model='gpt-4o-mini-2024-07-18' object='chat.completion' system_fingerprint='fp_1bb46167f9' usage={'completion_tokens': 7, 'prompt_tokens': 14, 'total_tokens': 21, 'completion_tokens_details': {'reasoning_tokens': 0}} config=EndpointConfig(kind='openai', url='', model='gpt-4o-mini', additional_headers={}, provider=None, temperature=None, max_tokens=None, top_k=3, use_ranker=True, collection_name='default')[0m


True

In [10]:
# Once more visualize the test scenario to see the changes
rich.print(test_scenario)

In [11]:
# Response from llm
test_scenario.steps[0].response.message

'The capital of Poland is Warsaw.'

### Config update

We initially left the config empty, but we can easily populate it with configuration that best fits our needs.

For defining the connection to the llm or rag system we use `endpoint_under_test`. For demo purposes we will use one of OpenAI's models which are already implemented by default. For more information please visit [TODO - Link to config]

In [12]:
yaml_string = """
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o   

steps:
   - What is the capital of Poland?
"""

yaml_from_string = yaml.safe_load(yaml_string)

In [13]:
# Create a test scenario
test_scenario = TestScenario.model_validate(yaml_from_string)

In [14]:
# visualize test scenario
# Note the change in config from gpt-4o-mini to gpt-4o
rich.print(test_scenario)

In [15]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [16]:
executor.run_all()

[32m2024-09-26 15:20:24.684[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-09-26 15:20:24.687[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='What is the capital of Poland?' request=RequestBase(message='What is the capital of Poland?') response=None asserts=[] result=None[0m
[32m2024-09-26 15:20:24.691[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='What is the capital of Poland?'[0m


[32m2024-09-26 15:20:25.398[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='The capital of Poland is Warsaw.' stats=ResponseStats(tokens_request=14, tokens_response=7, tokens_total=21, conn_start_time=25029.309263794, conn_end_time=25030.014328409, conn_duration=0.705064615001902) id='chatcmpl-ABihtG9ggk95ihNvIXqrXJUBtLri1' choices=[{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': 'The capital of Poland is Warsaw.', 'role': 'assistant', 'refusal': None}}] created=1727356825 model='gpt-4o-2024-05-13' object='chat.completion' system_fingerprint='fp_e375328146' usage={'completion_tokens': 7, 'prompt_tokens': 14, 'total_tokens': 21, 'completion_tokens_details': {'reasoning_tokens': 0}} config=EndpointConfig(kind='openai', url='', model='gpt-4o', additional_headers={}, provider=None, temperature=None, max_tokens=None, top_k=3, use_ranker=True, collection_name='default')[0m


True

In [17]:
# Response from llm
test_scenario.steps[0].response.message

'The capital of Poland is Warsaw.'

### Simple scenario

Lets create a simple test scenario which will help you understand the working of contextcheck.
We will use simple asserts which are based on python's `eval` build-in functionality.


We believe it's also a good place to introduce the nomenclature for test steps.

Each step can by defined by its `name` (optional), `request` and `asserts` (optional):
- `name` is a name of the test step
- `request` is a message to an llm
- `asserts` is a list of assertions done on llm response

NOTE: By default each assert is treated as an `eval` assertion

In [18]:
# Reuse yaml from previous example and extend it
yaml_string = """
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o   

steps:
   - name: Write sucess
     request: 'Please write only "success" as a response'
     asserts:
        - '"success" == response.message'
        - 'response.stats.conn_duration < 10'
"""

yaml_from_string = yaml.safe_load(yaml_string)

In [19]:
# Create a test scenario
test_scenario = TestScenario.model_validate(yaml_from_string)

In [20]:
# visualize test scenario
rich.print(test_scenario)

In [21]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [22]:
executor.run_all()

[32m2024-09-26 15:20:26.760[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-09-26 15:20:26.762[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Write sucess' request=RequestBase(message='Please write only "success" as a response') response=None asserts=[AssertionEval(result=None, eval='"success" == response.message'), AssertionEval(result=None, eval='response.stats.conn_duration < 10')] result=None[0m
[32m2024-09-26 15:20:26.763[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Please write only "success" as a response'[0m


[32m2024-09-26 15:20:27.261[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='success' stats=ResponseStats(tokens_request=16, tokens_response=1, tokens_total=17, conn_start_time=25031.381730285, conn_end_time=25031.877946775, conn_duration=0.49621648999891477) id='chatcmpl-ABihvJYAhRnRUmB3wWjzOx1D8F2Bv' choices=[{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': 'success', 'role': 'assistant', 'refusal': None}}] created=1727356827 model='gpt-4o-2024-05-13' object='chat.completion' system_fingerprint='fp_e375328146' usage={'completion_tokens': 1, 'prompt_tokens': 16, 'total_tokens': 17, 'completion_tokens_details': {'reasoning_tokens': 0}} config=EndpointConfig(kind='openai', url='', model='gpt-4o', additional_headers={}, provider=None, temperature=None, max_tokens=None, top_k=3, use_ranker=True, collection_name='default')[0m
[32m2024-09-26 15:20:27.262[0m | [1mINFO    [0m | [36mcontextchec

True

In [23]:
rich.print(test_scenario)

In [24]:
# Show the result
show_test_step_results(test_scenario=test_scenario)

------------
Name: Write sucess; Result: True

Assertion: ""success" == response.message", Result: True
Assertion: "response.stats.conn_duration < 10", Result: True
------------


### Scenario extension

Having introduction under our belt we will extend the already built scenario by new types of assertions and explain more in depth the needed topics.

#### Explain config

To extend our scenario we need to introduce new config features that are needed for some of the asertions.

In short, config defines llm (or Rag system) connection. We provide several popular llm providers implementations which lets you be productive from the start. For more info about them please go to [Link here].

There are three components used in config:
1. `endpoint_under_test` - defines the tested endpoint
2. `default_request` - defines the defaults for both the `endpoint_under_test` and `eval_endpoint` (TODO: Please someone confirm that)
3. `eval_endpoint` - defines the endpoint which is used for evaluating the responses from `endpoint_under_test`

For more infromation about configuration please go to [TODO - INSERT LINK HERE]

TODO: What's the purpose of `default_request` when the same configuration can be given to `endpoint_under_test` or `eval_endpoint`?

In [25]:
# Lets use our new knowledge and define a scenario with llm evaluation - full explanation later
# In short `llm_metric` uses another llm to evaluate the response and `model-grading-qa` particularly uses
# another llm to check whether the response is about the topic X defined by user.
# TODO: We cannot have multiple assertions under the same llm metric
yaml_config_1 = """
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o-mini
      temperature: 0.2
   eval_endpoint: # Needed for llm_metric assertions
      kind: openai
      model: gpt-4o
      temperature: 0.0

steps:
  - name: Test model grading QA evaluator
    request:
      message: "Please write a 5 line poem about AI."
    asserts:
      - llm_metric: model-grading-qa
        assertion: Text should be a poem about AI.
      - llm_metric: model-grading-qa
        assertion: Text should be a report on taxes. # Misleading assertion for demo purposes
"""

yaml_from_string = yaml.safe_load(yaml_config_1)
yaml_from_string

{'config': {'endpoint_under_test': {'kind': 'openai',
   'model': 'gpt-4o-mini',
   'temperature': 0.2},
  'eval_endpoint': {'kind': 'openai', 'model': 'gpt-4o', 'temperature': 0.0}},
 'steps': [{'name': 'Test model grading QA evaluator',
   'request': {'message': 'Please write a 5 line poem about AI.'},
   'asserts': [{'llm_metric': 'model-grading-qa',
     'assertion': 'Text should be a poem about AI.'},
    {'llm_metric': 'model-grading-qa',
     'assertion': 'Text should be a report on taxes.'}]}]}

In [26]:
# Create a test scenario
test_scenario = TestScenario.model_validate(yaml_from_string)

In [27]:
# visualize test scenario
rich.print(test_scenario)

In [28]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [29]:
executor.run_all()

[32m2024-09-26 15:20:27.757[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-09-26 15:20:27.759[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Test model grading QA evaluator' request=RequestBase(message='Please write a 5 line poem about AI.') response=None asserts=[AssertionLLM(result=None, llm_metric='model-grading-qa', reference='', assertion='Text should be a poem about AI.'), AssertionLLM(result=None, llm_metric='model-grading-qa', reference='', assertion='Text should be a report on taxes.')] result=None[0m
[32m2024-09-26 15:20:27.761[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Please write a 5 line poem about AI.'[0m
[32m2024-09-26 15:20:29.085[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='In circu

False

In [30]:
rich.print(test_scenario)

In [31]:
# Show the result of each step
show_test_step_results(test_scenario=test_scenario)

------------
Name: Test model grading QA evaluator; Result: False

Assertion: "Text should be a poem about AI.", Result: True
Assertion: "Text should be a report on taxes.", Result: False
------------


#### Extra: Adding custom endpoint

In [32]:
# Logic or a link for creating and using custom endpoint should be added somewhere here

#### Explain assertions

There are three families of assertions (two of which we already know and used):
1. `eval` assertion - converts a string to python code using (you guessed it) eval
2. `llm_metric` assertion - uses another llm defined in `eval_endpoint` to assess the `endpoint_under_test` performance
3. `deterministic` assertion - does string assessments like contains, contains-any etc.

In [33]:
# NOTE RB: Metrics should be easilly extended i.e. if someone wants to add a metric we should provide a simple way
# to do that, which should not break any functionalities like result summarization or time statistics etc.
# NOTE: How detailed should be the explanations? And should each sub metric like llm_metric-hallucination be mentioned, or should we link the docs instead? 

##### Explain eval assertions

`eval` assertion uses python's build in eval function which changes any string to python executable code. User has Response model for disposition which include in a base form should include the response from the `endpoint_under_test` and the time statistics (see `ConnectorStats` model).

##### Explain llm assertions

`llm_metric` uses another llm to assess the response of the `endpoint_under_test`. For this `eval_endpoint` should be added in config section to define evaluation endpoint. It can be one of the available endpoints (link here) or one created by the user (link here).

There are 5 specific sub metrics associated with it:
- `hallucination` (available only for RAG systems): This metric assesses whether the LLM's answer includes information not present in the provided reference data
- `qa-reference` - (available only for RAG systems): This metric assesses whether the LLM's response accurately answers the user query based on the provided reference data.
- `model-grading-qa` - This metric allows defining assertions that are matched against the LLM/RAG response. Think of it as "regular expressions defined using natural language".
- `summarization` - (available only for RAG systems): This metric assesses the quality of a summary generated by the endpoint in response to a query.
- `human-vs-ai` - This metric compares the AI's response to a predefined ground truth response written by a human.

For more in depth explanations and examples please go to [TODO - Insert link here]

##### Explain deterministic assertions

`deterministic` assertion provide a way to assert the content of the response through string comparisons like `contains` or `contains-any`.
To use `deterministic` assertion use keyword `kind` with assertion type (see final example).

For more information please go to [Link here]

## Final scenario

In [34]:
# When the test scenario is finally ready we can load it
# TODO: Extend scenario_example1.yaml
test_scenario_file_path = "../tests/scenario_example1.yaml"
test_scenario = TestScenario.from_yaml(file_path=test_scenario_file_path)

In [35]:
# Inspect the structure of test_scenario
rich.print(test_scenario)

In [36]:
# Initiate executor which runs test scenario
executor = Executor(test_scenario=test_scenario)

In [37]:
# Run test scenario
executor.run_all()

[32m2024-09-26 15:20:30.616[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-09-26 15:20:30.617[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Write sucess' request=RequestBase(message='Please write only "success" as a response') response=None asserts=[AssertionEval(result=None, eval='"success" == response.message'), AssertionEval(result=None, eval='response.stats.conn_duration < 10')] result=None[0m
[32m2024-09-26 15:20:30.619[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='Please write only "success" as a response'[0m


[32m2024-09-26 15:20:31.014[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='success' stats=ResponseStats(tokens_request=16, tokens_response=1, tokens_total=17, conn_start_time=25035.239937527, conn_end_time=25035.633646139, conn_duration=0.39370861199859064) id='chatcmpl-ABihznQK0dfBY5UAgeWbjH9p4SsnC' choices=[{'finish_reason': 'stop', 'index': 0, 'logprobs': None, 'message': {'content': 'success', 'role': 'assistant', 'refusal': None}}] created=1727356831 model='gpt-4o-mini-2024-07-18' object='chat.completion' system_fingerprint='fp_3a215618e8' usage={'completion_tokens': 1, 'prompt_tokens': 16, 'total_tokens': 17, 'completion_tokens_details': {'reasoning_tokens': 0}} config=EndpointConfig(kind='openai', url='', model='gpt-4o-mini', additional_headers={}, provider=None, temperature=0.2, max_tokens=None, top_k=3, use_ranker=True, collection_name='default')[0m
[32m2024-09-26 15:20:31.015[0m | [1mINFO    [0m | [36mco

False

In [40]:
# NOTE RB: Maybe executor should copy the test scenario
# Inspect updated test_scenario
rich.print(test_scenario)

In [41]:
show_test_step_results(test_scenario=test_scenario)

------------
Name: Write sucess; Result: True

Assertion: ""success" == response.message", Result: True
Assertion: "response.stats.conn_duration < 10", Result: True
------------
Name: Capital of Poland; Result: True

Assertion: ""Warsaw" in response.message", Result: True
------------
Name: Test model grading QA evaluator; Result: False

Assertion: "Text should be a poem about AI.", Result: True
Assertion: "Text should be a report on taxes.", Result: False
------------
Name: Deterministic assertion test; Result: True

Assertion: "Paris", Result: True
------------


### Execute scenario using ccheck command - TODO

In [42]:
# We can also run contextcheck in a command line
!ccheck --output-type console --filename ../tests/scenario_example1.yaml

[32m2024-09-26 15:21:10.777[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[1;35mTestStep[0m[1m([0m
    [33mname[0m=[32m'Write sucess'[0m,
    [33mrequest[0m=[1;35mRequestBase[0m[1m([0m[33mmessage[0m=[32m'Please write only "success" as a response'[0m[1m)[0m,
    [33mresponse[0m=[3;35mNone[0m,
    [33masserts[0m=[1m[[0m
        [1;35mAssertionEval[0m[1m([0m[33mresult[0m=[3;35mNone[0m, [33meval[0m=[32m'"success" == response.message'[0m[1m)[0m,
        [1;35mAssertionEval[0m[1m([0m[33mresult[0m=[3;35mNone[0m, [33meval[0m=[32m'response.stats.conn_duration < 10'[0m[1m)[0m
    [1m][0m,
    [33mresult[0m=[3;35mNone[0m
[1m)[0m
╭──────────────────────────────────────────────────────────────────────────────╮
│ [1;31m🎈 Request:[0m                                                                  │
│ [1;35mRequestBase[0m[1m([0m[33mmessage[0m=[32m'Pleas