# Example 1

In this notebook we will present you a simple case of using contextcheck to validate llm responses.

We will talk about:
- Configuration
- Test Scenario
- Running the Test Scenario


In [1]:
# TODO: Create another test scenario where each type of assertion would be included
# TODO: For this, config should also be updated
# TODO: Talk about sub components separately (step1, step2, assertion1, assertion2 etc.)
# TODO: Add optional jinja2 templating section or a remark with a link

## Installation

In [2]:
# %pip install contextcheck
# %pip install devtools

## Imports

In [3]:
from contextcheck import TestScenario
from contextcheck.executors.executor import Executor # NOTE RB: Maybe Executor should be at the most outer layer for import
from devtools import pprint # Needed for pydantic models pretty formatting
import yaml

## Scenario creation

Note that throughout this notebook we present a separate bits of a single scenario which are all gathered in a proper yaml, which is used after the explanation of the particular parts which make a scenario.

### Explain config

Config defines llm (or Rag system) connection. We provide several popular llm providers which lets you be productive from the start.
There are three components used in config:
1. `endpoint_under_test` - defines the tested endpoint
2. `default_request` - defines the defaults for both the `endpoint_under_test` and `eval_endpoint` (TODO: Please someone confirm that)
3. `eval_endpoint` - defines the endpoint which is used for evaluating the responses from `endpoint_under_test`

For more infromation about configuration please go to [TODO - INSERT LINK HERE]

TODO: What's the purpose of `default_request` when the same configuration can be given to `endpoint_under_test` or `eval_endpoint`?

In [4]:
# Define configuration in yaml - for demonstration purposes it's done in notebook
# TIP: For eval_endpoint try to use SOTA llm if possible
yaml_config_1 = """
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o-mini
      temperature: 0.2
   eval_endpoint:
      kind: openai
      model: gpt-4o
      temperature: 0.0
"""

yaml_from_string = yaml.safe_load(yaml_config_1)
yaml_from_string

{'config': {'endpoint_under_test': {'kind': 'openai',
   'model': 'gpt-4o-mini',
   'temperature': 0.2},
  'eval_endpoint': {'kind': 'openai', 'model': 'gpt-4o', 'temperature': 0.0}}}

#### Extra: Adding custom endpoint

In [5]:
# Logic or a link for creating and using custom endpoint should be added somewhere here

### Explain steps

Each test scenario consists of at least one testing step.

Each step can by defined by its `name` (optional), `request` and `asserts` (optional):
- `name` is a name of the test step
- `request` is a message to an llm
- `asserts` is a list of assertions done on llm response

NOTE: By default each assert is treated as an `eval` assertion

In [6]:
# TODO: Add other type of asserts
yaml_from_string = yaml.safe_load("""
steps:
   - name: Check capital of Poland
     request: 'What is the capital city of Poland?'
     asserts:
        - '"Warsaw" in response.message'
        - 'response.stats.conn_duration < 3'
   - name: Test hallucination evaluator (hallucinated)
     request:
       message: Where did Mike go? Choose between the home and the park.
     asserts:
        - llm_metric: hallucination
          reference: Mike went to the store.
""")
yaml_from_string

{'steps': [{'name': 'Check capital of Poland',
   'request': 'What is the capital city of Poland?',
   'asserts': ['"Warsaw" in response.message',
    'response.stats.conn_duration < 3']},
  {'name': 'Test hallucination evaluator (hallucinated)',
   'request': {'message': 'Where did Mike go? Choose between the home and the park.'},
   'asserts': [{'llm_metric': 'hallucination',
     'reference': 'Mike went to the store.'}]}]}

#### Explain assertions

There are three families of assertions:
1. `eval` assertion - converts a string to python code using (you guessed it) eval
2. `llm_metric` assertion - uses another llm defined in `eval_endpoint` to assess the `endpoint_under_test` performance
3. `deterministic` assertion - does string assessments like contains, contains-any etc.

In [7]:
# TODO: Add deterministic assertion combining it with the two previous assertions
# NOTE RB: Metrics should be easilly extended i.e. if someone wants to add a metric we should provide a simple way
# to do that, which should not break any functionalities like result summarization or time statistics etc.

##### Explain llm assertions

`llm_metric` uses another llm to assess the response of the `endpoint_under_test`. For this `eval_endpoint` should be added in config section to define evaluation endpoint. It can be one of the available endpoints (link here) or one created by the user (link here).

In [8]:
# TODO: Add 1-2 examples here and link other options

##### Explain eval assertions

`eval` assertion uses python's build in eval function which changes any string to python executable code. User has Response model for disposition which include in a base form should include the response from the `endpoint_under_test` and the time statistics (see `ConnectorStats` model).

In [9]:
# TODO: Add 1-2 examples of eval here

##### Explain deterministic assertions

`deterministic` assertion provide a way to assert the content of the response through string comparisons like contains or contains-any.

In [10]:
# TODO: Show 1-2 examples of that and link to other options

## Final scenario

In [11]:
# When the test scenario is finally ready we can load it
# TODO: Extend scenario_example1.yaml
test_scenario_file_path = "../tests/scenario_example1.yaml"
test_scenario = TestScenario.from_yaml(file_path=test_scenario_file_path)

In [12]:
# Inspect the structure of test_scenario
pprint(test_scenario)

TestScenario(
    steps=[
        TestStep(
            name='Check capital of Poland',
            request=RequestBase(
                message='What is the capital city of Poland?',
            ),
            response=None,
            asserts=[
                AssertionEval(
                    result=None,
                    eval='"Warsaw" in response.message',
                ),
                AssertionEval(
                    result=None,
                    eval='response.stats.conn_duration < 3',
                ),
            ],
            result=None,
        ),
        TestStep(
            name='Test hallucination evaluator (hallucinated)',
            request=RequestBase(
                message='Where did Mike go? Choose between the home and the park.',
            ),
            response=None,
            asserts=[
                AssertionLLM(
                    result=None,
                    llm_metric='hallucination',
                    reference='Mike went to

In [13]:
# Initiate executor which runs test scenario
executor = Executor(test_scenario=test_scenario)

In [14]:
# Run test scenario
executor.run_all()

[32m2024-09-24 10:19:49.495[0m | [1mINFO    [0m | [36mcontextcheck.executors.executor[0m:[36mrun_all[0m:[36m41[0m - [1mRunning scenario[0m
[32m2024-09-24 10:19:49.496[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mname='Check capital of Poland' request=RequestBase(message='What is the capital city of Poland?') response=None asserts=[AssertionEval(result=None, eval='"Warsaw" in response.message'), AssertionEval(result=None, eval='response.stats.conn_duration < 3')] result=None[0m
[32m2024-09-24 10:19:49.497[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='What is the capital city of Poland?'[0m
[32m2024-09-24 10:19:50.035[0m | [1mINFO    [0m | [36mcontextcheck.interfaces.interface[0m:[36m__call__[0m:[36m11[0m - [1mmessage='The capital city of Poland is Warsaw.' stats=ResponseStats(tokens_request=15, tokens_response=8, tokens_total=23, conn_s

False

In [15]:
# NOTE RB: Maybe executor should copy the test scenario
# Inspect updated test_scenario
pprint(test_scenario) # It shows api key

TestScenario(
    steps=[
        TestStep(
            name='Check capital of Poland',
            request=RequestBase(
                message='What is the capital city of Poland?',
            ),
            response=ResponseModel(
                message='The capital city of Poland is Warsaw.',
                stats=ResponseStats(
                    tokens_request=15,
                    tokens_response=8,
                    tokens_total=23,
                    conn_start_time=7808.719369254,
                    conn_end_time=7809.255086824,
                    conn_duration=0.5357175699991785,
                ),
                id='chatcmpl-AAv3t0s4vtdA72ZXv6GNn9cWHsTYM',
                choices=[
                    {
                        'finish_reason': 'stop',
                        'index': 0,
                        'logprobs': None,
                        'message': {
                            'content': 'The capital city of Poland is Warsaw.',
                    

In [16]:
# We can inspect each test step separately and check its results
for step in test_scenario.steps:
    print(f"Step name: {step.name}, Result: {step.result}")

Step name: Check capital of Poland, Result: True
Step name: Test hallucination evaluator (hallucinated), Result: False


In [17]:
# # We can also inspect each assertion for each step separately
for step in test_scenario.steps:
    print(f"Step name: {step.name}:\n")
    for assertion in step.asserts:
        print(assertion)
    print("-"*12)

Step name: Check capital of Poland:

result=True eval='"Warsaw" in response.message'
result=True eval='response.stats.conn_duration < 3'
------------
Step name: Test hallucination evaluator (hallucinated):

result=False llm_metric='hallucination' reference='Mike went to the store.' assertion='' metric_evaluator=LLMMetricEvaluator(eval_endpoint=EndpointOpenAI(connector=ConnectorOpenAI(stats=ConnectorStats(conn_start_time=7809.912424328, conn_end_time=7810.2899517, conn_duration=0.37752737199934927), model='gpt-4o'), config=EndpointConfig(kind='openai', url='', model='gpt-4o', additional_headers={}, provider=None, temperature=0.0, max_tokens=None, top_k=3, use_ranker=True, collection_name='default')), metric=MetricHallucination(prompt_template='\nIn this task, you will be presented with a query, a reference text and an answer. The answer is\ngenerated to the question based on the reference text. The answer may contain false information. You\nmust use the reference text to determine if th

### Execute scenario using ccheck comments - TODO