# Example 1

In this notebook we will present you a simple case of using contextcheck to validate llm responses.

We will talk about:
- Configuration
- Test Scenario
- Test Steps
- Running the Test Scenario

## Installation

In [None]:
# %pip install contextcheck

## Imports

In [None]:
from contextcheck import TestScenario, Executor
import rich

### Send default request

Let's initially create a simple yaml that we will use to send a dummy request to OpenAI.

*When config is empty then OpenAI's gpt-4o-mini is used.

In [None]:
%%writefile test_scenario_ex1_progress.yaml
config:

steps:
   - What is the capital of Poland?

In [None]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [None]:
# visualize test scenario
rich.print(test_scenario)

In [None]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [None]:
# run all test steps
executor.run_all()

In [None]:
# Once more visualize the test scenario to see the changes
rich.print(test_scenario)

In [None]:
# Response from llm
test_scenario.steps[0].response.message

### Config update

We initially left the config empty, but we can easily populate it with configuration that best fits our needs.

For defining the connection to the llm or rag system we use `endpoint_under_test`. For demo purposes we will use one of OpenAI's models which are already implemented by default. For more information please visit [TODO - Link to config]

In [None]:
%%writefile test_scenario_ex1_progress.yaml
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o   

steps:
   - What is the capital of Poland?


In [None]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [None]:
# visualize test scenario
# Note the change in config from gpt-4o-mini to gpt-4o
rich.print(test_scenario)

In [None]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [None]:
executor.run_all()

In [None]:
# Response from llm
test_scenario.steps[0].response.message

##### Model's Parameters update

In config we can also update the model parameters like temperature, max_tokens etc.

In [None]:
# TODO: Check this after rebase with contextcheck changes
# TODO: I'd add a possibility to transfer parameters through step/request

In [None]:
%%writefile test_scenario_ex1_progress.yaml
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o-mini
      temperature: 2.0
      max_tokens: 64

steps:
   - Write a poem about LLMs


In [None]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [None]:
# visualize test scenario
rich.print(test_scenario)

In [None]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [None]:
executor.run_all()

In [None]:
# Response from llm
test_scenario.steps[0].response.message

### Simple scenario

Lets create a simple test scenario which will help you understand the working of contextcheck.
We will use simple asserts which are based on python's `eval` build-in functionality.


We believe it's also a good place to introduce the nomenclature for test steps.

Each step can by defined by its `name` (optional), `request` and `asserts` (optional):
- `name` is a name of the test step
- `request` is a message to an llm
- `asserts` is a list of assertions done on llm response

NOTE: By default each assert is treated as an `eval` assertion

In [None]:
%%writefile test_scenario_ex1_progress.yaml

config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o

steps:
   - name: Write sucess
     request: 'Please write only "success" as a response'
     asserts:
        - '"success" == response.message'
        - 'response.stats.conn_duration < 10'


In [None]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [None]:
# visualize test scenario
rich.print(test_scenario)

In [None]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [None]:
executor.run_all()

In [None]:
rich.print(test_scenario)

In [None]:
# Show the result
test_scenario.show_test_step_results()

### Scenario extension

Having introduction under our belt we will extend the already built scenario by new types of assertions and explain more in depth the needed topics.

#### Explain config

To extend our scenario we need to introduce new config features that are needed for some of the asertions.

In short, config defines llm (or Rag system) connection. We provide several popular llm providers implementations which lets you be productive from the start. For more info about them please go to [Link here].

There are three components used in config:
1. `endpoint_under_test` - defines the tested endpoint
2. `default_request` - defines the defaults for both the `endpoint_under_test` and `eval_endpoint` (TODO: Please someone confirm that)
3. `eval_endpoint` - defines the endpoint which is used for evaluating the responses from `endpoint_under_test`

For more infromation about configuration please go to [TODO - INSERT LINK HERE]

TODO: What's the purpose of `default_request` when the same configuration can be given to `endpoint_under_test` or `eval_endpoint`?

In [None]:
# Lets use our new knowledge and define a scenario with llm evaluation - full explanation later
# In short `llm_metric` uses another llm to evaluate the response and `model-grading-qa` particularly uses
# another llm to check whether the response is about the topic X defined by user.
# TODO: We cannot have multiple assertions under the same llm metric

In [None]:
%%writefile test_scenario_ex1_progress.yaml
config:
   endpoint_under_test:
      kind: openai
      model: gpt-4o-mini
      temperature: 0.2
   eval_endpoint: # Needed for llm_metric assertions
      kind: openai
      model: gpt-4o
      temperature: 0.0

steps:
  - name: Test model grading QA evaluator
    request:
      message: "Please write a 5 line poem about AI."
    asserts:
      - llm_metric: model-grading-qa
        assertion: Text should be a poem about AI.
      - llm_metric: model-grading-qa
        assertion: Text should be a report on taxes. # Misleading assertion for demo purposes

In [None]:
# Create a test scenario
test_scenario = TestScenario.from_yaml("test_scenario_ex1_progress.yaml")

In [None]:
# visualize test scenario
rich.print(test_scenario)

In [None]:
# create executor that uses test scenario
executor = Executor(test_scenario=test_scenario)

In [None]:
executor.run_all()

In [None]:
rich.print(test_scenario)

In [None]:
# Show the result of each step
test_scenario.show_test_step_results()

#### Extra: Adding custom endpoint

In [None]:
# Logic or a link for creating and using custom endpoint should be added somewhere here

#### Explain assertions

There are three families of assertions (two of which we already know and used):
1. `eval` assertion - converts a string to python code using (you guessed it) eval
2. `llm_metric` assertion - uses another llm defined in `eval_endpoint` to assess the `endpoint_under_test` performance
3. `deterministic` assertion - does string assessments like contains, contains-any etc.

##### Explain eval assertions

`eval` assertion uses python's build in eval function which changes any string to python executable code. User has Response model for disposition which include in a base form should include the response from the `endpoint_under_test` and the time statistics (see `ConnectorStats` model).

##### Explain llm assertions

`llm_metric` uses another llm to assess the response of the `endpoint_under_test`. For this `eval_endpoint` should be added in config section to define evaluation endpoint. It can be one of the available endpoints (link here) or one created by the user (link here).

There are 5 specific sub metrics associated with it:
- `hallucination` (available only for RAG systems): This metric assesses whether the LLM's answer includes information not present in the provided reference data
- `qa-reference` - (available only for RAG systems): This metric assesses whether the LLM's response accurately answers the user query based on the provided reference data.
- `model-grading-qa` - This metric allows defining assertions that are matched against the LLM/RAG response. Think of it as "regular expressions defined using natural language".
- `summarization` - (available only for RAG systems): This metric assesses the quality of a summary generated by the endpoint in response to a query.
- `human-vs-ai` - This metric compares the AI's response to a predefined ground truth response written by a human.

For more in depth explanations and examples please go to [TODO - Insert link here]

##### Explain deterministic assertions

`deterministic` assertion provide a way to assert the content of the response through string comparisons like `contains` or `contains-any`.
To use `deterministic` assertion use keyword `kind` with assertion type (see final example).

For more information please go to [Link here]

## Final scenario

In [None]:
# When the test scenario is finally ready we can load it
test_scenario_file_path = "../tests/scenario_example1.yaml"
test_scenario = TestScenario.from_yaml(file_path=test_scenario_file_path)

In [None]:
# Inspect the structure of test_scenario
rich.print(test_scenario)

In [None]:
# Initiate executor which runs test scenario
executor = Executor(test_scenario=test_scenario)

In [None]:
# Run test scenario
executor.run_all()

In [None]:
# Inspect updated test_scenario
rich.print(test_scenario)

In [None]:
test_scenario.show_test_step_results()

### Execute scenario using ccheck command

In [None]:
# We can also run contextcheck in a command line
!ccheck --output-type console --filename ../tests/scenario_example1.yaml