## Defining the Task
In this notebook we will perform NER (named entity extraction) task with the help of LLM. We will evaluate different models on this task to see which ones are accomplishing it more reliably.

Let's first define the pydantic schema of what we want to extract.

In [1]:
from pydantic import BaseModel

class Address(BaseModel):
    number: int
    street_name: str
    city: str
    country: str

## Data Preparation
Next, we create some data entries and save it to the dataframe that we will be used as an input to our evaluators. In our case we conduct the evaluations on a small set of 3 examples. We also define the expected outputs that will be later compared with the LLM outputs.

In [2]:
import pandas as pd

entries = pd.DataFrame(
    {
        "input": [
            "Please send the package to 123 Main St, Springfield.",
            "J'ai déménagé récemment à 56 Rue de l'Université, Paris.",
            "A reunião será na Avenida Paulista, 900, São Paulo.",
        ],
        "expected_output": [
            Address(
                number=123, street_name="Main St", city="Springfield", country="USA"
            ).model_dump_json(),
            Address(
                number=56,
                street_name="Rue de l'Université",
                city="Paris",
                country="France",
            ).model_dump_json(),
            Address(
                number=900,
                street_name="Avenida Paulista",
                city="São Paulo",
                country="Brazil",
            ).model_dump_json(),
        ],
    }
)

## Running the test
Next, we will pick the models that we want to evaluate on our task and define the test for it.
Our test will run an evaluation for earch data entry and each of the models because of `@pytest.mark.parametrize` pytest extension. It will also run 3 times, and mark the test as failed only if it failed all of the tries as we use `@pytest.mark.flaky`. 

In [3]:
import instructor
from itertools import product
from litellm import completion
import pytest

models = ["gpt-3.5-turbo", "gpt-4-turbo", "groq/llama3-70b-8192"]

client = instructor.from_litellm(completion)

# Create the test function
@pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
@pytest.mark.flaky(max_runs=3)
def test_extracts_the_right_address(entry, model):
    address = client.chat.completions.create(
        model=model,
        response_model=Address,
        messages=[
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )

    assert address.model_dump_json() == entry.expected_output

As we are running our evaluations in the Notebook environment, we need to use `ipytest` library. 

In [4]:
import ipytest
ipytest.config.rewrite_asserts = True
ipytest.config.addopts = ["--disable-warnings"]
ipytest.run()

platform darwin -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks
configfile: pyproject.toml
plugins: flaky-3.8.1, anyio-4.3.0, langevals-0.1.5, xdist-3.6.1
collected 9 items

t_2e5f9ddc0bcc4b878912296c8b6dee42.py [32m.[0m[32m.[0m[31mF[0m[32m.[0m[32m.[0m[31mF[0m[32m.[0m[32m.[0m[31m                                               [100%][0m[31m [100%][0m[31mF[0m[31m [100%][0m

[31m[1m___________________ test_extracts_the_right_address[entry2-groq/llama3-70b-8192] ___________________[0m

entry = Pandas(Index=0, input='Please send the package to 123 Main St, Springfield.', expected_output='{"number":123,"street_name":"Main St","city":"Springfield","country":"USA"}')
model = 'groq/llama3-70b-8192'

    [0m[37m@pytest[39;49;00m.mark.parametrize([33m"[39;49;00m[33mentry, model[39;49;00m[33m"[39;49;00m, product(entries.itertuples(), models))[90m[39;49;00m
    [37m@pytest[39;49;00m.mark.

<ExitCode.TESTS_FAILED: 1>

## Other Tests
Lets try another task, imagine you want to classify the language of the message. You can levarage LangEvals here as well!
Pay attention, how in this case we also define `@pytest.mark.pass_rate` - a pass rate with which the test will be considered as a pass. Now, we have 3 models evaluated on 3 data entries - resulting in 9 tests. Only if 80% of them are passing - this whole test case will be marked as pass too.

In [5]:
import litellm
from litellm import ModelResponse
from langevals_lingua.language_detection import (
    LinguaLanguageDetectionEvaluator,
    LinguaLanguageDetectionSettings,
    LinguaLanguageDetectionEvaluator,
)
from langevals_ragas.answer_relevancy import RagasAnswerRelevancyEvaluator
from langevals import expect
import pytest

entries = pd.DataFrame(
    {
        "input": [
            "What's the connection between 'breaking the ice' and the Titanic's first voyage?",
            "Comment la bataille de Verdun a-t-elle influencé la cuisine française?",
            "¿Puede el musgo participar en la purificación del aire en espacios cerrados?",
        ],
    }
)


@pytest.mark.parametrize("entry", entries.itertuples())
@pytest.mark.flaky(max_runs=3)
@pytest.mark.pass_rate(0.8)
def test_language_and_relevancy(entry):
    response: ModelResponse = litellm.completion(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "You reply questions only in english, no matter tha language the question was asked",
            },
            {"role": "user", "content": entry.input},
        ],
        temperature=0.0,
    )  # type: ignore
    answer = response.choices[0].message.content  # type: ignore

    language_checker = LinguaLanguageDetectionEvaluator(
        settings=LinguaLanguageDetectionSettings(
            check_for="output_matches_language",
            expected_language="EN",
        )
    )
    answer_relevancy_checker = RagasAnswerRelevancyEvaluator()

    expect(input=entry.input, output=answer).to_pass(language_checker)
    expect(input=entry.input, output=answer).score(
        answer_relevancy_checker
    ).to_be_greater_than(0.8)

Pay attention how we used `expect` funtion that is conveniently attaching the evaluator to the `input` and generated `output` of the model.

In [7]:
import ipytest
ipytest.config.rewrite_asserts = True
ipytest.config.addopts = ["--disable-warnings"]
ipytest.run()

platform darwin -- Python 3.12.3, pytest-8.2.0, pluggy-1.5.0
rootdir: /Users/zhenyabudnyk/DevProjects/langwatch-saas/langevals/notebooks
configfile: pyproject.toml
plugins: flaky-3.8.1, anyio-4.3.0, langevals-0.1.5, xdist-3.6.1
collected 3 items

t_fa956453ec6941f394dfd2ebc296ab6c.py 



[32m.[0m[32m.[0m[32m.[0m[33m                                                    [100%][0m

.venv/lib/python3.12/site-packages/_pytest/config/__init__.py:1285
    self._mark_plugins_for_rewrite(hook)

===Flaky Test Report===

test_language_and_relevancy[entry0] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry1] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry2] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry0] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry1] passed 1 out of the required 1 times. Success!
test_language_and_relevancy[entry2] passed 1 out of the required 1 times. Success!

===End Flaky Test Report===


<ExitCode.OK: 0>