![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/misc/Evaluation_with_Structured_Outputs.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
%pip install langtest[llms]

## Initial setup

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "OPENAI_API_KEY"

In [None]:
from pydantic import BaseModel

class Answer(BaseModel):
    
    class Rationale(BaseModel):
        """Explanation for an answer. why the answer is correct or incorrect with a valid reasons, a score and a summary."""
        reason: str
        score: float
        summary: str

    answer: bool
    rationale: Rationale

    # eq method is needed to compare the answers
    # evaluate the response of the model by comparing
    # two objects of the Answer class
    def __eq__(self, other: 'Answer') -> bool:
        return self.answer == other.answer

issubclass(Answer, BaseModel)

True

# Harness and its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>



| Parameter     | Description |
| - | - |
| **task**      | Task for which the model is to be evaluated (text-classification or ner) |
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li><li>type : type is use for chat or completion of model</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |


<br/>
<br/>

In [3]:
from langtest import Harness

harness = Harness(
    task='question-answering',
    model={
        'model': 'llama3.1',
        'hub': 'ollama',
        'type': 'chat',
        'output_schema': Answer,
    },
    data={
        "data_source": "BoolQ",
        "split": "test-tiny",
    },
    config={
        "tests": {
            "defaults": {
                "min_pass_rate": 0.5,
            },
            "robustness": {
                "uppercase": {
                    "min_pass_rate": 0.8,
                },
                "add_ocr_typo": {
                    "min_pass_rate": 0.8,
                },
                "add_tabs": {
                    "min_pass_rate": 0.8,
                }
            }
        }
    }
)

[91m🚨 Your Spark-Healthcare is outdated, installed==5.5.2 but latest version==5.5.0
You can run [92m nlp.install() [39mto update Spark-Healthcare
Test Configuration : 
 {
 "tests": {
  "defaults": {
   "min_pass_rate": 0.5
  },
  "robustness": {
   "uppercase": {
    "min_pass_rate": 0.8
   },
   "add_ocr_typo": {
    "min_pass_rate": 0.8
   },
   "add_tabs": {
    "min_pass_rate": 0.8
   }
  }
 }
}


### Generate the testcases
The result of the generate function now has an extra column called model_name which specifies which model is the testcase is for.

In [4]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]




In [5]:
testcases_df = harness.testcases()

testcases_df.head()

Unnamed: 0,category,test_type,original_context,original_question,perturbed_context,perturbed_question
0,robustness,uppercase,20 euro note -- Until now there has been only ...,is the first series 20 euro note still legal t...,20 EURO NOTE -- UNTIL NOW THERE HAS BEEN ONLY ...,IS THE FIRST SERIES 20 EURO NOTE STILL LEGAL T...
1,robustness,uppercase,2018–19 UEFA Champions League -- The final wil...,do the champions league winners get automatic ...,2018–19 UEFA CHAMPIONS LEAGUE -- THE FINAL WIL...,DO THE CHAMPIONS LEAGUE WINNERS GET AUTOMATIC ...
2,robustness,uppercase,Bullsnake -- Bullsnakes are very powerful cons...,can a bull snake kill a small dog,BULLSNAKE -- BULLSNAKES ARE VERY POWERFUL CONS...,CAN A BULL SNAKE KILL A SMALL DOG
3,robustness,uppercase,NBA playoffs -- All rounds are best-of-seven s...,are all nba playoff games best of 7,NBA PLAYOFFS -- ALL ROUNDS ARE BEST-OF-SEVEN S...,ARE ALL NBA PLAYOFF GAMES BEST OF 7
4,robustness,uppercase,Manchester station group -- The Manchester sta...,can i use my train ticket on the tram in manch...,MANCHESTER STATION GROUP -- THE MANCHESTER STA...,CAN I USE MY TRAIN TICKET ON THE TRAM IN MANCH...


harness.generate() method automatically generates the test cases (based on the provided configuration)

### Running the tests

In [6]:
harness.run()

Running testcases... : 100%|██████████| 150/150 [03:34<00:00,  1.43s/it]




### Generated Results

In [7]:
resultsdf = harness.generated_results()
resultsdf.head()

Unnamed: 0,category,test_type,original_context,original_question,perturbed_context,perturbed_question,expected_result,actual_result,pass
0,robustness,uppercase,20 euro note -- Until now there has been only ...,is the first series 20 euro note still legal t...,20 EURO NOTE -- UNTIL NOW THERE HAS BEEN ONLY ...,IS THE FIRST SERIES 20 EURO NOTE STILL LEGAL T...,answer=False rationale=Rationale(reason='The t...,answer=False rationale=Rationale(reason='The t...,True
1,robustness,uppercase,2018–19 UEFA Champions League -- The final wil...,do the champions league winners get automatic ...,2018–19 UEFA CHAMPIONS LEAGUE -- THE FINAL WIL...,DO THE CHAMPIONS LEAGUE WINNERS GET AUTOMATIC ...,"answer=True rationale=Rationale(reason=""Accord...",answer=True rationale=Rationale(reason='The wi...,True
2,robustness,uppercase,Bullsnake -- Bullsnakes are very powerful cons...,can a bull snake kill a small dog,BULLSNAKE -- BULLSNAKES ARE VERY POWERFUL CONS...,CAN A BULL SNAKE KILL A SMALL DOG,answer=False rationale=Rationale(reason='Bulls...,answer=False rationale=Rationale(reason='Bulls...,True
3,robustness,uppercase,NBA playoffs -- All rounds are best-of-seven s...,are all nba playoff games best of 7,NBA PLAYOFFS -- ALL ROUNDS ARE BEST-OF-SEVEN S...,ARE ALL NBA PLAYOFF GAMES BEST OF 7,answer=True rationale=Rationale(reason='NBA pl...,answer=True rationale=Rationale(reason='The te...,True
4,robustness,uppercase,Manchester station group -- The Manchester sta...,can i use my train ticket on the tram in manch...,MANCHESTER STATION GROUP -- THE MANCHESTER STA...,CAN I USE MY TRAIN TICKET ON THE TRAM IN MANCH...,answer=True rationale=Rationale(reason='For co...,"answer=True rationale=Rationale(reason=""The te...",True


In [8]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,uppercase,3,47,94%,80%,True
1,robustness,add_ocr_typo,6,44,88%,80%,True
2,robustness,add_tabs,11,39,78%,80%,False
