![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/llm_notebooks/Legal_Support.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
!pip install "langtest[openai]"

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "<ADD OPEN-AI-KEY>"

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>


| Parameter  | Description |  
| - | - |
|**task**     |Task for which the model is to be evaluated (legal, question-answering, summarization)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |

<br/>
<br/>

In [1]:
# Import Harness from the LangTest library
from langtest import Harness

# legal 👨‍⚖️⚖️🏢

We have added a new **legal-support** test. The legal-support dataset evaluates fine-grained reverse entailment. Each sample consists of a text passage making a legal claim, and two case summaries. Each summary describes a legal conclusion reached by a different court. The task is to determine which case (i.e. legal conclusion) most forcefully and directly supports the legal claim in the passage. The construction of this benchmark leverages annotations derived from a legal taxonomy expliciting different levels of entailment (e.g. "directly supports" vs "indirectly supports"). As such, the benchmark tests a model's ability to reason regarding the strength of support a particular case summary provides.

### Supported Datset : Legal-Support

**Data Splits**

- `test`: contains 100 samples.

### Setup and Configure Harness

In [4]:
model = {"model": "gpt-3.5-turbo-instruct", "hub":"openai"}

data = {"data_source": "Legal-Support", "split":"test"}

task={"task": "question-answering", "category": "legal"}

harness = Harness(task=task, model=model, data=data)

Test Configuration : 
 {
 "model_parameters": {
  "temperature": 0
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "legal": {
   "legal-support": {
    "min_pass_rate": 0.7
   }
  }
 }
}


We have specified task as `legal` , hub as `openai` and model as `gpt-3.5-turbo-instruct`



### Generating the test cases.

In [5]:
harness.generate()

Generating testcases...: 100%|██████████| 1/1 [00:00<00:00, 5722.11it/s]




harness.generate() method automatically generates the test cases (based on the provided configuration)

In [6]:
harness.testcases()

Unnamed: 0,category,test_type,case,legal_claim,legal_conclusion_A,legal_conclusion_B,correct_conlusion
0,legal,legal-support,"O’Leary v. Schweiker, 710 F.2d 1334, 1341 (8th...","""A treating physician's opinion does not deser...","""Because of the interpretive problems inherent...","""A treating physician's checkmarks on an MSS f...",a
1,legal,legal-support,"Causey v. State, 274 Ga. App. 506, 508 (618 SE...","Moreover, in addition to the officer's testimo...","loaded weapon, large quantity of narcotics and...",evidence sufficient where officer testified qu...,a
2,legal,legal-support,"Berman v. Bromberg, 56 Cal.App.4th 936, 948, 6...","Supp. Opp'n, at 17), there is no evidence that...","""It is the outward expression of the agreement...",observing that evidence of subjective intent i...,b
3,legal,legal-support,"Cohen v. Young, 127 F.2d 721, 724 (6th Cir.194...",Other courts agree that shareholders who recei...,treating an objector responding to a trial cou...,holding that appellant who properly filed an o...,a
4,legal,legal-support,"See Ex parte Jones, 163 Tex. 513, 358 S.W.2d 3...","Even before the Francis decision, the supreme ...",holding that judgment based on terms of settle...,holding that agreement for periodic child supp...,a
...,...,...,...,...,...,...,...
95,legal,legal-support,"See Stoops v. One Call Comm., Inc., 141 F.3d 3...","16. Instead, when an employee who is eligible ...","employee need not mention, and may be ignorant...",employee who told employer she was hospitalize...,a
96,legal,legal-support,"See Ott, 827 F.2d at 477 (holding, in a differ...",We recognize that disclosure may not always be...,noting that disclosure of sensitive informatio...,"holding, in a different context, that ""Congres...",b
97,legal,legal-support,"See, e.g., State v. Farmer, 158 N.C. App. 699,...","Subsequently, the court clerk asked every juro...",trial court properly gave jury second verdict ...,evidentiary rule prohibiting juror from testif...,a
98,legal,legal-support,"See also Branning v. CNA Ins. Cos., 729 F.Supp...",When the policy means to refer to defense cost...,finding an insurance policy obligating a prima...,finding policy ambiguous as to whether defense...,b


harness.testcases() method displays the produced test cases in form of a pandas data frame.

### Running the tests

In [7]:
harness.run()


Running testcases... : 100%|██████████| 100/100 [00:27<00:00,  3.58it/s]




Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

In [8]:
harness.generated_results()

Unnamed: 0,category,test_type,case,legal_claim,legal_conclusion_A,legal_conclusion_B,correct_conlusion,model_conclusion,pass
0,legal,legal-support,"O’Leary v. Schweiker, 710 F.2d 1334, 1341 (8th...","""A treating physician's opinion does not deser...","""Because of the interpretive problems inherent...","""A treating physician's checkmarks on an MSS f...",a,b,False
1,legal,legal-support,"Causey v. State, 274 Ga. App. 506, 508 (618 SE...","Moreover, in addition to the officer's testimo...","loaded weapon, large quantity of narcotics and...",evidence sufficient where officer testified qu...,a,b,False
2,legal,legal-support,"Berman v. Bromberg, 56 Cal.App.4th 936, 948, 6...","Supp. Opp'n, at 17), there is no evidence that...","""It is the outward expression of the agreement...",observing that evidence of subjective intent i...,b,b,True
3,legal,legal-support,"Cohen v. Young, 127 F.2d 721, 724 (6th Cir.194...",Other courts agree that shareholders who recei...,treating an objector responding to a trial cou...,holding that appellant who properly filed an o...,a,a,True
4,legal,legal-support,"See Ex parte Jones, 163 Tex. 513, 358 S.W.2d 3...","Even before the Francis decision, the supreme ...",holding that judgment based on terms of settle...,holding that agreement for periodic child supp...,a,a,True
...,...,...,...,...,...,...,...,...,...
95,legal,legal-support,"See Stoops v. One Call Comm., Inc., 141 F.3d 3...","16. Instead, when an employee who is eligible ...","employee need not mention, and may be ignorant...",employee who told employer she was hospitalize...,a,a,True
96,legal,legal-support,"See Ott, 827 F.2d at 477 (holding, in a differ...",We recognize that disclosure may not always be...,noting that disclosure of sensitive informatio...,"holding, in a different context, that ""Congres...",b,b,True
97,legal,legal-support,"See, e.g., State v. Farmer, 158 N.C. App. 699,...","Subsequently, the court clerk asked every juro...",trial court properly gave jury second verdict ...,evidentiary rule prohibiting juror from testif...,a,a,True
98,legal,legal-support,"See also Branning v. CNA Ins. Cos., 729 F.Supp...",When the policy means to refer to defense cost...,finding an insurance policy obligating a prima...,finding policy ambiguous as to whether defense...,b,b,True


This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

### Final Results

We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

In [9]:
harness.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,legal,legal-support,25,75,75%,70%,True
