![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/llm_notebooks/NER%20Casual%20LLM.ipynb)

**LangTest** is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy** models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

# Getting started with LangTest

In [None]:
!pip install "langtest[evaluate,openai]==2.2.0" requests

# Harness and Its Parameters

The Harness class is a testing class for Natural Language Processing (NLP) models. It evaluates the performance of a NLP model on a given task using test data and generates a report with test results.Harness can be imported from the LangTest library in the following way.

In [1]:
#Import Harness from the LangTest library
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>


| Parameter  | Description |  
| - | - | 
|**task**     |Task for which the model is to be evaluated (ner)|
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub or from path</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |

<br/>
<br/>

A system prompt, in the context of Language Model (LLM), is a predefined input that guides the model to generate a specific structured output. This is particularly useful in tasks where the output needs to follow a certain format or structure.

For instance, in Named Entity Recognition (NER), a task in Natural Language Processing (NLP), we might want the model to identify and classify named entities in a text into predefined categories like person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

In such a case, a system prompt could be a sentence with placeholders for the model to fill. The LLM model, upon receiving this prompt, generates an output that fills these placeholders with appropriate entities, thereby producing a structured output.

This approach of using system prompts helps in controlling the output of the LLM models, making them more useful in practical applications where structured outputs are required.

# OpenAI Model Testing For NER

In this section, we dive into testing of OpenAI models in NER task.

LangTest supports robustness and accuracy tests for LLM testing for now.

### Set environment for OpenAI

In [2]:
import os

os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"


In [None]:
# Load CoNLL
!wget https://github.com/JohnSnowLabs/langtest/raw/main/langtest/data/conll/sample.conll

### Setup and Configure Harness

In [3]:
# Create a Harness object
h = Harness(task="ner",
            model={
                "model": "gpt-3.5-turbo-instruct",
                "hub": "openai",},
            data={
                "data_source": '../../data/conll03.conll'
            },
            config={
                "model_parameters": {
                    "temperature": 0,
                },
                "tests": {
                    "defaults": {
                        "min_pass_rate": 1.0
                    },
                    "robustness": {
                        "lowercase": {
                            "min_pass_rate": 0.7
                        }
                    },
                    "accuracy": {
                        "min_f1_score": {
                            "min_score": 0.7,
                        },
                    }
                }
            }
            )



Test Configuration : 
 {
 "model_parameters": {
  "temperature": 0
 },
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "lowercase": {
    "min_pass_rate": 0.7
   }
  },
  "accuracy": {
   "min_f1_score": {
    "min_score": 0.7
   }
  }
 }
}


We have specified task as NER, hub as OpenAI and model as GPT-3.5.

For dataset we used default `Conll` dataset 

For tests we used lowercase and uppercase. Other available robustness tests for QA task are:
* `add_context`
* `add_contraction`
* `add_punctuation`
* `add_typo`
* `add_ocr_typo`
* `american_to_british`
* `british_to_american`
* `lowercase`
* `strip_punctuation`
* `titlecase`
* `uppercase`
* `number_to_word`
* `add_abbreviation`
* `add_speech_to_text_typo`
* `add_slangs`
* `dyslexia_word_swap`
* `multiple_perturbations`
* `adjective_synonym_swap`
* `adjective_antonym_swap`
* `strip_all_punctuation`

Available Bias tests for QA task are:

* `replace_to_male_pronouns`
* `replace_to_female_pronouns`
* `replace_to_neutral_pronouns`
* `replace_to_high_income_country`
* `replace_to_low_income_country`
* `replace_to_upper_middle_income_country`
* `replace_to_lower_middle_income_country`
* `replace_to_white_firstnames`
* `replace_to_black_firstnames`
* `replace_to_hispanic_firstnames`
* `replace_to_asian_firstnames`
* `replace_to_white_lastnames`
* `replace_to_sikh_names`
* `replace_to_christian_names`
* `replace_to_hindu_names`
* `replace_to_muslim_names`
* `replace_to_inter_racial_lastnames`
* `replace_to_native_american_lastnames`
* `replace_to_asian_lastnames`
* `replace_to_hispanic_lastnames`
* `replace_to_black_lastnames`
* `replace_to_parsi_names`
* `replace_to_jain_names`
* `replace_to_buddhist_names`

Available Representation tests for QA task are:

* `min_gender_representation_count`
* `min_ethnicity_name_representation_count`
* `min_religion_name_representation_count`
* `min_country_economic_representation_count`
* `min_gender_representation_proportion`
* `min_ethnicity_name_representation_proportion`
* `min_religion_name_representation_proportion`
* `min_country_economic_representation_proportion`


Available Accuracy tests for QA task are:

* `min_exact_match_score`
* `min_bleu_score`
* `min_rouge1_score`
* `min_rouge2_score`
* `min_rougeL_score`
* `min_rougeLsum_score`


Available Fairness tests for QA task are:

* `max_gender_rouge1_score`
* `max_gender_rouge2_score`
* `max_gender_rougeL_score`
* `max_gender_rougeLsum_score`
* `min_gender_rouge1_score`
* `min_gender_rouge2_score`
* `min_gender_rougeL_score`
* `min_gender_rougeLsum_score`

You can also set prompts and other model parameters in config. Possible parameters are:
* `user_prompt:` Promt to be given to the model.
* `temperature:` Temperature of the model.
* `max_tokens:` Maximum number of output tokens allowed for model.

In [4]:
h.configure({
    "model_parameters": {
        "temperature": 0,
    },
    "tests": {
        "defaults": {
            "min_pass_rate": 1.0
        },
        "robustness": {
            "lowercase": {
                "min_pass_rate": 0.7
            }
        },
        "accuracy": {
            "min_f1_score": {
                "min_score": 0.7,
            },
        }
    }
})

{'model_parameters': {'temperature': 0},
 'tests': {'defaults': {'min_pass_rate': 1.0},
  'robustness': {'lowercase': {'min_pass_rate': 0.7}},
  'accuracy': {'min_f1_score': {'min_score': 0.7}}}}

Here we have configured the harness to perform two robustness tests (uppercase and lowercase) and defined the minimum pass rate for each test.

➤ You can adjust the level of transformation in the sentence by using the "`prob`" parameter, which controls the proportion of words to be changed during robustness tests.

➤ **NOTE** : "`prob`" defaults to 1.0, which means all words will be transformed.
```
harness.configure(
{
 'tests': {
    'defaults': {'min_pass_rate': 0.65},
      'robustness': {
        'lowercase': {'min_pass_rate': 0.66, 'prob': 0.50}, 
        'uppercase':{'min_pass_rate': 0.60, 'prob': 0.70},
      }
  }
})

```

In [5]:
import random as rnd 

rnd.seed(0)

h.data = rnd.choices(h.data, k=100)

### Generating the test cases.

In [6]:
h.generate()

Generating testcases...: 100%|██████████| 2/2 [00:00<?, ?it/s]




harness.generate() method automatically generates the test cases (based on the provided configuration)

In [7]:
h.testcases()

Unnamed: 0,category,test_type,original,test_case
0,robustness,lowercase,He won acclaim for the insights that he gave i...,he won acclaim for the insights that he gave i...
1,robustness,lowercase,FLORIDA AT CINCINNATI,florida at cincinnati
2,robustness,lowercase,ISSUER : Bay Co Building Authority ST : MI,issuer : bay co building authority st : mi
3,robustness,lowercase,Chernomyrdin said on Thursday after a meeting ...,chernomyrdin said on thursday after a meeting ...
4,robustness,lowercase,Wigan 42 Bradford Bulls 36,wigan 42 bradford bulls 36
...,...,...,...,...
100,accuracy,min_f1_score,-,ORG
101,accuracy,min_f1_score,-,MISC
102,accuracy,min_f1_score,-,PER
103,accuracy,min_f1_score,-,O


harness.testcases() method displays the produced test cases in form of a pandas data frame.

### Running the tests

In [8]:
h.run()

Running testcases... :  99%|█████████▉| 104/105 [06:20<00:03,  3.66s/it]




Called after harness.generate() and is to used to run all the tests.  Returns a pass/fail flag for each test.

In [9]:
df = h.generated_results()

This method returns the generated results in the form of a pandas dataframe, which provides a convenient and easy-to-use format for working with the test results. You can use this method to quickly identify the test cases that failed and to determine where fixes are needed.

In [10]:
df

Unnamed: 0,category,test_type,original,test_case,expected_result,actual_result,pass
0,robustness,lowercase,He won acclaim for the insights that he gave i...,he won acclaim for the insights that he gave i...,"Europe: Location, Europe: Location, 20th: Date...","he: PERSON, modern: DATE, europe: LOCATION, eu...",False
1,robustness,lowercase,FLORIDA AT CINCINNATI,florida at cincinnati,"FLORIDA: LOCATION, CINCINNATI: LOCATION","florida: LOCATION, cincinnati: LOCATION",True
2,robustness,lowercase,ISSUER : Bay Co Building Authority ST : MI,issuer : bay co building authority st : mi,"Bay Co Building Authority: Organization, ST: L...","bay: issuer, co: issuer, building: issuer, aut...",False
3,robustness,lowercase,Chernomyrdin said on Thursday after a meeting ...,chernomyrdin said on thursday after a meeting ...,"Chernomyrdin: Person, Thursday: Date, Lebed: P...","chernomyrdin: PERSON, thursday: DATE, lebed: P...",False
4,robustness,lowercase,Wigan 42 Bradford Bulls 36,wigan 42 bradford bulls 36,"Wigan: Location, 42: Number, Bradford Bulls: O...","wigan: ORG, 42: CARDINAL, bradford: ORG, bulls...",False
...,...,...,...,...,...,...,...
99,robustness,lowercase,Indonesian President Suharto has asked busines...,indonesian president suharto has asked busines...,"Indonesian: Location, President: Title, Suhart...","indonesian: GPE, president: TITLE, suharto: PE...",False
100,accuracy,min_f1_score,-,ORG,0.7,0.0,False
101,accuracy,min_f1_score,-,PER,-,-,-
102,accuracy,min_f1_score,-,O,0.7,0.4,False


### Final Results

We can call `.report()` which summarizes the results giving information about pass and fail counts and overall test pass/fail flag.

In [11]:
h.report()

Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,lowercase,69,31,31%,70%,False
1,accuracy,min_f1_score,4,0,0%,100%,False
