This notebook provides a basic illustration of how to use different parts of LegalBench. 

In [1]:
from tqdm.auto import tqdm
import datasets

from tasks import TASKS, ISSUE_TASKS
from utils import generate_prompts

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Supress progress bars which appear every time a task is downloaded
datasets.utils.logging.set_verbosity_error()

### Task organization

`tasks.py` provides data structures which organize all LegalBench tasks. For instance, `TASKS` lists all LegalBench tasks, and `ISSUE_TASKS` lists all tasks in the issue-spotting reasoning category.

In [3]:
print(len(TASKS), TASKS[:10])
print()
print(len(ISSUE_TASKS), ISSUE_TASKS)

162 ['abercrombie', 'canada_tax_court_outcomes', 'citation_prediction_classification', 'citation_prediction_open', 'consumer_contracts_qa', 'contract_nli_confidentiality_of_agreement', 'contract_nli_explicit_identification', 'contract_nli_inclusion_of_verbally_conveyed_information', 'contract_nli_limited_use', 'contract_nli_no_licensing']

17 ['corporate_lobbying', 'learned_hands_benefits', 'learned_hands_business', 'learned_hands_consumer', 'learned_hands_courts', 'learned_hands_crime', 'learned_hands_divorce', 'learned_hands_domestic_violence', 'learned_hands_education', 'learned_hands_employment', 'learned_hands_estates', 'learned_hands_family', 'learned_hands_health', 'learned_hands_housing', 'learned_hands_immigration', 'learned_hands_torts', 'learned_hands_traffic']


### Loading task data

LegalBench can be downloaded from Huggingface: https://huggingface.co/datasets/nguha/legalbench. Each LegalBench dataset comes with `train` and `test` split.

- The `train` split is small (usually fewer than 10 samples). Following the [RAFT](https://raft.elicit.org/) benchmark, it's intended to provide labaled samples that can be used as few-shot demonstrations for prompts.
- The `test` split is larger, and contains samples to evaluate an LLM on. 

Documentation for each task can be found on the Github repository, under the task-specific folder. For instance, the documentation for the `abercrombie` task can be found at <https://github.com/HazyResearch/legalbench/tree/main/tasks/abercrombie>.

In [4]:
dataset = datasets.load_dataset("nguha/legalbench", "abercrombie")
dataset["train"].to_pandas()

Unnamed: 0,answer,index,text
0,generic,0,"The mark ""Ivory"" for a product made of elephan..."
1,descriptive,1,"The mark ""Tasty"" for bread."
2,suggestive,2,"The mark ""Caress"" for body soap."
3,arbitrary,3,"The mark ""Virgin"" for wireless communications."
4,fanciful,4,"The mark ""Aswelly"" for a taxi service."


### Loading and applying prompts

Each task folder also stores prompt templates which can be used with different models. In LegalBench, prompt templates are represented as text files, in which "{{col_name}}" denote place holders for column names.

For instance:

In [5]:
# Load base prompt
with open(f"tasks/abercrombie/base_prompt.txt") as in_file:
    prompt_template = in_file.read()
print(prompt_template)

A mark is generic if it is the common name for the product. A mark is descriptive if it describes a purpose, nature, or attribute of the product. A mark is suggestive if it suggests or implies a quality or characteristic of the product. A mark is arbitrary if it is a real English word that has no relation to the product. A mark is fanciful if it is an invented word.

Q: The mark "Ivory" for a product made of elephant tusks. What is the type of mark?
A: generic

Q: The mark "Tasty" for bread. What is the type of mark?
A: descriptive

Q: The mark "Caress" for body soap. What is the type of mark?
A: suggestive

Q: The mark "Virgin" for wireless communications. What is the type of mark?
A: arbitrary

Q: The mark "Aswelly" for a taxi service. What is the type of mark?
A: fanciful

Q: {{text}} What is the type of mark?
A:


The script `utils.py` provides a simple function for generating prompts for a dataset given a template.

In [6]:
test_df = dataset["test"].to_pandas()
prompts = generate_prompts(prompt_template=prompt_template, data_df=test_df)
print(prompts[0])

A mark is generic if it is the common name for the product. A mark is descriptive if it describes a purpose, nature, or attribute of the product. A mark is suggestive if it suggests or implies a quality or characteristic of the product. A mark is arbitrary if it is a real English word that has no relation to the product. A mark is fanciful if it is an invented word.

Q: The mark "Ivory" for a product made of elephant tusks. What is the type of mark?
A: generic

Q: The mark "Tasty" for bread. What is the type of mark?
A: descriptive

Q: The mark "Caress" for body soap. What is the type of mark?
A: suggestive

Q: The mark "Virgin" for wireless communications. What is the type of mark?
A: arbitrary

Q: The mark "Aswelly" for a taxi service. What is the type of mark?
A: fanciful

Q: The mark “Salt” for packages of sodium chloride. What is the type of mark?
A:


### Evaluation

The majority of LegalBench tasks are evaluated using balanced-accuracy. A handful of tasks which involve extraction or multilabel classification are evaluated using F1. To simplify evaluation, we provide an evaluation which which scores performance.

In [7]:
from evaluation import evaluate
import numpy as np

# Generate random predictions for abercrombie
classes = ["generic", "descriptive", "suggestive", "arbitrary", "fanciful"]
generations = np.random.choice(classes, len(test_df))

evaluate("abercrombie", generations, test_df["answer"].tolist())

0.23157894736842105

### Selecting tasks by license

LegalBench tasks are covered under different licenses. The following code allows you to filter out tasks by license type.

In [8]:
target_license = "CC BY 4.0"
tasks_with_target_license = []
for task in tqdm(TASKS):
    dataset = datasets.load_dataset("nguha/legalbench", task, split="train")
    if dataset.info.license == target_license:
        tasks_with_target_license.append(task)
print("Tasks with target license:", tasks_with_target_license)

100%|████████████████████████████████████████████████████████████████████████████████████████████| 162/162 [03:00<00:00,  1.11s/it]

Tasks with target license: ['abercrombie', 'citation_prediction_classification', 'citation_prediction_open', 'contract_nli_confidentiality_of_agreement', 'contract_nli_explicit_identification', 'contract_nli_inclusion_of_verbally_conveyed_information', 'contract_nli_limited_use', 'contract_nli_no_licensing', 'contract_nli_notice_on_compelled_disclosure', 'contract_nli_permissible_acquirement_of_similar_information', 'contract_nli_permissible_copy', 'contract_nli_permissible_development_of_similar_information', 'contract_nli_permissible_post-agreement_possession', 'contract_nli_return_of_confidential_information', 'contract_nli_sharing_with_employees', 'contract_nli_sharing_with_third-parties', 'contract_nli_survival_of_obligations', 'contract_qa', 'corporate_lobbying', 'cuad_affiliate_license-licensee', 'cuad_affiliate_license-licensor', 'cuad_anti-assignment', 'cuad_audit_rights', 'cuad_cap_on_liability', 'cuad_change_of_control', 'cuad_competitive_restriction_exception', 'cuad_covena


