This notebook outlines the steps required to make a submission to the LegalBench leaderboard.

### 1. Create a virtual environment and install requirements

From your project directory, run the following three commands:

In [None]:
python -m venv legalbench_venv

source legalbench_venv/bin/activate

pip install -r requirements.txt

### 2. Fix ```weave``` (skip if weave==v.0.51.55 or higher)

Due to a deserialization bug in ```weave``` (the observability platform that powers the leaderboard) submissions as of now require a (very simple) manual fix to the dependency. To implement it, replace the following two files:

a. replace the content of ```legalbench_venv/lib/python3.13/site-packages/weave/flow/casting.py```with [this file](https://gist.github.com/KensingtonOscupant/edc131bcf1052d319b89dcd70378d976) and save the changes.

b. replace the content of ```legalbench_venv/lib/python3.13/site-packages/weave/trace/serialization/serialize.py``` with [this file](https://gist.github.com/KensingtonOscupant/2620f3af72a3e7c92dc77b918e684463) and save the changes.

Do not worry if your Python version in the path is not 3.13 (e.g. ```legalbench_venv/lib/python3.12/...``` instead of ```legalbench_venv/lib/python3.13/...```).

This step can be skipped as soon as ```weave v.0.51.55``` is released (provided the fix will be implemented as planned by the maintainers). Without the fix, submitting an evaluation will fail with ```TypeError: Unable to cast to Scorer```.



### 3. Create an account with Weights & Biases

a. Go to to [wandb.ai](https://wandb.ai/) and sign up. If you are affiliated with a university, you may select the "Academic account" option.   
b. Go to Settings -> API key to copy your API key.

### 4. Create a submission

Now you are set to create your first submission to the leaderboard! A minimal working example can be found at ```leaderboard/participant_setup/submit_run.py```. Run the file as a module like below, for example on the rule-conclusion task ```abercrombie```.  
Replace ```your_name``` with the name you would like to appear on the leaderboard:

In [None]:
python -m leaderboard.participant_setup.submit_run --task abercrombie --model_name your_name

> Note: In case you should not see your submission on the leaderboard, try changing the ```model_name```.  
> A model that has the exact same name and configuration as an already listed model will not show up on the leaderboard.

> Note to LegalBench team: The defaults are currently set to the minimal implementation I linked in the PR. As soon as you have set up the full version, we would need to replace the default values for the --team and --project arguments in ```submit_run.py``` and ```submit_run_all_tasks.py```.

You will be prompted to enter your Weights & Biases API key. After a few seconds, your run should be submitted. Congratulations!  
If you head over to the leaderboard [insert URL], you should now see an entry with the name you specified.  
(It will score 0.0% accuracy for now because it just echoes the prompt.)

### 5. Helpful details on how to create a submission

To help you create your first next submission quickly, you only need to understand one line towards the end of the sample script ```leaderboard/participant_setup/submit_run.py```, which is  

```asyncio.run(eval.evaluate(model))```. 

Going more into detail:

a. ```asyncio.run(...)```: the run is an asynchronous process; this does not have any practical relevance.

b. ```eval```: This the evaluation object that the run is attached to. The leaderboard only tracks runs using this exact evaluation object. It contains all the information on how your run will be scored, including the dataset and scorer that will be used. You can reference the object like in the sample script using ```eval = weave.ref(f"{TASK}_evaluation").get()```, where TASK is the name of the task, e.g. ```abercrombie```.

c. ```.evaluate```: built-in method that starts the evaluation process.

d. ```model```: This is where your work lives. The term 'model' in this context does not refer to the LLM, but rather all the code that should be executed to create the prediction (synonymously: generation). Going through ```MyModel``` from the sample script in more detail:

In [None]:
class MyModel(Model):
    prompt_template: str

    @weave.op()
    def predict(self, text: str):
        prompt_template = self.prompt_template

        prompt = prompt_template.replace("{{text}}", text)
        
        return {'generation': prompt}

aa. ```class MyModel(Model):```: The model is a Python class that inherits from ```weave```'s ```Model``` class.  

bb. ```prompt_template: str```: A variable that stores the prompt template. You can see that in the sample script, the prompt template is loaded from the tasks.py directory where various prompts are stored (```with open(f"tasks/{TASK}/base_prompt.txt") as in_file:``` f.) and then the class is instantiated with that prompt (```model = MyModel(prompt_template=prompt_template, ...)```). This way, all the methods in that class can access the ```prompt_template```. ```prompt_template``` is not required and could just as well be implemented differently.

cc. ```@weave.op()```: Adding this decorator to a method causes the method to be traced in Weights & Biases - very helpful for gaining a detailed understanding of your own runs as well as those of others.

dd. ```def predict(self, text: str):```: Your model must have a ```predict()``` method. This method gets called on every row of your data and its output is the prediction that will be scored on the dataset. As inputs, you are able to get each row of that task's LegalBench dataset by defining a parameter of that name - in our example, ```text``` accesses the value of the text column for the row currently processed. You can inspect the datasets here [insert link]. Within the ```predict()``` method, you are free to implement any logic you need. Most importantly, it is where you would call your LLM. Example:

In [None]:
@weave.op()
def predict(self, text: str):

    prompt_template = self.prompt_template
    prompt = prompt_template.replace("{{text}}", text)

    # you would have to set up openai outside of the method, somewhere at the top of your script
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1000,
        temperature=0.0,
    )
    generated_text = response.choices[0].message.content
    return {"generation": generated_text}

> Note: If you would like to perform an evaluation on all tasks at once, you can do so using the ```submit_run_all_tasks.py``` helper script. It uses the ```data``` dictionary that is also passed to the predict method alongside all the individual columns and allows you to construct prompts following the same logic as LegalBench's ```generate_prompts``` function from ```utils.py```.