Skip to content

Commit

Permalink
Argparse Refactor (#103)
Browse files Browse the repository at this point in the history
* Initial working refactor

This just pulls the argparse stuff into a separate function.

* Do some rearrangement for the refactor

Eval args are necessary, other params are optional.

The print output is only needed when called from the cli, plus it
assumes that various keys are present (even if None), which is not the
case when calling from Python.

* Move main script to scripts dir, add symlink

Other scripts can't import the main script since it's in the top level.
This moves it into the scripts dir and adds a symlink so it's still
usable at the old location.

* Work on adding example Python harness script

* Add notify script

* Fix arg

* task cleanup

* Add versions to tasks

* Fix typo

* Fix versions

* Read webook url from env var

* evaluate line-corporation large models (#81)

* compare results between Jsquad prompt with title and without title (#84)

* re-evaluate models with jsquad prompt with title

* update jsquad to include titles into the prompt

* re-evaluate models with jsquad prompt with title

* inherit JSQuAD v1.2 tasks from v1.1 for readability

* re-evaluate models with jsquad prompt with title

* wont need jsquad_v11

* revert result.json and harness.sh in models

* fix format

* Verbose output for more tasks (#92)

* Add output to jaqket v2

* Add details to jsquad

* Add versbose output to xlsum

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add gptq support (#87)

* add EleutherAI PR519 autoGPTQ

* add comma

* change type

* change type2

* change path

* Undo README modifications

---------

Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>

* Add Balanced Accuracy (#95)

* First implementation of balanced accuracy

* Add comment

* Make JNLI a balanced acc task

* Add mcc and balanced f1 scores

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Remove 3.8 version spec from pre-commit config

The version here makes it so that pre-commit can only run in an
environment with python3.8 in the path, but there's no compelling reason
for that. Removing the spec just uses system python.

* Fix Linter Related Issues (#96)

* Change formatting to make the linter happy

This is mostly:

- newlines at end of files
- removing blank lines at end of files
- changing single to double quotes
- black multi-line formatting rules
- other whitespace edits

* Remove codespell

Has a lot of false positives

* boolean style issue

* bare except

These seem harmless enough, so just telling the linter to ignore them

* More linter suggestions

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Simplify neologdn version

This was pointing to a commit, but the relevant PR has been merged and
released for a while now, so a normal version spec can be used.

* Update xwinograd dataset

The old dataset was deleted.

* won't need llama2/llama2-2.7b due to duplication (#99)

* add gekko (#98)

Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>

* add llama2 format (#100)

* add llama2 format

* add 0.6 in prompt_templates.md

* make pre-commit pass

* remove debugging line

* fix bug on `mgsm` for prompt version `0.3` (#101)

* Add JCoLA task (#93)

* WIP: need JCoLA

* Update harness.jcola.sh

* update prompt

* update prompt

* update prompt

* update prompt

* Revert "update prompt"

This reverts commit cd9a914.

* WIP: evaluate on JCoLA

* Add new metrics to cola

This modifies cola, since jcola just inherits this part. It's not a
problem to modify the parent task because it just adds some output.

* Linter edits

* evaluate on JCoLA

* need JCoLAWithLlama2

* JCoLA's prompt version should be 0.0

https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md

* documentation

jptasks.md and prompt_templates.md

* won't need harness and result for JCoLA

* fix linter related issue

* Delete harness.jcola.sh

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>

* Linter fixes

* Remove example - script is used instead of function

* Cleanup

* Cleanup / linter fixes

There were some things related to the old shell script usage that
weren't working, this should fix it.

* Add README section describing cluster usage

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: kumapo <kumapo@users.noreply.github.com>
Co-authored-by: webbigdata-jp <87654083+webbigdata-jp@users.noreply.github.com>
Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>
Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>
  • Loading branch information
6 people committed Nov 6, 2023
1 parent 9b42d41 commit 96b590b
Show file tree
Hide file tree
Showing 6 changed files with 387 additions and 124 deletions.
24 changes: 24 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,30 @@ We support wildcards in task names, for example you can run all of the machine-t

We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
## Cluster Usage
The evaluation suite can be called via the Python API, which makes it possible to script jobs with [submitit](https://github.com/facebookincubator/submitit), for example. You can find a detailed example of how this works in `scripts/run_eval.py`.
Running a job via submitit has two steps: preparing the **executor**, which controls cluster options, and preparing the actual **evaluation** options.
First you need to configure the executor. This controls cluster job details, like how many GPUs or nodes to use. For a detailed example, see `build_executor` in `run_eval.py`, but a minimal example looks like this:
base_args = {... cluster args ...}
executor = submitit.AutoExecutor(folder="./logs")
executor.update_parameters(**base_args)
Once the executor is prepared, you need to actually run the evaluation task. A detailed example of wrapping the API to make this easy is in the `eval_task` function, which mainly just calls out to `main` in `scripts/main_eval.py`. The basic structure is like this:
def my_task():
args = {... eval args ...}
# this is the function from main_eval.py
main_eval(args, output_path="./hoge.json")
job = executor.submit(my_task)
You can then get output from the job and check that it completed successfully. See `run_job` for an example of how that works.
## Implementing new tasks
To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).
Expand Down
124 changes: 0 additions & 124 deletions main.py

This file was deleted.

1 change: 1 addition & 0 deletions main.py
38 changes: 38 additions & 0 deletions scripts/harness_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env python
# This script runs eval in the cluster. Use it as a basis for your own harnesses.
from run_eval import build_executor, run_job
from run_eval import JAEVAL8_TASKS, JAEVAL8_FEWSHOT
from main_eval import main as main_eval


def build_task_list(tasks, prompt):
out = []
# Some tasks don't have a prompt version
promptless = ["xwinograd_ja"]
for task in tasks:
if task not in promptless:
out.append(f"{task}-{prompt}")
else:
out.append(task)
return out


def main():
executor = build_executor("eval", gpus_per_task=8, cpus_per_gpu=12)

tasks = build_task_list(JAEVAL8_TASKS, "0.3")
eval_args = {
"tasks": tasks,
"num_fewshot": JAEVAL8_FEWSHOT,
"model": "hf-causal",
"model_args": "pretrained=rinna/japanese-gpt-1b,use_fast=False",
"device": "cuda",
"limit": 100,
"verbose": True,
}

run_job(executor, main_eval, eval_args=eval_args, output_path="./check.json")


if __name__ == "__main__":
main()
127 changes: 127 additions & 0 deletions scripts/main_eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
import os
import argparse
import json
import logging
import fnmatch

from lm_eval import tasks, evaluator

logging.getLogger("openai").setLevel(logging.WARNING)


class MultiChoice:
def __init__(self, choices):
self.choices = choices

# Simple wildcard support (linux filename patterns)
def __contains__(self, values):
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0:
return False

return True

def __iter__(self):
for choice in self.choices:
yield choice


def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--model_args", default="")
parser.add_argument("--tasks", default=None, choices=MultiChoice(tasks.ALL_TASKS))
parser.add_argument("--num_fewshot", type=str, default="0")
parser.add_argument("--batch_size", type=int, default=None)
parser.add_argument("--device", type=str, default=None)
parser.add_argument("--output_path", default=None)
parser.add_argument("--limit", type=str, default=None)
parser.add_argument("--no_cache", action="store_true")
parser.add_argument("--decontamination_ngrams_path", default=None)
parser.add_argument("--description_dict_path", default=None)
parser.add_argument("--check_integrity", action="store_true")
parser.add_argument("--verbose", action="store_true")
# TODO This is deprecated and throws an error, remove it
parser.add_argument("--provide_description", action="store_true")

return parser.parse_args()


def clean_args(args) -> dict:
"""Handle conversion to lists etc. for args"""

assert not args.provide_description, "provide-description is not implemented"

if args.limit:
print(
"WARNING: --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT."
)

if args.tasks is None:
args.tasks = tasks.ALL_TASKS
else:
args.tasks = pattern_match(args.tasks.split(","), tasks.ALL_TASKS)

print(f"Selected Tasks: {args.tasks}")
if args.num_fewshot is not None:
args.num_fewshot = [int(n) for n in args.num_fewshot.split(",")]

if args.limit is not None:
args.limit = [
int(n) if n.isdigit() else float(n) for n in args.limit.split(",")
]

return vars(args)


# Returns a list containing all values of the source_list that
# match at least one of the patterns
def pattern_match(patterns, source_list):
task_names = []
for pattern in patterns:
for matching in fnmatch.filter(source_list, pattern):
task_names.append(matching)
return task_names


def main(eval_args: dict, description_dict_path: str = None, output_path: str = None):
"""Run evaluation and optionally save output.
For a description of eval args, see `simple_evaluate`.
"""
if description_dict_path:
with open(description_dict_path, "r") as f:
eval_args["description_dict"] = json.load(f)

results = evaluator.simple_evaluate(**eval_args)

dumped = json.dumps(results, indent=2, ensure_ascii=False)
print(dumped)

if output_path:
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, "w") as f:
f.write(dumped)

return results


if __name__ == "__main__":
args = parse_args()
args = clean_args(args)

# This is not used
args.pop("provide_description", None)
# treat non-eval args separately
description_dict_path = args.get("description_dict_path", None)
args.pop("description_dict_path", None)
output_path = args.get("output_path", None)
args.pop("output_path", None)

results = main(args, description_dict_path, output_path)

print(
f"{args['model']} ({args['model_args']}), limit: {args['limit']}, "
f"num_fewshot: {args['num_fewshot']}, batch_size: {args['batch_size']}"
)
print(evaluator.make_table(results))
29 changes: 29 additions & 0 deletions scripts/notify.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/usr/bin/env python
# This is an example of sending a slack notification. For more details see
# official docs:
# https://api.slack.com/messaging/webhooks

import requests
import json
import os

# This URL is tied to a single channel. That can be generalized, or you can
# create a new "app" to use another channel.
WEBHOOK = os.environ.get("WEBHOOK_URL")
if WEBHOOK is None:
print("Webhook URL not found in WEBHOOK_URL env var. Will just print messages.")


def notify(message):
headers = {"Content-Type": "application/json"}
data = json.dumps({"text": message})
if WEBHOOK is None:
print(message)
else:
requests.post(WEBHOOK, data=data, headers=headers)


if __name__ == "__main__":
print("Please type your message.")
message = input("message> ")
notify(message)
Loading

0 comments on commit 96b590b

Please sign in to comment.