Argparse Refactor (#103)

* Initial working refactor This just pulls the argparse stuff into a separate function. * Do some rearrangement for the refactor Eval args are necessary, other params are optional. The print output is only needed when called from the cli, plus it assumes that various keys are present (even if None), which is not the case when calling from Python. * Move main script to scripts dir, add symlink Other scripts can't import the main script since it's in the top level. This moves it into the scripts dir and adds a symlink so it's still usable at the old location. * Work on adding example Python harness script * Add notify script * Fix arg * task cleanup * Add versions to tasks * Fix typo * Fix versions * Read webook url from env var * evaluate line-corporation large models (#81) * compare results between Jsquad prompt with title and without title (#84) * re-evaluate models with jsquad prompt with title * update jsquad to include titles into the prompt * re-evaluate models with jsquad prompt with title * inherit JSQuAD v1.2 tasks from v1.1 for readability * re-evaluate models with jsquad prompt with title * wont need jsquad_v11 * revert result.json and harness.sh in models * fix format * Verbose output for more tasks (#92) * Add output to jaqket v2 * Add details to jsquad * Add versbose output to xlsum --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Add gptq support (#87) * add EleutherAI PR519 autoGPTQ * add comma * change type * change type2 * change path * Undo README modifications --------- Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp> * Add Balanced Accuracy (#95) * First implementation of balanced accuracy * Add comment * Make JNLI a balanced acc task * Add mcc and balanced f1 scores --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Remove 3.8 version spec from pre-commit config The version here makes it so that pre-commit can only run in an environment with python3.8 in the path, but there's no compelling reason for that. Removing the spec just uses system python. * Fix Linter Related Issues (#96) * Change formatting to make the linter happy This is mostly: - newlines at end of files - removing blank lines at end of files - changing single to double quotes - black multi-line formatting rules - other whitespace edits * Remove codespell Has a lot of false positives * boolean style issue * bare except These seem harmless enough, so just telling the linter to ignore them * More linter suggestions --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> * Simplify neologdn version This was pointing to a commit, but the relevant PR has been merged and released for a while now, so a normal version spec can be used. * Update xwinograd dataset The old dataset was deleted. * won't need llama2/llama2-2.7b due to duplication (#99) * add gekko (#98) Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp> * add llama2 format (#100) * add llama2 format * add 0.6 in prompt_templates.md * make pre-commit pass * remove debugging line * fix bug on `mgsm` for prompt version `0.3` (#101) * Add JCoLA task (#93) * WIP: need JCoLA * Update harness.jcola.sh * update prompt * update prompt * update prompt * update prompt * Revert "update prompt" This reverts commit cd9a914. * WIP: evaluate on JCoLA * Add new metrics to cola This modifies cola, since jcola just inherits this part. It's not a problem to modify the parent task because it just adds some output. * Linter edits * evaluate on JCoLA * need JCoLAWithLlama2 * JCoLA's prompt version should be 0.0 https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md * documentation jptasks.md and prompt_templates.md * won't need harness and result for JCoLA * fix linter related issue * Delete harness.jcola.sh --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com> * Linter fixes * Remove example - script is used instead of function * Cleanup * Cleanup / linter fixes There were some things related to the old shell script usage that weren't working, this should fix it. * Add README section describing cluster usage --------- Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com> Co-authored-by: kumapo <kumapo@users.noreply.github.com> Co-authored-by: webbigdata-jp <87654083+webbigdata-jp@users.noreply.github.com> Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp> Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>
Stability-AI · Nov 6, 2023 · 96b590b · 96b590b
1 parent 9b42d41
commit 96b590b
Show file tree

Hide file tree

Showing 6 changed files with 387 additions and 124 deletions.
diff --git a/README.md b/README.md
@@ -205,6 +205,30 @@ We support wildcards in task names, for example you can run all of the machine-t
 
 We currently only support one prompt per task, which we strive to make the "standard" as defined by the benchmark's authors. If you would like to study how varying prompts causes changes in the evaluation score, check out the [BigScience fork](https://github.com/bigscience-workshop/lm-evaluation-harness) of this repo. We are currently working on upstreaming this capability to `main`.
 
+## Cluster Usage
+
+The evaluation suite can be called via the Python API, which makes it possible to script jobs with [submitit](https://github.com/facebookincubator/submitit), for example. You can find a detailed example of how this works in `scripts/run_eval.py`.
+
+Running a job via submitit has two steps: preparing the **executor**, which controls cluster options, and preparing the actual **evaluation** options.
+
+First you need to configure the executor. This controls cluster job details, like how many GPUs or nodes to use. For a detailed example, see `build_executor` in `run_eval.py`, but a minimal example looks like this:
+
+    base_args = {... cluster args ...}
+    executor = submitit.AutoExecutor(folder="./logs")
+    executor.update_parameters(**base_args)
+
+Once the executor is prepared, you need to actually run the evaluation task. A detailed example of wrapping the API to make this easy is in the `eval_task` function, which mainly just calls out to `main` in `scripts/main_eval.py`. The basic structure is like this:
+
+    def my_task():
+        args = {... eval args ...}
+
+        # this is the function from main_eval.py
+        main_eval(args, output_path="./hoge.json")
+
+    job = executor.submit(my_task)
+
+You can then get output from the job and check that it completed successfully. See `run_job` for an example of how that works.
+
 ## Implementing new tasks
 
 To implement a new task in the eval harness, see [this guide](./docs/task_guide.md).

diff --git a/main.py b/main.py
diff --git a/main.py b/main.py
@@ -0,0 +1 @@
+scripts/main_eval.py
diff --git a/scripts/harness_example.py b/scripts/harness_example.py
@@ -0,0 +1,38 @@
+#!/usr/bin/env python
+# This script runs eval in the cluster. Use it as a basis for your own harnesses.
+from run_eval import build_executor, run_job
+from run_eval import JAEVAL8_TASKS, JAEVAL8_FEWSHOT
+from main_eval import main as main_eval
+
+
+def build_task_list(tasks, prompt):
+    out = []
+    # Some tasks don't have a prompt version
+    promptless = ["xwinograd_ja"]
+    for task in tasks:
+        if task not in promptless:
+            out.append(f"{task}-{prompt}")
+        else:
+            out.append(task)
+    return out
+
+
+def main():
+    executor = build_executor("eval", gpus_per_task=8, cpus_per_gpu=12)
+
+    tasks = build_task_list(JAEVAL8_TASKS, "0.3")
+    eval_args = {
+        "tasks": tasks,
+        "num_fewshot": JAEVAL8_FEWSHOT,
+        "model": "hf-causal",
+        "model_args": "pretrained=rinna/japanese-gpt-1b,use_fast=False",
+        "device": "cuda",
+        "limit": 100,
+        "verbose": True,
+    }
+
+    run_job(executor, main_eval, eval_args=eval_args, output_path="./check.json")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/scripts/main_eval.py b/scripts/main_eval.py
@@ -0,0 +1,127 @@
+import os
+import argparse
+import json
+import logging
+import fnmatch
+
+from lm_eval import tasks, evaluator
+
+logging.getLogger("openai").setLevel(logging.WARNING)
+
+
+class MultiChoice:
+    def __init__(self, choices):
+        self.choices = choices
+
+    # Simple wildcard support (linux filename patterns)
+    def __contains__(self, values):
+        for value in values.split(","):
+            if len(fnmatch.filter(self.choices, value)) == 0:
+                return False
+
+        return True
+
+    def __iter__(self):
+        for choice in self.choices:
+            yield choice
+
+
+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model", required=True)
+    parser.add_argument("--model_args", default="")
+    parser.add_argument("--tasks", default=None, choices=MultiChoice(tasks.ALL_TASKS))
+    parser.add_argument("--num_fewshot", type=str, default="0")
+    parser.add_argument("--batch_size", type=int, default=None)
+    parser.add_argument("--device", type=str, default=None)
+    parser.add_argument("--output_path", default=None)
+    parser.add_argument("--limit", type=str, default=None)
+    parser.add_argument("--no_cache", action="store_true")
+    parser.add_argument("--decontamination_ngrams_path", default=None)
+    parser.add_argument("--description_dict_path", default=None)
+    parser.add_argument("--check_integrity", action="store_true")
+    parser.add_argument("--verbose", action="store_true")
+    # TODO This is deprecated and throws an error, remove it
+    parser.add_argument("--provide_description", action="store_true")
+
+    return parser.parse_args()
+
+
+def clean_args(args) -> dict:
+    """Handle conversion to lists etc. for args"""
+
+    assert not args.provide_description, "provide-description is not implemented"
+
+    if args.limit:
+        print(
+            "WARNING: --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT."
+        )
+
+    if args.tasks is None:
+        args.tasks = tasks.ALL_TASKS
+    else:
+        args.tasks = pattern_match(args.tasks.split(","), tasks.ALL_TASKS)
+
+    print(f"Selected Tasks: {args.tasks}")
+    if args.num_fewshot is not None:
+        args.num_fewshot = [int(n) for n in args.num_fewshot.split(",")]
+
+    if args.limit is not None:
+        args.limit = [
+            int(n) if n.isdigit() else float(n) for n in args.limit.split(",")
+        ]
+
+    return vars(args)
+
+
+# Returns a list containing all values of the source_list that
+# match at least one of the patterns
+def pattern_match(patterns, source_list):
+    task_names = []
+    for pattern in patterns:
+        for matching in fnmatch.filter(source_list, pattern):
+            task_names.append(matching)
+    return task_names
+
+
+def main(eval_args: dict, description_dict_path: str = None, output_path: str = None):
+    """Run evaluation and optionally save output.
+
+    For a description of eval args, see `simple_evaluate`.
+    """
+    if description_dict_path:
+        with open(description_dict_path, "r") as f:
+            eval_args["description_dict"] = json.load(f)
+
+    results = evaluator.simple_evaluate(**eval_args)
+
+    dumped = json.dumps(results, indent=2, ensure_ascii=False)
+    print(dumped)
+
+    if output_path:
+        os.makedirs(os.path.dirname(output_path), exist_ok=True)
+        with open(output_path, "w") as f:
+            f.write(dumped)
+
+    return results
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    args = clean_args(args)
+
+    # This is not used
+    args.pop("provide_description", None)
+    # treat non-eval args separately
+    description_dict_path = args.get("description_dict_path", None)
+    args.pop("description_dict_path", None)
+    output_path = args.get("output_path", None)
+    args.pop("output_path", None)
+
+    results = main(args, description_dict_path, output_path)
+
+    print(
+        f"{args['model']} ({args['model_args']}), limit: {args['limit']}, "
+        f"num_fewshot: {args['num_fewshot']}, batch_size: {args['batch_size']}"
+    )
+    print(evaluator.make_table(results))
diff --git a/scripts/notify.py b/scripts/notify.py
@@ -0,0 +1,29 @@
+#!/usr/bin/env python
+# This is an example of sending a slack notification. For more details see
+# official docs:
+# https://api.slack.com/messaging/webhooks
+
+import requests
+import json
+import os
+
+# This URL is tied to a single channel. That can be generalized, or you can
+# create a new "app" to use another channel.
+WEBHOOK = os.environ.get("WEBHOOK_URL")
+if WEBHOOK is None:
+    print("Webhook URL not found in WEBHOOK_URL env var. Will just print messages.")
+
+
+def notify(message):
+    headers = {"Content-Type": "application/json"}
+    data = json.dumps({"text": message})
+    if WEBHOOK is None:
+        print(message)
+    else:
+        requests.post(WEBHOOK, data=data, headers=headers)
+
+
+if __name__ == "__main__":
+    print("Please type your message.")
+    message = input("message> ")
+    notify(message)