New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

compare results between Jsquad prompt with title and without title #84

Merged

mkshing merged 8 commits into Stability-AI:jp-stable from kumapo:jsquad-prompt-with-title

Sep 30, 2023

kumapo commented Aug 26, 2023 •

edited

Loading

Model	jsquad w/o title	jsquad w/ title
rinna-japanese-gpt-1b/EM	26.18	30.19
rinna-japanese-gpt-1b/f1	44.68	47.12
rinna-gpt-neox-3.6b-instruction-sft-v2/EM	44.91	47.55
rinna-gpt-neox-3.6b-instruction-sft-v2/f1	59.38	61.63
Llama2-2.7b/EM	58.37	59.93
Llama2-2.7b/f1	69.52	70.82

kumapo force-pushed the jsquad-prompt-with-title branch 3 times, most recently from f843cbf to 7c5d4bd Compare

August 27, 2023 04:49

kumapo changed the title ~~compare results between Jsquad prompt with title and the one without title~~ compare results between Jsquad prompt with title and without title

mkshing self-requested a review

September 5, 2023 00:03

kumapo force-pushed the jsquad-prompt-with-title branch 8 times, most recently from 93b4610 to a73f7a1 Compare

September 12, 2023 07:15

kumapo force-pushed the jsquad-prompt-with-title branch from a73f7a1 to 5dcde92 Compare

September 16, 2023 04:57

kumapo added 2 commits

September 16, 2023 13:59


          re-evaluate models with jsquad prompt with title

8fa63b7


          update jsquad to include titles into the prompt

11d8c89

kumapo force-pushed the jsquad-prompt-with-title branch 3 times, most recently from 18a2996 to 556fc89 Compare

September 16, 2023 11:06


          re-evaluate models with jsquad prompt with title

7e9e1ab

kumapo force-pushed the jsquad-prompt-with-title branch from 556fc89 to 7e9e1ab Compare

September 17, 2023 02:28

kumapo marked this pull request as ready for review

September 17, 2023 02:28

kumapo requested a review from jon-tow as a code owner

September 17, 2023 02:28

Author

kumapo commented Sep 17, 2023

could you check this PR?
if I missed something, please let me know.

mkshing removed the request for review from jon-tow

September 20, 2023 02:43

mkshing reviewed

View reviewed changes

mkshing left a comment

@kumapo thank you for awesome investigation and PR! I left a few comments to organize the code.
Also, regarding the results, I'm thinking if we can skip those changes this time because all models in the leaderboard haven't done yet and the scores in the leaderboard wouldn't match if result.json and harness.sh are changed. I know you kindly evaluated as many models as you could as I said you should before and I feel sorry...

lm_eval/tasks/ja/jsquad_v11.py Outdated

		DYNAMIC_MAX_LENGTH = os.getenv("DYNAMIC_MAX_LENGTH", "true").lower()


		class JSQuADV11(Task):

mkshing Sep 20, 2023

Is it possible to put inside jsquad.py and inherit from the v1 class. Then, we can see the diffs easily.

lm_eval/tasks/ja/jsquad_v11.py Outdated

		return self._squad_metric(predictions=predictions, references=references)[key]


		class JSQuADV11WithFintanPrompt(JSQuADV11):

mkshing Sep 20, 2023

Is it possible to inherit from JSQuADV11 and JSQuADWithFintanPrompt to reduce duplicates?

lm_eval/tasks/ja/jsquad_v11.py Outdated

		)


		class JSQuADV11WithJAAlpacaPrompt(JSQuADV11):

mkshing Sep 20, 2023

Same as JSQuADV11WithFintanPrompt

lm_eval/tasks/ja/jsquad_v11.py Outdated

		return f"### 指示:\n{self.INSTRUCTION}\n\n### 入力:\n{input_text}\n\n### 応答:\n"


		class JSQuADV11WithRinnaInstructionSFT(JSQuADV11):

mkshing Sep 20, 2023

Same as JSQuADV11WithFintanPrompt

lm_eval/tasks/ja/jsquad_v11.py Outdated

		return f"ユーザー: {input_text}{self.SEP}システム: "


		class JSQuADV11WithRinnaBilingualInstructionSFT(JSQuADV11WithRinnaInstructionSFT):

mkshing Sep 20, 2023

Same as JSQuADV11WithFintanPrompt

Author

kumapo commented Sep 21, 2023 •

edited

Loading

Thanks for your comments.
I understand the situation, and let me exclude result.json and harness.sh from this PR.

kumapo added 2 commits

September 23, 2023 16:08


          inherit JSQuAD v1.2 tasks from v1.1 for readability

e920f6d


          re-evaluate models with jsquad prompt with title

6b2fb51

Author

kumapo commented Sep 24, 2023

Will you please check the updated PR?
I've made JSQuAD tasks v1.2 inherit from the original ones (v1.1) because it's easy to see what I changed in v1.2.

Please let me know if I missed something.


          wont need jsquad_v11

e1444bf

mkshing commented Sep 24, 2023

@kumapo Awesome! Thank you!
Maybe you haven't excluded result.json and harness.sh yet? I saw jsquad.py but it looks good to me! So, let me merge this PR after you exclude the changes under models. Thanks.


          revert result.json and harness.sh in models

359b436

Author

kumapo commented Sep 25, 2023 •

edited

Loading

Sorry, I forgot that.
I've updated the PR and result.json and harness.sh updates are excluded now.

mkshing approved these changes

View reviewed changes

mkshing left a comment

LGTM!

mkshing commented Sep 29, 2023

@kumapo sorry but could you try to resolve this error in the action?
https://github.com/Stability-AI/lm-evaluation-harness/actions/runs/6294683571/job/17086866139?pr=84


          fix format

bfb439e

mkshing commented Sep 30, 2023

@kumapo Thank you for trying to resolve it. But, I see the errors come from the entire code rather than your PR. So, please let me merge this PR now!

Thank you so much for your effort again 🙏

mkshing merged commit bca52e7 into Stability-AI:jp-stable

1 check failed

Author

kumapo commented Sep 30, 2023

Actually, I was wondering if it would have been better to fix all the problems.
Thank you for your quick reply.

polm-stability pushed a commit to polm-stability/lm-evaluation-harness that referenced this pull request


          compare results between Jsquad prompt with title and without title (S…

1547fc7

…tability-AI#84)

* re-evaluate models with jsquad prompt with title

* update jsquad to include titles into the prompt

* re-evaluate models with jsquad prompt with title

* inherit JSQuAD v1.2 tasks from v1.1 for readability

* re-evaluate models with jsquad prompt with title

* wont need jsquad_v11

* revert result.json and harness.sh in models

* fix format

polm-stability added a commit that referenced this pull request


          Argparse Refactor (#103)

96b590b

* Initial working refactor

This just pulls the argparse stuff into a separate function.

* Do some rearrangement for the refactor

Eval args are necessary, other params are optional.

The print output is only needed when called from the cli, plus it
assumes that various keys are present (even if None), which is not the
case when calling from Python.

* Move main script to scripts dir, add symlink

Other scripts can't import the main script since it's in the top level.
This moves it into the scripts dir and adds a symlink so it's still
usable at the old location.

* Work on adding example Python harness script

* Add notify script

* Fix arg

* task cleanup

* Add versions to tasks

* Fix typo

* Fix versions

* Read webook url from env var

* evaluate line-corporation large models (#81)

* compare results between Jsquad prompt with title and without title (#84)

* re-evaluate models with jsquad prompt with title

* update jsquad to include titles into the prompt

* re-evaluate models with jsquad prompt with title

* inherit JSQuAD v1.2 tasks from v1.1 for readability

* re-evaluate models with jsquad prompt with title

* wont need jsquad_v11

* revert result.json and harness.sh in models

* fix format

* Verbose output for more tasks (#92)

* Add output to jaqket v2

* Add details to jsquad

* Add versbose output to xlsum

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Add gptq support (#87)

* add EleutherAI PR519 autoGPTQ

* add comma

* change type

* change type2

* change path

* Undo README modifications

---------

Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>

* Add Balanced Accuracy (#95)

* First implementation of balanced accuracy

* Add comment

* Make JNLI a balanced acc task

* Add mcc and balanced f1 scores

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Remove 3.8 version spec from pre-commit config

The version here makes it so that pre-commit can only run in an
environment with python3.8 in the path, but there's no compelling reason
for that. Removing the spec just uses system python.

* Fix Linter Related Issues (#96)

* Change formatting to make the linter happy

This is mostly:

- newlines at end of files
- removing blank lines at end of files
- changing single to double quotes
- black multi-line formatting rules
- other whitespace edits

* Remove codespell

Has a lot of false positives

* boolean style issue

* bare except

These seem harmless enough, so just telling the linter to ignore them

* More linter suggestions

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>

* Simplify neologdn version

This was pointing to a commit, but the relevant PR has been merged and
released for a while now, so a normal version spec can be used.

* Update xwinograd dataset

The old dataset was deleted.

* won't need llama2/llama2-2.7b due to duplication (#99)

* add gekko (#98)

Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>

* add llama2 format (#100)

* add llama2 format

* add 0.6 in prompt_templates.md

* make pre-commit pass

* remove debugging line

* fix bug on `mgsm` for prompt version `0.3` (#101)

* Add JCoLA task (#93)

* WIP: need JCoLA

* Update harness.jcola.sh

* update prompt

* update prompt

* update prompt

* update prompt

* Revert "update prompt"

This reverts commit cd9a914.

* WIP: evaluate on JCoLA

* Add new metrics to cola

This modifies cola, since jcola just inherits this part. It's not a
problem to modify the parent task because it just adds some output.

* Linter edits

* evaluate on JCoLA

* need JCoLAWithLlama2

* JCoLA's prompt version should be 0.0

https://github.com/Stability-AI/lm-evaluation-harness/blob/jp-stable/docs/prompt_templates.md

* documentation

jptasks.md and prompt_templates.md

* won't need harness and result for JCoLA

* fix linter related issue

* Delete harness.jcola.sh

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>

* Linter fixes

* Remove example - script is used instead of function

* Cleanup

* Cleanup / linter fixes

There were some things related to the old shell script usage that
weren't working, this should fix it.

* Add README section describing cluster usage

---------

Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
Co-authored-by: kumapo <kumapo@users.noreply.github.com>
Co-authored-by: webbigdata-jp <87654083+webbigdata-jp@users.noreply.github.com>
Co-authored-by: webbigdata-jp <dahara1@webbigdata.jp>
Co-authored-by: mkshing <33302880+mkshing@users.noreply.github.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment