Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more semantics, logical & reasoning datasets | multilingual #3122

Open
7 of 10 tasks
echo0x22 opened this issue May 10, 2023 · 0 comments
Open
7 of 10 tasks

Add more semantics, logical & reasoning datasets | multilingual #3122

echo0x22 opened this issue May 10, 2023 · 0 comments
Labels

Comments

@echo0x22
Copy link
Contributor

echo0x22 commented May 10, 2023

I believe that such data can help the model become more meaningful; at the moment, when tested, for example, on semantic tests (evaluate which words from the list are synonyms), the model gives terrible results.

I will gradually develop this idea (as well as refine this issue) and find (then adapt in QnA/instruction format) datasets that could potentially help solve this problem.


Very useful source: https://allenai.org/data


Semantics:

  • [semantics] [multilingual] Tatoeba Q&A Translation Dataset; resolved by Add Tatoeba QnA dataset #3114
  • [semantics] [multilingual] Word similarity dataset; resolved by Add semantics WS dataset #3200
  • [semantics] [russian] RUSSE is series of workshops on Russian Semantic Evaluation. Each workshop is centered around a shared task on a specific topic related to the semantic processing of the Russian language; https://github.com/nlpub/russe-evaluation (too many entries; ws ds^ covered Russian too, so no need)
  • Find more...

Reasoning & Logic:

Instructions:


@echo0x22 echo0x22 changed the title Add more semantics, logical & reasoning datasets Add more semantics, logical & reasoning datasets | multilingual May 11, 2023
sedthh pushed a commit that referenced this issue May 13, 2023
Part of this: #3122
https://huggingface.co/datasets/0x22almostEvil/reasoning_bg_oa
---
# Dataset Card for Bulgarian QnA reasoning with ~2.7K entries.

### Dataset Summary

Contains Parquet of a list of instructions and answer.

Each row consists of

* INSTRUCTION
* RESPONSE
* SOURCE (reasoning_bg)
* METADATA (json with language, url, id).

### Original Dataset is available here:
* https://huggingface.co/datasets/reasoning_bg

---------

Co-authored-by: 0x22almostEvil <0x22almostEvil>
sedthh pushed a commit that referenced this issue May 13, 2023
Resolves two of #3122

---

# Dataset Card for GSM QnA reasoning with ~8.8K entries.

### Dataset Summary

License: MIT. Contains Parquet of a list of instructions and answers
(English
only). Reasoning, logic and programming.

Each row consists of

- INSTRUCTION
- RESPONSE
- SOURCE
- METADATA (json with language).

### Link:

https://huggingface.co/datasets/0x22almostEvil/reasoning-gsm-qna-oa

### Original Datasets are available here:

- https://huggingface.co/datasets/gsm8k
- https://huggingface.co/datasets/reasoning-machines/gsm-hard

Co-authored-by: 0x22almostEvil <0x22almostEvil>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants