-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818
Conversation
f675345
to
4f1f72d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great! did you manage to do it automatically? could you somehow get the tags from the README?
for example in amazon_massive:
annotations_creators:
- expert-generated
language_creators:
- found
license:
- cc-by-4.0
multilinguality:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- ca-ES
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
size_categories:
- 100K<n<1M
source_datasets:
- original
task_categories:
- text-classification
task_ids:
- intent-classification
- multi-class-classification
paperswithcode_id: massive
pretty_name: MASSIVE
language_bcp47:
- af-ZA
- am-ET
- ar-SA
- az-AZ
- bn-BD
- ca-ES
- cy-GB
- da-DK
- de-DE
- el-GR
- en-US
- es-ES
- fa-IR
- fi-FI
- fr-FR
- he-IL
- hi-IN
- hu-HU
- hy-AM
- id-ID
- is-IS
- it-IT
- ja-JP
- jv-ID
- ka-GE
- km-KH
- kn-IN
- ko-KR
- lv-LV
- ml-IN
- mn-MN
- ms-MY
- my-MM
- nb-NO
- nl-NL
- pl-PL
- pt-PT
- ro-RO
- ru-RU
- sl-SL
- sq-AL
- sv-SE
- sw-KE
- ta-IN
- te-IN
- th-TH
- tl-PH
- tr-TR
- ur-PK
- vi-VN
- zh-CN
- zh-TW
tags:
- natural-language-understanding
source: https://huggingface.co/datasets/AmazonScience/massive/blob/main/README.md
you can even access its raw version to scrape with this url: https://huggingface.co/datasets/AmazonScience/massive/raw/main/README.md
f37e230
to
c76a3c5
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #818 +/- ##
==========================================
+ Coverage 92.06% 92.13% +0.06%
==========================================
Files 104 104
Lines 10738 10739 +1
==========================================
+ Hits 9886 9894 +8
+ Misses 852 845 -7 ☔ View full report in Codecov by Sentry. |
Hi @michal-jacovi and @elronbandel ,
I made a 'wash' through the existing cards, for the vast majority of which the above tool found tags. |
@elronbandel , I found a way to force git-commit to swallow what it defines as spelling errors, but the automatic tool that checks the PR, still insists that some of the language names (e.g. 'som') are spelling errors. Is there a way to overcome this? or we will simply ignore this automatic complain for this PR? |
…et_builder Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
… tags Signed-off-by: dafnapension <dafnashein@yahoo.com>
…matically Signed-off-by: dafnapension <dafnashein@yahoo.com>
…matically Signed-off-by: dafnapension <dafnashein@yahoo.com>
…rd generator file Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
8456c25
to
8477b36
Compare
Hi @michal-jacovi and @elronbandel , |
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
…ad_dataset_builder (#818) * description, homepage, and citation obtained with datasets.load_dataset_builder Signed-off-by: dafnapension <dafnashein@yahoo.com> * forgot dart.. Signed-off-by: dafnapension <dafnashein@yahoo.com> * add a utility to find where, in the prepare_card.py file, to push the tags Signed-off-by: dafnapension <dafnashein@yahoo.com> * added dataset_info.tags to almost all cards. Extracted and added automatically Signed-off-by: dafnapension <dafnashein@yahoo.com> * added dataset_info.tags to almost all cards. Extracted and added automatically Signed-off-by: dafnapension <dafnashein@yahoo.com> * utils to extract info from hf datasest_info and plant into the taskcard generator file Signed-off-by: dafnapension <dafnashein@yahoo.com> * have a commit with tags only, then continue to description Signed-off-by: dafnapension <dafnashein@yahoo.com> * added descriptions automatically Signed-off-by: dafnapension <dafnashein@yahoo.com> * the automatic hacking for scraping info from hf Signed-off-by: dafnapension <dafnashein@yahoo.com> * Fix pre commit to pass Signed-off-by: Elron Bandel <elron.bandel@ibm.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Co-authored-by: Elron Bandel <elron.bandel@ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Hi @michal-jacovi and @elronbandel ,
Just to kickoff, I tweaked test_card, making it invoke load_dataset_builder (rather than LoadHF), and printed the description, citation, homepage, and whatever that load_dataset_builder harvested for me. I ran test_preparation to activate the tweak over all cards. Not too many surrendered descriptions, but at least we have something to begin with.
If the 5 descriptions thus obtained make sense, I will complete for the other cards, and look for more ways to harvest.