Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818

Merged
merged 11 commits into from
May 20, 2024

Conversation

dafnapension
Copy link
Collaborator

Hi @michal-jacovi and @elronbandel ,
Just to kickoff, I tweaked test_card, making it invoke load_dataset_builder (rather than LoadHF), and printed the description, citation, homepage, and whatever that load_dataset_builder harvested for me. I ran test_preparation to activate the tweak over all cards. Not too many surrendered descriptions, but at least we have something to begin with.
If the 5 descriptions thus obtained make sense, I will complete for the other cards, and look for more ways to harvest.

Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great! did you manage to do it automatically? could you somehow get the tags from the README?

for example in amazon_massive:

annotations_creators:
  - expert-generated
language_creators:
  - found
license:
  - cc-by-4.0
multilinguality:
  - af-ZA
  - am-ET
  - ar-SA
  - az-AZ
  - bn-BD
  - ca-ES
  - cy-GB
  - da-DK
  - de-DE
  - el-GR
  - en-US
  - es-ES
  - fa-IR
  - fi-FI
  - fr-FR
  - he-IL
  - hi-IN
  - hu-HU
  - hy-AM
  - id-ID
  - is-IS
  - it-IT
  - ja-JP
  - jv-ID
  - ka-GE
  - km-KH
  - kn-IN
  - ko-KR
  - lv-LV
  - ml-IN
  - mn-MN
  - ms-MY
  - my-MM
  - nb-NO
  - nl-NL
  - pl-PL
  - pt-PT
  - ro-RO
  - ru-RU
  - sl-SL
  - sq-AL
  - sv-SE
  - sw-KE
  - ta-IN
  - te-IN
  - th-TH
  - tl-PH
  - tr-TR
  - ur-PK
  - vi-VN
  - zh-CN
  - zh-TW
size_categories:
  - 100K<n<1M
source_datasets:
  - original
task_categories:
  - text-classification
task_ids:
  - intent-classification
  - multi-class-classification
paperswithcode_id: massive
pretty_name: MASSIVE
language_bcp47:
  - af-ZA
  - am-ET
  - ar-SA
  - az-AZ
  - bn-BD
  - ca-ES
  - cy-GB
  - da-DK
  - de-DE
  - el-GR
  - en-US
  - es-ES
  - fa-IR
  - fi-FI
  - fr-FR
  - he-IL
  - hi-IN
  - hu-HU
  - hy-AM
  - id-ID
  - is-IS
  - it-IT
  - ja-JP
  - jv-ID
  - ka-GE
  - km-KH
  - kn-IN
  - ko-KR
  - lv-LV
  - ml-IN
  - mn-MN
  - ms-MY
  - my-MM
  - nb-NO
  - nl-NL
  - pl-PL
  - pt-PT
  - ro-RO
  - ru-RU
  - sl-SL
  - sq-AL
  - sv-SE
  - sw-KE
  - ta-IN
  - te-IN
  - th-TH
  - tl-PH
  - tr-TR
  - ur-PK
  - vi-VN
  - zh-CN
  - zh-TW
tags:
  - natural-language-understanding

source: https://huggingface.co/datasets/AmazonScience/massive/blob/main/README.md
you can even access its raw version to scrape with this url: https://huggingface.co/datasets/AmazonScience/massive/raw/main/README.md

Copy link

codecov bot commented May 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.13%. Comparing base (4064804) to head (a85f9b5).
Report is 1 commits behind head on main.

Current head a85f9b5 differs from pull request most recent head a0e97a2

Please upload reports for the commit a0e97a2 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #818      +/-   ##
==========================================
+ Coverage   92.06%   92.13%   +0.06%     
==========================================
  Files         104      104              
  Lines       10738    10739       +1     
==========================================
+ Hits         9886     9894       +8     
+ Misses        852      845       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dafnapension
Copy link
Collaborator Author

Hi @michal-jacovi and @elronbandel ,
I found a way to extract dataset's tags from an object dataset_info:

from huggingface_hub import dataset_info
ds_info = dataset_info(repo_id=card["path"])     # the path input to LoadHF in the card
# then add ds_info.tags  into the file that generates the card, e.g.  cola.py

I made a 'wash' through the existing cards, for the vast majority of which the above tool found tags.
Now, we just need to review by the eye what are these tags that I (so nicely, isnt it? :-) extracted.

@dafnapension
Copy link
Collaborator Author

dafnapension commented May 19, 2024

@elronbandel , I found a way to force git-commit to swallow what it defines as spelling errors, but the automatic tool that checks the PR, still insists that some of the language names (e.g. 'som') are spelling errors. Is there a way to overcome this? or we will simply ignore this automatic complain for this PR?

…et_builder

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
… tags

Signed-off-by: dafnapension <dafnashein@yahoo.com>
…matically

Signed-off-by: dafnapension <dafnashein@yahoo.com>
…matically

Signed-off-by: dafnapension <dafnashein@yahoo.com>
…rd generator file

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
@dafnapension
Copy link
Collaborator Author

Hi @michal-jacovi and @elronbandel ,
I also added, automatically.., descriptions to the vast majority of the cards.
Tests still fail for spelling errors.. but I ignore for now, as long as I found a trick to commit these errors and push, על אפו וחמתו של ראף.

dafnapension and others added 3 commits May 20, 2024 10:45
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
@elronbandel elronbandel enabled auto-merge (squash) May 20, 2024 09:11
@elronbandel elronbandel merged commit 4c376c3 into main May 20, 2024
7 checks passed
@elronbandel elronbandel deleted the add_info_to_cards branch May 20, 2024 09:34
bnayahu pushed a commit that referenced this pull request May 21, 2024
…ad_dataset_builder (#818)

* description, homepage, and citation obtained with datasets.load_dataset_builder

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* forgot dart..

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* add a utility to find where, in the prepare_card.py file, to push the tags

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added dataset_info.tags to almost all cards. Extracted and added automatically

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added dataset_info.tags to almost all cards. Extracted and added automatically

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* utils to extract info from hf datasest_info and plant into the taskcard generator file

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* have a commit with tags only, then continue to description

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added descriptions automatically

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* the automatic hacking for scraping info from hf

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* Fix pre commit to pass

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elron Bandel <elron.bandel@ibm.com>
Co-authored-by: Elron Bandel <elron.bandel@ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants