description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818

dafnapension · 2024-05-12T21:02:15Z

Hi @michal-jacovi and @elronbandel ,
Just to kickoff, I tweaked test_card, making it invoke load_dataset_builder (rather than LoadHF), and printed the description, citation, homepage, and whatever that load_dataset_builder harvested for me. I ran test_preparation to activate the tweak over all cards. Not too many surrendered descriptions, but at least we have something to begin with.
If the 5 descriptions thus obtained make sense, I will complete for the other cards, and look for more ways to harvest.

elronbandel

Overall looks great! did you manage to do it automatically? could you somehow get the tags from the README?

for example in amazon_massive:

annotations_creators:
  - expert-generated
language_creators:
  - found
license:
  - cc-by-4.0
multilinguality:
  - af-ZA
  - am-ET
  - ar-SA
  - az-AZ
  - bn-BD
  - ca-ES
  - cy-GB
  - da-DK
  - de-DE
  - el-GR
  - en-US
  - es-ES
  - fa-IR
  - fi-FI
  - fr-FR
  - he-IL
  - hi-IN
  - hu-HU
  - hy-AM
  - id-ID
  - is-IS
  - it-IT
  - ja-JP
  - jv-ID
  - ka-GE
  - km-KH
  - kn-IN
  - ko-KR
  - lv-LV
  - ml-IN
  - mn-MN
  - ms-MY
  - my-MM
  - nb-NO
  - nl-NL
  - pl-PL
  - pt-PT
  - ro-RO
  - ru-RU
  - sl-SL
  - sq-AL
  - sv-SE
  - sw-KE
  - ta-IN
  - te-IN
  - th-TH
  - tl-PH
  - tr-TR
  - ur-PK
  - vi-VN
  - zh-CN
  - zh-TW
size_categories:
  - 100K<n<1M
source_datasets:
  - original
task_categories:
  - text-classification
task_ids:
  - intent-classification
  - multi-class-classification
paperswithcode_id: massive
pretty_name: MASSIVE
language_bcp47:
  - af-ZA
  - am-ET
  - ar-SA
  - az-AZ
  - bn-BD
  - ca-ES
  - cy-GB
  - da-DK
  - de-DE
  - el-GR
  - en-US
  - es-ES
  - fa-IR
  - fi-FI
  - fr-FR
  - he-IL
  - hi-IN
  - hu-HU
  - hy-AM
  - id-ID
  - is-IS
  - it-IT
  - ja-JP
  - jv-ID
  - ka-GE
  - km-KH
  - kn-IN
  - ko-KR
  - lv-LV
  - ml-IN
  - mn-MN
  - ms-MY
  - my-MM
  - nb-NO
  - nl-NL
  - pl-PL
  - pt-PT
  - ro-RO
  - ru-RU
  - sl-SL
  - sq-AL
  - sv-SE
  - sw-KE
  - ta-IN
  - te-IN
  - th-TH
  - tl-PH
  - tr-TR
  - ur-PK
  - vi-VN
  - zh-CN
  - zh-TW
tags:
  - natural-language-understanding

source: https://huggingface.co/datasets/AmazonScience/massive/blob/main/README.md
you can even access its raw version to scrape with this url: https://huggingface.co/datasets/AmazonScience/massive/raw/main/README.md

codecov · 2024-05-18T23:52:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.13%. Comparing base (4064804) to head (a85f9b5).
Report is 1 commits behind head on main.

❗ Current head a85f9b5 differs from pull request most recent head a0e97a2

Please upload reports for the commit a0e97a2 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #818      +/-   ##
==========================================
+ Coverage   92.06%   92.13%   +0.06%     
==========================================
  Files         104      104              
  Lines       10738    10739       +1     
==========================================
+ Hits         9886     9894       +8     
+ Misses        852      845       -7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dafnapension · 2024-05-19T05:43:02Z

Hi @michal-jacovi and @elronbandel ,
I found a way to extract dataset's tags from an object dataset_info:

from huggingface_hub import dataset_info
ds_info = dataset_info(repo_id=card["path"])     # the path input to LoadHF in the card
# then add ds_info.tags  into the file that generates the card, e.g.  cola.py

I made a 'wash' through the existing cards, for the vast majority of which the above tool found tags.
Now, we just need to review by the eye what are these tags that I (so nicely, isnt it? :-) extracted.

dafnapension · 2024-05-19T05:44:46Z

@elronbandel , I found a way to force git-commit to swallow what it defines as spelling errors, but the automatic tool that checks the PR, still insists that some of the language names (e.g. 'som') are spelling errors. Is there a way to overcome this? or we will simply ignore this automatic complain for this PR?

…et_builder Signed-off-by: dafnapension <dafnashein@yahoo.com>

Signed-off-by: dafnapension <dafnashein@yahoo.com>

… tags Signed-off-by: dafnapension <dafnashein@yahoo.com>

…matically Signed-off-by: dafnapension <dafnashein@yahoo.com>

…rd generator file Signed-off-by: dafnapension <dafnashein@yahoo.com>

Signed-off-by: dafnapension <dafnashein@yahoo.com>

dafnapension · 2024-05-20T06:32:15Z

Hi @michal-jacovi and @elronbandel ,
I also added, automatically.., descriptions to the vast majority of the cards.
Tests still fail for spelling errors.. but I ignore for now, as long as I found a trick to commit these errors and push, על אפו וחמתו של ראף.

Signed-off-by: dafnapension <dafnashein@yahoo.com>

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

…ad_dataset_builder (#818) * description, homepage, and citation obtained with datasets.load_dataset_builder Signed-off-by: dafnapension <dafnashein@yahoo.com> * forgot dart.. Signed-off-by: dafnapension <dafnashein@yahoo.com> * add a utility to find where, in the prepare_card.py file, to push the tags Signed-off-by: dafnapension <dafnashein@yahoo.com> * added dataset_info.tags to almost all cards. Extracted and added automatically Signed-off-by: dafnapension <dafnashein@yahoo.com> * added dataset_info.tags to almost all cards. Extracted and added automatically Signed-off-by: dafnapension <dafnashein@yahoo.com> * utils to extract info from hf datasest_info and plant into the taskcard generator file Signed-off-by: dafnapension <dafnashein@yahoo.com> * have a commit with tags only, then continue to description Signed-off-by: dafnapension <dafnashein@yahoo.com> * added descriptions automatically Signed-off-by: dafnapension <dafnashein@yahoo.com> * the automatic hacking for scraping info from hf Signed-off-by: dafnapension <dafnashein@yahoo.com> * Fix pre commit to pass Signed-off-by: Elron Bandel <elron.bandel@ibm.com> --------- Signed-off-by: dafnapension <dafnashein@yahoo.com> Signed-off-by: Elron Bandel <elron.bandel@ibm.com> Co-authored-by: Elron Bandel <elron.bandel@ibm.com> Co-authored-by: Elron Bandel <elronbandel@gmail.com>

dafnapension force-pushed the add_info_to_cards branch from f675345 to 4f1f72d Compare May 12, 2024 21:02

elronbandel reviewed May 13, 2024

View reviewed changes

dafnapension force-pushed the add_info_to_cards branch from f37e230 to c76a3c5 Compare May 17, 2024 09:58

dafnapension added 8 commits May 20, 2024 09:14

description, homepage, and citation obtained with datasets.load_datas…

0d5c3d8

…et_builder Signed-off-by: dafnapension <dafnashein@yahoo.com>

forgot dart..

804e4d7

Signed-off-by: dafnapension <dafnashein@yahoo.com>

add a utility to find where, in the prepare_card.py file, to push the…

6e7853b

… tags Signed-off-by: dafnapension <dafnashein@yahoo.com>

added dataset_info.tags to almost all cards. Extracted and added auto…

94f9217

…matically Signed-off-by: dafnapension <dafnashein@yahoo.com>

added dataset_info.tags to almost all cards. Extracted and added auto…

83a2b58

…matically Signed-off-by: dafnapension <dafnashein@yahoo.com>

utils to extract info from hf datasest_info and plant into the taskca…

1b30a25

…rd generator file Signed-off-by: dafnapension <dafnashein@yahoo.com>

have a commit with tags only, then continue to description

4a55fb5

Signed-off-by: dafnapension <dafnashein@yahoo.com>

added descriptions automatically

8477b36

Signed-off-by: dafnapension <dafnashein@yahoo.com>

dafnapension force-pushed the add_info_to_cards branch from 8456c25 to 8477b36 Compare May 20, 2024 06:14

dafnapension and others added 3 commits May 20, 2024 10:45

the automatic hacking for scraping info from hf

bfa6caa

Signed-off-by: dafnapension <dafnashein@yahoo.com>

Fix pre commit to pass

4511180

Signed-off-by: Elron Bandel <elron.bandel@ibm.com>

Merge branch 'main' into add_info_to_cards

a0e97a2

elronbandel approved these changes May 20, 2024

View reviewed changes

elronbandel enabled auto-merge (squash) May 20, 2024 09:11

elronbandel merged commit 4c376c3 into main May 20, 2024
7 checks passed

elronbandel deleted the add_info_to_cards branch May 20, 2024 09:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818

description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818

dafnapension commented May 12, 2024

elronbandel left a comment

codecov bot commented May 18, 2024 •

edited

Loading

dafnapension commented May 19, 2024

dafnapension commented May 19, 2024 •

edited

Loading

dafnapension commented May 20, 2024

description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818

description, homepage, and citation obtained from HF with datasets.load_dataset_builder #818

Conversation

dafnapension commented May 12, 2024

elronbandel left a comment

Choose a reason for hiding this comment

codecov bot commented May 18, 2024 • edited Loading

Codecov Report

dafnapension commented May 19, 2024

dafnapension commented May 19, 2024 • edited Loading

dafnapension commented May 20, 2024

codecov bot commented May 18, 2024 •

edited

Loading

dafnapension commented May 19, 2024 •

edited

Loading