Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add poetry dataset setup #2730

Merged
merged 9 commits into from Apr 20, 2023
Merged

add poetry dataset setup #2730

merged 9 commits into from Apr 20, 2023

Conversation

CheckMC
Copy link
Contributor

@CheckMC CheckMC commented Apr 19, 2023

Dataset Description
This dataset contains around 14,000 poems from the PoetryFoundation.org site. They are converted to question:response pairs, using the tags as topics.
5% of the dataset is titling requests -- the user provides a poem and asks the assistant to title it.

Languages
English

Dataset Structure
This dataset follows the OA format, which is:

INSTRUCTION (string): The user asks for a poem (from a variety of premade prompts) with topics (tags). If the given poem has no tags, the user asks for a poem on it's own.

RESPONSE (string): The assistant replies with the poem and title (from a variety of premade prompts).

SOURCE (string): The source is PoetryFoundation.org and the poet's name.

METADATA (JSON String):
{"author": "author of the original poem",
"title": "title of the poem",
"tags": "tags from poetry foundation."}

Preparing the Dataset
The dataset can be created with prepare.py. Make sure to install the required libraries in requirements.txt!

Contributions
Created by Check
Original dataset source - https://www.kaggle.com/datasets/tgdivy/poetry-foundation-poems

You can view it on my huggingface here: https://huggingface.co/datasets/checkai/instruction-poems
(this time i ran pre-commit so it should be good :D )

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

Copy link
Collaborator

@olliestanley olliestanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to remove the input CSV from the Git repo and include code in the script to download it from original source

Copy link
Collaborator

@sedthh sedthh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments, but otherwise this is very well done. You have added all the fixes requested previously.

Please add the dataset to the parent __init__.py and the HF dataset card to the script's README.md and it should be good to go!

data/datasets/poetry_instruction/README.md Show resolved Hide resolved
data/datasets/poetry_instruction/prepare.py Show resolved Hide resolved
data/datasets/poetry_instruction/prepare.py Show resolved Hide resolved
data/datasets/poetry_instruction/prepare.py Show resolved Hide resolved
data/datasets/poetry_instruction/prepare.py Outdated Show resolved Hide resolved
data/datasets/poetry_instruction/prepare.py Show resolved Hide resolved
Copy link
Collaborator

@sedthh sedthh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments

data/datasets/poetry_instruction/prepare.py Outdated Show resolved Hide resolved
Added poetry dataset to init
@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

Copy link
Collaborator

@olliestanley olliestanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to remove the input CSV from the Git repo and include code in the script to download it from original source

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@sedthh sedthh merged commit a700700 into LAION-AI:main Apr 20, 2023
1 check passed
@CheckMC CheckMC deleted the poetry-dataset branch April 20, 2023 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants