Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gutenberg eBook Crawler + Datasets #1463

Merged
merged 4 commits into from Feb 11, 2023
Merged

Conversation

sedthh
Copy link
Collaborator

@sedthh sedthh commented Feb 11, 2023

This PR resolves #1110

  • Created Notebook with a crawler class that downloads contents from Project Gutenberg, removes metadata from body of text and skips books with a copyright header
  • Added collab link with basic requirements.txt for Notebook
  • Created datasets for eBooks with their metadata included in the following languages: "en", "de", "fr", "es", "it", "pt", "nl", "hu"
  • Added README.md with dataset card and in-depth explanation
  • Added hub.py with basic loader but let me know if there is anything else, I was following the examples in: https://github.com/LAION-AI/Open-Assistant/tree/main/openassistant/datasets

EDIT: reran pre-commit run --all-files

- added configrable Jupyter Notebook with crawler + ability to save to parquet
- added README.md
- added requirements.txt for running in venv / collab
- eBooks are now available on Huggingface as datasets
- added hub.py with basic loading
- updated READM.md to match the datasets' cards
@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

- forgot to add --all-files last time
Copy link
Collaborator

@andrewm4894 andrewm4894 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet! Can you run the notebook just so someone browsing on github can easily see typical or expected outputs.

- the crawler Notebook now has example outputs
- updated README.md
- reran pre-commit hook
@sedthh
Copy link
Collaborator Author

sedthh commented Feb 11, 2023

Done.

@andrewm4894 andrewm4894 merged commit 55a52fd into LAION-AI:main Feb 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download Ebooks from Project Gutenberg
2 participants