Skip to content

EleutherAI/pile-allpoetry

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

pile_allpoetry

A tool for scraping poems from allpoetry.com

Usage

to scrape the first 100000 poems on the site:

python scrape_poems.py

to scrape all poems:

python scrape_poems.py -a

to scrape from poem 500000 to 1000000:

python scrape_poems.py --start_id 500000 --latest_id 1000000

All usage options:

usage: scrape_poems.py [-h] [--latest_id LATEST_ID] [--start_id START_ID]
                       [--chunk_size CHUNK_SIZE] [-a] [-v] [-c]

CLI for allpoetry dataset - A tool for scraping poems from allpoetry.com

optional arguments:
  -h, --help            show this help message and exit
  --latest_id LATEST_ID
                        scrape from start_id to latest_id poems (default:
                        100000)
  --start_id START_ID   scrape from start_id to latest_id poems (default: 1)
  --chunk_size CHUNK_SIZE
                        size of multiprocessing chunks (default: 500)
  -a, --all             if this flag is set *all poems* up until the latest
                        poem will be scraped
  -v, --verbose         if this flag is set a poem will be printed out every
                        chunk
  -c, --checkpoint      if this flag is set a the scraper will resume from the
                        poem id in out/checkpoint.txt

About

Scraper to gather poems from allpoetry.com

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages