P1 Crawling: Office Civil Rights #3

Daniellappv · 2020-02-25T14:03:01Z

Description: Scrape metadata for https://ocrdata.ed.gov/ (this seems like a useful page - https://www2.ed.gov/about/offices/list/ocr/data.html?src=rt )

Acceptance criteria

Crawl the site
Perfect the crawling to reach as many resources as possible
Integrate with the existing pipeline rules (provide a HTML response for the parser)
Test run with a dummy parser - it should collect datasets and dump them into JSON files
Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-498

nightsh self-assigned this Feb 26, 2020

nightsh mentioned this issue Feb 26, 2020

P1 Parsing: Office Civil Rights #15

Closed

7 tasks

nightsh changed the title ~~P1 Scraping: Office Civil Rights~~ P1 Crawling: Office Civil Rights Mar 2, 2020

nightsh mentioned this issue Mar 2, 2020

Transform the collected JSON datasets to CKAN harvester data.json format #25

Closed

20 tasks

nightsh closed this as completed Mar 11, 2020