Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P1 Crawling: Office Civil Rights #3

Closed
6 tasks done
Daniellappv opened this issue Feb 25, 2020 · 0 comments
Closed
6 tasks done

P1 Crawling: Office Civil Rights #3

Daniellappv opened this issue Feb 25, 2020 · 0 comments
Assignees

Comments

@Daniellappv
Copy link

Daniellappv commented Feb 25, 2020

Description: Scrape metadata for https://ocrdata.ed.gov/ (this seems like a useful page - https://www2.ed.gov/about/offices/list/ocr/data.html?src=rt )

Acceptance criteria

  • We have a data dump with all the resources metadata we can get from target site

Task-list:

  • Crawl the site
  • Perfect the crawling to reach as many resources as possible
  • Integrate with the existing pipeline rules (provide a HTML response for the parser)
  • Test run with a dummy parser - it should collect datasets and dump them into JSON files
  • Push the code once it checks all the above criteria

Jira card: https://open-data-ed.atlassian.net/browse/OD-498

@nightsh nightsh self-assigned this Feb 26, 2020
@nightsh nightsh changed the title P1 Scraping: Office Civil Rights P1 Crawling: Office Civil Rights Mar 2, 2020
@nightsh nightsh closed this as completed Mar 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants