Skip to content
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.

P1 Parsing: Office Civil Rights #15

Closed
7 tasks done
nightsh opened this issue Feb 26, 2020 · 2 comments
Closed
7 tasks done

P1 Parsing: Office Civil Rights #15

nightsh opened this issue Feb 26, 2020 · 2 comments
Assignees

Comments

@nightsh
Copy link
Contributor

nightsh commented Feb 26, 2020

Subtask of #3

As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.

Desired properties for the resulting datasets:

  • source URL (where was it scraped from)
  • title
  • name (usually a unique slug of title)
  • publisher
  • description
  • tags
  • date
  • person of contact (name)
  • person of contact (email)

List of pages to get information from:

List of "false positives" that should bear no dataset information:

Tasks:

  • using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
  • test the parser script and output the data in a spreadsheet format for all pages in the list
  • integrate the script into the pipeline after the above validation

Acceptance criteria:

  • script accepts raw HTML as a input
  • correctly identifies pages that have or have not resources
  • produces a Python structure with the properties in the list above
  • returns None for no resources and a Python dictionary with the result otherwise
@osahon-okungbowa
Copy link
Contributor

Tasks are clear

@nightsh
Copy link
Contributor Author

nightsh commented Feb 28, 2020

New page parsing rules:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants