P1 Parsing: Office Civil Rights #15

nightsh · 2020-02-26T15:29:13Z

Subtask of #3

As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.

Desired properties for the resulting datasets:

source URL (where was it scraped from)
title
name (usually a unique slug of title)
publisher
description
tags
date
person of contact (name)
person of contact (email)

List of pages to get information from:

List of "false positives" that should bear no dataset information:

Tasks:

using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
test the parser script and output the data in a spreadsheet format for all pages in the list
integrate the script into the pipeline after the above validation

Acceptance criteria:

script accepts raw HTML as a input
correctly identifies pages that have or have not resources
produces a Python structure with the properties in the list above
returns None for no resources and a Python dictionary with the result otherwise

The text was updated successfully, but these errors were encountered:

osahon-okungbowa · 2020-02-26T15:46:02Z

Tasks are clear

nightsh · 2020-02-28T11:32:21Z

New page parsing rules:

only extract (potential) dataset if there is at least one data file in it, i.e. no pdf/doc only items
there are pages with multiple datasets
- https://www2.ed.gov/about/offices/list/ocr/data.html?src=rt
- https://ocrdata.ed.gov/StateNationalEstimations/Estimations_2011_12

nightsh assigned osahon-okungbowa Feb 26, 2020

nightsh mentioned this issue Mar 2, 2020

Transform the collected JSON datasets to CKAN harvester data.json format #25

Closed

20 tasks

nightsh closed this as completed Mar 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P1 Parsing: Office Civil Rights #15

P1 Parsing: Office Civil Rights #15

nightsh commented Feb 26, 2020 •

edited

Loading

osahon-okungbowa commented Feb 26, 2020

nightsh commented Feb 28, 2020

P1 Parsing: Office Civil Rights #15

P1 Parsing: Office Civil Rights #15

Comments

nightsh commented Feb 26, 2020 • edited Loading

osahon-okungbowa commented Feb 26, 2020

nightsh commented Feb 28, 2020

nightsh commented Feb 26, 2020 •

edited

Loading