You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 20, 2024. It is now read-only.
As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.
using list of pages as raw HTML input, write a script that identifies if the page has resources, and if it has resource then extract all the metadata needed to create a dataset from it
test the parser script and output the data in a spreadsheet format for all pages in the list
integrate the script into the pipeline after the above validation
Acceptance criteria:
script accepts raw HTML as a input
correctly identifies pages that have or have not resources
produces a Python structure with the properties in the list above
returns None for no resources and a Python dictionary with the result otherwise
The text was updated successfully, but these errors were encountered:
As we are now crawling and setting up the pipeline for the scraping sources, we want to start developing the data extracting tools as soon as possible. This task only refers to a single, rather isolated aspect of the entire pipeline, which is extracting the data from a HTML structure.
Desired properties for the resulting datasets:
List of pages to get information from:
List of "false positives" that should bear no dataset information:
Tasks:
Acceptance criteria:
None
for no resources and a Python dictionary with the result otherwiseThe text was updated successfully, but these errors were encountered: