Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Breitbart Articles #8

Open
josephpd3 opened this issue Sep 30, 2017 · 1 comment
Open

Parse Breitbart Articles #8

josephpd3 opened this issue Sep 30, 2017 · 1 comment
Assignees

Comments

@josephpd3
Copy link
Collaborator

Note: If you prefer to not work with this source, please leave it to other contributors. As far as we are concerned, all media is relevant from a research perspective.

Using the WashingtonPost parser as an example, we want to create another parser for this source.
Note: As of now, we only care to grab anchor tag <a> references.

This will involve a few things:

  • You will have to define the parser in its own submodule under crawler/crawler/parsers
  • This parser will have to return a list of reference objects (dicts in Python), given a scrapy response
  • These parser objects must have the following:
    • 'href': the link within the anchor tag itself
    • 'text': the text or item which the anchor tag wraps
    • 'context': the paragraph <p> tag enclosing the given anchor tag's cleaned text.
  • Some sites may have various formats depending on article category or article age (see this issue). These will have to all be handled in the parser. It is fine if you do not catch this at first. Sometimes older articles will only be referenced by older articles, and that is one crazy rabbit hole to try and go down in the initial stages.

When submitting a PR for this, please include some sample references which you scraped from a source. We can work through cleaning it and getting it right if it comes down to it :)

@brycecf
Copy link
Member

brycecf commented Oct 2, 2017

@josephpd3 I have this implemented. I'll make a pull request later on today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants