Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When parsing Washington Post articles, handle old format as well as new format #4

Open
josephpd3 opened this issue Sep 30, 2017 · 0 comments

Comments

@josephpd3
Copy link
Collaborator

While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given xpath. (See Scrapy docs here for info on xpath selectors)

This article about Herman Cain, for instance, is still present in an older, wordpress-based format.

The way to go about handling this will be to extend the existing parser function to try various xpath or css selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a <div ...> item rather than an <article itemprop="articleBody">:

<div class="wp-column ten margin-right main-content">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant