When parsing Washington Post articles, handle old format as well as new format #4

josephpd3 · 2017-09-30T14:13:55Z

While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given xpath. (See Scrapy docs here for info on xpath selectors)

This article about Herman Cain, for instance, is still present in an older, wordpress-based format.

The way to go about handling this will be to extend the existing parser function to try various xpath or css selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a <div ...> item rather than an <article itemprop="articleBody">:

<div class="wp-column ten margin-right main-content">

The text was updated successfully, but these errors were encountered:

josephpd3 added the help wanted label Sep 30, 2017

This was referenced Sep 30, 2017

Parse Forbes Articles #6

Open

Parse The Hill Articles #7

Open

Parse Breitbart Articles #8

Open

Parse Business Insider Articles #9

Open

Parse Associated Press Articles #10

Open

Parse Reuters Articles #11

Open

josephpd3 added the enhancement label Sep 30, 2017

This was referenced Sep 30, 2017

Parse Fox News Articles #13

Open

Parse NPR Articles #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When parsing Washington Post articles, handle old format as well as new format #4

When parsing Washington Post articles, handle old format as well as new format #4

josephpd3 commented Sep 30, 2017

When parsing Washington Post articles, handle old format as well as new format #4

When parsing Washington Post articles, handle old format as well as new format #4

Comments

josephpd3 commented Sep 30, 2017