You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given xpath. (See Scrapy docs here for info on xpath selectors)
The way to go about handling this will be to extend the existing parser function to try various xpath or css selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a <div ...> item rather than an <article itemprop="articleBody">:
<divclass="wp-column ten margin-right main-content">
The text was updated successfully, but these errors were encountered:
While testing the crawler with the Washington Post parser, I noticed that some errors thrown were for not being able to find the article body in articles from the Washington Post domain. This means that a different article format was present and the parser couldn't grab the article body with the given
xpath
. (See Scrapy docs here for info on xpath selectors)This article about Herman Cain, for instance, is still present in an older, wordpress-based format.
The way to go about handling this will be to extend the existing parser function to try various
xpath
orcss
selectors (see doc link above) to determine what format the article is in. In this instance, the article itself is now in a<div ...>
item rather than an<article itemprop="articleBody">
:The text was updated successfully, but these errors were encountered: