-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #157 from Lukas0907/dev-docs
Rework documentation
- Loading branch information
Showing
6 changed files
with
226 additions
and
54 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,153 @@ | ||
.. _Development: | ||
|
||
Writing a custom spider | ||
======================= | ||
Feeds already supports a number of websites (see :ref:`Supported Websites`) but | ||
adding support for a new website doesn't take too much time. | ||
|
||
A quick example | ||
--------------- | ||
Writing a spider is easy! Take the spider for :ref:`spider_indiehackers.com` for | ||
example: | ||
|
||
.. code-block:: python | ||
import scrapy | ||
from feeds.loaders import FeedEntryItemLoader | ||
from feeds.spiders import FeedsSpider | ||
class IndieHackersComSpider(FeedsSpider): | ||
name = "indiehackers.com" | ||
allowed_domains = [name] | ||
start_urls = ["https://www.indiehackers.com/interviews/page/1"] | ||
_title = "Indie Hackers" | ||
def parse(self, response): | ||
interviews = response.css( | ||
".interview__link::attr(href), .interview__date::text" | ||
).extract() | ||
for link, date in zip(interviews[::2], interviews[1::2]): | ||
yield scrapy.Request( | ||
response.urljoin(link), | ||
self._parse_interview, | ||
meta={"updated": date.strip()}, | ||
) | ||
def _parse_interview(self, response): | ||
remove_elems = [ | ||
".shareable-quote", | ||
".share-bar", | ||
# Remove the last two h2s and all paragraphs below. | ||
".interview-body > h2:last-of-type ~ p", | ||
".interview-body > h2:last-of-type", | ||
".interview-body > h2:last-of-type ~ p", | ||
".interview-body > h2:last-of-type", | ||
] | ||
il = FeedEntryItemLoader( | ||
response=response, | ||
base_url="https://{}".format(self.name), | ||
remove_elems=remove_elems, | ||
) | ||
il.add_value("link", response.url) | ||
il.add_css("title", "h1::text") | ||
il.add_css("author_name", "header .user-link__name::text") | ||
il.add_css("content_html", ".interview-body") | ||
il.add_value("updated", response.meta["updated"]) | ||
return il.load_item() | ||
First, the URL from the ``start_urls`` list is downloaded and the response is | ||
given to ``parse()``. From there we extract the article links that should be | ||
scraped and yield ``scrapy.Request`` objects from the for loop. The callback method | ||
``_parse_interview()`` is executed once the download has finished. It extracts | ||
the article from the response HTML document and returns an item that will be | ||
placed into the feed automatically. | ||
|
||
It's enough to place the spider in the ``spiders`` folder. It doesn't have to | ||
be registered somewhere for Feeds to pick it up. | ||
|
||
Reusing an existing feed | ||
------------------------ | ||
Often websites provide a feed but it's not full text. In such cases you | ||
usually only want to augment the original feed with the full article. | ||
|
||
Generic spider | ||
~~~~~~~~~~~~~~ | ||
For a lot of feeds (especially those from blogs) it is actually sufficient to | ||
use the :ref:`spider_generic` spider which can extract content from any | ||
website using heuristics (go to :ref:`spider_generic` for more on that). | ||
|
||
Note that a lot of feeds (e.g. those generated by Wordpress) actually contain | ||
the full text but your feed reader chooses to show a summary instead. In such | ||
cases you can also use the :ref:`spider_generic` spider and add your feed URL | ||
to the ``fulltext_urls`` key in the config. This will create a full text feed | ||
from an existing feed without having to rely on heuristics. | ||
|
||
Custom extraction | ||
~~~~~~~~~~~~~~~~~ | ||
These spiders take an existing RSS feed and inline the article content while | ||
cleaning up the content (removing share buttons, etc.): | ||
|
||
* :ref:`spider_addendum.org` | ||
* :ref:`spider_arstechnica.com` | ||
* :ref:`spider_derstandard.at` | ||
* :ref:`spider_gnucash.org` | ||
* :ref:`spider_lwn.net` | ||
* :ref:`spider_openwrt.org` | ||
* :ref:`spider_orf.at` | ||
|
||
Paywalled content | ||
~~~~~~~~~~~~~~~~~ | ||
If your website has a feed but some or all articles are behind a paywall or | ||
require to login to read, take a look at the following spiders: | ||
|
||
* :ref:`spider_lwn.net` | ||
* :ref:`spider_nachrichten_at` | ||
* :ref:`spider_uebermedien.de` | ||
|
||
Creating a feed from scratch | ||
---------------------------- | ||
Some websites don't offer any feed at all. In such cases we have to find an | ||
efficient way to detect new content and extract it. | ||
|
||
Utilizing an API | ||
~~~~~~~~~~~~~~~~ | ||
Some use a REST API which we can use to fetch the content. | ||
|
||
* :ref:`spider_facebook.com` | ||
* :ref:`spider_falter.at` | ||
* :ref:`spider_oe1.orf.at` | ||
* :ref:`spider_tvthek.orf.at` | ||
* :ref:`spider_vice.com` | ||
|
||
Utilizing the sitemap | ||
~~~~~~~~~~~~~~~~~~~~~ | ||
Others provide a sitemap_ which we can parse: | ||
|
||
* :ref:`spider_profil.at` | ||
|
||
Custom extraction | ||
~~~~~~~~~~~~~~~~~ | ||
The last resort is to find a page that lists the newest articles and start | ||
scraping from there. | ||
|
||
* :ref:`spider_ak.ciando.com` | ||
* :ref:`spider_atv.at` | ||
* :ref:`spider_biblioweb.at` | ||
* :ref:`spider_cbird.at` | ||
* :ref:`spider_help.gv.at` | ||
* :ref:`spider_indiehackers.com` | ||
* :ref:`spider_puls4.com` | ||
* :ref:`spider_usenix.org` | ||
* :ref:`spider_verbraucherrecht.at` | ||
* :ref:`spider_wienerlinien.at` | ||
* :ref:`spider_zeit.diebin.at` | ||
|
||
For paywalled content, take a look at: | ||
|
||
* :ref:`spider_falter.at` | ||
* :ref:`spider_konsument.at` | ||
|
||
.. _sitemap: https://en.wikipedia.org/wiki/Site_map |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
.. _zeit.diebin.at: | ||
.. _spider_zeit.diebin.at: | ||
|
||
zeit.diebin.at | ||
------------------- | ||
|