Skip to content

Commit

Permalink
Merge pull request #176 from Lukas0907/docs
Browse files Browse the repository at this point in the history
Document spider API
  • Loading branch information
Lukas0907 committed May 16, 2020
2 parents 5e0b34c + 3338025 commit efa40f3
Show file tree
Hide file tree
Showing 4 changed files with 300 additions and 32 deletions.
175 changes: 175 additions & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
.. _API:

API for Spiders
===============
If you want to you support a custom website, take a look at
:ref:`Development`.

Spider class
------------
A spider is a class in a module (Python file) in ``feeds.spiders`` that is a
subclass of ``feeds.spiders.FeedsSpider``, ``feeds.spiders.FeedsCrawlSpider``
or ``feeds.spiders.FeedsXMLFeedSpider``.

* ``FeedsXMLFeedSpider`` is used, if the spider is based on parsing an XML
document as a basis. This is useful if the spider should start from an
existing XML feed or a sitemap.
* ``FeedsCrawlSpider`` is used, if the spider should crawl the site based on
links that are found on the site. Patterns can be given to limit what
links should be followed.
* ``FeedsSpider`` is used in all other cases (this spider is usually used).

Class variables
^^^^^^^^^^^^^^^
* ``name``: The name of the spider (**mandatory**).
* ``start_urls``: A list of URLs to start (used if the
``start_requests(self)`` method is not overwritten).
* ``feed_title``: Title of the feed.
* ``feed_subtitle``: Subtitle of the feed.
* ``feed_link``
* ``author_name``: Author of the feed.
* ``feed_icon``: URL of a site favicon.
* ``feed_logo``: URL of a site logo.

Methods
^^^^^^^

* ``start_requests(self)``: If the start request is more complicated than a
simply ``GET`` to the URL(s) in the ``start_urls`` list, this method can
be overwritten. It is expected to yield or return a ``scrapy.Request``
object. Please note that this method can *only* emit ``Request`` objects.
* ``parse(self, response)``: After a URL from ``start_urls`` has been
scraped, the ``parse()`` method is called and the response is given as an
argument. It is also the default call back method for new
``scrapy.Request`` objects.
* ``parse_node(self, response, node)``: A ``FeedsXMLFeedSpider`` calls
``parse_node()`` instead of ``parse()`` for every node in the XML document
returned by the URL in ``start_urls``.


FeedEntryItemLoader
-------------------
A spider uses a ``FeedEntryItemLoader`` object to extract content from a
response. The following fields are accepted and can be added to a item loader
object:

* ``link``
* ``title``
* ``author_name``
* ``author_email``
* ``content_html``
* ``updated``
* ``category``
* ``path``
* ``enclosure_iri``
* ``enclosure_type``

A value can be added to an item loader with the ``add_value()``, ``add_css()``
or ``add_xpath()`` methods like in the following example:

.. code-block:: python
il = FeedEntryItemLoader(response=response)
il.add_value("link", response.url)
il.add_css("title", "h1::text")
il.add_css("author_name", "header .user-link__name::text")
il.add_css("content_html", ".interview-body")
il.add_css("updated", ".date::text")
return il.load_item()
Only the ``link`` field is required, all the other fields can be empty but
usually it is adviced to add as many fields as possible (i.e. the original
site provides).

If the ``updated`` field is not provided, the date and time during the
extraction is used. If caching is enabled, the date and time when the item was
first seen is cached and reused on following runs.

Input processing
----------------
Automatic rules are applied to fields depending on their type.

Default input rules
^^^^^^^^^^^^^^^^^^^

These rules are usually applied to every field.

#. Empty strings and ``None`` are skipped.
#. The content is stripped.
#. The content is unescaped twice, i.e. ``&&xxx;`` is converted to its
decoded (binary) equivalent.

``title``
^^^^^^^^^

#. The default input rules apply.
#. One title: "<title 1>"
#. Two titles: "<title 1>: <title 2>"
#. Three or more titles: "<title 1>: <title 2> - <title 3> - <title n>"

``updated``
^^^^^^^^^^^

#. Empty strings and ``None`` are skipped.
#. Unless the date is already a ``datetime`` object, it is parsed using
``dateutil.parser.parse()`` (with the year expected to be first, and the
day *not* expected to be first). If ``dateutil`` can't parse it because
it's a human readable string, ``dateparser`` is used. ``dayfirst``
(default ``False``), ``yearfirst`` (default ``True``) and ``ignoretz``
(default ``False``) can be set in the ``FeedEntryItemLoader``.
#. If the ``datetime`` object is not already timezone aware, the timezone
specified in the ``FeedEntryItemLoader`` is set.
#. The first ``datetime`` object is used.

``author_name``
^^^^^^^^^^^^^^^

#. The default input rules apply.
#. Multiple author names are joined with ", " (comma and space) as a
separator.

``path``
^^^^^^^^

#. The default input rules apply.
#. Multiple paths are joined with ``os.sep`` (e.g. ``/``) as a separator.

``content_html``
^^^^^^^^^^^^^^^^

#. Empty strings and ``None`` are skipped.
#. ``replace_regex`` in the ``FeedEntryItemLoader`` is a dict with
``pattern`` as a key and ``repl`` as a value. ``pattern`` and ``repl``
are used as parameters for ``re.sub()``. ``pattern`` can be a string or
a pattern object, ``repl`` a string or a function.
#. ``convert_footnotes`` in the ``FeedEntryItemLoader`` is a list of CSS
selectors which select footnotes or otherwise hidden text. Such elements
are replaced with ``<small>`` elements and the text of the respective
footnote in brackets.
#. ``pullup_elems`` in the ``FeedEntryItemLoader`` is a dict with a CSS
selector as a key and a distance as a value. A parent that is a given
distance away from the selected element is replaced with the selected
element. E.g. a distance of 1 means that the children replaces its parent.
#. ``replace_elems`` in the ``FeedEntryItemLoader`` is a dict that contains a
selector as a key and a string as a value. The selected element is
replaced with the HTML fragment.
#. ``remove_elems`` in the ``FeedEntryItemLoader`` is a list with CSS
selectors of elements that should be removed.
#. ``remove_elems_xpath`` in the ``FeedEntryItemLoader`` is a list with XPath
queries of elements that should be removed.
#. ``change_attribs`` in the ``FeedEntryItemLoader`` is a dict with a CSS
selector as a key and a dict that describes how to change attribs as a
value. The dict contains the old attrib name as a key and the new attrib
name as a value. If the value is ``None``, the attrib is removed.
#. ``change_tags`` in the ``FeedEntryItemLoader`` is a dict with a CSS
selector as a key and a new tag name as a value. The tag name of the
selected element is changed to the new tag name.
#. Attributes ``class``, ``id`` and ones that start with ``data-`` are
removed.
#. Iframes are converted to a ``<div>`` that contains a link to the source of
the iframe.
#. Scripts, JavaScript, comments, styles and inline styles are removed.
#. The HTML tree is flattened: Elements which do not have a text and are not
supposed to be empty are removed. An element is replaced with is child if
it has exactly one child and the child has the same tag.
#. References in tags like ``<a>`` and ``<img>`` are made absolute.
143 changes: 116 additions & 27 deletions docs/development.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,36 @@
.. _Development:

Writing a custom spider
=======================
Supporting a new Website
========================
Feeds already supports a number of websites (see :ref:`Supported Websites`) but
adding support for a new website doesn't take too much time.
adding support for a new website doesn't take too much time. All you need to do
is write a so-called spider. A spider is a Python class that is used by Feeds
to extract content from a website.

The feed generation pipeline looks like this:

#. A spider extracts the content (e.g. an article) that should be part of
the feed from a website. The spider also tells Feeds how the content
should be cleaned up, e.g. which HTML elements should be removed.
#. Feeds takes the content, cleans it up with the hints from the spider and
some generic cleanup rules (e.g. ``<script>`` tags are always removed).
#. Feeds writes an Atom feed for that site with the cleaned content to the
file system.

A quick example
---------------
Writing a spider is easy! Consider the slightly simplified spider for
:ref:`spider_indiehackers.com`:
Writing a spider is easy! For simple websites it can be done in only about 30
lines of code.

Consider this example for a fictional website that hosts articles. When a new
article is published, a link to it is added to an overview page. The idea is
now to use that URL as a starting point for the spider and let the spider
extract all the URLs to the articles. In the next step, the spider visits
every article, extracts the article text and meta information (time, author)
and creates a feed item out of it.

The following code shows how such a spider could look like for our example
website:

.. code-block:: python
Expand All @@ -18,26 +40,18 @@ Writing a spider is easy! Consider the slightly simplified spider for
from feeds.spiders import FeedsSpider
class IndieHackersComSpider(FeedsSpider):
name = "indiehackers.com"
start_urls = ["https://www.indiehackers.com/interviews/page/1"]
feed_title = "Indie Hackers"
class ExampleComSpider(FeedsSpider):
name = "example.com"
start_urls = ["https://www.example.com/articles"]
feed_title = "Example Website"
def parse(self, response):
interview_links = response.css(".interview__link::attr(href)").extract()
interview_dates = response.css(".interview__date::text").extract()
for link, date in zip(interview_links, interview_dates):
yield scrapy.Request(
response.urljoin(link),
self._parse_interview,
meta={"updated": date.strip()},
)
def _parse_interview(self, response):
remove_elems = [
".shareable-quote",
".share-bar",
]
article_links = response.css(".article__link::attr(href)").extract()
for link in article_links:
yield scrapy.Request(response.urljoin(link), self._parse_article)
def _parse_article(self, response):
remove_elems = [".shareable-quote", ".share-bar"]
il = FeedEntryItemLoader(
response=response,
base_url="https://{}".format(self.name),
Expand All @@ -46,20 +60,24 @@ Writing a spider is easy! Consider the slightly simplified spider for
il.add_value("link", response.url)
il.add_css("title", "h1::text")
il.add_css("author_name", "header .user-link__name::text")
il.add_css("content_html", ".interview-body")
il.add_value("updated", response.meta["updated"])
il.add_css("content_html", ".article-body")
il.add_css("updated", ".article-date::text")
return il.load_item()
First, the URL from the ``start_urls`` list is downloaded and the response is
given to ``parse()``. From there we extract the article links that should be
scraped and yield ``scrapy.Request`` objects from the for loop. The callback
method ``_parse_interview()`` is executed once the download has finished. It
method ``_parse_article()`` is executed once the download has finished. It
extracts the article from the response HTML document and returns an item that
will be placed into the feed automatically.

It's enough to place the spider in the ``spiders`` folder. It doesn't have to
be registered somewhere for Feeds to pick it up.
be registered somewhere for Feeds to pick it up. Now you can run it::

$ feeds crawl example.com

The resulting feed can be found in ``output/example.com/feed.xml``.

Reusing an existing feed
------------------------
Expand Down Expand Up @@ -150,4 +168,75 @@ For paywalled content, take a look at:
* :ref:`spider_falter.at`
* :ref:`spider_konsument.at`

Extraction rules
----------------
A great feed transports all the information from the original site but without
the clutter. The reader should never have to leave their reader and go to the
original site. The following rules help to reach that goal.

Unwanted content
~~~~~~~~~~~~~~~~
Advertisement, share buttons/links, navigation elements and everything
that is not part of the content is removed. The output should be similar to
what Firefox Reader View (Readability) outputs, but more polished.

Images
~~~~~~
The HTML tags ``<figure>`` and ``<figcaption>`` are used for figures (if
possible).
Example:

.. code-block:: html

<figure>
<div><img src="https://example.com/img.jpg"></img><div>
<figcaption>A very interesting image.</figcaption>
</figure>

Credits for images are removed. Images are included in their highest resolution
available.

Depaginate
~~~~~~~~~~
If content is split in multiple pages, all pages are scraped.

Iframes
~~~~~~~
Iframes are removed if they are unnecessary or untouched. Iframes are
automatically replaced with a link to their source.

Updated field
~~~~~~~~~~~~~
Every feed item has an updated field. If the spider cannot provide such a field
for an item because the original site doesn't expose that information, Feeds
will automatically use the timestamp when it saw the link of the item for the
first time.

Not embeddable content
~~~~~~~~~~~~~~~~~~~~~~
Sometimes external content like videos cannot be included in the feed because
it needs JavaScript. In such cases the container of the external video is
replaced with a note that says that the content is only available in the
original content.

Regular expressions
~~~~~~~~~~~~~~~~~~~
Regular expressions are only used to replace content if using CSS selectors
with ``replace_elems`` is not possible.

Categories
~~~~~~~~~~
A feed item has categories taken from its original feed or from the site.

Headings
~~~~~~~~
``<h*>`` tags are used for headings (i. e. not generic tags like ``<p>`` or
``<div>``). Headings start with ``<h2>``. The title of the content is not part
of the content and is removed.

Author name(s)
~~~~~~~~~~~~~~
The name of all authors are added to the ``author_name`` field. The names are
not part of the content and are removed.

.. _sitemap: https://en.wikipedia.org/wiki/Site_map
11 changes: 7 additions & 4 deletions docs/docker.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
Docker
==========

If you prefer to run pyfeeds in a docker container, you can use the official `pyfeeds image <https://hub.docker.com/r/pyfeeds/pyfeeds/>`_.
If you prefer to run Feeds in a docker container, you can use the official
`PyFeeds image <https://hub.docker.com/r/pyfeeds/pyfeeds/>`_.

A ``docker-compose.yaml`` could look like this:

Expand All @@ -22,7 +23,7 @@ A ``docker-compose.yaml`` could look like this:
name: pyfeeds-output
It mounts the ``config`` folder next to the ``docker-compose.yaml`` and uses
the contained ``feeds.cfg`` as config for PyFeeds. The feeds are stored in a
the contained ``feeds.cfg`` as config for Feeds. The feeds are stored in a
volume which could be picked up by a webserver:

.. code-block:: yaml
Expand All @@ -39,5 +40,7 @@ volume which could be picked up by a webserver:
external: true
name: pyfeeds-output
Now any other container in the same docker network (f.e. a ttrss server) could access the feeds (f.e. http://pyfeeds-server/theoatmeal.com/feed.atom).
Add a port mapping in case you want to allow access from outside the container's docker network.
Now any other container in the same docker network (f.e. a ttrss server) could
access the feeds (f.e. http://pyfeeds-server/theoatmeal.com/feed.atom). Add a
port mapping in case you want to allow access from outside the container's
docker network.
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ Feeds provides DIY Atom feeds in times of social media and paywall.
quickstart
configure
spiders
docker
development
docker
api
contribute
license

Expand Down

0 comments on commit efa40f3

Please sign in to comment.