Merge pull request #176 from Lukas0907/docs

Document spider API
PyFeeds · May 16, 2020 · efa40f3 · efa40f3
2 parents 5e0b34c + 3338025
commit efa40f3
Show file tree

Hide file tree

Showing 4 changed files with 300 additions and 32 deletions.
diff --git a/docs/api.rst b/docs/api.rst
@@ -0,0 +1,175 @@
+.. _API:
+
+API for Spiders
+===============
+If you want to you support a custom website, take a look at
+:ref:`Development`.
+
+Spider class
+------------
+A spider is a class in a module (Python file) in ``feeds.spiders`` that is a
+subclass of ``feeds.spiders.FeedsSpider``, ``feeds.spiders.FeedsCrawlSpider``
+or ``feeds.spiders.FeedsXMLFeedSpider``.
+
+  * ``FeedsXMLFeedSpider`` is used, if the spider is based on parsing an XML
+    document as a basis. This is useful if the spider should start from an
+    existing XML feed or a sitemap.
+  * ``FeedsCrawlSpider`` is used, if the spider should crawl the site based on
+    links that are found on the site. Patterns can be given to limit what
+    links should be followed.
+  * ``FeedsSpider`` is used in all other cases (this spider is usually used).
+
+Class variables
+^^^^^^^^^^^^^^^
+  * ``name``: The name of the spider (**mandatory**).
+  * ``start_urls``: A list of URLs to start (used if the
+    ``start_requests(self)`` method is not overwritten).
+  * ``feed_title``: Title of the feed.
+  * ``feed_subtitle``: Subtitle of the feed.
+  * ``feed_link``
+  * ``author_name``: Author of the feed.
+  * ``feed_icon``: URL of a site favicon.
+  * ``feed_logo``: URL of a site logo.
+
+Methods
+^^^^^^^
+
+  * ``start_requests(self)``: If the start request is more complicated than a
+    simply ``GET`` to the URL(s) in the ``start_urls`` list, this method can
+    be overwritten. It is expected to yield or return a ``scrapy.Request``
+    object. Please note that this method can *only* emit ``Request`` objects.
+  * ``parse(self, response)``: After a URL from ``start_urls`` has been
+    scraped, the ``parse()`` method is called and the response is given as an
+    argument.  It is also the default call back method for new
+    ``scrapy.Request`` objects.
+  * ``parse_node(self, response, node)``: A ``FeedsXMLFeedSpider`` calls
+    ``parse_node()`` instead of ``parse()`` for every node in the XML document
+    returned by the URL in ``start_urls``.
+
+
+FeedEntryItemLoader
+-------------------
+A spider uses a ``FeedEntryItemLoader`` object to extract content from a
+response. The following fields are accepted and can be added to a item loader
+object:
+
+    * ``link``
+    * ``title``
+    * ``author_name``
+    * ``author_email``
+    * ``content_html``
+    * ``updated``
+    * ``category``
+    * ``path``
+    * ``enclosure_iri``
+    * ``enclosure_type``
+
+A value can be added to an item loader with the ``add_value()``, ``add_css()``
+or ``add_xpath()`` methods like in the following example:
+
+.. code-block:: python
+
+    il = FeedEntryItemLoader(response=response)
+    il.add_value("link", response.url)
+    il.add_css("title", "h1::text")
+    il.add_css("author_name", "header .user-link__name::text")
+    il.add_css("content_html", ".interview-body")
+    il.add_css("updated", ".date::text")
+    return il.load_item()
+
+Only the ``link`` field is required, all the other fields can be empty but
+usually it is adviced to add as many fields as possible (i.e. the original
+site provides).
+
+If the ``updated`` field is not provided, the date and time during the
+extraction is used. If caching is enabled, the date and time when the item was
+first seen is cached and reused on following runs.
+
+Input processing
+----------------
+Automatic rules are applied to fields depending on their type.
+
+Default input rules
+^^^^^^^^^^^^^^^^^^^
+
+These rules are usually applied to every field.
+
+  #. Empty strings and ``None`` are skipped.
+  #. The content is stripped.
+  #. The content is unescaped twice, i.e. ``&amp;&xxx;`` is converted to its
+     decoded (binary) equivalent.
+
+``title``
+^^^^^^^^^
+
+  #. The default input rules apply.
+  #. One title: "<title 1>"
+  #. Two titles: "<title 1>: <title 2>"
+  #. Three or more titles: "<title 1>: <title 2> - <title 3> - <title n>"
+
+``updated``
+^^^^^^^^^^^
+
+  #. Empty strings and ``None`` are skipped.
+  #. Unless the date is already a ``datetime`` object, it is parsed using
+     ``dateutil.parser.parse()`` (with the year expected to be first, and the
+     day *not* expected to be first).  If ``dateutil`` can't parse it because
+     it's a human readable string, ``dateparser`` is used.  ``dayfirst``
+     (default ``False``), ``yearfirst`` (default ``True``) and ``ignoretz``
+     (default ``False``) can be set in the ``FeedEntryItemLoader``.
+  #. If the ``datetime`` object is not already timezone aware, the timezone
+     specified in the ``FeedEntryItemLoader`` is set.
+  #. The first ``datetime`` object is used.
+
+``author_name``
+^^^^^^^^^^^^^^^
+
+  #. The default input rules apply.
+  #. Multiple author names are joined with ", " (comma and space) as a
+     separator.
+
+``path``
+^^^^^^^^
+
+  #. The default input rules apply.
+  #. Multiple paths are joined with ``os.sep`` (e.g. ``/``) as a separator.
+
+``content_html``
+^^^^^^^^^^^^^^^^
+
+  #. Empty strings and ``None`` are skipped.
+  #. ``replace_regex`` in the ``FeedEntryItemLoader`` is a dict with
+     ``pattern`` as a key and ``repl`` as a value. ``pattern`` and ``repl``
+     are used as parameters for ``re.sub()``. ``pattern`` can be a string or
+     a pattern object, ``repl`` a string or a function.
+  #. ``convert_footnotes`` in the ``FeedEntryItemLoader`` is a list of CSS
+     selectors which select footnotes or otherwise hidden text. Such elements
+     are replaced with ``<small>`` elements and the text of the respective
+     footnote in brackets.
+  #. ``pullup_elems`` in the ``FeedEntryItemLoader`` is a dict with a CSS
+     selector as a key and a distance as a value. A parent that is a given
+     distance away from the selected element is replaced with the selected
+     element. E.g. a distance of 1 means that the children replaces its parent.
+  #. ``replace_elems`` in the ``FeedEntryItemLoader`` is a dict that contains a
+     selector as a key and a string as a value. The selected element is
+     replaced with the HTML fragment.
+  #. ``remove_elems`` in the ``FeedEntryItemLoader`` is a list with CSS
+     selectors of elements that should be removed.
+  #. ``remove_elems_xpath`` in the ``FeedEntryItemLoader`` is a list with XPath
+     queries of elements that should be removed.
+  #. ``change_attribs`` in the ``FeedEntryItemLoader`` is a dict with a CSS
+     selector as a key and a dict that describes how to change attribs as a
+     value. The dict contains the old attrib name as a key and the new attrib
+     name as a value. If the value is ``None``, the attrib is removed.
+  #. ``change_tags`` in the ``FeedEntryItemLoader`` is a dict with a CSS
+     selector as a key and a new tag name as a value. The tag name of the
+     selected element is changed to the new tag name.
+  #. Attributes ``class``, ``id`` and ones that start with ``data-`` are
+     removed.
+  #. Iframes are converted to a ``<div>`` that contains a link to the source of
+     the iframe.
+  #. Scripts, JavaScript, comments, styles and inline styles are removed.
+  #. The HTML tree is flattened: Elements which do not have a text and are not
+     supposed to be empty are removed. An element is replaced with is child if
+     it has exactly one child and the child has the same tag.
+  #. References in tags like ``<a>`` and ``<img>`` are made absolute.
diff --git a/docs/development.rst b/docs/development.rst
@@ -1,14 +1,36 @@
 .. _Development:
 
-Writing a custom spider
-=======================
+Supporting a new Website
+========================
 Feeds already supports a number of websites (see :ref:`Supported Websites`) but
-adding support for a new website doesn't take too much time.
+adding support for a new website doesn't take too much time. All you need to do
+is write a so-called spider. A spider is a Python class that is used by Feeds
+to extract content from a website.
+
+The feed generation pipeline looks like this:
+
+  #. A spider extracts the content (e.g. an article) that should be part of
+     the feed from a website. The spider also tells Feeds how the content
+     should be cleaned up, e.g. which HTML elements should be removed.
+  #. Feeds takes the content, cleans it up with the hints from the spider and
+     some generic cleanup rules (e.g. ``<script>`` tags are always removed).
+  #. Feeds writes an Atom feed for that site with the cleaned content to the
+     file system.
 
 A quick example
 ---------------
-Writing a spider is easy! Consider the slightly simplified spider for
-:ref:`spider_indiehackers.com`:
+Writing a spider is easy! For simple websites it can be done in only about 30
+lines of code.
+
+Consider this example for a fictional website that hosts articles. When a new
+article is published, a link to it is added to an overview page.  The idea is
+now to use that URL as a starting point for the spider and let the spider
+extract all the URLs to the articles. In the next step, the spider visits
+every article, extracts the article text and meta information (time, author)
+and creates a feed item out of it.
+
+The following code shows how such a spider could look like for our example
+website:
 
 .. code-block:: python
 
@@ -18,26 +40,18 @@ Writing a spider is easy! Consider the slightly simplified spider for
     from feeds.spiders import FeedsSpider
 
 
-    class IndieHackersComSpider(FeedsSpider):
-        name = "indiehackers.com"
-        start_urls = ["https://www.indiehackers.com/interviews/page/1"]
-        feed_title = "Indie Hackers"
+    class ExampleComSpider(FeedsSpider):
+        name = "example.com"
+        start_urls = ["https://www.example.com/articles"]
+        feed_title = "Example Website"
 
         def parse(self, response):
-            interview_links = response.css(".interview__link::attr(href)").extract()
-            interview_dates = response.css(".interview__date::text").extract()
-            for link, date in zip(interview_links, interview_dates):
-                yield scrapy.Request(
-                    response.urljoin(link),
-                    self._parse_interview,
-                    meta={"updated": date.strip()},
-                )
-
-        def _parse_interview(self, response):
-            remove_elems = [
-                ".shareable-quote",
-                ".share-bar",
-            ]
+            article_links = response.css(".article__link::attr(href)").extract()
+            for link in article_links:
+                yield scrapy.Request(response.urljoin(link), self._parse_article)
+
+        def _parse_article(self, response):
+            remove_elems = [".shareable-quote", ".share-bar"]
             il = FeedEntryItemLoader(
                 response=response,
                 base_url="https://{}".format(self.name),
@@ -46,20 +60,24 @@ Writing a spider is easy! Consider the slightly simplified spider for
             il.add_value("link", response.url)
             il.add_css("title", "h1::text")
             il.add_css("author_name", "header .user-link__name::text")
-            il.add_css("content_html", ".interview-body")
-            il.add_value("updated", response.meta["updated"])
+            il.add_css("content_html", ".article-body")
+            il.add_css("updated", ".article-date::text")
             return il.load_item()
 
 
 First, the URL from the ``start_urls`` list is downloaded and the response is
 given to ``parse()``. From there we extract the article links that should be
 scraped and yield ``scrapy.Request`` objects from the for loop. The callback
-method ``_parse_interview()`` is executed once the download has finished. It
+method ``_parse_article()`` is executed once the download has finished. It
 extracts the article from the response HTML document and returns an item that
 will be placed into the feed automatically.
 
 It's enough to place the spider in the ``spiders`` folder. It doesn't have to
-be registered somewhere for Feeds to pick it up.
+be registered somewhere for Feeds to pick it up. Now you can run it::
+
+    $ feeds crawl example.com
+
+The resulting feed can be found in ``output/example.com/feed.xml``.
 
 Reusing an existing feed
 ------------------------
@@ -150,4 +168,75 @@ For paywalled content, take a look at:
   * :ref:`spider_falter.at`
   * :ref:`spider_konsument.at`
 
+Extraction rules
+----------------
+A great feed transports all the information from the original site but without
+the clutter. The reader should never have to leave their reader and go to the
+original site. The following rules help to reach that goal.
+
+Unwanted content
+~~~~~~~~~~~~~~~~
+Advertisement, share buttons/links, navigation elements and everything
+that is not part of the content is removed. The output should be similar to
+what Firefox Reader View (Readability) outputs, but more polished.
+
+Images
+~~~~~~
+The HTML tags ``<figure>`` and ``<figcaption>`` are used for figures (if
+possible).
+Example:
+
+.. code-block:: html
+
+   <figure>
+   <div><img src="https://example.com/img.jpg"></img><div>
+   <figcaption>A very interesting image.</figcaption>
+   </figure>
+
+Credits for images are removed. Images are included in their highest resolution
+available.
+
+Depaginate
+~~~~~~~~~~
+If content is split in multiple pages, all pages are scraped.
+
+Iframes
+~~~~~~~
+Iframes are removed if they are unnecessary or untouched. Iframes are
+automatically replaced with a link to their source.
+
+Updated field
+~~~~~~~~~~~~~
+Every feed item has an updated field. If the spider cannot provide such a field
+for an item because the original site doesn't expose that information, Feeds
+will automatically use the timestamp when it saw the link of the item for the
+first time.
+
+Not embeddable content
+~~~~~~~~~~~~~~~~~~~~~~
+Sometimes external content like videos cannot be included in the feed because
+it needs JavaScript. In such cases the container of the external video is
+replaced with a note that says that the content is only available in the
+original content.
+
+Regular expressions
+~~~~~~~~~~~~~~~~~~~
+Regular expressions are only used to replace content if using CSS selectors
+with ``replace_elems`` is not possible.
+
+Categories
+~~~~~~~~~~
+A feed item has categories taken from its original feed or from the site.
+
+Headings
+~~~~~~~~
+``<h*>`` tags are used for headings (i. e. not generic tags like ``<p>`` or
+``<div>``). Headings start with ``<h2>``. The title of the content is not part
+of the content and is removed.
+
+Author name(s)
+~~~~~~~~~~~~~~
+The name of all authors are added to the ``author_name`` field.  The names are
+not part of the content and are removed.
+
 .. _sitemap: https://en.wikipedia.org/wiki/Site_map
diff --git a/docs/docker.rst b/docs/docker.rst
@@ -3,7 +3,8 @@
 Docker
 ==========
 
-If you prefer to run pyfeeds in a docker container, you can use the official `pyfeeds image <https://hub.docker.com/r/pyfeeds/pyfeeds/>`_.
+If you prefer to run Feeds in a docker container, you can use the official
+`PyFeeds image <https://hub.docker.com/r/pyfeeds/pyfeeds/>`_.
 
 A ``docker-compose.yaml`` could look like this:
 
@@ -22,7 +23,7 @@ A ``docker-compose.yaml`` could look like this:
         name: pyfeeds-output
 
 It mounts the ``config`` folder next to the ``docker-compose.yaml`` and uses
-the contained ``feeds.cfg`` as config for PyFeeds. The feeds are stored in a
+the contained ``feeds.cfg`` as config for Feeds. The feeds are stored in a
 volume which could be picked up by a webserver:
 
 .. code-block:: yaml
@@ -39,5 +40,7 @@ volume which could be picked up by a webserver:
         external: true
         name: pyfeeds-output
 
-Now any other container in the same docker network (f.e. a ttrss server) could access the feeds (f.e. http://pyfeeds-server/theoatmeal.com/feed.atom).
-Add a port mapping in case you want to allow access from outside the container's docker network.
+Now any other container in the same docker network (f.e. a ttrss server) could
+access the feeds (f.e. http://pyfeeds-server/theoatmeal.com/feed.atom).  Add a
+port mapping in case you want to allow access from outside the container's
+docker network.
diff --git a/docs/index.rst b/docs/index.rst
@@ -11,8 +11,9 @@ Feeds provides DIY Atom feeds in times of social media and paywall.
    quickstart
    configure
    spiders
-   docker
    development
+   docker
+   api
    contribute
    license