Merge pull request #157 from Lukas0907/dev-docs

Rework documentation
PyFeeds · Aug 20, 2018 · 1855267 · 1855267
2 parents e344239 + 60a8d23
commit 1855267
Show file tree

Hide file tree

Showing 6 changed files with 226 additions and 54 deletions.
diff --git a/README.rst b/README.rst
@@ -34,7 +34,7 @@ complete list of `supported websites is available in the documentation
 <https://pyfeeds.readthedocs.io/en/latest/spiders.html>`_.
 
 Content behind paywalls
-```````````````````````
+~~~~~~~~~~~~~~~~~~~~~~~
 
 Some sites (Falter_, Konsument_, LWN_, `Oberösterreichische Nachrichten`_,
 Übermedien_) offer articles only behind a paywall. If you have a paid

diff --git a/docs/development.rst b/docs/development.rst
@@ -0,0 +1,153 @@
+.. _Development:
+
+Writing a custom spider
+=======================
+Feeds already supports a number of websites (see :ref:`Supported Websites`) but
+adding support for a new website doesn't take too much time.
+
+A quick example
+---------------
+Writing a spider is easy! Take the spider for :ref:`spider_indiehackers.com` for
+example:
+
+.. code-block:: python
+
+    import scrapy
+
+    from feeds.loaders import FeedEntryItemLoader
+    from feeds.spiders import FeedsSpider
+
+
+    class IndieHackersComSpider(FeedsSpider):
+        name = "indiehackers.com"
+        allowed_domains = [name]
+        start_urls = ["https://www.indiehackers.com/interviews/page/1"]
+        _title = "Indie Hackers"
+
+        def parse(self, response):
+            interviews = response.css(
+                ".interview__link::attr(href), .interview__date::text"
+            ).extract()
+            for link, date in zip(interviews[::2], interviews[1::2]):
+                yield scrapy.Request(
+                    response.urljoin(link),
+                    self._parse_interview,
+                    meta={"updated": date.strip()},
+                )
+
+        def _parse_interview(self, response):
+            remove_elems = [
+                ".shareable-quote",
+                ".share-bar",
+                # Remove the last two h2s and all paragraphs below.
+                ".interview-body > h2:last-of-type ~ p",
+                ".interview-body > h2:last-of-type",
+                ".interview-body > h2:last-of-type ~ p",
+                ".interview-body > h2:last-of-type",
+            ]
+            il = FeedEntryItemLoader(
+                response=response,
+                base_url="https://{}".format(self.name),
+                remove_elems=remove_elems,
+            )
+            il.add_value("link", response.url)
+            il.add_css("title", "h1::text")
+            il.add_css("author_name", "header .user-link__name::text")
+            il.add_css("content_html", ".interview-body")
+            il.add_value("updated", response.meta["updated"])
+            return il.load_item()
+
+
+First, the URL from the ``start_urls`` list is downloaded and the response is
+given to ``parse()``. From there we extract the article links that should be
+scraped and yield ``scrapy.Request`` objects from the for loop.  The callback method
+``_parse_interview()`` is executed once the download has finished. It extracts
+the article from the response HTML document and returns an item that will be
+placed into the feed automatically.
+
+It's enough to place the spider in the ``spiders`` folder. It doesn't have to
+be registered somewhere for Feeds to pick it up.
+
+Reusing an existing feed
+------------------------
+Often websites provide a feed but it's not full text.  In such cases you
+usually only want to augment the original feed with the full article.
+
+Generic spider
+~~~~~~~~~~~~~~
+For a lot of feeds (especially those from blogs) it is actually sufficient to
+use the :ref:`spider_generic` spider which can extract content from any
+website using heuristics (go to :ref:`spider_generic` for more on that).
+
+Note that a lot of feeds (e.g. those generated by Wordpress) actually contain
+the full text but your feed reader chooses to show a summary instead. In such
+cases you can also use the :ref:`spider_generic` spider and add your feed URL
+to the ``fulltext_urls`` key in the config. This will create a full text feed
+from an existing feed without having to rely on heuristics.
+
+Custom extraction
+~~~~~~~~~~~~~~~~~
+These spiders take an existing RSS feed and inline the article content while
+cleaning up the content (removing share buttons, etc.):
+
+  * :ref:`spider_addendum.org`
+  * :ref:`spider_arstechnica.com`
+  * :ref:`spider_derstandard.at`
+  * :ref:`spider_gnucash.org`
+  * :ref:`spider_lwn.net`
+  * :ref:`spider_openwrt.org`
+  * :ref:`spider_orf.at`
+
+Paywalled content
+~~~~~~~~~~~~~~~~~
+If your website has a feed but some or all articles are behind a paywall or
+require to login to read, take a look at the following spiders:
+
+  * :ref:`spider_lwn.net`
+  * :ref:`spider_nachrichten_at`
+  * :ref:`spider_uebermedien.de`
+
+Creating a feed from scratch
+----------------------------
+Some websites don't offer any feed at all. In such cases we have to find an
+efficient way to detect new content and extract it.
+
+Utilizing an API
+~~~~~~~~~~~~~~~~
+Some use a REST API which we can use to fetch the content.
+
+  * :ref:`spider_facebook.com`
+  * :ref:`spider_falter.at`
+  * :ref:`spider_oe1.orf.at`
+  * :ref:`spider_tvthek.orf.at`
+  * :ref:`spider_vice.com`
+
+Utilizing the sitemap
+~~~~~~~~~~~~~~~~~~~~~
+Others provide a sitemap_ which we can parse:
+
+  * :ref:`spider_profil.at`
+
+Custom extraction
+~~~~~~~~~~~~~~~~~
+The last resort is to find a page that lists the newest articles and start
+scraping from there.
+
+  * :ref:`spider_ak.ciando.com`
+  * :ref:`spider_atv.at`
+  * :ref:`spider_biblioweb.at`
+  * :ref:`spider_cbird.at`
+  * :ref:`spider_help.gv.at`
+  * :ref:`spider_indiehackers.com`
+  * :ref:`spider_puls4.com`
+  * :ref:`spider_usenix.org`
+  * :ref:`spider_verbraucherrecht.at`
+  * :ref:`spider_wienerlinien.at`
+  * :ref:`spider_zeit.diebin.at`
+
+For paywalled content, take a look at:
+
+  * :ref:`spider_falter.at`
+  * :ref:`spider_konsument.at`
+
+.. _sitemap: https://en.wikipedia.org/wiki/Site_map
diff --git a/docs/index.rst b/docs/index.rst
@@ -3,12 +3,53 @@ Welcome to Feeds
 Feeds provides DIY Atom feeds in times of social media and paywall.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
+   :hidden:
 
-   introduction
+   self
    get
    quickstart
    configure
    spiders
+   development
    contribute
    license
+
+About Feeds
+-----------
+Once upon a time every website offered an RSS feed to keep readers updated
+about new articles/blog posts via the users' feed readers. These times are long
+gone. The once iconic orange RSS icon has been replaced by "social share"
+buttons.
+
+Feeds aims to bring back the good old reading times. It creates Atom feeds for
+websites that don't offer them (anymore). It allows you to read new articles of
+your favorite websites in your feed reader (e.g. TinyTinyRSS_) even if this is
+not officially supported by the website.
+
+Feeds is based on Scrapy_, a framework for extracting data from websites and it
+has support for a few websites already, see :ref:`Supported Websites`. It's
+easy to add support for new websites. Just take a look at the existing spiders
+in ``feeds/spiders`` and feel free to open a :ref:`pull request <Contribute>`!
+
+Related work
+------------
+* `morss <https://github.com/pictuga/morss>`_ creates feeds, similar to Feeds
+  but in "real-time", i.e. on (HTTP) request.
+* `Full-Text RSS <https://bitbucket.org/fivefilters/full-text-rss>`_ converts
+  feeds to contain the full article and not only a teaser based on heuristics
+  and rules. Feeds are converted in "real-time", i.e. on request basis.
+* `f43.me <https://github.com/j0k3r/f43.me>`_ converts feeds to contain the
+  full article and also improves articles by adding links to the comment
+  sections of Hacker News and Reddit. Feeds are converted periodically.
+* `python-ftr <https://github.com/1flow/python-ftr>`_ is a library to extract
+  content from pages. A partial reimplementation of Full-Text RSS.
+
+Authors
+-------
+Feeds is written and maintained by `Florian Preinstorfer <https://nblock.org>`_
+and `Lukas Anzinger <https://www.notinventedhere.org>`_ (`@LukasAnzinger`_).
+
+.. _Scrapy: https://www.scrapy.org
+.. _TinyTinyRSS: https://tt-rss.org
+.. _@LukasAnzinger: https://twitter.com/LukasAnzinger
diff --git a/docs/introduction.rst b/docs/introduction.rst
diff --git a/docs/spiders.rst b/docs/spiders.rst
@@ -2,21 +2,38 @@
 
 Supported Websites
 ==================
-Feeds is currently able to create full text Atom feeds for the following websites:
+Feeds is currently able to create full text Atom feeds for the websites listed
+below. All feeds contain the articles in full text so you never have to leave your
+feed reader while reading.
+
+A note on paywalls
+------------------
+Some sites (:ref:`Falter <spider_falter.at>`, :ref:`Konsument
+<spider_konsument.at>`, :ref:`LWN <spider_lwn.net>`) offer articles only
+behind a paywall. If you have a paid subscription, you can configure your
+username and password in ``feeds.cfg`` (see also :ref:`Configure Feeds`) and
+also paywalled articles will be included in full text in the created feed. If
+you don't have a subscription and hence the full text cannot be included,
+paywalled articles are tagged with ``paywalled`` so they can be filtered, if
+desired.
+
+Most popular sites
+------------------
 
+.. toctree::
+   :maxdepth: 1
+
+   spiders/arstechnica.com
+   spiders/facebook.com
+   spiders/indiehackers.com
+   spiders/lwn.net
+   spiders/vice.com
+
+All supported sites
+-------------------
 .. toctree::
    :maxdepth: 1
    :glob:
 
    spiders/*
 
-Some sites (:ref:`Falter <spider_falter.at>`, :ref:`Konsument
-<spider_konsument.at>`, :ref:`LWN <spider_lwn.net>`) offer articles only
-behind a paywall. If you have a paid subscription, you can configure your
-username and password in ``feeds.cfg`` and also read paywalled articles from
-within your feed reader.  For the less fortunate who don't have a
-subscription, paywalled articles are tagged with ``paywalled`` so they can be
-filtered, if desired.
-
-All feeds contain the articles in full text so you never have to leave your
-feed reader while reading.
diff --git a/docs/spiders/zeit.diebin.at.rst b/docs/spiders/zeit.diebin.at.rst
@@ -1,4 +1,4 @@
-.. _zeit.diebin.at:
+.. _spider_zeit.diebin.at:
 
 zeit.diebin.at
 -------------------