Skip to content

Commit

Permalink
Merge pull request #157 from Lukas0907/dev-docs
Browse files Browse the repository at this point in the history
Rework documentation
  • Loading branch information
Lukas0907 committed Aug 20, 2018
2 parents e344239 + 60a8d23 commit 1855267
Show file tree
Hide file tree
Showing 6 changed files with 226 additions and 54 deletions.
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ complete list of `supported websites is available in the documentation
<https://pyfeeds.readthedocs.io/en/latest/spiders.html>`_.

Content behind paywalls
```````````````````````
~~~~~~~~~~~~~~~~~~~~~~~

Some sites (Falter_, Konsument_, LWN_, `Oberösterreichische Nachrichten`_,
Übermedien_) offer articles only behind a paywall. If you have a paid
Expand Down
153 changes: 153 additions & 0 deletions docs/development.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
.. _Development:

Writing a custom spider
=======================
Feeds already supports a number of websites (see :ref:`Supported Websites`) but
adding support for a new website doesn't take too much time.

A quick example
---------------
Writing a spider is easy! Take the spider for :ref:`spider_indiehackers.com` for
example:

.. code-block:: python
import scrapy
from feeds.loaders import FeedEntryItemLoader
from feeds.spiders import FeedsSpider
class IndieHackersComSpider(FeedsSpider):
name = "indiehackers.com"
allowed_domains = [name]
start_urls = ["https://www.indiehackers.com/interviews/page/1"]
_title = "Indie Hackers"
def parse(self, response):
interviews = response.css(
".interview__link::attr(href), .interview__date::text"
).extract()
for link, date in zip(interviews[::2], interviews[1::2]):
yield scrapy.Request(
response.urljoin(link),
self._parse_interview,
meta={"updated": date.strip()},
)
def _parse_interview(self, response):
remove_elems = [
".shareable-quote",
".share-bar",
# Remove the last two h2s and all paragraphs below.
".interview-body > h2:last-of-type ~ p",
".interview-body > h2:last-of-type",
".interview-body > h2:last-of-type ~ p",
".interview-body > h2:last-of-type",
]
il = FeedEntryItemLoader(
response=response,
base_url="https://{}".format(self.name),
remove_elems=remove_elems,
)
il.add_value("link", response.url)
il.add_css("title", "h1::text")
il.add_css("author_name", "header .user-link__name::text")
il.add_css("content_html", ".interview-body")
il.add_value("updated", response.meta["updated"])
return il.load_item()
First, the URL from the ``start_urls`` list is downloaded and the response is
given to ``parse()``. From there we extract the article links that should be
scraped and yield ``scrapy.Request`` objects from the for loop. The callback method
``_parse_interview()`` is executed once the download has finished. It extracts
the article from the response HTML document and returns an item that will be
placed into the feed automatically.

It's enough to place the spider in the ``spiders`` folder. It doesn't have to
be registered somewhere for Feeds to pick it up.

Reusing an existing feed
------------------------
Often websites provide a feed but it's not full text. In such cases you
usually only want to augment the original feed with the full article.

Generic spider
~~~~~~~~~~~~~~
For a lot of feeds (especially those from blogs) it is actually sufficient to
use the :ref:`spider_generic` spider which can extract content from any
website using heuristics (go to :ref:`spider_generic` for more on that).

Note that a lot of feeds (e.g. those generated by Wordpress) actually contain
the full text but your feed reader chooses to show a summary instead. In such
cases you can also use the :ref:`spider_generic` spider and add your feed URL
to the ``fulltext_urls`` key in the config. This will create a full text feed
from an existing feed without having to rely on heuristics.

Custom extraction
~~~~~~~~~~~~~~~~~
These spiders take an existing RSS feed and inline the article content while
cleaning up the content (removing share buttons, etc.):

* :ref:`spider_addendum.org`
* :ref:`spider_arstechnica.com`
* :ref:`spider_derstandard.at`
* :ref:`spider_gnucash.org`
* :ref:`spider_lwn.net`
* :ref:`spider_openwrt.org`
* :ref:`spider_orf.at`

Paywalled content
~~~~~~~~~~~~~~~~~
If your website has a feed but some or all articles are behind a paywall or
require to login to read, take a look at the following spiders:

* :ref:`spider_lwn.net`
* :ref:`spider_nachrichten_at`
* :ref:`spider_uebermedien.de`

Creating a feed from scratch
----------------------------
Some websites don't offer any feed at all. In such cases we have to find an
efficient way to detect new content and extract it.

Utilizing an API
~~~~~~~~~~~~~~~~
Some use a REST API which we can use to fetch the content.

* :ref:`spider_facebook.com`
* :ref:`spider_falter.at`
* :ref:`spider_oe1.orf.at`
* :ref:`spider_tvthek.orf.at`
* :ref:`spider_vice.com`

Utilizing the sitemap
~~~~~~~~~~~~~~~~~~~~~
Others provide a sitemap_ which we can parse:

* :ref:`spider_profil.at`

Custom extraction
~~~~~~~~~~~~~~~~~
The last resort is to find a page that lists the newest articles and start
scraping from there.

* :ref:`spider_ak.ciando.com`
* :ref:`spider_atv.at`
* :ref:`spider_biblioweb.at`
* :ref:`spider_cbird.at`
* :ref:`spider_help.gv.at`
* :ref:`spider_indiehackers.com`
* :ref:`spider_puls4.com`
* :ref:`spider_usenix.org`
* :ref:`spider_verbraucherrecht.at`
* :ref:`spider_wienerlinien.at`
* :ref:`spider_zeit.diebin.at`

For paywalled content, take a look at:

* :ref:`spider_falter.at`
* :ref:`spider_konsument.at`

.. _sitemap: https://en.wikipedia.org/wiki/Site_map
45 changes: 43 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,53 @@ Welcome to Feeds
Feeds provides DIY Atom feeds in times of social media and paywall.

.. toctree::
:maxdepth: 2
:maxdepth: 1
:hidden:

introduction
self
get
quickstart
configure
spiders
development
contribute
license

About Feeds
-----------
Once upon a time every website offered an RSS feed to keep readers updated
about new articles/blog posts via the users' feed readers. These times are long
gone. The once iconic orange RSS icon has been replaced by "social share"
buttons.

Feeds aims to bring back the good old reading times. It creates Atom feeds for
websites that don't offer them (anymore). It allows you to read new articles of
your favorite websites in your feed reader (e.g. TinyTinyRSS_) even if this is
not officially supported by the website.

Feeds is based on Scrapy_, a framework for extracting data from websites and it
has support for a few websites already, see :ref:`Supported Websites`. It's
easy to add support for new websites. Just take a look at the existing spiders
in ``feeds/spiders`` and feel free to open a :ref:`pull request <Contribute>`!

Related work
------------
* `morss <https://github.com/pictuga/morss>`_ creates feeds, similar to Feeds
but in "real-time", i.e. on (HTTP) request.
* `Full-Text RSS <https://bitbucket.org/fivefilters/full-text-rss>`_ converts
feeds to contain the full article and not only a teaser based on heuristics
and rules. Feeds are converted in "real-time", i.e. on request basis.
* `f43.me <https://github.com/j0k3r/f43.me>`_ converts feeds to contain the
full article and also improves articles by adding links to the comment
sections of Hacker News and Reddit. Feeds are converted periodically.
* `python-ftr <https://github.com/1flow/python-ftr>`_ is a library to extract
content from pages. A partial reimplementation of Full-Text RSS.

Authors
-------
Feeds is written and maintained by `Florian Preinstorfer <https://nblock.org>`_
and `Lukas Anzinger <https://www.notinventedhere.org>`_ (`@LukasAnzinger`_).

.. _Scrapy: https://www.scrapy.org
.. _TinyTinyRSS: https://tt-rss.org
.. _@LukasAnzinger: https://twitter.com/LukasAnzinger
39 changes: 0 additions & 39 deletions docs/introduction.rst

This file was deleted.

39 changes: 28 additions & 11 deletions docs/spiders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,38 @@

Supported Websites
==================
Feeds is currently able to create full text Atom feeds for the following websites:
Feeds is currently able to create full text Atom feeds for the websites listed
below. All feeds contain the articles in full text so you never have to leave your
feed reader while reading.

A note on paywalls
------------------
Some sites (:ref:`Falter <spider_falter.at>`, :ref:`Konsument
<spider_konsument.at>`, :ref:`LWN <spider_lwn.net>`) offer articles only
behind a paywall. If you have a paid subscription, you can configure your
username and password in ``feeds.cfg`` (see also :ref:`Configure Feeds`) and
also paywalled articles will be included in full text in the created feed. If
you don't have a subscription and hence the full text cannot be included,
paywalled articles are tagged with ``paywalled`` so they can be filtered, if
desired.

Most popular sites
------------------

.. toctree::
:maxdepth: 1

spiders/arstechnica.com
spiders/facebook.com
spiders/indiehackers.com
spiders/lwn.net
spiders/vice.com

All supported sites
-------------------
.. toctree::
:maxdepth: 1
:glob:

spiders/*

Some sites (:ref:`Falter <spider_falter.at>`, :ref:`Konsument
<spider_konsument.at>`, :ref:`LWN <spider_lwn.net>`) offer articles only
behind a paywall. If you have a paid subscription, you can configure your
username and password in ``feeds.cfg`` and also read paywalled articles from
within your feed reader. For the less fortunate who don't have a
subscription, paywalled articles are tagged with ``paywalled`` so they can be
filtered, if desired.

All feeds contain the articles in full text so you never have to leave your
feed reader while reading.
2 changes: 1 addition & 1 deletion docs/spiders/zeit.diebin.at.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.. _zeit.diebin.at:
.. _spider_zeit.diebin.at:

zeit.diebin.at
-------------------
Expand Down

0 comments on commit 1855267

Please sign in to comment.