Skip to content

Commit

Permalink
Merge pull request #215 from Lukas0907/next
Browse files Browse the repository at this point in the history
Spider fixes
  • Loading branch information
Lukas0907 committed May 16, 2020
2 parents fc23b1f + 34eaddd commit 02fc464
Show file tree
Hide file tree
Showing 15 changed files with 73 additions and 305 deletions.
4 changes: 1 addition & 3 deletions docs/development.rst
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ Utilizing an API
Some use a REST API which we can use to fetch the content.

* :ref:`spider_falter.at`
* :ref:`spider_indiehackers.com`
* :ref:`spider_kurier.at`
* :ref:`spider_oe1.orf.at`
* :ref:`spider_tvthek.orf.at`
Expand All @@ -120,7 +121,6 @@ Utilizing the sitemap
~~~~~~~~~~~~~~~~~~~~~
Others provide a sitemap_ which we can parse:

* :ref:`spider_diepresse.com`
* :ref:`spider_profil.at`

Custom extraction
Expand All @@ -134,8 +134,6 @@ scraping from there.
* :ref:`spider_cbird.at`
* :ref:`spider_delinski.at`
* :ref:`spider_flimmit.com`
* :ref:`spider_help.gv.at`
* :ref:`spider_indiehackers.com`
* :ref:`spider_openwrt.org`
* :ref:`spider_puls4.com`
* :ref:`spider_python-patterns.guide`
Expand Down
30 changes: 0 additions & 30 deletions docs/spiders/diepresse.com.rst

This file was deleted.

16 changes: 0 additions & 16 deletions docs/spiders/help.gv.at.rst

This file was deleted.

5 changes: 5 additions & 0 deletions docs/spiders/spotify.com.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Add ``spotify.com`` to the list of spiders:
spiders =
spotify.com
The market you are in (i. e. your country as an ISO 3166-1 alpha-2 country
code) has to be specified in the config as well. For example, for Austria
specify: ``market = AT``

spotify.com supports different podcasts via the ``show`` parameter (one per
line).

Expand All @@ -22,5 +26,6 @@ Example configuration:
.. code-block:: ini
[spotify.com]
market = AT
shows =
6u7pI0o0CUBQq0T1fwPgbj
5 changes: 1 addition & 4 deletions feeds.cfg.dist
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,6 @@ useragent = feeds (+https://github.com/pyfeeds/pyfeeds)
#links =
# /katalog?sortiertnach=neueste

#[diepresse.com]
#sections =
# Meinung/Pizzicato

#[kurier.at]
#channels =
# /chronik/wien
Expand All @@ -158,6 +154,7 @@ useragent = feeds (+https://github.com/pyfeeds/pyfeeds)
# barbara.kaufmann

#[spotify.com]
#market = AT
#shows =
# 6u7pI0o0CUBQq0T1fwPgbj

Expand Down
4 changes: 2 additions & 2 deletions feeds/spiders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ def feed_headers(self):
link=getattr(self, "feed_link", None),
path=getattr(self, "path", None),
author_name=getattr(self, "author_name", None),
icon=getattr(self, "icon", None),
logo=getattr(self, "logo", None),
icon=getattr(self, "feed_icon", None),
logo=getattr(self, "feed_logo", None),
)

def start_requests(self):
Expand Down
4 changes: 2 additions & 2 deletions feeds/spiders/derstandard_at.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@

class DerStandardAtSpider(FeedsSpider):
name = "derstandard.at"
custom_settings = {"COOKIES_ENABLED": False}

_users = {}
_titles = {}

def start_requests(self):
self._ressorts = self.settings.get("FEEDS_SPIDER_DERSTANDARD_AT_RESSORTS")
self._ressorts = self.settings.get("FEEDS_SPIDER_DERSTANDARD_AT_RESSORTS", [])
if self._ressorts:
self._ressorts = set(self._ressorts.split())
else:
Expand Down
104 changes: 0 additions & 104 deletions feeds/spiders/diepresse_com.py

This file was deleted.

2 changes: 1 addition & 1 deletion feeds/spiders/economist_com.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
class EconomistComSpider(FeedsXMLFeedSpider):
name = "economist.com"
# Don't send too many requests to not trigger the bot detection.
custom_settings = {"COOKIES_ENABLED": False, "DOWNLOAD_DELAY": 5.0}
custom_settings = {"DOWNLOAD_DELAY": 5.0}

_titles = {}

Expand Down
2 changes: 1 addition & 1 deletion feeds/spiders/ft_com.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

class FtComSpider(FeedsXMLFeedSpider):
name = "ft.com"
custom_settings = {"COOKIES_ENABLED": False, "REFERER_ENABLED": False}
custom_settings = {"REFERER_ENABLED": False}

_titles = {}

Expand Down
10 changes: 10 additions & 0 deletions feeds/spiders/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,16 @@ def parse(self, response):

def _parse_article(self, response):
feed_entry = response.meta["feed_entry"]

il = FeedEntryItemLoader(parent=response.meta["il"])
try:
response.text
except AttributeError:
# Response is not text (e.g. PDF, ...).
il.add_value("title", feed_entry.get("title"))
il.add_value("content_html", feed_entry.get("summary"))
return il.load_item()

doc = Document(response.text, url=response.url)
il.add_value("title", doc.short_title() or feed_entry.get("title"))
summary = feed_entry.get("summary")
Expand All @@ -86,4 +95,5 @@ def _parse_article(self, response):
except Unparseable:
content = summary
il.add_value("content_html", content)

return il.load_item()
98 changes: 0 additions & 98 deletions feeds/spiders/help_gv_at.py

This file was deleted.

0 comments on commit 02fc464

Please sign in to comment.