Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse files

merge changes

  • Loading branch information...
commit e01edf6902dd51fd4a0e0ac0664a286b5f17bae1 2 parents cec7e0a + 96bbf57
@JoeGermuska authored
View
4 .gitignore
@@ -3,4 +3,8 @@
*~
build/
errors/
+dist/
scrapelib.egg-info/
+.coverage
+cover/
+.tox
View
9 .travis.yml
@@ -0,0 +1,9 @@
+language: python
+python:
+ - "2.7"
+ - "3.3"
+install: pip install mock nose "requests>=1.0" --use-mirrors --upgrade
+script: nosetests
+notifications:
+ email:
+ - jturk@sunlightfoundation.com
View
4 LICENSE
@@ -1,4 +1,4 @@
-Copyright (c) 2010, Sunlight Labs
+Copyright (c) 2012, Sunlight Labs
All rights reserved.
@@ -24,4 +24,4 @@ PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
View
41 README.rst
@@ -2,30 +2,43 @@
scrapelib
=========
-A Python library for scraping things.
+scrapelib is a library for making requests to websites, particularly those
+that may be less-than-reliable.
-Features include:
+scrapelib originated as part of the `Open States <http://openstates.org/>`_
+project to scrape the websites of all 50 state legislatures and as a result
+was therefore designed with features desirable when dealing with sites that
+have intermittent errors or require rate-limiting.
- * HTTP, HTTPS, FTP requests via an identical API
- * HTTP caching, compression and cookies
- * redirect following
- * request throttling
- * robots.txt compliance (optional)
- * robust error handling
+As of version 0.7 scrapelib has been retooled to take advantage of the superb
+`requests <http://python-requests.org>`_ library.
-scrapelib is a project of Sunlight Labs (c) 2011.
+Advantages of using scrapelib over alternatives like httplib2 simply using
+requests as-is:
+
+* All of the power of the suberb `requests <http://python-requests.org>`_ library.
+* HTTP, HTTPS, and FTP requests via an identical API
+* support for simple caching with pluggable cache backends
+* request throttling
+* configurable retries for non-permanent site failures
+* optional robots.txt compliance
+
+scrapelib is a project of Sunlight Labs (c) 2013.
All code is released under a BSD-style license, see LICENSE for details.
-Written by Michael Stephens <mstephens@sunlightfoundation.com> and James Turk
-<jturk@sunlightfoundation.com>.
+Written by James Turk <jturk@sunlightfoundation.com>
+
+Contributors:
+ * Michael Stephens - initial urllib2/httplib2 version
+ * Joe Germuska - fix for IPython embedding
+ * Alex Chiang - fix to test suite
Requirements
============
-python >= 2.6
-
-httplib2 optional but highly recommended.
+* python 2.7 or 3.3
+* requests >= 1.0
Installation
============
View
2  coverage.sh
@@ -0,0 +1,2 @@
+rm -rf cover/
+nosetests --cover-html --with-coverage --cover-package=scrapelib
View
95 docs/changelog.rst
@@ -1,10 +1,103 @@
scrapelib changelog
===================
+0.9.0
+-----
+**22 May 2013**
+ * replace FTPSession with FTPAdapter
+ * fixes for latest requests
+
+0.8.0
+-----
+**18 March 2013**
+ * requests 1.0 compatibility
+ * removal of requests pass-throughs
+ * deprecation of setting parameters via constructor
+
+0.7.4
+-----
+**20 December 2012**
+ * bugfix for status_code coming from a cache
+ * bugfix for setting user-agent from headers
+ * fix requests version at <1.0
+
+0.7.3
+-----
+**21 June 2012**
+ * fix for combination of FTP and caching
+ * drop unnecessary ScrapelibSession
+ * bytes fix for scrapeshell
+ * use UTF8 if encoding guess fails
+
+0.7.2
+-----
+**9 May 2012**
+ * bugfix for user-agent check
+ * bugfix for cached content with \r characters
+ * bugfix for requests >= 0.12
+ * cache_dir deprecation is total
+
+0.7.1
+-----
+**27 April 2012**
+ * breaking change: no longer accept URLs without a scheme
+ * deprecation of error_dir & context-manager mode
+ * addition of overridable accept_response hook
+ * bugfix: retry on more requests errors
+ * bugfix: unicode cached content no longer incorrectly encoded
+ * implement various requests enhancements separately for ease of reuse
+ * convert more Scraper parameters to properties
+
+0.7.0
+-----
+**23 April 2012**
+ * rewritten internals to use requests, dropping httplib2
+ * as a result of rewrite, caching behavior no longer attempts to be
+ compliant with the HTTP specification but is much more configurable
+ * added cache_write_only option
+ * deprecation of accept_cookies, use_cache_first, cache_dir parameter
+ * improved tests
+ * improved Python 3 support
+
+0.6.2
+-----
+**20 April 2012**
+ * bugfix for POST-redirects
+ * drastically improved test coverage
+ * add encoding to ResultStr
+
+0.6.1
+-----
+**19 April 2012**
+ * add .bytes attribute to ResultStr
+ * bugfix related to bytes in urlretrieve
+
+0.6.0
+-----
+**19 April 2012**
+ * remove urllib2 fallback for HTTP
+ * rework entire test suite to not rely on Flask
+ * Unicode & Str unification
+ * experimental Python 3.2 support
+
+0.5.8
+-----
+**15 February 2012**
+ * fix to test suite from Alex Chiang
+
+0.5.7
+-----
+**2 February 2012**
+ * -p, --postdata parameter
+ * argv fix for IPython <= 0.10 from Joe Germuska
+ * treat FTP 550 errors as HTTP 404s
+ * use_cache_first improvements
+
0.5.6
-----
-**7 November 2011**
+**9 November 2011**
* scrapeshell fix for IPython >= 0.11
+ * scrapelib.urlopen can take method/body params too
0.5.5
-----
View
6 docs/conf.py
@@ -41,16 +41,16 @@
# General information about the project.
project = u'scrapelib'
-copyright = u'2011, Michael Stephens and James Turk'
+copyright = u'2013, Sunlight Labs'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
-version = '0.5'
+version = '0.9'
# The full version, including alpha/beta/rc tags.
-release = '0.5.6'
+release = '0.9.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
View
25 docs/index.rst
@@ -3,17 +3,26 @@ scrapelib |release|
Overview
--------
-scrapelib is a library that at its simplest provides a replacement for urllib2's urlopen functionality but can do much more.
+scrapelib is a library for making requests to websites, particularly those
+that may be less-than-reliable.
-Advantages of using scrapelib over urllib2 or httplib2 include:
+scrapelib originated as part of the `Open States <http://openstates.org/`_
+project to scrape the websites of all 50 state legislatures and as a result
+was therefore designed with features desirable when dealing with sites that
+have intermittent errors or require rate-limiting.
-* HTTP, HTTPS, FTP requests via an identical API
-* HTTP caching, compression, and cookies
-* intelligent and configurable redirect following
+As of version 0.7 scrapelib has been retooled to take advantage of the superb
+`requests <http://python-requests.org>`_ library.
+
+Advantages of using scrapelib over alternatives like httplib2 simply using
+requests as-is:
+
+* All of the power of the suberb `requests <http://python-requests.org>`_ library.
+* HTTP(S) and FTP requests via an identical API
+* support for simple caching with pluggable cache backends
* request throtting
-* configurable retries for non-permanent failures
-* robots.txt compliance
-* robust error handling
+* configurable retries for non-permanent site failures
+* optional robots.txt compliance
Contents
--------
View
6 docs/scrapelib.rst
@@ -20,13 +20,11 @@ Response objects
.. autoclass:: Headers
-ResultStr and ResultUnicode
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ResultStr
+~~~~~~~~~
.. autoclass:: ResultStr
-.. autoclass:: ResultUnicode
-
Exceptions
----------
View
3  requirements.txt
@@ -1,2 +1 @@
-# Not strictly required, but you probably want it
-httplib2>=0.6.0
+requests>=1.0
View
711 scrapelib.py
@@ -1,711 +0,0 @@
-import os
-import sys
-import time
-import socket
-import logging
-import tempfile
-import urllib2
-import urlparse
-import datetime
-import cookielib
-import robotparser
-
-try:
- import json
-except ImportError:
- import simplejson as json
-
-try:
- import httplib2
- USE_HTTPLIB2 = True
-except ImportError:
- USE_HTTPLIB2 = False
-
-__version__ = '0.5.6'
-_user_agent = 'scrapelib %s' % __version__
-
-
-class NullHandler(logging.Handler):
- def emit(self, record):
- pass
-
-_log = logging.getLogger('scrapelib')
-_log.addHandler(NullHandler())
-
-
-class ScrapeError(Exception):
- pass
-
-
-class RobotExclusionError(ScrapeError):
- """
- Raised when an attempt is made to access a page denied by
- the host's robots.txt file.
- """
-
- def __init__(self, message, url, user_agent):
- super(RobotExclusionError, self).__init__(message)
- self.url = url
- self.user_agent = user_agent
-
-
-class HTTPMethodUnavailableError(ScrapeError):
- """
- Raised when the supplied HTTP method is invalid or not supported
- by the HTTP backend.
- """
-
- def __init__(self, message, method):
- super(HTTPMethodUnavailableError, self).__init__(message)
- self.method = method
-
-
-class HTTPError(ScrapeError):
- """
- Raised when urlopen encounters a 4xx or 5xx error code and the
- raise_errors option is true.
- """
-
- def __init__(self, response, body):
- message = '%s while retrieving %s' % (response.code, response.url)
- super(HTTPError, self).__init__(message)
- self.response = response
- self.body = body
-
-
-class ErrorManager(object):
- def __enter__(self):
- return self
-
- def __exit__(self, exc_type, exc_val, exc_tb):
- if exc_type and self._scraper.save_errors:
- self._scraper._save_error(self.response.url, self)
- return False
-
-
-class ResultStr(str, ErrorManager):
- """
- Wrapper for non-unicode responses. Can treat identically to a ``str``
- o get body of response, additional headers, etc. available via ``response``
- attribute (instance of :class:`Response`).
- """
- def __new__(cls, scraper, response, str):
- self = str.__new__(cls, str)
- self._scraper = scraper
- self.response = response
- return self
-
-
-class ResultUnicode(unicode, ErrorManager):
- """
- Wrapper for unicode responses. Can treat identically to a ``unicode``
- string to get body of response, additional headers, etc. available via
- ``response`` attribute (instance of :class:`Response`).
- """
- def __new__(cls, scraper, response, str):
- self = unicode.__new__(cls, str)
- self._scraper = scraper
- self.response = response
- return self
-
-
-class Headers(dict):
- """
- Dictionary-like object for storing response headers in a
- case-insensitive way (keeping with the HTTP spec).
-
- Accessed as the ``headers`` attribute of :class:`Response`.
- """
- def __init__(self, d={}):
- super(Headers, self).__init__()
- for k, v in d.items():
- self[k] = v
-
- def __getitem__(self, key):
- return super(Headers, self).__getitem__(key.lower())
-
- def __setitem__(self, key, value):
- super(Headers, self).__setitem__(key.lower(), value)
-
- def __delitem__(self, key):
- return super(Headers, self).__delitem__(key.lower())
-
- def __contains__(self, key):
- return super(Headers, self).__contains__(key.lower())
-
- def __eq__(self, other):
- if len(self) != len(other):
- return False
-
- for k, v in other.items():
- if self[k] != v:
- return False
-
- return True
-
- def __ne__(self, other):
- return not self.__eq__(other)
-
- def getallmatchingheaders(self, name):
- try:
- header = self[name]
- return [name + ": " + header]
- except KeyError:
- return []
-
- def getheaders(self, name):
- try:
- return [self[name]]
- except KeyError:
- return []
-
-
-class Response(object):
- """
- Details about a server response.
-
- Has the following attributes:
-
- :attr:`url`
- actual URL of the response (after redirects)
- :attr:`requested_url`
- original URL requested
- :attr:`code`
- HTTP response code (not set for FTP requests)
- :attr:`protocol`
- protocol used: http, https, or ftp
- :attr:`fromcache`
- True iff responsse was retrieved from local cache
- :attr:`headers`
- :class:`Headers` instance containing response headers
- """
-
- def __init__(self, url, requested_url, protocol='http', code=None,
- fromcache=False, headers={}):
- """
- :param url: the actual URL of the response (after following any
- redirects)
- :param requested_url: the original URL requested
- :param code: response code (if HTTP)
- :param fromcache: response was retrieved from local cache
- """
- self.url = url
- self.requested_url = requested_url
- self.protocol = protocol
- self.code = code
- self.fromcache = fromcache
- self.headers = Headers(headers)
-
- def info(self):
- return self.headers
-
-
-class MongoCache(object):
- """
- Implements the httplib2 cache protocol using MongoDB
- (especially useful with capped collection)
- """
-
- def __init__(self, collection):
- """
- :param collection: a pymongo collection obj to store the cache in
- """
- self.collection = collection
-
- def get(self, key):
- ret = self.collection.find_one({'_id': key})
- if ret:
- ret = ret['value']
- return ret
-
- def set(self, key, value):
- self.collection.save({'_id': key, 'value': value})
-
- def delete(self, key):
- self.collection.remove({'_id': key})
-
-
-class Scraper(object):
- """
- Scraper is the most important class provided by scrapelib (and generally
- the only one to be instantiated directly). It provides a large number
- of options allowing for customization.
-
- Usage is generally just creating an instance with the desired options and
- then using the :meth:`urlopen` & :meth:`urlretrieve` methods of that
- instance.
-
- :param user_agent: the value to send as a User-Agent header on
- HTTP requests (default is "scrapelib |release|")
- :param cache_dir: if not None, http caching will be enabled with
- cached pages stored under the supplied path
- :param requests_per_minute: maximum requests per minute (0 for
- unlimited, defaults to 60)
- :param follow_robots: respect robots.txt files (default: True)
- :param error_dir: if not None, store scraped documents for which
- an error was encountered. (TODO: document with blocks)
- :param accept_cookies: set to False to disable HTTP cookie support
- :param disable_compression: set to True to not accept compressed content
- :param use_cache_first: set to True to always make an attempt to use cached
- data, before even making a HEAD request to check if content is stale
- :param raise_errors: set to True to raise a :class:`HTTPError`
- on 4xx or 5xx response
- :param follow_redirects: set to False to disable redirect following
- :param timeout: socket timeout in seconds (default: None)
- :param retry_attempts: number of times to retry if timeout occurs or
- page returns a (non-404) error
- :param retry_wait_seconds: number of seconds to retry after first failure,
- subsequent retries will double this wait
- :param cache_obj: object to use for non-file based cache. scrapelib
- provides :class:`MongoCache` for this purpose
- """
- def __init__(self, user_agent=_user_agent,
- cache_dir=None,
- headers={},
- requests_per_minute=60,
- follow_robots=True,
- error_dir=None,
- accept_cookies=True,
- disable_compression=False,
- use_cache_first=False,
- raise_errors=True,
- follow_redirects=True,
- timeout=None,
- retry_attempts=0,
- retry_wait_seconds=5,
- cache_obj=None,
- **kwargs):
- self.user_agent = user_agent
- self.headers = headers
- # weird quirk between 0/None for timeout, accept either
- if timeout == 0:
- timeout = None
- self.timeout = timeout
-
- self.follow_robots = follow_robots
- self._robot_parsers = {}
-
- self.requests_per_minute = requests_per_minute
-
- if cache_dir and not USE_HTTPLIB2:
- _log.warning("httplib2 not available, HTTP caching "
- "and compression will be disabled.")
-
- self.error_dir = error_dir
- if self.error_dir:
- try:
- os.makedirs(error_dir)
- except OSError, e:
- if e.errno != 17:
- raise
- self.save_errors = True
- else:
- self.save_errors = False
-
- self.accept_cookies = accept_cookies
- self._cookie_jar = cookielib.CookieJar()
-
- self.disable_compression = disable_compression
-
- self.use_cache_first = use_cache_first
- self.raise_errors = raise_errors
-
- if USE_HTTPLIB2:
- self._cache_obj = cache_dir
- if cache_obj:
- self._cache_obj = cache_obj
- self._http = httplib2.Http(self._cache_obj, timeout=timeout)
- else:
- self._http = None
-
- self.follow_redirects = follow_redirects
-
- self.retry_attempts = max(retry_attempts, 0)
- self.retry_wait_seconds = retry_wait_seconds
-
- def _throttle(self):
- now = time.time()
- diff = self._request_frequency - (now - self._last_request)
- if diff > 0:
- _log.debug("sleeping for %fs" % diff)
- time.sleep(diff)
- self._last_request = time.time()
- else:
- self._last_request = now
-
- def _robot_allowed(self, user_agent, parsed_url):
- _log.info("checking robots permission for %s" % parsed_url.geturl())
- robots_url = urlparse.urljoin(parsed_url.scheme + "://" +
- parsed_url.netloc, "robots.txt")
-
- try:
- parser = self._robot_parsers[robots_url]
- _log.info("using cached copy of %s" % robots_url)
- except KeyError:
- _log.info("grabbing %s" % robots_url)
- parser = robotparser.RobotFileParser()
- parser.set_url(robots_url)
- parser.read()
- self._robot_parsers[robots_url] = parser
-
- return parser.can_fetch(user_agent, parsed_url.geturl())
-
- def _make_headers(self, url):
- if callable(self.headers):
- headers = self.headers(url)
- else:
- headers = self.headers
-
- if self.accept_cookies:
- # CookieJar expects a urllib2.Request-like object
- req = urllib2.Request(url, headers=headers)
- self._cookie_jar.add_cookie_header(req)
- headers = req.headers
- headers.update(req.unredirected_hdrs)
-
- headers = Headers(headers)
-
- if 'User-Agent' not in headers:
- headers['User-Agent'] = self.user_agent
-
- if self.disable_compression and 'Accept-Encoding' not in headers:
- headers['Accept-Encoding'] = 'text/*'
-
- return headers
-
- def _wrap_result(self, response, body):
- if self.raise_errors and response.code >= 400:
- raise HTTPError(response, body)
-
- if isinstance(body, unicode):
- return ResultUnicode(self, response, body)
-
- if isinstance(body, str):
- return ResultStr(self, response, body)
-
- raise ValueError('expected body string')
-
- @property
- def follow_redirects(self):
- if self._http:
- return self._http.follow_redirects
- return False
-
- @follow_redirects.setter
- def follow_redirects(self, value):
- if self._http:
- self._http.follow_redirects = value
-
- @property
- def requests_per_minute(self):
- return self._requests_per_minute
-
- @requests_per_minute.setter
- def requests_per_minute(self, value):
- if value > 0:
- self._throttled = True
- self._requests_per_minute = value
- self._request_frequency = 60.0 / value
- self._last_request = 0
- else:
- self._throttled = False
- self._requests_per_minute = 0
- self._request_frequency = 0.0
- self._last_request = 0
-
- def _do_request(self, url, method, body, headers, use_httplib2,
- retry_on_404=False):
-
- # initialization for this request
- if not use_httplib2:
- req = urllib2.Request(url, data=body, headers=headers)
- if self.accept_cookies:
- self._cookie_jar.add_cookie_header(req)
-
- # the retry loop
- tries = 0
- exception_raised = None
-
- while tries <= self.retry_attempts:
- exception_raised = None
-
- if use_httplib2:
- try:
- resp, content = self._http.request(url, method, body=body,
- headers=headers)
- # return on a success/redirect/404
- if resp.status < 400 or (resp.status == 404
- and not retry_on_404):
- return resp, content
- except socket.error, e:
- exception_raised = e
- except AttributeError, e:
- if (str(e) ==
- "'NoneType' object has no attribute 'makefile'"):
- # when this error occurs, re-establish the connection
- self._http = httplib2.Http(self._cache_obj,
- timeout=self.timeout)
- exception_raised = e
- else:
- raise
- else:
- try:
- _log.info("getting %s using urllib2" % url)
- resp = urllib2.urlopen(req, timeout=self.timeout)
- if self.accept_cookies:
- self._cookie_jar.extract_cookies(resp, req)
-
- return resp
- except urllib2.URLError, e:
- exception_raised = e
- if getattr(e, 'code', None) == 404 and not retry_on_404:
- raise e
-
- # if we're going to retry, sleep first
- tries += 1
- if tries <= self.retry_attempts:
- # twice as long each time
- wait = self.retry_wait_seconds * (2 ** (tries - 1))
- _log.debug('sleeping for %s seconds before retry' % wait)
- time.sleep(wait)
-
- if exception_raised:
- raise exception_raised
- else:
- return resp, content
-
- def urlopen(self, url, method='GET', body=None, retry_on_404=False):
- """
- Make an HTTP request and return a
- :class:`ResultStr` or :class:`ResultUnicode` object.
-
- If an error is encountered may raise any of the scrapelib
- `exceptions`_.
-
- :param url: URL for request
- :param method: any valid HTTP method, but generally GET or POST
- :param body: optional body for request, to turn parameters into
- an appropriate string use :func:`urllib.urlencode()`
- :param retry_on_404: if retries are enabled, retry if a 404 is
- encountered, this should only be used on pages known to exist
- if retries are not enabled this parameter does nothing
- (default: False)
- """
- if self._throttled:
- self._throttle()
-
- method = method.upper()
- if method == 'POST' and body is None:
- body = ''
-
- # Default to HTTP requests
- if not "://" in url:
- _log.warning("no URL scheme provided, assuming HTTP")
- url = "http://" + url
-
- parsed_url = urlparse.urlparse(url)
-
- headers = self._make_headers(url)
- user_agent = headers['User-Agent']
-
- if parsed_url.scheme in ['http', 'https']:
- if self.follow_robots and not self._robot_allowed(user_agent,
- parsed_url):
- raise RobotExclusionError(
- "User-Agent '%s' not allowed at '%s'" % (
- user_agent, url), url, user_agent)
-
- if USE_HTTPLIB2:
- _log.info("getting %s using HTTPLIB2" % url)
-
- # make sure POSTs have x-www-form-urlencoded content type
- if method == 'POST' and 'Content-Type' not in headers:
- headers['Content-Type'] = ('application/'
- 'x-www-form-urlencoded')
-
- # optionally try a dummy request to cache only
- if self.use_cache_first and 'Cache-Control' not in headers:
- headers['cache-control'] = 'only-if-cached'
-
- resp, content = self._http.request(url, method,
- body=body,
- headers=headers)
- if resp.status == 504:
- headers.pop('cache-control')
- resp = content = None
- else:
- resp = content = None
-
- # do request if there's no copy in local cache
- if not resp:
- resp, content = self._do_request(url, method, body,
- headers,
- use_httplib2=True,
- retry_on_404=retry_on_404)
-
- our_resp = Response(resp.get('content-location') or url,
- url,
- code=resp.status,
- fromcache=resp.fromcache,
- protocol=parsed_url.scheme,
- headers=resp)
-
- # important to accept cookies before redirect handling
- if self.accept_cookies:
- fake_req = urllib2.Request(url, headers=headers)
- self._cookie_jar.extract_cookies(our_resp, fake_req)
-
- # needed because httplib2 follows the HTTP spec a bit *too*
- # closely and won't issue a GET following a POST (incorrect
- # but expected and often seen behavior)
- if (resp.status in (301, 302, 303, 307) and
- self.follow_redirects):
-
- if resp['location'].startswith('http'):
- redirect = resp['location']
- else:
- redirect = urlparse.urljoin(parsed_url.scheme +
- "://" +
- parsed_url.netloc +
- parsed_url.path,
- resp['location'])
- _log.debug('redirecting to %s' % redirect)
- resp = self.urlopen(redirect)
- resp.response.requested_url = url
- return resp
-
- return self._wrap_result(our_resp, content)
- else:
- # not an HTTP(S) request
- if method != 'GET':
- raise HTTPMethodUnavailableError(
- "non-HTTP(S) requests do not support method '%s'" %
- method, method)
-
- if method not in ['GET', 'POST']:
- raise HTTPMethodUnavailableError(
- "urllib2 does not support '%s' method" % method, method)
-
- resp = self._do_request(url, method, body, headers,
- use_httplib2=False, retry_on_404=retry_on_404)
-
- our_resp = Response(resp.geturl(), url, code=resp.code,
- fromcache=False, protocol=parsed_url.scheme,
- headers=resp.headers)
-
- return self._wrap_result(our_resp, resp.read())
-
- def urlretrieve(self, url, filename=None, method='GET', body=None):
- """
- Save result of a request to a file, similarly to
- :func:`urllib.urlretrieve`.
-
- If an error is encountered may raise any of the scrapelib
- `exceptions`_.
-
- A filename may be provided or :meth:`urlretrieve` will safely create a
- temporary file. Either way it is the responsibility of the caller
- to ensure that the temporary file is deleted when it is no longer
- needed.
-
- :param url: URL for request
- :param filename: optional name for file
- :param method: any valid HTTP method, but generally GET or POST
- :param body: optional body for request, to turn parameters into
- an appropriate string use :func:`urllib.urlencode()`
- :returns filename, response: tuple with filename for saved
- response (will be same as given filename if one was given,
- otherwise will be a temp file in the OS temp directory) and
- a :class:`Response` object that can be used to inspect the
- response headers.
- """
- result = self.urlopen(url, method, body)
-
- if not filename:
- fd, filename = tempfile.mkstemp()
- f = os.fdopen(fd, 'w')
- else:
- f = open(filename, 'w')
-
- f.write(result)
- f.close()
-
- return filename, result.response
-
- def _save_error(self, url, body):
- exception = sys.exc_info()[1]
-
- out = {'exception': repr(exception),
- 'url': url,
- 'body': body,
- 'when': str(datetime.datetime.now())}
-
- base_path = os.path.join(self.error_dir, url.replace('/', ','))
- path = base_path
-
- n = 0
- while os.path.exists(path):
- n += 1
- path = base_path + "-%d" % n
-
- with open(path, 'w') as fp:
- json.dump(out, fp, ensure_ascii=False)
-
-_default_scraper = Scraper(follow_robots=False, requests_per_minute=0)
-
-
-def urlopen(url):
- return _default_scraper.urlopen(url)
-
-
-def scrapeshell():
- try:
- from IPython import embed
- except ImportError:
- try:
- from IPython.Shell import IPShellEmbed
- embed = IPShellEmbed()
- except ImportError:
- print 'scrapeshell requires ipython'
- return
- try:
- import argparse
- except ImportError:
- print 'scrapeshell requires argparse'
- return
- try:
- import lxml.html
- USE_LXML = True
- except ImportError:
- USE_LXML = False
-
- parser = argparse.ArgumentParser(description='interactive python shell for'
- ' scraping')
- parser.add_argument('url', help="url to scrape")
- parser.add_argument('--ua', dest='user_agent', default=_user_agent,
- help='user agent to make requests with')
- parser.add_argument('--robots', dest='robots', action='store_true',
- default=False, help='obey robots.txt')
- parser.add_argument('--noredirect', dest='redirects', action='store_false',
- default=True, help="don't follow redirects")
-
- args = parser.parse_args()
-
- scraper = Scraper(user_agent=args.user_agent,
- follow_robots=args.robots,
- follow_redirects=args.redirects)
- url = args.url
- html = scraper.urlopen(args.url)
-
- if USE_LXML:
- doc = lxml.html.fromstring(html)
-
- print 'local variables'
- print '---------------'
- print 'url: %s' % url
- print 'html: `scrapelib.ResultStr` instance'
- if USE_LXML:
- print 'doc: `lxml HTML element`'
- import sys
- sys.argv = []
- embed()
View
439 scrapelib/__init__.py
@@ -0,0 +1,439 @@
+import logging
+import os
+import sys
+import tempfile
+import time
+
+import requests
+from .cache import CachingSession, FileCache # noqa
+
+if sys.version_info[0] < 3: # pragma: no cover
+ from urllib2 import urlopen as urllib_urlopen
+ from urllib2 import URLError as urllib_URLError
+ import urlparse
+ import robotparser
+ _str_type = unicode
+else: # pragma: no cover
+ PY3K = True
+ from urllib.request import urlopen as urllib_urlopen
+ from urllib.error import URLError as urllib_URLError
+ from urllib import parse as urlparse
+ from urllib import robotparser
+ _str_type = str
+
+__version__ = '0.9.0'
+_user_agent = ' '.join(('scrapelib', __version__,
+ requests.utils.default_user_agent()))
+
+
+class NullHandler(logging.Handler):
+ def emit(self, record):
+ pass
+
+_log = logging.getLogger('scrapelib')
+_log.addHandler(NullHandler())
+
+
+class RobotExclusionError(requests.RequestException):
+ """
+ Raised when an attempt is made to access a page denied by
+ the host's robots.txt file.
+ """
+
+ def __init__(self, message, url, user_agent):
+ super(RobotExclusionError, self).__init__(message)
+ self.url = url
+ self.user_agent = user_agent
+
+
+class HTTPMethodUnavailableError(requests.RequestException):
+ """
+ Raised when the supplied HTTP method is invalid or not supported
+ by the HTTP backend.
+ """
+
+ def __init__(self, message, method):
+ super(HTTPMethodUnavailableError, self).__init__(message)
+ self.method = method
+
+
+class HTTPError(requests.HTTPError):
+ """
+ Raised when urlopen encounters a 4xx or 5xx error code and the
+ raise_errors option is true.
+ """
+
+ def __init__(self, response, body=None):
+ message = '%s while retrieving %s' % (response.status_code,
+ response.url)
+ super(HTTPError, self).__init__(message)
+ self.response = response
+ self.body = body or self.response.text
+
+
+class FTPError(requests.HTTPError):
+ def __init__(self, url):
+ message = 'error while retrieving %s' % url
+ super(FTPError, self).__init__(message)
+
+
+class ResultStr(_str_type):
+ """
+ Wrapper for responses. Can treat identically to a ``str``
+ to get body of response, additional headers, etc. available via
+ ``response`` attribute.
+ """
+ def __new__(cls, scraper, response, requested_url):
+ try:
+ self = _str_type.__new__(cls, response.text)
+ except TypeError:
+ # use UTF8 as a default encoding if one couldn't be guessed
+ response.encoding = 'utf8'
+ self = _str_type.__new__(cls, response.text)
+ self._scraper = scraper
+ self.bytes = response.content
+ self.encoding = response.encoding
+ self.response = response
+ # augment self.response
+ # manually set: requested_url
+ # aliases: code -> status_code
+ self.response.requested_url = requested_url
+ self.response.code = self.response.status_code
+ return self
+
+
+class ThrottledSession(requests.Session):
+ def _throttle(self):
+ now = time.time()
+ diff = self._request_frequency - (now - self._last_request)
+ if diff > 0:
+ _log.debug("sleeping for %fs" % diff)
+ time.sleep(diff)
+ self._last_request = time.time()
+ else:
+ self._last_request = now
+
+ @property
+ def requests_per_minute(self):
+ return self._requests_per_minute
+
+ @requests_per_minute.setter
+ def requests_per_minute(self, value):
+ if value > 0:
+ self._throttled = True
+ self._requests_per_minute = value
+ self._request_frequency = 60.0 / value
+ self._last_request = 0
+ else:
+ self._throttled = False
+ self._requests_per_minute = 0
+ self._request_frequency = 0.0
+ self._last_request = 0
+
+ def request(self, method, url, **kwargs):
+ if self._throttled:
+ self._throttle()
+ return super(ThrottledSession, self).request(method, url, **kwargs)
+
+
+class RobotsTxtSession(requests.Session):
+
+ def __init__(self):
+ super(RobotsTxtSession, self).__init__()
+ self._robot_parsers = {}
+ self.follow_robots = True
+
+ def _robot_allowed(self, user_agent, parsed_url):
+ _log.info("checking robots permission for %s" % parsed_url.geturl())
+ robots_url = urlparse.urljoin(parsed_url.scheme + "://" +
+ parsed_url.netloc, "robots.txt")
+
+ try:
+ parser = self._robot_parsers[robots_url]
+ _log.info("using cached copy of %s" % robots_url)
+ except KeyError:
+ _log.info("grabbing %s" % robots_url)
+ parser = robotparser.RobotFileParser()
+ parser.set_url(robots_url)
+ parser.read()
+ self._robot_parsers[robots_url] = parser
+
+ return parser.can_fetch(user_agent, parsed_url.geturl())
+
+ def request(self, method, url, **kwargs):
+ parsed_url = urlparse.urlparse(url)
+ user_agent = (kwargs.get('headers', {}).get('User-Agent') or
+ self.headers.get('User-Agent'))
+ # robots.txt is http-only
+ if (parsed_url.scheme in ('http', 'https') and
+ self.follow_robots and
+ not self._robot_allowed(user_agent, parsed_url)):
+ raise RobotExclusionError(
+ "User-Agent '%s' not allowed at '%s'" % (
+ user_agent, url), url, user_agent)
+
+ return super(RobotsTxtSession, self).request(method, url, **kwargs)
+
+
+# this object exists because Requests assumes it can call
+# resp.raw._original_response.msg.getheaders() and we need to cope with that
+class DummyObject(object):
+ def getheaders(self, name):
+ return ''
+
+ def get_all(self, name, default):
+ return default
+
+_dummy = DummyObject()
+_dummy._original_response = DummyObject()
+_dummy._original_response.msg = DummyObject()
+
+
+class FTPAdapter(requests.adapters.BaseAdapter):
+
+ def send(self, request, stream=False, timeout=None, verify=False,
+ cert=None, proxies=None):
+ if request.method != 'GET':
+ raise HTTPMethodUnavailableError(
+ "FTP requests do not support method '%s'" % request.method,
+ request.method)
+ try:
+ real_resp = urllib_urlopen(request.url, timeout=timeout)
+ # we're going to fake a requests.Response with this
+ resp = requests.Response()
+ resp.status_code = 200
+ resp.url = request.url
+ resp.headers = {}
+ resp._content = real_resp.read()
+ resp.raw = _dummy
+ return resp
+ except urllib_URLError:
+ raise FTPError(request.url)
+
+
+class RetrySession(requests.Session):
+
+ def __init__(self):
+ super(RetrySession, self).__init__()
+ self._retry_attempts = 0
+ self.retry_wait_seconds = 10
+
+ # retry_attempts is a property so that it can't go negative
+ @property
+ def retry_attempts(self):
+ return self._retry_attempts
+
+ @retry_attempts.setter
+ def retry_attempts(self, value):
+ self._retry_attempts = max(value, 0)
+
+ def accept_response(self, response, **kwargs):
+ return response.status_code < 400
+
+ def request(self, method, url, retry_on_404=False, **kwargs):
+ # the retry loop
+ tries = 0
+ exception_raised = None
+
+ while tries <= self.retry_attempts:
+ exception_raised = None
+
+ try:
+ resp = super(RetrySession, self).request(method, url, **kwargs)
+ # break from loop on an accepted response
+ if self.accept_response(resp) or (resp.status_code == 404
+ and not retry_on_404):
+ break
+
+ except (requests.HTTPError, requests.ConnectionError,
+ requests.Timeout) as e:
+ exception_raised = e
+
+ # if we're going to retry, sleep first
+ tries += 1
+ if tries <= self.retry_attempts:
+ # twice as long each time
+ wait = (self.retry_wait_seconds * (2 ** (tries - 1)))
+ _log.debug('sleeping for %s seconds before retry' % wait)
+ time.sleep(wait)
+
+ # out of the loop, either an exception was raised or we had a success
+ if exception_raised:
+ raise exception_raised
+ else:
+ return resp
+
+
+# compose sessions, order matters
+class Scraper(RobotsTxtSession, # first, check robots.txt
+ CachingSession, # cache responses
+ ThrottledSession, # throttle requests
+ RetrySession, # do retries
+ ):
+ """
+ Scraper is the most important class provided by scrapelib (and generally
+ the only one to be instantiated directly). It provides a large number
+ of options allowing for customization.
+
+ Usage is generally just creating an instance with the desired options and
+ then using the :meth:`urlopen` & :meth:`urlretrieve` methods of that
+ instance.
+
+ :param raise_errors: set to True to raise a :class:`HTTPError`
+ on 4xx or 5xx response
+ :param requests_per_minute: maximum requests per minute (0 for
+ unlimited, defaults to 60)
+ :param follow_robots: respect robots.txt files (default: True)
+ :param retry_attempts: number of times to retry if timeout occurs or
+ page returns a (non-404) error
+ :param retry_wait_seconds: number of seconds to retry after first failure,
+ subsequent retries will double this wait
+ """
+ def __init__(self,
+ raise_errors=True,
+ requests_per_minute=60,
+ follow_robots=True,
+ retry_attempts=0,
+ retry_wait_seconds=5,
+ header_func=None):
+
+ super(Scraper, self).__init__()
+ self.mount('ftp://', FTPAdapter())
+
+ # added by this class
+ self.raise_errors = raise_errors
+
+ # added by ThrottledSession
+ self.requests_per_minute = requests_per_minute
+
+ # added by RobotsTxtSession
+ self.follow_robots = follow_robots
+
+ # added by RetrySession
+ self.retry_attempts = retry_attempts
+ self.retry_wait_seconds = retry_wait_seconds
+
+ # added by this class
+ self._header_func = header_func
+
+ # added by CachingSession
+ self.cache_storage = None
+ self.cache_write_only = True
+
+ # non-parameter options
+ self.timeout = None
+ self.user_agent = _user_agent
+
+ @property
+ def user_agent(self):
+ return self.headers['User-Agent']
+
+ @user_agent.setter
+ def user_agent(self, value):
+ self.headers['User-Agent'] = value
+
+ @property
+ def disable_compression(self):
+ return self.headers['Accept-Encoding'] == 'text/*'
+
+ @disable_compression.setter
+ def disable_compression(self, value):
+ # disabled: set encoding to text/*
+ if value:
+ self.headers['Accept-Encoding'] = 'text/*'
+ # enabled: if set to text/* pop, otherwise leave unmodified
+ elif self.headers.get('Accept-Encoding') == 'text/*':
+ self.headers['Accept-Encoding'] = 'gzip, deflate, compress'
+
+ def request(self, method, url, **kwargs):
+ # apply global timeout
+ timeout = kwargs.pop('timeout', self.timeout)
+
+ if self._header_func:
+ headers = requests.structures.CaseInsensitiveDict(
+ self._header_func(url))
+ else:
+ headers = {}
+ try:
+ # requests < 1.2.2
+ headers = requests.sessions.merge_kwargs(headers, self.headers)
+ headers = requests.sessions.merge_kwargs(kwargs.pop('headers', {}),
+ headers)
+ except AttributeError:
+ # requests >= 1.2.2
+ headers = requests.sessions.merge_setting(headers, self.headers)
+ headers = requests.sessions.merge_setting(
+ kwargs.pop('headers', {}), headers)
+
+ return super(Scraper, self).request(method, url, timeout=timeout,
+ headers=headers, **kwargs)
+
+ def urlopen(self, url, method='GET', body=None, retry_on_404=False):
+ """
+ Make an HTTP request and return a :class:`ResultStr` object.
+
+ If an error is encountered may raise any of the scrapelib
+ `exceptions`_.
+
+ :param url: URL for request
+ :param method: any valid HTTP method, but generally GET or POST
+ :param body: optional body for request, to turn parameters into
+ an appropriate string use :func:`urllib.urlencode()`
+ :param retry_on_404: if retries are enabled, retry if a 404 is
+ encountered, this should only be used on pages known to exist
+ if retries are not enabled this parameter does nothing
+ (default: False)
+ """
+
+ _log.info("{0} - {1}".format(method.upper(), url))
+
+ resp = self.request(method, url, data=body, retry_on_404=retry_on_404)
+
+ if self.raise_errors and not self.accept_response(resp):
+ raise HTTPError(resp)
+ else:
+ return ResultStr(self, resp, url)
+
+ def urlretrieve(self, url, filename=None, method='GET', body=None):
+ """
+ Save result of a request to a file, similarly to
+ :func:`urllib.urlretrieve`.
+
+ If an error is encountered may raise any of the scrapelib
+ `exceptions`_.
+
+ A filename may be provided or :meth:`urlretrieve` will safely create a
+ temporary file. Either way it is the responsibility of the caller
+ to ensure that the temporary file is deleted when it is no longer
+ needed.
+
+ :param url: URL for request
+ :param filename: optional name for file
+ :param method: any valid HTTP method, but generally GET or POST
+ :param body: optional body for request, to turn parameters into
+ an appropriate string use :func:`urllib.urlencode()`
+ :returns filename, response: tuple with filename for saved
+ response (will be same as given filename if one was given,
+ otherwise will be a temp file in the OS temp directory) and
+ a :class:`Response` object that can be used to inspect the
+ response headers.
+ """
+ result = self.urlopen(url, method, body)
+
+ if not filename:
+ fd, filename = tempfile.mkstemp()
+ f = os.fdopen(fd, 'wb')
+ else:
+ f = open(filename, 'wb')
+
+ f.write(result.bytes)
+ f.close()
+
+ return filename, result.response
+
+
+_default_scraper = Scraper(follow_robots=False, requests_per_minute=0)
+
+
+def urlopen(url, method='GET', body=None): # pragma: no cover
+ return _default_scraper.urlopen(url, method, body)
View
57 scrapelib/__main__.py
@@ -0,0 +1,57 @@
+from . import Scraper, _user_agent
+import argparse
+
+
+def scrapeshell(): # pragma: no cover
+ # clear argv for IPython
+ import sys
+ orig_argv = sys.argv[1:]
+ sys.argv = sys.argv[:1]
+
+ try:
+ from IPython import embed
+ except ImportError:
+ print('scrapeshell requires ipython >= 0.11')
+ return
+ try:
+ import lxml.html
+ USE_LXML = True
+ except ImportError:
+ USE_LXML = False
+
+ parser = argparse.ArgumentParser(prog='scrapeshell',
+ description='interactive python shell for'
+ ' scraping')
+ parser.add_argument('url', help="url to scrape")
+ parser.add_argument('--ua', dest='user_agent', default=_user_agent,
+ help='user agent to make requests with')
+ parser.add_argument('--robots', dest='robots', action='store_true',
+ default=False, help='obey robots.txt')
+ parser.add_argument('-p', '--postdata', dest='postdata',
+ default=None,
+ help="POST data (will make a POST instead of GET)")
+ args = parser.parse_args(orig_argv)
+
+ scraper = Scraper(follow_robots=args.robots)
+ scraper.user_agent = args.user_agent
+ url = args.url
+ if args.postdata:
+ html = scraper.urlopen(args.url, 'POST', args.postdata)
+ else:
+ html = scraper.urlopen(args.url)
+
+ if USE_LXML:
+ doc = lxml.html.fromstring(html.bytes) # noqa
+
+ print('local variables')
+ print('---------------')
+ print('url: %s' % url)
+ print('html: `scrapelib.ResultStr` instance')
+ if USE_LXML:
+ print('doc: `lxml HTML element`')
+ else:
+ print('doc not available: lxml not installed')
+ embed()
+
+
+scrapeshell()
View
157 scrapelib/cache.py
@@ -0,0 +1,157 @@
+"""
+ module providing caching support for requests
+
+ use CachingSession in place of requests.Session to take advantage
+"""
+import re
+import os
+import glob
+import hashlib
+import requests
+
+
+class CachingSession(requests.Session):
+ def __init__(self, cache_storage=None):
+ super(CachingSession, self).__init__()
+ self.cache_storage = cache_storage
+ self.cache_write_only = False
+
+ def key_for_request(self, method, url, **kwargs):
+ """ Return a cache key from a given set of request parameters.
+
+ Default behavior is to return a complete URL for all GET
+ requests, and None otherwise.
+
+ Can be overriden if caching of non-get requests is desired.
+ """
+ if method != 'get':
+ return None
+
+ return requests.Request(url=url,
+ params=kwargs.get('params', {})).prepare().url
+
+ def should_cache_response(self, response):
+ """ Check if a given Response object should be cached.
+
+ Default behavior is to only cache responses with a 200
+ status code.
+ """
+ return response.status_code == 200
+
+ def request(self, method, url, **kwargs):
+ """ Override, wraps Session.request in caching.
+
+ Cache is only used if key_for_request returns a valid key
+ and should_cache_response was true as well.
+ """
+ # short circuit if cache isn't configured
+ if not self.cache_storage:
+ resp = super(CachingSession, self).request(method, url, **kwargs)
+ resp.fromcache = False
+ return resp
+
+ resp = None
+ method = method.lower()
+
+ request_key = self.key_for_request(method, url, **kwargs)
+
+ if request_key and not self.cache_write_only:
+ resp = self.cache_storage.get(request_key)
+
+ if resp:
+ resp.fromcache = True
+ else:
+ resp = super(CachingSession, self).request(method, url, **kwargs)
+ # save to cache if request and response meet criteria
+ if request_key and self.should_cache_response(resp):
+ self.cache_storage.set(request_key, resp)
+ resp.fromcache = False
+
+ return resp
+
+
+class MemoryCache(object):
+ def __init__(self):
+ self.cache = {}
+
+ def get(self, key):
+ return self.cache.get(key, None)
+
+ def set(self, key, response):
+ self.cache[key] = response
+
+
+class FileCache(object):
+ # file name escaping inspired by httplib2
+ _prefix = re.compile(r'^\w+://')
+ _illegal = re.compile(r'[?/:|]+')
+ _header_re = re.compile(r'([-\w]+): (.*)')
+ _maxlen = 200
+
+ def _clean_key(self, key):
+ # strip scheme
+ md5 = hashlib.md5(key.encode('utf8')).hexdigest()
+ key = self._prefix.sub('', key)
+ key = self._illegal.sub(',', key)
+ return ','.join((key[:self._maxlen], md5))
+
+ def __init__(self, cache_dir):
+ # normalize path
+ self.cache_dir = os.path.join(os.getcwd(), cache_dir)
+ # create directory
+ os.path.isdir(self.cache_dir) or os.makedirs(self.cache_dir)
+
+ def get(self, orig_key):
+ resp = requests.Response()
+
+ key = self._clean_key(orig_key)
+ path = os.path.join(self.cache_dir, key)
+
+ try:
+ with open(path, 'rb') as f:
+ # read lines one at a time
+ while True:
+ line = f.readline().decode('utf8').strip('\r\n')
+ # set headers
+ header = self._header_re.match(line)
+ if header:
+ resp.headers[header.group(1)] = header.group(2)
+ else:
+ break
+ # everything left is the real content
+ resp._content = f.read()
+
+ # status & encoding will be in headers, but are faked
+ # need to split spaces out of status to get code (e.g. '200 OK')
+ resp.status_code = int(resp.headers.pop('status').split(' ')[0])
+ resp.encoding = resp.headers.pop('encoding')
+ resp.url = resp.headers.get('content-location', orig_key)
+ #TODO: resp.request = request
+ return resp
+ except IOError:
+ return None
+
+ def set(self, key, response):
+ key = self._clean_key(key)
+ path = os.path.join(self.cache_dir, key)
+
+ with open(path, 'wb') as f:
+ status_str = 'status: {0}\n'.format(response.status_code)
+ f.write(status_str.encode('utf8'))
+ encoding_str = 'encoding: {0}\n'.format(response.encoding)
+ f.write(encoding_str.encode('utf8'))
+ for h, v in response.headers.items():
+ # header: value\n
+ f.write(h.encode('utf8'))
+ f.write(b': ')
+ f.write(v.encode('utf8'))
+ f.write(b'\n')
+ # one blank line
+ f.write(b'\n')
+ f.write(response.content)
+
+ def clear(self):
+ # only delete things that end w/ a md5, less dangerous this way
+ cache_glob = '*,' + ('[0-9a-f]' * 32)
+ for fname in glob.glob(os.path.join(self.cache_dir, cache_glob)):
+ os.remove(fname)
View
0  scrapelib/tests/__init__.py
No changes.
View
116 scrapelib/tests/test_cache.py
@@ -0,0 +1,116 @@
+import sys
+from nose.tools import assert_equal, assert_true
+
+import requests
+from ..cache import CachingSession, MemoryCache, FileCache
+
+DUMMY_URL = 'http://dummy/'
+HTTPBIN = 'http://httpbin.org/'
+
+
+def test_default_key_for_request():
+ cs = CachingSession()
+
+ # non-get methods
+ for method in ('post', 'head', 'put', 'delete', 'patch'):
+ assert_equal(cs.key_for_request(method, DUMMY_URL), None)
+
+ # simple get method
+ assert_equal(cs.key_for_request('get', DUMMY_URL), DUMMY_URL)
+ # now with params
+ assert_equal(cs.key_for_request('get', DUMMY_URL, params={'foo': 'bar'}),
+ DUMMY_URL + '?foo=bar')
+ # params in both places
+ assert_equal(cs.key_for_request('get', DUMMY_URL + '?abc=def',
+ params={'foo': 'bar'}),
+ DUMMY_URL + '?abc=def&foo=bar')
+
+
+def test_default_should_cache_response():
+ cs = CachingSession()
+ resp = requests.Response()
+ # only 200 should return True
+ resp.status_code = 200
+ assert_equal(cs.should_cache_response(resp), True)
+ for code in (203, 301, 302, 400, 403, 404, 500):
+ resp.status_code = code
+ assert_equal(cs.should_cache_response(resp), False)
+
+
+def test_no_cache_request():
+ cs = CachingSession()
+ # call twice, to prime cache (if it were enabled)
+ resp = cs.request('get', HTTPBIN + 'status/200')
+ resp = cs.request('get', HTTPBIN + 'status/200')
+ assert_equal(resp.status_code, 200)
+ assert_equal(resp.fromcache, False)
+
+
+def test_simple_cache_request():
+ cs = CachingSession(cache_storage=MemoryCache())
+ url = HTTPBIN + 'get'
+
+ # first response not from cache
+ resp = cs.request('get', url)
+ assert_equal(resp.fromcache, False)
+
+ assert_true(url in cs.cache_storage.cache)
+
+ # second response comes from cache
+ cached_resp = cs.request('get', url)
+ assert_equal(resp.text, cached_resp.text)
+ assert_equal(cached_resp.fromcache, True)
+
+
+def test_cache_write_only():
+ cs = CachingSession(cache_storage=MemoryCache())
+ cs.cache_write_only = True
+ url = HTTPBIN + 'get'
+
+ # first response not from cache
+ resp = cs.request('get', url)
+ assert_equal(resp.fromcache, False)
+
+ # response was written to cache
+ assert_true(url in cs.cache_storage.cache)
+
+ # but second response doesn't come from cache
+ cached_resp = cs.request('get', url)
+ assert_equal(cached_resp.fromcache, False)
+
+
+# test storages #####
+
+def _test_cache_storage(storage_obj):
+ # unknown key returns None
+ assert_true(storage_obj.get('one') is None)
+
+ _content_as_bytes = b"here's unicode: \xe2\x98\x83"
+ if sys.version_info[0] < 3:
+ _content_as_unicode = unicode("here's unicode: \u2603",
+ 'unicode_escape')
+ else:
+ _content_as_unicode = "here's unicode: \u2603"
+
+ # set 'one'
+ resp = requests.Response()
+ resp.headers['x-num'] = 'one'
+ resp.status_code = 200
+ resp._content = _content_as_bytes
+ storage_obj.set('one', resp)
+ cached_resp = storage_obj.get('one')
+ assert_equal(cached_resp.headers, {'x-num': 'one'})
+ assert_equal(cached_resp.status_code, 200)
+ cached_resp.encoding = 'utf8'
+ assert_equal(cached_resp.text, _content_as_unicode)
+
+
+def test_memory_cache():
+ _test_cache_storage(MemoryCache())
+
+
+def test_file_cache():
+ fc = FileCache('cache')
+ fc.clear()
+ _test_cache_storage(fc)
+ fc.clear()
View
378 scrapelib/tests/test_scraper.py
@@ -0,0 +1,378 @@
+import os
+import sys
+import glob
+import json
+import tempfile
+from io import BytesIO
+
+if sys.version_info[0] < 3:
+ import robotparser
+else:
+ from urllib import robotparser
+
+import mock
+from nose.tools import assert_equal, assert_raises
+import requests
+from .. import (Scraper, HTTPError, HTTPMethodUnavailableError,
+ RobotExclusionError, urllib_URLError, FTPError)
+from .. import _user_agent as default_user_agent
+from ..cache import MemoryCache
+
+HTTPBIN = 'http://httpbin.org/'
+
+
+class FakeResponse(object):
+ def __init__(self, url, code, content, encoding='utf-8', headers=None):
+ self.url = url
+ self.status_code = code
+ self.content = content
+ self.text = str(content)
+ self.encoding = encoding
+ self.headers = headers or {}
+
+
+def request_200(method, url, *args, **kwargs):
+ return FakeResponse(url, 200, b'ok')
+mock_200 = mock.Mock(wraps=request_200)
+
+
+def test_fields():
+ # timeout=0 means None
+ s = Scraper(requests_per_minute=100,
+ follow_robots=False,
+ raise_errors=False,
+ retry_attempts=-1, # will be 0
+ retry_wait_seconds=100)
+ assert s.requests_per_minute == 100
+ assert s.follow_robots is False
+ assert s.raise_errors is False
+ assert s.retry_attempts == 0 # -1 becomes 0
+ assert s.retry_wait_seconds == 100
+
+
+def test_get():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ resp = s.urlopen(HTTPBIN + 'get?woo=woo')
+ assert_equal(resp.response.code, 200)
+ assert_equal(json.loads(resp)['args']['woo'], 'woo')
+
+
+def test_post():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ resp = s.urlopen(HTTPBIN + 'post', 'POST', {'woo': 'woo'})
+ assert_equal(resp.response.code, 200)
+ resp_json = json.loads(resp)
+ assert_equal(resp_json['form']['woo'], 'woo')
+ assert_equal(resp_json['headers']['Content-Type'],
+ 'application/x-www-form-urlencoded')
+
+
+def test_request_throttling():
+ s = Scraper(requests_per_minute=30, follow_robots=False)
+ assert_equal(s.requests_per_minute, 30)
+
+ mock_sleep = mock.Mock()
+
+ # check that sleep is called on call 2 & 3
+ with mock.patch('time.sleep', mock_sleep):
+ with mock.patch.object(requests.Session, 'request', mock_200):
+ s.urlopen('http://dummy/')
+ s.urlopen('http://dummy/')
+ s.urlopen('http://dummy/')
+ assert_equal(mock_sleep.call_count, 2)
+ # should have slept for ~2 seconds
+ assert 1.8 <= mock_sleep.call_args[0][0] <= 2.2
+
+ # unthrottled, sleep shouldn't be called
+ s.requests_per_minute = 0
+ mock_sleep.reset_mock()
+
+ with mock.patch('time.sleep', mock_sleep):
+ with mock.patch.object(requests.Session, 'request', mock_200):
+ s.urlopen('http://dummy/')
+ s.urlopen('http://dummy/')
+ s.urlopen('http://dummy/')
+ assert_equal(mock_sleep.call_count, 0)
+
+
+def test_user_agent():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ resp = s.urlopen(HTTPBIN + 'user-agent')
+ ua = json.loads(resp)['user-agent']
+ assert_equal(ua, default_user_agent)
+
+ s.user_agent = 'a different agent'
+ resp = s.urlopen(HTTPBIN + 'user-agent')
+ ua = json.loads(resp)['user-agent']
+ assert_equal(ua, 'a different agent')
+
+
+def test_user_agent_from_headers():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ s.headers = {'User-Agent': 'from headers'}
+ resp = s.urlopen(HTTPBIN + 'user-agent')
+ ua = json.loads(resp)['user-agent']
+ assert_equal(ua, 'from headers')
+
+
+def test_follow_robots():
+ s = Scraper(requests_per_minute=0, follow_robots=True)
+
+ with mock.patch.object(requests.Session, 'request', mock_200):
+ # check that a robots.txt is created
+ s.urlopen(HTTPBIN)
+ assert HTTPBIN + 'robots.txt' in s._robot_parsers
+
+ # set a fake robots.txt for http://dummy
+ parser = robotparser.RobotFileParser()
+ parser.parse(['User-agent: *', 'Disallow: /private/', 'Allow: /'])
+ s._robot_parsers['http://dummy/robots.txt'] = parser
+
+ # anything behind private fails
+ assert_raises(RobotExclusionError, s.urlopen,
+ "http://dummy/private/secret.html")
+ # but others work
+ assert_equal(200, s.urlopen("http://dummy/").response.code)
+
+ # turn off follow_robots, everything works
+ s.follow_robots = False
+ assert_equal(
+ 200,
+ s.urlopen("http://dummy/private/secret.html").response.code)
+
+
+def test_404():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ assert_raises(HTTPError, s.urlopen, HTTPBIN + 'status/404')
+
+ s.raise_errors = False
+ resp = s.urlopen(HTTPBIN + 'status/404')
+ assert_equal(404, resp.response.code)
+
+
+def test_500():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+
+ assert_raises(HTTPError, s.urlopen, HTTPBIN + 'status/500')
+
+ s.raise_errors = False
+ resp = s.urlopen(HTTPBIN + 'status/500')
+ assert_equal(500, resp.response.code)
+
+
+def test_caching():
+ cache_dir = tempfile.mkdtemp()
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ s.cache_storage = MemoryCache()
+ s.cache_write_only = False
+
+ resp = s.urlopen(HTTPBIN + 'status/200')
+ assert not resp.response.fromcache
+ resp = s.urlopen(HTTPBIN + 'status/200')
+ assert resp.response.fromcache
+
+ for path in glob.iglob(os.path.join(cache_dir, "*")):
+ os.remove(path)
+ os.rmdir(cache_dir)
+
+
+def test_urlretrieve():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+
+ with mock.patch.object(requests.Session, 'request', mock_200):
+ fname, resp = s.urlretrieve("http://dummy/")
+ with open(fname) as f:
+ assert_equal(f.read(), 'ok')
+ assert_equal(200, resp.code)
+ os.remove(fname)
+
+ (fh, set_fname) = tempfile.mkstemp()
+ fname, resp = s.urlretrieve("http://dummy/", set_fname)
+ assert_equal(fname, set_fname)
+ with open(set_fname) as f:
+ assert_equal(f.read(), 'ok')
+ assert_equal(200, resp.code)
+ os.remove(set_fname)
+
+## TODO: on these retry tests it'd be nice to ensure that it tries
+## 3 times for 500 and once for 404
+
+
+def test_retry():
+ s = Scraper(retry_attempts=3, retry_wait_seconds=0.001,
+ follow_robots=False, raise_errors=False)
+
+ # On the first call return a 500, then a 200
+ mock_request = mock.Mock(side_effect=[
+ FakeResponse('http://dummy/', 500, 'failure!'),
+ FakeResponse('http://dummy/', 200, 'success!')
+ ])
+
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ resp = s.urlopen('http://dummy/')
+ assert_equal(mock_request.call_count, 2)
+
+ # 500 always
+ mock_request = mock.Mock(return_value=FakeResponse('http://dummy/', 500,
+ 'failure!'))
+
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ resp = s.urlopen('http://dummy/')
+ assert_equal(resp.response.code, 500)
+ assert_equal(mock_request.call_count, 4)
+
+
+def test_retry_404():
+ s = Scraper(retry_attempts=3, retry_wait_seconds=0.001,
+ follow_robots=False, raise_errors=False)
+
+ # On the first call return a 404, then a 200
+ mock_request = mock.Mock(side_effect=[
+ FakeResponse('http://dummy/', 404, 'failure!'),
+ FakeResponse('http://dummy/', 200, 'success!')
+ ])
+
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ resp = s.urlopen('http://dummy/', retry_on_404=True)
+ assert_equal(mock_request.call_count, 2)
+ assert_equal(resp.response.code, 200)
+
+ # 404 always
+ mock_request = mock.Mock(return_value=FakeResponse('http://dummy/', 404,
+ 'failure!'))
+
+ # retry on 404 true, 4 tries
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ resp = s.urlopen('http://dummy/', retry_on_404=True)
+ assert_equal(resp.response.code, 404)
+ assert_equal(mock_request.call_count, 4)
+
+ # retry on 404 false, just one more try
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ resp = s.urlopen('http://dummy/', retry_on_404=False)
+ assert_equal(resp.response.code, 404)
+ assert_equal(mock_request.call_count, 5)
+
+
+def test_timeout():
+ s = Scraper()
+ s.timeout = 0.001
+ s.follow_robots = False
+ with assert_raises(requests.Timeout):
+ s.urlopen(HTTPBIN + 'delay/1')
+
+
+def test_timeout_retry():
+ # TODO: make this work with the other requests exceptions
+ count = []
+
+ # On the first call raise timeout
+ def side_effect(*args, **kwargs):
+ if count:
+ return FakeResponse('http://dummy/', 200, 'success!')
+ count.append(1)
+ raise requests.Timeout('timed out :(')
+
+ mock_request = mock.Mock(side_effect=side_effect)
+
+ s = Scraper(retry_attempts=0, retry_wait_seconds=0.001,
+ follow_robots=False)
+
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ # first, try without retries
+ # try only once, get the error
+ assert_raises(requests.Timeout, s.urlopen, "http://dummy/")
+ assert_equal(mock_request.call_count, 1)
+
+ # reset and try again with retries
+ mock_request.reset_mock()
+ count = []
+ s = Scraper(retry_attempts=2, retry_wait_seconds=0.001,
+ follow_robots=False)
+ with mock.patch.object(requests.Session, 'request', mock_request):
+ resp = s.urlopen("http://dummy/")
+ # get the result, take two tries
+ assert_equal(resp, "success!")
+ assert_equal(mock_request.call_count, 2)
+
+
+def test_disable_compression():
+ s = Scraper()
+ s.disable_compression = True
+
+ # compression disabled
+ data = s.urlopen(HTTPBIN + 'headers')
+ assert 'compress' not in json.loads(data)['headers']['Accept-Encoding']
+ assert 'gzip' not in json.loads(data)['headers']['Accept-Encoding']
+
+ # default is restored
+ s.disable_compression = False
+ data = s.urlopen(HTTPBIN + 'headers')
+ assert 'compress' in json.loads(data)['headers']['Accept-Encoding']
+ assert 'gzip' in json.loads(data)['headers']['Accept-Encoding']
+
+ # A supplied Accept-Encoding headers overrides the
+ # disable_compression option
+ s.headers['Accept-Encoding'] = 'xyz'
+ data = s.urlopen(HTTPBIN + 'headers')
+ assert 'xyz' in json.loads(data)['headers']['Accept-Encoding']
+
+
+def test_callable_headers():
+ s = Scraper(header_func=lambda url: {'X-Url': url}, follow_robots=False)
+
+ data = s.urlopen(HTTPBIN + 'headers')
+ assert_equal(json.loads(data)['headers']['X-Url'], HTTPBIN + 'headers')
+
+ # Make sure it gets called freshly each time
+ data = s.urlopen(HTTPBIN + 'headers?shh')
+ assert_equal(json.loads(data)['headers']['X-Url'], HTTPBIN + 'headers?shh')
+
+
+def test_ftp_uses_urllib2():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+ urlopen = mock.Mock(return_value=BytesIO(b"ftp success!"))
+
+ with mock.patch('scrapelib.urllib_urlopen', urlopen):
+ r = s.urlopen('ftp://dummy/')
+ assert r.response.code == 200
+ assert r == "ftp success!"
+
+
+def test_ftp_retries():
+ count = []
+
+ # On the first call raise URLError, then work
+ def side_effect(*args, **kwargs):
+ if count:
+ return BytesIO(b"ftp success!")
+ count.append(1)
+ raise urllib_URLError('ftp failure!')
+
+ mock_urlopen = mock.Mock(side_effect=side_effect)
+
+ # retry on
+ with mock.patch('scrapelib.urllib_urlopen', mock_urlopen):
+ s = Scraper(retry_attempts=2, retry_wait_seconds=0.001,
+ follow_robots=False)
+ r = s.urlopen('ftp://dummy/', retry_on_404=True)
+ assert r == "ftp success!"
+ assert_equal(mock_urlopen.call_count, 2)
+
+ # retry off, retry_on_404 on (shouldn't matter)
+ count = []
+ mock_urlopen.reset_mock()
+ with mock.patch('scrapelib.urllib_urlopen', mock_urlopen):
+ s = Scraper(retry_attempts=0, retry_wait_seconds=0.001,
+ follow_robots=False)
+ assert_raises(FTPError, s.urlopen, 'ftp://dummy/',
+ retry_on_404=True)
+ assert_equal(mock_urlopen.call_count, 1)
+
+
+def test_ftp_method_restrictions():
+ s = Scraper(requests_per_minute=0, follow_robots=False)
+
+ # only http(s) supports non-'GET' requests
+ assert_raises(HTTPMethodUnavailableError, s.urlopen, "ftp://dummy/",
+ method='POST')
View
14 setup.py
@@ -1,17 +1,18 @@
#!/usr/bin/env python
-from setuptools import setup
-from scrapelib import __version__
+import sys
+from setuptools import setup, find_packages
long_description = open('README.rst').read()
setup(name="scrapelib",
- version=__version__,
+ version='0.9.0',
py_modules=['scrapelib'],
author="James Turk",
author_email='jturk@sunlightfoundation.com',
license="BSD",
url="http://github.com/sunlightlabs/scrapelib",
long_description=long_description,
+ packages=find_packages(),
description="a library for scraping things",
platforms=["any"],
classifiers=["Development Status :: 4 - Beta",
@@ -19,13 +20,14 @@
"License :: OSI Approved :: BSD License",
"Natural Language :: English",
"Operating System :: OS Independent",
- "Programming Language :: Python",
+ "Programming Language :: Python :: 2.7",
+ "Programming Language :: Python :: 3.3",
("Topic :: Software Development :: Libraries :: "
"Python Modules"),
],
- install_requires=["httplib2 >= 0.7.0"],
+ install_requires=['requests>=1.0'],
entry_points="""
[console_scripts]
-scrapeshell = scrapelib:scrapeshell
+scrapeshell = scrapelib.__main__:scrapeshell
"""
)
View
397 test.py
@@ -1,397 +0,0 @@
-import os
-import sys
-import glob
-import time
-import socket
-import urllib2
-import tempfile
-from multiprocessing import Process
-
-if sys.version_info[1] < 7:
- try:
- import unittest2 as unittest
- except ImportError:
- print 'Test Suite requires Python 2.7 or unittest2'
- sys.exit(1)
-else:
- import unittest
-
-import mock
-import flask
-import httplib2
-import scrapelib
-
-app = flask.Flask(__name__)
-app.config.shaky_fail = False
-app.config.shaky_404_fail = False
-
-@app.route('/')
-def index():
- resp = app.make_response("Hello world!")
- return resp
-
-
-@app.route('/ua')
-def ua():
- resp = app.make_response(flask.request.headers['user-agent'])
- resp.headers['cache-control'] = 'no-cache'
- return resp
-
-
-@app.route('/p/s.html')
-def secret():
- return "secret"
-
-
-@app.route('/redirect')
-def redirect():
- return flask.redirect(flask.url_for('index'))
-
-
-@app.route('/500')
-def fivehundred():
- flask.abort(500)
-
-
-@app.route('/robots.txt')
-def robots():
- return """
- User-agent: *
- Disallow: /p/
- Allow: /
- """
-
-
-@app.route('/shaky')
-def shaky():
- # toggle failure state each time
- app.config.shaky_fail = not app.config.shaky_fail
-
- if app.config.shaky_fail:
- flask.abort(500)
- else:
- return "shaky success!"
-
-@app.route('/shaky404')
-def shaky404():
- # toggle failure state each time
- app.config.shaky_404_fail = not app.config.shaky_404_fail
-
- if app.config.shaky_404_fail:
- flask.abort(404)
- else:
- return "shaky404 success!"
-
-def run_server():
- class NullFile(object):
- def write(self, s):
- pass
-
- sys.stdout = NullFile()
- sys.stderr = NullFile()
-
- app.run()
-
-
-class HeaderTest(unittest.TestCase):
- def test_keys(self):
- h = scrapelib.Headers()
- h['A'] = '1'
-
- self.assertEqual(h['A'], '1')
-
- self.assertEqual(h.getallmatchingheaders('A'), ["A: 1"])
- self.assertEqual(h.getallmatchingheaders('b'), [])
- self.assertEqual(h.getheaders('A'), ['1'])
- self.assertEqual(h.getheaders('b'), [])
-
- # should be case-insensitive
- self.assertEqual(h['a'], '1')
- h['a'] = '2'
- self.assertEqual(h['A'], '2')
-
- self.assert_('a' in h)
- self.assert_('A' in h)
-
- del h['A']
- self.assert_('a' not in h)
- self.assert_('A' not in h)
-
- def test_equality(self):
- h1 = scrapelib.Headers()
- h1['Accept-Encoding'] = '*'
-
- h2 = scrapelib.Headers()
- self.assertNotEqual(h1, h2)
-
- h2['accept-encoding'] = 'not'
- self.assertNotEqual(h1, h2)
-
- h2['accept-encoding'] = '*'
- self.assertEqual(h1, h2)
-
-
-class ScraperTest(unittest.TestCase):
- def setUp(self):
- self.cache_dir = tempfile.mkdtemp()
- self.error_dir = tempfile.mkdtemp()
- self.s = scrapelib.Scraper(requests_per_minute=0,
- error_dir=self.error_dir,
- cache_dir=self.cache_dir,
- use_cache_first=True)
-
- def tearDown(self):
- for path in glob.iglob(os.path.join(self.cache_dir, "*")):
- os.remove(path)
- os.rmdir(self.cache_dir)
- for path in glob.iglob(os.path.join(self.error_dir, "*")):
- os.remove(path)
- os.rmdir(self.error_dir)
-
- def test_get(self):
- self.assertEqual('Hello world!',
- self.s.urlopen("http://localhost:5000/"))
-
- def test_request_throttling(self):
- requests = 0
- s = scrapelib.Scraper(requests_per_minute=30)
- self.assertEqual(s.requests_per_minute, 30)
-
- begin = time.time()
- while time.time() <= (begin + 1):
- s.urlopen("http://localhost:5000/")
- requests += 1
- self.assert_(requests <= 2)
-
- s.requests_per_minute = 500
- requests = 0