Permalink
Browse files

merge changes

  • Loading branch information...
2 parents cec7e0a + 96bbf57 commit e01edf6902dd51fd4a0e0ac0664a286b5f17bae1 @JoeGermuska committed Nov 4, 2013
Showing with 1,329 additions and 1,148 deletions.
  1. +4 −0 .gitignore
  2. +9 −0 .travis.yml
  3. +2 −2 LICENSE
  4. +27 −14 README.rst
  5. +2 −0 coverage.sh
  6. +94 −1 docs/changelog.rst
  7. +3 −3 docs/conf.py
  8. +17 −8 docs/index.rst
  9. +2 −4 docs/scrapelib.rst
  10. +1 −2 requirements.txt
  11. +0 −711 scrapelib.py
  12. +439 −0 scrapelib/__init__.py
  13. +57 −0 scrapelib/__main__.py
  14. +157 −0 scrapelib/cache.py
  15. 0 scrapelib/tests/__init__.py
  16. +116 −0 scrapelib/tests/test_cache.py
  17. +378 −0 scrapelib/tests/test_scraper.py
  18. +8 −6 setup.py
  19. +0 −397 test.py
  20. +13 −0 tox.ini
View
@@ -3,4 +3,8 @@
*~
build/
errors/
+dist/
scrapelib.egg-info/
+.coverage
+cover/
+.tox
View
@@ -0,0 +1,9 @@
+language: python
+python:
+ - "2.7"
+ - "3.3"
+install: pip install mock nose "requests>=1.0" --use-mirrors --upgrade
+script: nosetests
+notifications:
+ email:
+ - jturk@sunlightfoundation.com
View
@@ -1,4 +1,4 @@
-Copyright (c) 2010, Sunlight Labs
+Copyright (c) 2012, Sunlight Labs
All rights reserved.
@@ -24,4 +24,4 @@ PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
View
@@ -2,30 +2,43 @@
scrapelib
=========
-A Python library for scraping things.
+scrapelib is a library for making requests to websites, particularly those
+that may be less-than-reliable.
-Features include:
+scrapelib originated as part of the `Open States <http://openstates.org/>`_
+project to scrape the websites of all 50 state legislatures and as a result
+was therefore designed with features desirable when dealing with sites that
+have intermittent errors or require rate-limiting.
- * HTTP, HTTPS, FTP requests via an identical API
- * HTTP caching, compression and cookies
- * redirect following
- * request throttling
- * robots.txt compliance (optional)
- * robust error handling
+As of version 0.7 scrapelib has been retooled to take advantage of the superb
+`requests <http://python-requests.org>`_ library.
-scrapelib is a project of Sunlight Labs (c) 2011.
+Advantages of using scrapelib over alternatives like httplib2 simply using
+requests as-is:
+
+* All of the power of the suberb `requests <http://python-requests.org>`_ library.
+* HTTP, HTTPS, and FTP requests via an identical API
+* support for simple caching with pluggable cache backends
+* request throttling
+* configurable retries for non-permanent site failures
+* optional robots.txt compliance
+
+scrapelib is a project of Sunlight Labs (c) 2013.
All code is released under a BSD-style license, see LICENSE for details.
-Written by Michael Stephens <mstephens@sunlightfoundation.com> and James Turk
-<jturk@sunlightfoundation.com>.
+Written by James Turk <jturk@sunlightfoundation.com>
+
+Contributors:
+ * Michael Stephens - initial urllib2/httplib2 version
+ * Joe Germuska - fix for IPython embedding
+ * Alex Chiang - fix to test suite
Requirements
============
-python >= 2.6
-
-httplib2 optional but highly recommended.
+* python 2.7 or 3.3
+* requests >= 1.0
Installation
============
View
@@ -0,0 +1,2 @@
+rm -rf cover/
+nosetests --cover-html --with-coverage --cover-package=scrapelib
View
@@ -1,10 +1,103 @@
scrapelib changelog
===================
+0.9.0
+-----
+**22 May 2013**
+ * replace FTPSession with FTPAdapter
+ * fixes for latest requests
+
+0.8.0
+-----
+**18 March 2013**
+ * requests 1.0 compatibility
+ * removal of requests pass-throughs
+ * deprecation of setting parameters via constructor
+
+0.7.4
+-----
+**20 December 2012**
+ * bugfix for status_code coming from a cache
+ * bugfix for setting user-agent from headers
+ * fix requests version at <1.0
+
+0.7.3
+-----
+**21 June 2012**
+ * fix for combination of FTP and caching
+ * drop unnecessary ScrapelibSession
+ * bytes fix for scrapeshell
+ * use UTF8 if encoding guess fails
+
+0.7.2
+-----
+**9 May 2012**
+ * bugfix for user-agent check
+ * bugfix for cached content with \r characters
+ * bugfix for requests >= 0.12
+ * cache_dir deprecation is total
+
+0.7.1
+-----
+**27 April 2012**
+ * breaking change: no longer accept URLs without a scheme
+ * deprecation of error_dir & context-manager mode
+ * addition of overridable accept_response hook
+ * bugfix: retry on more requests errors
+ * bugfix: unicode cached content no longer incorrectly encoded
+ * implement various requests enhancements separately for ease of reuse
+ * convert more Scraper parameters to properties
+
+0.7.0
+-----
+**23 April 2012**
+ * rewritten internals to use requests, dropping httplib2
+ * as a result of rewrite, caching behavior no longer attempts to be
+ compliant with the HTTP specification but is much more configurable
+ * added cache_write_only option
+ * deprecation of accept_cookies, use_cache_first, cache_dir parameter
+ * improved tests
+ * improved Python 3 support
+
+0.6.2
+-----
+**20 April 2012**
+ * bugfix for POST-redirects
+ * drastically improved test coverage
+ * add encoding to ResultStr
+
+0.6.1
+-----
+**19 April 2012**
+ * add .bytes attribute to ResultStr
+ * bugfix related to bytes in urlretrieve
+
+0.6.0
+-----
+**19 April 2012**
+ * remove urllib2 fallback for HTTP
+ * rework entire test suite to not rely on Flask
+ * Unicode & Str unification
+ * experimental Python 3.2 support
+
+0.5.8
+-----
+**15 February 2012**
+ * fix to test suite from Alex Chiang
+
+0.5.7
+-----
+**2 February 2012**
+ * -p, --postdata parameter
+ * argv fix for IPython <= 0.10 from Joe Germuska
+ * treat FTP 550 errors as HTTP 404s
+ * use_cache_first improvements
+
0.5.6
-----
-**7 November 2011**
+**9 November 2011**
* scrapeshell fix for IPython >= 0.11
+ * scrapelib.urlopen can take method/body params too
0.5.5
-----
View
@@ -41,16 +41,16 @@
# General information about the project.
project = u'scrapelib'
-copyright = u'2011, Michael Stephens and James Turk'
+copyright = u'2013, Sunlight Labs'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
-version = '0.5'
+version = '0.9'
# The full version, including alpha/beta/rc tags.
-release = '0.5.6'
+release = '0.9.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
View
@@ -3,17 +3,26 @@ scrapelib |release|
Overview
--------
-scrapelib is a library that at its simplest provides a replacement for urllib2's urlopen functionality but can do much more.
+scrapelib is a library for making requests to websites, particularly those
+that may be less-than-reliable.
-Advantages of using scrapelib over urllib2 or httplib2 include:
+scrapelib originated as part of the `Open States <http://openstates.org/`_
+project to scrape the websites of all 50 state legislatures and as a result
+was therefore designed with features desirable when dealing with sites that
+have intermittent errors or require rate-limiting.
-* HTTP, HTTPS, FTP requests via an identical API
-* HTTP caching, compression, and cookies
-* intelligent and configurable redirect following
+As of version 0.7 scrapelib has been retooled to take advantage of the superb
+`requests <http://python-requests.org>`_ library.
+
+Advantages of using scrapelib over alternatives like httplib2 simply using
+requests as-is:
+
+* All of the power of the suberb `requests <http://python-requests.org>`_ library.
+* HTTP(S) and FTP requests via an identical API
+* support for simple caching with pluggable cache backends
* request throtting
-* configurable retries for non-permanent failures
-* robots.txt compliance
-* robust error handling
+* configurable retries for non-permanent site failures
+* optional robots.txt compliance
Contents
--------
View
@@ -20,13 +20,11 @@ Response objects
.. autoclass:: Headers
-ResultStr and ResultUnicode
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ResultStr
+~~~~~~~~~
.. autoclass:: ResultStr
-.. autoclass:: ResultUnicode
-
Exceptions
----------
View
@@ -1,2 +1 @@
-# Not strictly required, but you probably want it
-httplib2>=0.6.0
+requests>=1.0
Oops, something went wrong. Retry.

0 comments on commit e01edf6

Please sign in to comment.