github
Advanced Search
  • Home
  • Pricing and Signup
  • Explore GitHub
  • Blog
  • Login

fczuardi / abrcrawl

  • Admin
  • Watch Unwatch
  • Fork
  • Your Fork
  • Pull Request
  • Download Source
    • 4
    • 1
  • Source
  • Commits
  • Network (1)
  • Issues (0)
  • Downloads (0)
  • Wiki (1)
  • Graphs
  • Branch: master

click here to add a description

click here to add a homepage

  • Branches (1)
    • master ✓
  • Tags (0)
Sending Request…
Enable Donations

Pledgie Donations

Once activated, we'll place the following badge in your repository's detail box:
Pledgie_example
This service is courtesy of Pledgie.

A simple crawler to extract data from the brazilian news and politics-related free images repository of Agencia Brasil — Read more

  cancel

  cancel
  • Private
  • Read-Only
  • HTTP Read-Only

This URL has Read+Write access

the initial csv is now all quoted 
Fabricio Zuardi (author)
Sat Dec 12 10:17:26 -0800 2009
commit  9dd0503ff22d8aebd6f9f877d02f93e4366350f5
tree    aa98d957c39cb74cde022c448c9ef47ea8405393
parent  06aae67c87217757cd0a4712059ec50df28201c2
abrcrawl /
name age
history
message
file .gitignore Loading commit data...
file README.textile Fri Dec 11 14:10:22 -0800 2009 created a new tool to enhance the generated csv... [Fabricio Zuardi]
file abrcrawl.py Sat Dec 12 10:17:26 -0800 2009 the initial csv is now all quoted [Fabricio Zuardi]
file add_images_info.py Sat Dec 12 10:16:37 -0800 2009 updated to a more reliable way of checking if t... [Fabricio Zuardi]
directory data/
README.textile

ABrCrawl

ABrCrawl is a simple command line tool for crawling the web pages of Agência Brasil’s free Images Archive and extract metadata to a structured table (csv or json file).

Requirements

  • Python >= 2.5
  • simplejson
  • jpeglib
  • PIL

Mac

Installing jpeglib and PIL on Mac OSX 10.6 (Snow Leopard)

From http://proteus-tech.com/blog/cwt/install-pil-in-snow-leopard/

“Next, I download libjpeg latest version (http://www.ijg.org/files/jpegsrc.v7.tar.gz), untar it, and configure it:
$ tar zxvf jpegsrc.v7.tar.gz
$ cd jpeg-7
$ ./configure —enable-shared —enable-static
$ make
$ sudo make install
Now, you just installed libjpeg into /usr/local/lib. So, it’s time to install PIL. Download the PIL source code from http://effbot.org/downloads/Imaging-1.1.6.tar.gz then untar it:
$ tar zxvf Imaging-1.1.6.tar.gz
$ cd Imaging-1.1.6
This is the trick, you must open setup.py with your prefered text editor then looking for the line JPEG_ROOT = None and then it to JPEG_ROOT = libinclude(“/usr/local”) then save the file and continue:
$ python setup.py build
If everything fine, you can install PIL to your system library. If the PIL need something that didn’t exist, it will tell you on the error message. You may install the missing library via fink or download and compile it by yourself. Then install the PIL as root:
$ sudo python setup.py install —optimize=1
Done. :)”

Download

The latest version is available at the ABrCrawl Git repository if you have git installed, checkout the repository on your machine:

git clone git://github.com/fczuardi/abrcrawl.git

Or if you prefer, just download the latest version zip file

Usage

The main script is the abrcrawl.py, you can call it using the help argument to get a list of available options:

python abrcrawl.py --help

Contribute

ABrCrawl is a free and open source software, if you find a bug or have any suggestions and patches to send and make it better, please use the ABrCrawl Github page either to file an issue or to send a pull request. Alternatively, you can contact me directly.

Software License (BSD)

Copyright © 2009, Fabricio Zuardi
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
* Neither the name of the author nor the names of its contributors
may be used to endorse or promote products derived from this
software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
“AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Content License (CC-Attribution-2.5)

The images and their descriptions produced by Agência Brasil are released under a Creative Commons Atribuição 2.5. Brasil license, here is the quote (in Portuguese) from the website:

COBERTURA GRATUITA
Diariamente, a equipe de repórteres fotográficos da Agência Brasil produz uma média de 100 imagens, diretamente de Brasília, e as distribui gratuitamente para todo o país. Todo esse conteúdo pode ser adquirido em várias definições, inclusive em alta resolução, e ser utilizado livremente, mediante citação do crédito.

Note

Although the majority of images of the image bank are produced by Agência Brasil some photos sometimes comes from a different source, so make sure you check the rights owner of an individual photo before using it on your projects, one way of identifying if the photo came from Agência Brasil is to look for an “/Abr” attached in the end of the photographer’s name.

Blog | Support | Training | Contact | API | Status | Twitter | Help | Security
© 2010 GitHub Inc. All rights reserved. | Terms of Service | Privacy Policy
Powered by the Dedicated Servers and
Cloud Computing of Rackspace Hosting®
Dedicated Server