Skip to content

fczuardi-trash/abrcrawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ABrCrawl

ABrCrawl is a simple command line tool for crawling the web pages of Agência Brasil’s free Images Archive and extract metadata to a structured table (csv or json file).

Important notice

EBC/Agencia Brasil have discontinued the website version that this crawler relied on in the beggining of 2010. Please check abrcrawl2 if you need to crawl the new(2010) version of their image bank.

EBC/Agencia Brasil have replaced the old website with a new (and worst) crappy piece of junk website that broke several links(people in the brazilian government apparently are not aware that Cool URIs don’t change ) and they have also deleted a good amount of brazilian pictures archive in the process.

The great rich repository of Creative Commons Licensed photos that they use to host, that covered lots of important and historic moments of the brazilian political news (more than 3 years worth of pictures, tens of thousands good quality images) is now gone forever. The new website contains only pictures starting in 2010.

So, I’ve have recently started a new project abrcrawl2 to crawl and archive content of this new website version, you should use that repository from now on. I am keeping this old code for reference purposes only.

Requirements

Mac

Installing jpeglib and PIL on Mac OSX 10.6 (Snow Leopard)

From http://proteus-tech.com/blog/cwt/install-pil-in-snow-leopard/

“Next, I download libjpeg latest version (http://www.ijg.org/files/jpegsrc.v7.tar.gz), untar it, and configure it:
$ tar zxvf jpegsrc.v7.tar.gz
$ cd jpeg-7
$ ./configure —enable-shared —enable-static
$ make
$ sudo make install
Now, you just installed libjpeg into /usr/local/lib. So, it’s time to install PIL. Download the PIL source code from http://effbot.org/downloads/Imaging-1.1.6.tar.gz then untar it:
$ tar zxvf Imaging-1.1.6.tar.gz
$ cd Imaging-1.1.6
This is the trick, you must open setup.py with your prefered text editor then looking for the line JPEG_ROOT = None and then it to JPEG_ROOT = libinclude(“/usr/local”) then save the file and continue:
$ python setup.py build
If everything fine, you can install PIL to your system library. If the PIL need something that didn’t exist, it will tell you on the error message. You may install the missing library via fink or download and compile it by yourself. Then install the PIL as root:
$ sudo python setup.py install —optimize=1
Done. :)”

Download

The latest version is available at the ABrCrawl Git repository if you have git installed, checkout the repository on your machine:

git clone git://github.com/fczuardi/abrcrawl.git

Or if you prefer, just download the latest version zip file

Usage

The main script is the abrcrawl.py, you can call it using the help argument to get a list of available options:

python abrcrawl.py --help

Contribute

ABrCrawl is a free and open source software, if you find a bug or have any suggestions and patches to send and make it better, please use the ABrCrawl Github page either to file an issue or to send a pull request. Alternatively, you can contact me directly.

Software License (BSD)

Copyright © 2009, Fabricio Zuardi
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
* Neither the name of the author nor the names of its contributors
may be used to endorse or promote products derived from this
software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
“AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Content License (CC-Attribution-2.5)

The images and their descriptions produced by Agência Brasil are released under a Creative Commons Atribuição 2.5. Brasil license, here is the quote (in Portuguese) from the website:

COBERTURA GRATUITA
Diariamente, a equipe de repórteres fotográficos da Agência Brasil produz uma média de 100 imagens, diretamente de Brasília, e as distribui gratuitamente para todo o país. Todo esse conteúdo pode ser adquirido em várias definições, inclusive em alta resolução, e ser utilizado livremente, mediante citação do crédito.

Note

Although the majority of images of the image bank are produced by Agência Brasil some photos sometimes comes from a different source, so make sure you check the rights owner of an individual photo before using it on your projects, one way of identifying if the photo came from Agência Brasil is to look for an “/Abr” attached in the end of the photographer’s name.

About

This project moved to fczuardi/abrcrawl3, I am keeping the repo for reference purposes only, it is inactive.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages