Skip to content

Commit

Permalink
some simple dump to list procedure from 28 Nov 2012
Browse files Browse the repository at this point in the history
  • Loading branch information
grammarware committed Dec 6, 2012
1 parent 1bc566d commit ea030cd
Show file tree
Hide file tree
Showing 6 changed files with 1,178,888 additions and 0 deletions.
13 changes: 13 additions & 0 deletions commons/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
TIMESTAMP = 20121118

all:

extract:
./filter.py commonswiki-${TIMESTAMP}-pages-articles-multistream-index.txt

get:
wget http://dumps.wikimedia.org/commonswiki/${TIMESTAMP}/commonswiki-${TIMESTAMP}-pages-articles-multistream-index.txt.bz2
bzip2 -d commonswiki-${TIMESTAMP}-pages-articles-multistream-index.txt.bz2

clean:
rm -rf *.bz2 *-index.txt
15 changes: 15 additions & 0 deletions commons/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
This directory contains the list of JPG, PNG and GIF files found on Wikimedia Commons.
Originally requested by Jeroen van den Bos for his joint research with Tijs van der Storm on parsing formatted binary data, it is free (CC-BY-SA) for anyone else to use as well.

The list is derived by a simple Python script from the data dump available from Wikimedia:
http://dumps.wikimedia.org/commonswiki/20121118/
(If you want a newer one, change TIMESTAMP in the Makefile and run 'make get' and then 'make extract')
The script is so trivial that it is very unlikely for something to break.

The data itself is distributed under open licenses GNU FDL and CC-BY-SA:
http://www.gnu.org/copyleft/fdl.html
http://creativecommons.org/licenses/by-sa/3.0/

Yours,
Vadim Zaytsev aka @grammarware,
http://grammarware.net
Loading

0 comments on commit ea030cd

Please sign in to comment.