some simple dump to list procedure from 28 Nov 2012

grammarware · Dec 6, 2012 · ea030cd · ea030cd
1 parent 1bc566d
commit ea030cd
Show file tree

Hide file tree

Showing 6 changed files with 1,178,888 additions and 0 deletions.
diff --git a/commons/Makefile b/commons/Makefile
@@ -0,0 +1,13 @@
+TIMESTAMP = 20121118
+
+all:
+
+extract:
+	./filter.py commonswiki-${TIMESTAMP}-pages-articles-multistream-index.txt
+
+get:
+	wget http://dumps.wikimedia.org/commonswiki/${TIMESTAMP}/commonswiki-${TIMESTAMP}-pages-articles-multistream-index.txt.bz2
+	bzip2 -d commonswiki-${TIMESTAMP}-pages-articles-multistream-index.txt.bz2
+
+clean:
+	rm -rf *.bz2 *-index.txt
diff --git a/commons/README.txt b/commons/README.txt
@@ -0,0 +1,15 @@
+This directory contains the list of JPG, PNG and GIF files found on Wikimedia Commons.
+Originally requested by Jeroen van den Bos for his joint research with Tijs van der Storm on parsing formatted binary data, it is free (CC-BY-SA) for anyone else to use as well.
+
+The list is derived by a simple Python script from the data dump available from Wikimedia:
+	http://dumps.wikimedia.org/commonswiki/20121118/
+(If you want a newer one, change TIMESTAMP in the Makefile and run 'make get' and then 'make extract')
+The script is so trivial that it is very unlikely for something to break.
+
+The data itself is distributed under open licenses GNU FDL and CC-BY-SA:
+	http://www.gnu.org/copyleft/fdl.html
+	http://creativecommons.org/licenses/by-sa/3.0/
+
+Yours,
+	Vadim Zaytsev aka @grammarware,
+	http://grammarware.net