Skip to content

Commit

Permalink
preprocessing of the DBLP dump
Browse files Browse the repository at this point in the history
  • Loading branch information
grammarware committed Dec 6, 2012
1 parent 909abcb commit fec7193
Show file tree
Hide file tree
Showing 5 changed files with 112 additions and 3 deletions.
3 changes: 0 additions & 3 deletions data/Makefile
@@ -1,8 +1,5 @@
all:

dblp:
wget http://dblp.uni-trier.de/xml/dblp.xml

clean:
rm -rf *.xml *.json

11 changes: 11 additions & 0 deletions dblp/Makefile
@@ -0,0 +1,11 @@
all:

extract:
xsltproc --stringparam conf models filter.xslt _everything.xml > models.xml

get:
wget http://dblp.uni-trier.de/xml/dblp.xml -O _everything.xml

clean:
rm -rf *.xml

14 changes: 14 additions & 0 deletions dblp/README.txt
@@ -0,0 +1,14 @@
This directory contains the XML dump taken from DBLP:
http://dblp.uni-trier.de/xml/
The data itself is distributed with an open data license ODC-BY:
http://opendatacommons.org/licenses/by/1.0

The _everything.xml file is identical to dblp.xml, but preprocessed to be made self-contained: all SGML entities otherwise specified by dblp.dtd, are replaced with their Unicode counterparts.

DO NOT run 'make get' on your machine unless you want to sacrifice about 1 GB of network traffic and 4 hours or so on preprocessing. Otherwise, go ahead.

DO NOT run 'make clean' unless you managed to irreparably damage the XML files. Otherwise, go ahead.

Yours,
Vadim Zaytsev aka @grammarware,
http://grammarware.net
File renamed without changes.
87 changes: 87 additions & 0 deletions dblp/deentitify.sh
@@ -0,0 +1,87 @@
#!/bin/sh

echo 'Fixing umlauts...'
sed -i 's/ä/ä/g' $1
sed -i 's/ë/ë/g' $1
sed -i 's/ï/ï/g' $1
sed -i 's/ö/ö/g' $1
sed -i 's/ü/ü/g' $1
sed -i 's/ÿ/ÿ/g' $1
echo 'Fixing umlauts on capitals...'
sed -i 's/Ä/Ä/g' $1
sed -i 's/Ë/Ë/g' $1
sed -i 's/Ï/Ï/g' $1
sed -i 's/Ö/Ö/g' $1
sed -i 's/Ü/Ü/g' $1
sed -i 's/Ÿ/Ÿ/g' $1
echo 'Fixing acutes...'
sed -i 's/á/á/g' $1
sed -i 's/é/é/g' $1
sed -i 's/í/í/g' $1
sed -i 's/ó/ó/g' $1
sed -i 's/ú/ú/g' $1
sed -i 's/ý/ý/g' $1
echo 'Fixing acutes on capitals...'
sed -i 's/Á/Á/g' $1
sed -i 's/É/É/g' $1
sed -i 's/Í/Í/g' $1
sed -i 's/Ó/Ó/g' $1
sed -i 's/Ú/Ú/g' $1
sed -i 's/Ý/Ý/g' $1
echo 'Fixing graves...'
sed -i 's/à/à/g' $1
sed -i 's/è/è/g' $1
sed -i 's/ì/ì/g' $1
sed -i 's/ò/ò/g' $1
sed -i 's/ù/ù/g' $1
sed -i 's/&ygrave;/ỳ/g' $1
echo 'Fixing graves on capitals...'
sed -i 's/À/À/g' $1
sed -i 's/È/È/g' $1
sed -i 's/Ì/Ì/g' $1
sed -i 's/Ò/Ò/g' $1
sed -i 's/Ù/Ù/g' $1
sed -i 's/&Ygrave;/Ỳ/g' $1
echo 'Fixing tildes...'
sed -i 's/ã/ã/g' $1
sed -i 's/õ/õ/g' $1
sed -i 's/ñ/ñ/g' $1
echo 'Fixing tildes in capitals...'
sed -i 's/Ã/Ã/g' $1
sed -i 's/Õ/Õ/g' $1
sed -i 's/Ñ/Ñ/g' $1
echo 'Fixing rings and circumflexes...'
sed -i 's/å/å/g' $1
sed -i 's/â/â/g' $1
sed -i 's/ĉ/ĉ/g' $1
sed -i 's/ê/ê/g' $1
sed -i 's/î/î/g' $1
sed -i 's/ô/ô/g' $1
sed -i 's/û/û/g' $1
echo 'Fixing rings and circumflexes in capitals...'
sed -i 's/Å/Å/g' $1
sed -i 's/Â/Â/g' $1
sed -i 's/Ĉ/Ĉ/g' $1
sed -i 's/Ê/Ê/g' $1
sed -i 's/Î/Î/g' $1
sed -i 's/Ô/Ô/g' $1
sed -i 's/Û/Û/g' $1
echo 'Fixing other diacritics...'
sed -i 's/ç/ç/g' $1
sed -i 's/Ç/Ç/g' $1
sed -i 's/ø/ø/g' $1
sed -i 's/Ø/Ø/g' $1
echo 'Fixing fancy letters...'
sed -i 's/µ/µ/g' $1
sed -i 's/ß/ß/g' $1
sed -i 's/æ/æ/g' $1
sed -i 's/Æ/Æ/g' $1
sed -i 's/œ/œ/g' $1
sed -i 's/Œ/Œ/g' $1
sed -i 's/ð/ð/g' $1
sed -i 's/Ð/Ð/g' $1
sed -i 's/þ/þ/g' $1
sed -i 's/Þ/Þ/g' $1
echo 'Fixing miscellaneous signs...'
sed -i 's/×/×/g' $1
sed -i 's/®/®/g' $1

0 comments on commit fec7193

Please sign in to comment.