Wikidata filtered-dump import

# the wikidata claim that entities have to match to be in the subset
claim=P31:Q5
# the type that will be passed to ElasticSearch 'wikidata' index
datatype=humans

./bin/dump_wikidata_subset $claim $datatype
# time for a coffee!

What happens here:

we download the latest Wikidata dump
pipe it to wikidata-filter to keep only entities matching the claim P31:Q5 and keeping only the entities attributes required by a full-text search engine, that is: id, labels, aliases, descriptions
pipe those filtered entities to ElasticSearch wikidata index under the datatype humans, making those entities searchable from the endpoint http://localhost:9200/wikidata/humans/_search (see ElasticSearch API doc)

⚠️ you are about to download a whole Wikidata dump that is something like 13GB compressed. Only the filtered output should be written to your disk though.

Import multiple Wikidata subsets into ElasticSearch

The same as the above but saving the Wikdiata dump to disk to avoid downloading 13GB multiple times when one time would be enough. This time, you do need the 13GB disk space, plus the space that will take your subsets in ElasticSearch

alias wdfilter=./node_modules/.bin/wikidata-filter
alias import_to_elastic=./bin/import_to_elasticsearch

curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz > wikidata-dump.json.gz

cat wikidata-dump.json.gz | gzip -d | wdfilter --claim P31:Q5 --omit type,sitelinks | import_to_elastic humans
# => will be available at http://localhost:9200/wikidata/humans

cat wikidata-dump.json.gz | gzip -d | wdfilter --claim P31:Q571 --omit type,sitelinks | import_to_elastic books
# => will be available at http://localhost:9200/wikidata/books

Tip If importing a dump fails at some point, rather than re-starting from 0, you can use the start-from command to restart from the latest known line. Example:

cat wikidata-dump.json.gz | gzip -d | start-from '"Q27999075"' | ./node_modules/.bin/wikidata-filter --claim P31:Q5 --omit type,sitelinks | ./bin/import_to_elasticsearch humans

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikidata_filtered_dump_import.md

wikidata_filtered_dump_import.md

Wikidata filtered-dump import

Import multiple Wikidata subsets into ElasticSearch

Files

wikidata_filtered_dump_import.md

Latest commit

History

wikidata_filtered_dump_import.md

File metadata and controls

Wikidata filtered-dump import

Import multiple Wikidata subsets into ElasticSearch