# the wikidata claim that entities have to match to be in the subset
claim=P31:Q5
# the type that will be passed to ElasticSearch 'wikidata' index
datatype=humans
./bin/dump_wikidata_subset $claim $datatype
# time for a coffee!
What happens here:
- we download the latest Wikidata dump
- pipe it to wikidata-filter to keep only entities matching the claim
P31:Q5
and keeping only the entities attributes required by a full-text search engine, that is:id
,labels
,aliases
,descriptions
- pipe those filtered entities to ElasticSearch
wikidata
index under the datatypehumans
, making those entities searchable from the endpointhttp://localhost:9200/wikidata/humans/_search
(see ElasticSearch API doc)
The same as the above but saving the Wikdiata dump to disk to avoid downloading 13GB multiple times when one time would be enough. This time, you do need the 13GB disk space, plus the space that will take your subsets in ElasticSearch
alias wdfilter=./node_modules/.bin/wikidata-filter
alias import_to_elastic=./bin/import_to_elasticsearch
curl -s https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz > wikidata-dump.json.gz
cat wikidata-dump.json.gz | gzip -d | wdfilter --claim P31:Q5 --omit type,sitelinks | import_to_elastic humans
# => will be available at http://localhost:9200/wikidata/humans
cat wikidata-dump.json.gz | gzip -d | wdfilter --claim P31:Q571 --omit type,sitelinks | import_to_elastic books
# => will be available at http://localhost:9200/wikidata/books
Tip
If importing a dump fails at some point, rather than re-starting from 0, you can use the start-from
command to restart from the latest known line.
Example:
cat wikidata-dump.json.gz | gzip -d | start-from '"Q27999075"' | ./node_modules/.bin/wikidata-filter --claim P31:Q5 --omit type,sitelinks | ./bin/import_to_elasticsearch humans