Produce Neo4j import CSVs from Wikipedia database dumps to build a graph of links between Wikipedia pages.
$ pip install wiki2neoUsage: wiki2neo [OPTIONS] [WIKI_XML_INFILE]
Parse Wikipedia pages-articles-multistream.xml[.bz2] dump into two Neo4j import
CSV files:
Node (Page) import, headers=["title:ID", "id"]
Relationships (Links) import, headers=[":START_ID", ":END_ID"]
Reads from stdin by default, pass [WIKI_XML_INFILE] to read from file.
Options:
-p, --pages-outfile FILENAME Node (Pages) CSV output file [default:pages.csv]
-l, --links-outfile FILENAME Relationships (Links) CSV output file [default: links.csv]
--help Show this message and exit.
Import resulting CSVs into Neo4j:
$ neo4j-admin import --nodes:Page pages.csv \
--relationships:LINKS_TO links.csv \
--ignore-duplicate-nodes --ignore-missing-nodes --multiline-fields
Downloads from Wikipedia are in compressed xml.bz2 format. wiki2neo supports
parsing either the compressed bz2 file directly or an uncompressed xml file:
# compressed
$ wiki2neo pages-articles-multistream.xml.bz2
# uncompressed
$ wiki2neo pages-articles-multistream.xml