This repository is used to build and analyze a large graph database (using neo4j). The intent is to use this database for analysis to examine the impact of EarthCube on investigator networks. There are two components. One to build the database, and one to analyze the constructed database. See the Details below for more information.
- Simon Goring - University of Wisconsin - Madison
I welcome contributions from developers, or non-developers. Please feel free to raise issues or contribute code and text. Please do so as a Pull Request.
The database is built using XML data downloaded from the National Sciences Foundation. For convenience, a bash
script, get_awards.sh
can be used to download all the required files to a new directory at ./data/input/awards
. Each individual zipped file represents all awards for a particular year. To execute the file from the command line simply enter:
bash ./get_awards.sh
This does not require the use of neo4j
and can be the basis of any kind of database you'd like to use.
The database can be built by executing the build_db_xml.sh
bash script. Additionally, the script to generate the database requires use of the apoc
plugin for Neo4j. Be sure to install the plugin before using the scripts to build the database.
The database will be built wherever you have your neo4j.conf
file set up to find it. If you wish to put this database somewhere else on your system, simply edit etc/neo4j/neo4j.conf
to point to the proper location.
Once you have things set up, execute:
sudo bash build_db_xml.sh
This assumes you use root
privileges to start and stop the neo4j
database (we re-start the service) and also to manage the database itself.
There is probably a lot of work that can be done to optimize the core CQL
file, cql_files/xml_direct.cql
, but I'm still learning Cypher.
Additionally, although this code brings down the processing speed for each file to about 100ms per XML file, there are hundreds of thousands of files. To optimize this operation I used GNU Parallel1. You can see how the code for build_db_xml.sh
changed by looking at the commit history. In particular, we make use of the parallel
function (install using apt install parallel
on linux systems). Because of the large number of files in later years, parallel
needs to be run using the --ungroup
flag. This allows the output to be pushed immediately, instead of filling memory up waiting for all the returns before dumping the output.
find ./data/input/awards/ -name *.xml | parallel --ungroup --eta "runxml {} >> output.log"
GNU Parallel was a pretty fun discovery and it seems to have sped things up a bit for me. There's some other great options here. The --eta
flag gives an output that returns some information about the run:
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 153s Left: 300 AVG: 0.53s local:300/7700/97%/0.7s
One suggestion that has been made is to add some scripted element in here that builds the graph in memory and then does a periodic commit to neo4j. This would speed the transaction further since each file requires its own MATCH
/CREATE
sequence. Having multiple transactions bundled at once would lower this overhead, since the duplicates could be dealt with in memory before being committed to the neo4j database.
I write too much academic research to leave without a concluding statement. The end.
1O. Tange (2011): GNU Parallel - The Command-Line Power Tool. ;login: The USENIX Magazine, February 2011:42-47.
data/input/awards/7407911.xml <- Error.