Utility to harvest records from Digital Commons, via OAI-PMH, and index them in Apache Solr.
Instructions for Use:
- Clone dc2Solr repository: git clone http://github.com/WSULib/dc2Solr
- Set system specific variables in dc2Solr.py
- baseURL = Location of Solr core for indexing records
- baseOAI = Digital Commons URL + "do/oai/?" suffix (e.g. http://digitalcommons.wayne.edu/do/oai/?)
- saxonLocation = This utility uses the Saxon Java command line program to perform XSL transformations, which can be downloaded here. This variable must point to the location of the Saxon jar file (likely "Saxon9he.jar")
- Configure Solr - an rough example schema is located in the /SolrConfig directory, this can surely be optimized for faceting and memory consumption.
- Change permissions on directories "setsXML" and "solrXML" such that python and Saxon can download and write to them.
- Finally, run "python dc2Solr.py" with the desired actions to perform:
- download = Download all OAI sets from Digital Commons
- transform = Transforms OAI XML to Solr ready XML via the XSLT stylesheet "dc2solr.xsl"
- index = Indexes all Solr ready XML documents in "solrXML" into Solr
- all = Performs all three actions, in order. This can be used to run this utility as a fully automated cron job.
Wayne State University Libraries, 2013