Indexes text with math in Lucene/Solr-based full-text search engines
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.circleci
src/main
tools
.gitignore
LICENSE
README.md
pom.xml

README.md

MIaS – Math-aware full-text search engine

CircleCI

MIaS (Math Indexer and Searcher) is a math-aware full-text search engine. It is based on Apache Lucene; however, its maths processing capabilities are standalone and can be easily integrated into any Lucene/Solr based system, as in EuDML search service.

Usage

Setting up mias.properties

Create a file named mias.properties in some location, e.g. /home/MIaS/conf/mias.properties, and set up the following properties:

  • INDEXDIR – Path to the directory, where the index is / will be located.
  • UPDATE – If TRUE, the files that are already indexed and are about to be indexed again, will be updated. If FALSE, the indexer will skip them and only add new files.
  • MAXRESULTS – The maximum number of results that the system retrieves.
  • DOCLIMIT - The limit for the number of the documents that are indexed during one run. -1 means no limit.
  • THREADS - The number of threads that will be used for processing.

The resulting file might have the following content:

INDEXDIR=/home/data/index
UPDATE=false
MAXRESULTS=10000
DOCLIMIT=-1
THREADS=8

In Windows, backslashes in paths need to be escaped, e.g. you would insert C:\\MIaS\\index instead of C:\MIaS\index.

Running MIaS

To run MIaS, locate the JAR file of MIaS and run the following command:

java -jar MIaS.jar [OPTIONS]

To see the available options, run the following command:

java -jar MIaS.jar -help

There must exist a directory named lib containing necessary dependencies located within the same directory as the jar file.

Citing MIaS

Text

SOJKA, Petr and Martin LÍŠKA. The Art of Mathematics Retrieval. In Matthew R. B. Hardy, Frank Wm. Tompa. Proceedings of the 2011 ACM Symposium on Document Engineering. Mountain View, CA, USA: ACM, 2011. p. 57–60. ISBN 978-1-4503-0863-2. doi:10.1145/2034691.2034703.

BibTeX

@inproceedings{doi:10.1145:2034691.2034703,
     author = "Petr Sojka and Martin L{\'\i}{\v s}ka",
      title = "{The Art of Mathematics Retrieval}",
  booktitle = "{Proceedings of the ACM Conference on Document Engineering,
                DocEng 2011}",
  publisher = "{Association of Computing Machinery}",
    address = "{Mountain View, CA}",
       year = 2011,
      month = Sep,
       isbn = "978-1-4503-0863-2",
      pages = "57--60",
        url = {http://doi.acm.org/10.1145/2034691.2034703},
        doi = {10.1145/2034691.2034703},
   abstract = {The design and architecture of MIaS (Math Indexer and Searcher), 
               a system for mathematics retrieval is presented, and design
               decisions are discussed. We argue for an approach based on
               Presentation MathML using a similarity of math subformulae. The
               system was implemented as a math-aware search engine based on
               the state-of-the-art system Apache Lucene. Scalability issues
               were checked against more than 400,000 arXiv documents with 158
               million mathematical formulae. Almost three billion MathML
               subformulae were indexed using a Solr-compatible Lucene.},
}