Skip to content

AlexanderSavochkin/MolecularLucene

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MolecularLucene

Lucene is exceptionally good in texts search. Other kinds of query (dates/numbers ranges, geospatial e.t.c) are also supported. Here is attempt to bring chemcal structures search into Lucene world.

This project introduces special kind of lucene analyzer for searching/indexing chemical structures.

In order to be indexed and/or searched by MolecularLucene chemical structures should be provided as text representation (SMILES is the only supported format now, but I am going to add InChi ).

This allows to create full text search and similar chemical structures search in one common "canvas".

For example lucene index contains documents having fields "description" and "smiles" Field "description" is free-text description of chemical compound and "smiles" contais chemical structure information. A query to index looks like this:

description:"amino acid" AND smiles:c1ccc2c\(c1\)cc\[nH\]2

Note that characters (,),[ and ] are escaped becase they have special meaning in Lucene query syntax.

Literally this means: Show me compounds having phrase "amino acid" in description and chemical structure similar to indole (smiles:c1ccc2c(c1)cc[nH]2).

See autotests source code for basic example of usage.

References

Post about this project at habrahabr.ru (in Russian). Demo: chemical wikipedia search project ChWiSe.Net. Source code available on github.

About

Lucene tokenizer for chemical structures indexing/searching

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages