PUBmatch is my final project at thisismetis.com it is now hosted live at PUBmatch.co
This project uses Latent Semantic Indexing to index the entire PubMed Corpus (48GB) in a matter of seconds finding the most conceptually similar research articles for a given email or news article, or anything really. No keywords necessary.
For a brief overview of the motivation for such a tool please see: http://www.ryanglambert.com/blog/pubmatchco-a-recommendation-engine-for-pubmed
There are two parts of this project: Model Creation and The Similarity Server hosted at PUBmatch.co
Corresponding code in https://github.com/Ryanglambert/kojak/blob/master/tfidf_pm.py
- This is done using gensim distributed on a relatively beefy AWS instance with the latest ATLAS BLAS libraries for numpy
- Memory friendly using a generator (see tfidf_pm.py line 17 - 50)
- Used multiprocessing library to speed up the generation of the Term Document Matrix from the Pubmed corpus. Reduced time at this step from 6 hours to 1.5 hours.
- TFIDF weighting and Singular Value Decomposition to 300 components takes roughly 4 hours.
- Creation of Matrix Similarity gensim object takes roughly 2 hours
- Total LSI model build time takes ~7-9 hours on a 32 Core 244GB Ram instance on AWS
Website and Model in Production for querying at PUBmatch.co
Corresponding code in (https://github.com/Ryanglambert/kojak/blob/master/similarity_server.py)
- Boot Strap + Flask
- Querying the index for 50 results is done in less than a second. (However, getting titles for links needs to be ironed out and is slow right now)
- The LSI model is ~<2GB thanks to SVD bringing it down from 48GB
- Use a database instead of have it sit in memory. It needs to run on an 8GB instance since the model sits in memory, this makes it fast, but hard to scale if it were to be used by a lot of people at once. It's also a bit expensive to run on AWS for me so there's some financial inspiration as well. :)
gensim[distributed] also see (https://github.com/Ryanglambert/kojak/blob/master/provisioning_gensim_and_blas)
Front End Stuff