This was my first foray into information retrieval using Latent Semantic Indexing and AWS's massive multicore servers. yay!
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

PUBmatch is down for reconstruction.

This project uses Latent Semantic Indexing to index the entire PubMed Corpus (48GB) in a matter of seconds finding the most conceptually similar research articles for a given email or news article, or anything really. No keywords necessary.

For a brief overview of the motivation for such a tool please see:

There are two parts of this project: Model Creation and The Similarity Server hosted at

Model Creation

Corresponding code in

  1. This is done using gensim distributed on a relatively beefy AWS instance with the latest ATLAS BLAS libraries for numpy
  2. Memory friendly using a generator (see line 17 - 50)
  3. Used multiprocessing library to speed up the generation of the Term Document Matrix from the Pubmed corpus. Reduced time at this step from 6 hours to 1.5 hours.
  4. TFIDF weighting and Singular Value Decomposition to 300 components takes roughly 4 hours.
  5. Creation of Matrix Similarity gensim object takes roughly 2 hours
  6. Total LSI model build time takes ~7-9 hours on a 32 Core 244GB Ram instance on AWS

Website and Model in Production for querying at

Corresponding code in (

  1. Boot Strap + Flask
  2. Querying the index for 50 results is done in less than a second. (However, getting titles for links needs to be ironed out and is slow right now)
  3. The LSI model is ~<2GB thanks to SVD bringing it down from 48GB

Future Improvements

  1. Use a database instead of have it sit in memory. It needs to run on an 8GB instance since the model sits in memory, this makes it fast, but hard to scale if it were to be used by a lot of people at once. It's also a bit expensive to run on AWS for me so there's some financial inspiration as well. :)

Text Processing

gensim[distributed] also see (

  • libatlas-base-dev
  • gfortran
  • numpy
  • scipy


Front End Stuff