Skip to content
The Automatic Biomedical Hypothesis Generation System.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
build_network removes fancy tuple constructor for compatability's sake May 16, 2018
run_query Fixes a normalization error in get_papers Mar 14, 2019
tools updated raw2vis to create_visualization. Updates reflect changes made… May 3, 2018
.gitignore adds pycache to ignore Apr 5, 2018
.gitmodules modules are now linked via http. added a '1' to the networkit build Apr 12, 2018


This repo contains the code described in the publication: MOLIERE: Automatic Biomedical Hypothesis Generation System

Don't have a super computer? Request your query here!

We organize our code into two major sub-projects, build_network and run_query.

As the names imply, you will need to run the former in order to construct a dataset that the latter can use. Further details can be found in each sub-projects own But, in short, running either python script with the -h flag will print our all command-line arguments for each utility. This should get you pointed in the right direction.

Note, please add export MOLIERE_HOME=<location of this git repo> to your bashrc. Both sub-projects are going to expect it.

If you have any questions, feel free to reach out to:

jsybran [at] clemson [dot] edu

Or just leave an issue in this repo.


First, make sure you clone this project with the --recursive flag. This initializes my submodules present in the ./external directory.

Then, make sure you have the following dependencies:

  gcc >= 5.4
  python 3
  java 1.8     (for AutoPhrase)
  scons        (for NetworKit)
  google test  (for NetworKit)
  mpich 3.1.4  (for PLDA)

Ideally, a simple make while in the project directory should do the trick. Often, errors are due to our NetworKit dependency, so please refer to their readme if you have any issues.

Runtime Requirements

If you are constructing the Moliere Knowledge Network, please note that our code is extremely memory intensive if you choose to follow the default construction parameters. After downloading MEDLINE and performing an initial parsing, the script then sends all text through AutoPhrase. This often requires over 1 terabyte of memory to complete. Then, we send the text through fasttext in order to define embeddings for each word. This too is memory intensive, on the order of 100's of Gb.

Note: If you do not have the resources to construct the Moliere Knowledge Network, we keep our latest public release Available Online.

When running a query, we expect the knowledge network data to be stored on a filesystem that allows for parallel reads. If this is not available, you may need to set the environment variable OMP_NUM_THREADS=1 in order to disable our multi-threaded reads. We rely on a parallel file system in order to keep query-time memory usage down by "skimming" our large network files in an efficient manner. This way, we only keep relevant information in memory, which drastically improves runtime.

Note: If you do not have the resources to run a query, we allow for query requests Available Online.


If you use or reference any of this work, please cite the following paper:

 author = {Sybrandt, Justin and Shtutman, Michael and Safro, Ilya},
 title = {MOLIERE: Automatic Biomedical Hypothesis Generation System},
 booktitle = {Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
 series = {KDD '17},
 year = {2017},
 isbn = {978-1-4503-4887-4},
 location = {Halifax, NS, Canada},
 pages = {1633--1642},
 numpages = {10},
 url = {},
 doi = {10.1145/3097983.3098057},
 acmid = {3098057},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {hypothesis generation, mining scientific publications, topic modeling, undiscovered public knowledge},
You can’t perform that action at this time.