An Interactive Tool for Natural Language Processing on Clinical Text
Clone or download
Type Name Latest commit message Commit time
Failed to load latest commit information.
.settings Initial commit Jun 16, 2014
WebContent Rename packages Feb 17, 2017
src/io/github/nlpreviz Rename packages Feb 17, 2017
.classpath Initial commit Jun 16, 2014
.gitignore Added war file in gitignore Aug 19, 2014
.project Initial commit Jun 16, 2014 Rename to Sep 17, 2015 Document data directory structure May 4, 2018
_config.yml Set theme jekyll-theme-cayman Feb 8, 2017
build.xml Fix ant resolve Oct 1, 2015

NLPReViz: emr-nlp-server

emr-nlp-server provides the backend service for the emr-vis-web project.

Getting Started

To get started, install the pre-requisites, get the emr-nlp-server application and then launch the service as described below:


  1. You must have Java Development Kit (JDK) 1.7 to build or Java Runtime (JRE) 1.7 to run this project. To confirm that you have the right version of JRE installed, run $ java -version and verify that the output is similar to:

    java version "1.7.0_51"
    Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

    If you don't have the JDK installed or have an older one, you may get the latest version from the Oracle Technology Network.

  2. We use the Apache Tomcat server to deploy the app. On a Mac with homebrew you may use $ brew install tomcat to install the server on your machine.

Building the project

  1. Clone the emr-nlp-server repository using git:

    git clone
    cd emr-nlp-server
  2. Our project depends on the following external dependencies which can be downloaded using Apache Ant:

    • Java Jersey which is dual licensed: COMMON DEVELOPMENT AND DISTRIBUTION LICENSE and GPL 2.
    • Weka licensed under GPL 3.
    • Libsvm with a license compatible with GPL.
    • Stanford CoreNLP licensed under the GNU General Public License (v3 or later; Stanford NLP code is GPL v2+, but the composite with external libraries is v3+).

    To download and resolve these dependencies from their respective repositories use:

    ant resolve
  3. Specify the path to the webapps directory in CATALINA_HOME environment variable and use ant deploy to to build and deploy the backend app.

    For example if your Tomcat's webapps directory accessible as /usr/local/Cellar/tomcat/7.0.54/libexec/webapps/, then you may use:

    env CATALINA_HOME=/usr/local/Cellar/tomcat/8.0.9/libexec/ ant deploy

We recommend using the Eclipse IDE for Java EE Developers with the EGit plugin installed for development. The repository contains appropriate project files to be imported into Eclipse.

Running the server

We have included some "dummy" data with our release so that you can run the tool and play with the interface. These are not actual medical records and and your models will not be useful. Contact the devs if you need more information about real datasets.

  1. Download and copy the data directory inside $CATALINA_BASE. You should be able to figure this path from the print messages you see after launching the server. Example path: /usr/local/Cellar/tomcat/8.0.9/libexec/data.

  2. You need to build libsvm before you may run the server for the first time. To do that run make inside data/libsvm directory or follow the instructions in the README file present there.

  3. Start the Tomcat server (eg. using $ catalina run or # service tomcat start etc.).

Now follow the steps on emr-vis-web to setup the front-end application.

Using your own dataset and defining custom variables

The tool is currently configured to make predictions for 14 colonoscopy quality variables. It also does specific format parsing for colonoscopy and pathology reports in the data provided with the release. We have a more generic version of the tool in the general branch of this repository. Checkout this experimental branch here.

You will still need to download the sample data directory, and organize your documents in the same structure defined as follows:

| |____initialIDList.xml
| |____testIDList.xml
| |____fullIDList.xml
| |____feedbackIDList.xml
| |____0719
| | |____report.txt
| |____0973
| | |____report.txt
| |____0184
| | |____report.txt
| | |____pathology.txt
| |____0726
| | |____report.txt
| | |____pathology.txt
| |____class-appendiceal-orifice.csv
| |____class-ileo-cecal-valve.csv
| |____class-informed-consent.csv
| |____class-proc-aborted.csv
| |____class-asa.csv
| |____class-prep-adequateYes.csv
| |____class-any-adenoma.csv
| |____class-cecum.csv
| |____class-withdraw-time.csv
| |____class-indication-type.csv
| |____class-prep-adequateNot.csv
| |____class-biopsy.csv
| |____class-prep-adequateNo.csv
| |____class-nursing-report.csv

docs contains the list of documents. Each patient or case is represented by a four digit ID as sub-directories. The ID length is hard-coded in These may contain at most 2 files: report.txt and pathology.txt. pathology.txt is optional. If you have more than 2 files, you may concatenate them into one report, or extend our code to support them. :)

documentList directory has the following files, with references to the documents described above:

  • initialIDList.xml - Used to train the initial model. This is how we boostrap the system.
  • feedbackIDList.xml - This is the list of documents you should be working on to give feedback on and improve the models. Used to create the global feature vector.
  • fullIDList.xml - Held out test set. There is code to general evaluation metrics but it is not exposed to the front-end at this point. Feel free to contribute ;)
  • testIDList.xml - List of all the IDs.

labels directory contains the gold standard data; used to train the initial models and run evaluation metrics.

The rest of the files are can be reset by pointing your browser to: /rest/server/resetDB, for example: http://localhost:8080/emr-nlp-server/rest/server/resetDB. Remember to update emr-vis-web to it's general branch.

The easiest way to configure the tool to use your own variables is to map them to the names of your choice in the front end. Remember to update emr-vis-web as described in its README as well.

This project will be updated to make this configuration easier in the near future.


The the rest calls to the server are protected with a basic access http authentication. The default login credentials are "username" and "password". You are encouraged to change them in when running the app on a publicly accessible server.


This project is released under the GPL 3 license. Take a look at the LICENSE file in the source for more information.