Skip to content
LSTM based sequence labeling model for scientific discourse tagging. This work was created by Pradeep Dasigi @ CMU
Jupyter Notebook Python Other
Branch: master
Clone or download
Pull request Compare This branch is 32 commits ahead of edvisees:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docker
notebooks/.ipynb_checkpoints
training_data
.gitignore
LICENSE
README.md
attention.py
count_passages.py
count_passages_tsv.py
enumerate_discourse_types_by_section.py
evidence_sentence_analysis.py
expand_compressed_word_embeddings.py
extract_experiments.py
fill_expt_spans_in_tsv.py
keras_extensions.py
link_scidt_to_intact.py
link_simple_intact_records_to_text_mining.py
memory.py
nn_passage_tagger.py
nn_tsv_passage_tagger.py
preprocess.py
read_min_max.py
rep_reader.py
replace_all_throughout.py
scidtPlugin.py
scidt_server.py
score_expt_spans_in_tsv.py
simplify_intact_records.py
start_docker.sh
tsv_to_training_data.py
util.py

README.md

Scientific Discourse Tagger (SciDT)

LSTM based sequence labeling model for scientific discourse tagger. Read the paper for more details.

An LSTM based sequence labeling model for analyzing the structure of scientific discourse in text. We provide a docker image that allows the system to be run out of the box with the minimum of configuration needed. Read the paper for more details.

Basic Python Requirements

  • Docker (tested with v1.12.3)
  • This repository
  • Pretrained word embedding as a prebuilt elastic search index data directory. Please download this file (http://bmkeg.isi.edu/data/embeddings/es_index_all_data_unique.tar.gz) and unpack it in an directory on the machine where you will be running the docker image. The file will expand into a directory called data.

Running Docker

First, build the docker image:

  cd $SCIDT_HOME$/docker
  docker build -t scidt .

Then run a container based on this image.

  cd $SCIDT_HOME$
  ./start_docker.sh 8888 8787 /path/to/documents /path/to/elasticsearch/data

This will then load a docker command prompt with the following additional functionality.

SciDT-Pipeline functionality.

Preprocessing for SciDT is provided by the https://github.com/BMKEG/sciDT-pipeline library. The docker container installs a fully assembled jar from https://zenodo.org/record/177565/. This provides efficient multithreaded systems to run the neccesary time-consuming Stanford-parser-based segmentation of the text into clauses for subsequent discourse tagging.

An Elastic Search Index for Word Embeddings.

Note, due to the memory requirements imposed by loading word embeddings into memory. We use a elasticsearch index of the word embeddings. This speeds up running the tagger (since the entire file does not need to be loaded every time). When you start the docker container, the elastic search index starts automatically.

A Simple SciDT Web Server

From the command prompt in the docker image, run:

  cd $SCIDT_HOME$
  python scidt_server.py --use_attention

This starts a simple HTML-based web-service that exposes the SciDT library as a webservice and permits rapid processing of article files.

SciDT processing of TSV files

We have added support for running the tagger over directories of tsv files generated by the SciDT-Pipeline code. This makes it easier to curate training data from spreadsheets

Experiment extraction from Results Section

This is still experimental, but two scripts manage the extraction of experimental statements from the text of research articles:

  python fill_expt_spans_in_tsv.py \
      --inDir /tmp/data/tsv_clause \
      --outDir /tmp/data/tsv_figs \ 
      --ganntChartDir /tmp/data/gannt

This script processes the discourse tags within the tsv files to estimate the location of text pertaining to a specific experiment (i.e., 'Fig 1A', etc). This adds a column to the tab-separated *.tsv files to generate *_span.tsv files with a fig_spans column that denotes the system's best guess of the linkage between figures and the underlying narrative text.

A second script extracts each text passage pertaining to separate figures in separate files:

  python extract_experiments.py \
      --inDir /tmp/data/tsv_figs \
      --outDir /tmp/data/tsv_expts

The output directory (/tmp/data/tsv_expts) then holds one subdirectory for each paper for which there are each one *.tsv spreadsheet for each detected figure in the paper. Thus, this system breaks each paper into small subdocuments each pertaining to a subfigure.

Use of Jupyter notebooks from within the container for development and experimentation

The container provides access to a Jupyter endpoint that is accessible from outside the container via the link http://localhost:8888/. This provides a framework for Python experimentation and development within the docker image.

Basic SciDT Function (preserved from original edvisees/sciDT version)

We include these instructions from the original version of SciDT (developed by Pradeep Dasigi under Ed Hovy at CMU)

Basic Python Requirements

  • Theano (tested with v0.8.0)
  • Keras (tested with v0.3.2)
  • Pretrained word embedding (recommended: http://bio.nlplab.org/#word-vectors): SciDP expects a gzipped embedding file with each line containing word and a the vector (list of floats) separated by spaces

Input Format

SciDT expects inputs to lists of clauses, with paragraph boundaries identified, i.e., each line in the input file needs to be a clause and and paragraphs should be separated by blank lines.

If you are training, the file additionally needs labels at the clause level, which can be specified on each line, after the clause, separated by a tab. Please look at the sample train and test files for the expected format.

Intended Usage

As mentioned in the paper, the model is intended for tagging discourse elements in experiment narratives in biomedical research papers, and we use the seven label taxonomy described in De Waard and Pander Maat (2012). The taxonomy is defined at the clause level, which is why we assume that each line in the input file is a clause. However, the model itself is more general than this, and can be put to use for tagging other kinds of discourse elements as well, even at the sentence level. If you find other uses for this code, I would love to hear about it!

Training

python nn_passage_tagger.py --repfile REPFILE --train_file TRAINFILE --use_attention

where REPFILE is the embedding file. --use_attention is recommended. Check out the help messages for nn_passage_tagger.py for more options

Trained model

After you train successfully, three new files appear in the directory, with file names containing chosen values for att, cont and bi:

  • model_att=*_cont=*_bi=*_config.json: The model description
  • model_att=*_cont=*_bi=*_label_ind.json: The label index
  • model_att=*_cont=*_bi=*_weights: Learned model weights

Testing

You can specify test files while training itself using --test_files arguments. Alternatively, you can do it after training is done. In the latter case, nn_passage_tagger assumes the trained model files described above are present in the directory.

python nn_passage_tagger.py REPFILE --test_files TESTFILE1 [TESTFILE2 ..] --use_attention

Make sure you use the same options for attention, context and bidirectional as you used for training.


python nn_tsv_passage_tagger.py --test_dir /tmp/data/tsv_clause/ --out_dir /tmp/data/disc_temp --use_attention --att_context clause
You can’t perform that action at this time.