# Lightning Tour

Introduces the main ways of using Saber.

### Table of contents

1. [Quick Start](#Quick-Start)
    1. [Web-service](#Web-service)
    2. [Pre-trained models](#Pre-trained-models)
        1. [Working with annotations](#Working-with-annotations)
        
2. [Guide to the Saber API](#Guide-to-the-Saber-API)  
    1. [Command line tool](#Command-line-tool)
    2. [Python package](#Python-package)
        1. [Transfer learning](#Transfer-learning)
        2. [Multi-task learning](#Multi-task-learning)
        3. [Saving and loading models](#Saving-and-loading-models)
            1. [Saving a model](#Saving-a-model)
            2. [Loading a model](#Loading-a-model)
4. [Visualizations](#Visualizations)

## Quick Start

If your goal is simply to use Saber to annotate biomedical text, then you can either use the [web-service](#Web-service) or a [pre-trained model](#pre-trained-models).

### Web-service

To use Saber as a **local** web-service, run:

In [None]:
! python -m saber.app

or, if you prefer, you can pull & run the Saber image from **Docker Hub**:

In [None]:
# Pull Saber image from Docker Hub
! docker pull pathwaycommons/saber
# Run docker (use `-dt` instead of `-it` to run container in background)
! docker run -it --rm -p 5000:5000 --name saber pathwaycommons/saber

> Alternatively, you can clone the GitHub repository and build the container from the `Dockerfile` with `docker build -t saber .`

The web-service is now live, and can be accessed by directing your browser here: [http://127.0.0.1:5000/](http://127.0.0.1:5000/). Although you can run these commands in the notebook, it makes more sense to copy paste them directly into the shell. Just remember to remove the proceeding "!".


There are currently two endpoints, /annotate/text and /annotate/pmid. Both expect a POST request with a JSON payload, e.g.

```json
{
  "text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53."
}
```

or


```json
{
  "pmid": 11835401
}
```

For example, with the web-service running locally and using `cURL`:

In [None]:
curl -X POST 'http://localhost:5000/annotate/text' --data '{"text": "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53."}'

Documentation for the Saber web-service API can be found [here](https://baderlab.github.io/saber-api-docs/). We hope to provide a live version of the web-service soon!

### Pre-trained models

First, import `SequenceProcessor`. This class coordinates training, annotation, saving and loading of models and datasets. In short, this is the interface to Saber.

In [None]:
from saber.sequence_processor import SequenceProcessor

To load a pre-trained model, we first create a `SequenceProcessor` object

In [None]:
sp = SequenceProcessor()

and then load the model of our choice

In [None]:
sp.load('PRGE')

You can see all the pre-trained models in the [web-service API docs](https://baderlab.github.io/saber-api-docs/) or, alternatively, by running the following line of code

In [None]:
from saber.constants import ENTITIES; print(list(ENTITIES.keys()))

To annotate text with the model, just call the `annotate()` method

In [None]:
text = "The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53."
# text = "Interleukin-6 is a multifaceted cytokine, usually reported as a pro-inflammatory molecule. However, certain anti-inflammatory activities were also attributed to IL-6. The levels of IL-6 in serum as well as in other biological fluids are elevated in an age-dependent manner. Notably, it is consistently reported also as a key feature of the senescence-associated secretory phenotype. In the elderly, this cytokine participates in the initiation of catabolism resulting in, e.g. sarcopenia. It can cross the blood-brain barrier, and so it is in causal association with, e.g. depression, bipolar disorder, schizophrenia, and anorexia. In the cancer patient, IL-6 is produced by cancer and stromal cells and actively participates in their crosstalk. IL-6 supports tumour growth and metastasising in terminal patients, and it significantly engages in cancer cachexia (including anorexia) and depression associated with malignancy. The pharmacological treatment impairing IL-6 signalling represents a potential mechanism of anti-tumour therapy targeting cancer growth, metastatic spread, metabolic deterioration and terminal cachexia in patients."
sp.annotate(text, coref=False, jupyter=True)

#### Coreference Resolution

[**Coreference**](http://www.wikiwand.com/en/Coreference) occurs when two or more expressions in a text refer to the same person or thing, that is, they have the same **referent**. Take the following example:

_"IL-6 supports tumour growth and metastasising in terminal patients, and it significantly engages in cancer cachexia (including anorexia) and depression associated with malignancy."_

Clearly, _"it"_ referes to _"IL-6"_. If we do not resolve this coreference, then _"it"_ will not be labeled as an entity and any relation or event it is mentioned in will not be extracted. Saber uses [NeuralCoref](https://github.com/huggingface/neuralcoref), a state-of-the-art coreference resolution tool based on neural nets and built on top of [Spacy](https://spacy.io). To use it, just supply the argument `coref=True` (which is `False` by default) to the `annotate()` method

In [None]:
text = "IL-6 supports tumour growth and metastasising in terminal patients, and it significantly engages in cancer cachexia (including anorexia) and depression associated with malignancy."
# WITHOUT coreference resolution
sp.annotate(text, coref=False, jupyter=True)
# WITH coreference resolution
sp.annotate(text, coref=True, jupyter=True)

> Note that if you are using the web-service, simply supply `"coref": true` in your `JSON` payload to resolve coreferences.

Saber currently takes the simplest possible approach: replace all coreference mentions with their referent, and then feed the resolved text to the model that identifies named entities.

#### Working with annotations

The `annotate()` method returns a simple `dict` object

In [None]:
ann = sp.annotate("The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.")

which contains the keys `title`, `text` and `ents`:

- `title`: contains the title of the article, if provided
- `text`: contains the text (which is minimally processed) the model was deployed on
- `ents`: contains a list of entities present in the `text` that were annotated by the model

For example, to see all entities annotated by the model, call

In [None]:
ann['ents']

##### Converting annotations to JSON

The `annotate()` method returns a `dict` object, but can be converted to a `JSON` formatted string for ease-of-use in downstream applications

In [None]:
import json

# convert to json object
json_ann = json.dumps(ann)

# convert back to python dictionary
ann = json.loads(json_ann)

## Guide to the Saber API

You can interact with Saber as a web-service (explained in [Quick Start](#Quick-Start), command line tool, python package, or via the Juypter notebooks. If you created a virtual environment, _remember to activate it first_.

Note: To train you own models, you will need to proved a dataset (or datasets!) and, ideally, pre-trained word embeddings. See [Resources](https://baderlab.github.io/saber/resources/) for help preparing datasets for training.

### Command line tool

Currently, the command line tool simply trains the model. To use it, call

In [None]:
! python -m saber.train

> Again, while you can run these commands in the notebook, it makes more sense to copy paste them directly into the shell. Just remember to remove the proceeding "!".

along with any command line arguments. For example, to train the model on the [NCBI Disease](https://www.ncbi.nlm.nih.gov/CBBresearch/Dogan/DISEASE/) corpus

In [None]:
! python -m saber.train --dataset_folder NCBI_disease_BIO

Run `python -m saber.train -h` to see all possible arguments.

Of course, supplying arguments at the command line can quickly become cumbersome. Saber also allows you to specify a configuration file, which can be specified like so

In [None]:
! python -m saber.train --config_filepath path/to/config.ini

Copy the contents of the [default config file](https://github.com/BaderLab/saber/blob/master/saber/config.ini) to a new `*.ini` file in order to get started.

Note that arguments supplied at the command line overwrite those found in the configuration file. For example

In [None]:
! python -m saber.train --dataset_folder path/to/dataset --k_folds 10

would overwrite the arguments for `dataset_folder` and `k_folds` found in the configuration file.

### Python package

You can also import Saber and interact with it as a python package. Saber exposes its functionality through the `SequenceProcessor` class. Here is just about everything Saber does in one script:

In [None]:
from saber.sequence_processor import SequenceProcessor

# First, create a SequenceProcessor object, which exposes Sabers functionality
sp = SequenceProcessor()

# Load a dataset and create a model (provide a list of datasets to use multi-task learning!)
sp.load_dataset('path/to/datasets/GENIA')
sp.create_model()

# Train and save a model
sp.fit()
sp.save('pretrained_models/GENIA')

# Load a model
del sp
sp = SequenceProcessor()
sp.load('pretrained_models/GENIA')

# Perform prediction on raw text, get resulting annotation
raw_text = 'The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.'
annotation = sp.annotate(raw_text)

# Use transfer learning to continue training on a new dataset
sp.load_dataset('path/to/datasets/CRAFT')
sp.fit()

#### Transfer learning

Transfer learning is as easy as training, saving, loading, and then continuing training of a model. Here is an example

In [None]:
# Create and train a model on GENIA corpus
sp = SequenceProcessor()
sp.load_dataset('path/to/datasets/GENIA')
sp.create_model()
sp.fit()
sp.save('pretrained_models/GENIA')

# Load that model
del sp
sp = SequenceProcessor()
sp.load('pretrained_models/GENIA')

# Use transfer learning to continue training on a new dataset
sp.load_dataset('path/to/datasets/CRAFT')
sp.fit()

> Note that there is currently no way to easily do this with the command line interface, but I am working on it!

#### Multi-task learning

Multi-task learning is as easy as specifying multiple dataset paths, either in the `config` file, at the command line via the flag `--dataset_folder`, or as an argument to `load_dataset()`. The number of datasets is arbitrary.

Here is an example using the last method

In [None]:
sp = SequenceProcessor()

# Simply pass multiple dataset paths as a list to load_dataset to use multi-task learning.
sp.load_dataset(['path/to/datasets/NCBI-Disease', 'path/to/datasets/Linnaeus'])

sp.create_model()
sp.fit()

#### Saving and loading models

In the following sections we introduce the saving and loading of models.

##### Saving a model

Assuming the model has already been created (see above), we can easily save our model like so

In [None]:
path_to_saved_model = 'path/to/pretrained_models/mymodel'

sp.save(path_to_saved_model)

##### Loading a model

Lets illustrate loading a model with a new `SequenceProccesor` object

In [None]:
# Delete our previous SequenceProccesor object (if it exists)
if 'sp' in locals(): del sp

# Create a new SequenceProccesor object
sp = SequenceProcessor()

# Load a previous model
sp.load(path_to_saved_model)

## Visualizations

_Note: This is less a feature and more a by-product of the fact that the model is implemented in [Keras](https://keras.io)._

We can easily create an image depiction our model. First, install the [graphviz graph library](http://www.graphviz.org/) and the [Python interface](https://pypi.python.org/pypi/graphviz). This is useful if you plan on modifying the architecture of the model.

> More info can be found [here](https://machinelearningmastery.com/visualize-deep-learning-neural-network-model-keras/).

In [None]:
# Assuming sp is a `SequenceProcessor` object and `sp.create_model()` has been called
sp = SequenceProcessor()

# set this variable equal to your Keras model object.
model_ = sp.model.model[0]

We can either: create and save an image on our local machine,

In [None]:
from keras.utils import plot_model
plot_model(model_, to_file='model.png', show_shapes=True, show_layer_names=True)

or, visualize it directly in the notebook

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(model_, show_shapes=True, show_layer_names=True).create(prog='dot', format='svg'))