In biomedical context, Named Entities Recognition is often followed Relation Detection (RD), meaning connecting various biomedical entities with each other to find meaningful interactions that can be further explored. Due to a large number of different named entity classes in the biomedical field, there is a combinatorial explosion between those entities. Hence, using biological experiments to determine which of these relationships are the most significant ones (which is the current way in most R&D labs) is too costly and time-consuming. However, by parsing millions of biomedical research articles using computational approaches, it is possible to identify millions of such associations for creating networks. For instance, identifying the interactions of proteins allows the construction of protein-protein interaction networks. As such, relation networks provide the possibility to narrow down previously-unknown and intriguing connections to explore further with the help of previously established associations. Moreover, they also provide a global view on different biological entities and their interactions, such as disease, genes, food, drugs, side effects, pathways, and toxins, opening new routes of research.


In this short project I demonstrate the use of a NER model to annotate genes trained on 2000 abstracts reviewed in sysrev's Gene Hunter project. The Gene Hunter project was a 2000 article open online review of pubmed abstracts. 15 reviewers highlighted genes in text. Sysrev data is accessible using the Sysrev Python client PySysrev. The gene hunter project has the project_id 3144 which is all we need to get data from PySysrev.getAnnotations api call.
PySysrev provides an API call to download data into a shape spaCy can handle.

Next, I will be using the NER model for relation detection between genes to various biomedical entities.

In [16]:
#pip install PySysrev


In [15]:
#pip install spacy

In [10]:
import PySysrev, spacy, random
from spacy import displacy

Get spacy ready annotations from Gene Hunter as list of annotated pharagraphs:

In [4]:
processed_output = PySysrev.processAnnotations(project_id=3144,label='GENE')


Create a model with spacy and name it 'Gene':

In [13]:
nlp = spacy.load("en_core_web_sm")
nlp.meta['name'] = 'Gene'

Create NER and an optimizer for training:

In [None]:
ner = nlp.create_pipe('ner')
ner.add_label('GENE')
nlp.add_pipe(ner)
optimizer = nlp.begin_training()

To train the model I repeatedly call nlp.update on the training corpus processed_output. Each iteration is referred to as an 'cycles' and the model should improve on each call. Internally spacy is fitting a complex model to the ~1000 training instances provided by Sysrev:

In [None]:
cycles = 30
for i in range(cycles):
    random.shuffle(processed_output)                     #shuffle examples 
    text = [item[0] for item in processed_output]        #get training text items
    annotations = [item[1] for item in processed_output] #get training annotations
    nlp.update(text, annotations, sgd=optimizer, drop=0.6)

I visualize the model abilities on a paragraph taken from a random paper from pubmed. I use the spacy displacy visualizer. The model seems to work: it captures the 4 genes in the segment, while skipping on IIS which is a pathway entity:

In [None]:
doc = nlp("""Two main classes have been described of lifespan-extension mutants
        in Caenorhabditis elegans. The first consists of genes with activity
        in the mitochondrial electron transport chain, such as clk-1 and isp-1, 
        whose mutation moderately reduces oxidative phosphorylation capacity and prolongs
        life in worms; these mutations established the first link between energy metabolism
        and longevity. The second mutant class is related to hormone mechanisms of the
        insulin/IGF-I signaling (IIS) pathway, such as daf-2 and age-1 mutants,
        which extend lifespan in worms, flies and mice.""")

displacy.serve(doc, style="ent")













Two main classes have been described of lifespan-extension mutants
in Caenorhabditis elegans. The first consists of genes with activity
in the mitochondrial electron transport chain, such as clk-1 GENE and isp-1 GENE, 
whose mutation moderately reduces oxidative phosphorylation capacity and prolongs
life in worms; these mutations established the first link between energy metabolism
and longevity. The second mutant class is related to hormone mechanisms of the
insulin/IGF-I signaling (IIS) pathway, such as daf-2 GENE and age-1 GENE mutants,
which extend lifespan in worms, flies and mice.""")