The task in this notebook is entity annotation and linking using Scispacy.

## Install scispacy
[Scispacy](https://github.com/allenai/scispacy) is a tool for processing biomedical, scientific or clinical text.
It allows to annotate and link entities.




In [None]:
!pip install spacy==2.3.1
!pip install scispacy==0.3.0
# Install en_core_sci_lg package from the website of spacy  (small corpus)
# Use en_core_sci_md for the medium corpus, or en_core_sci_lg for the large one
!pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz

Collecting https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz
  Using cached https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_sm-0.3.0.tar.gz (33.1 MB)
Building wheels for collected packages: en-core-sci-sm
  Building wheel for en-core-sci-sm (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-sci-sm: filename=en_core_sci_sm-0.3.0-py3-none-any.whl size=33119278 sha256=d79c50e966d35fdffb7a1cb7b651343140c85c3d7225783440a0f58c5357232e
  Stored in directory: /root/.cache/pip/wheels/c1/ac/bc/75799930092270c4efedabe938585fee0abc61d53d15dc6ea6
Successfully built en-core-sci-sm
Installing collected packages: en-core-sci-sm
  Attempting uninstall: en-core-sci-sm
    Found existing installation: en-core-sci-sm 0.4.0
    Uninstalling en-core-sci-sm-0.4.0:
      Successfully uninstalled en-core-sci-sm-0.4.0
Successfully installed en-core-sci-sm-0.3.0


# Import libraries

In [None]:
import scispacy
import spacy
# Import the large dataset
import en_core_sci_sm
from scispacy.linking import EntityLinker
from spacy import displacy

## NER
Annotate entities with Scispacy and link them to UMLS concepts.


In [None]:
# Load the corpus
nlp = en_core_sci_sm.load()
# This line takes a while, because we have to download ~1GB of data
# and load a large JSON file (the knowledge base). Be patient!
# Thankfully it should be faster after the first time you use it, because
# the downloads are cached.
# NOTE: The resolve_abbreviations parameter is optional, and requires that
# the AbbreviationDetector pipe has already been added to the pipeline. Adding
# the AbbreviationDetector pipe and setting resolve_abbreviations to True means
# that linking will only be performed on the long form of abbreviations.
linker = EntityLinker(resolve_abbreviations=True, name="umls")

nlp.add_pipe(linker)
# Text to annotate
sentence = "Spinal and bulbar muscular atrophy (SBMA) is an \
           inherited motor neuron disease caused by the expansion \
           of a polyglutamine tract within the androgen receptor (AR). \
           SBMA can be caused by this easily."

# Annotate entities
doc = nlp(sentence)

# Display text highlighting annotated entities
displacy.render(doc, jupyter = True, style ='ent')

# Look at the second random entity
entity = doc.ents[1]
print("Name: ", entity)

# Each entity is linked to more UMLS concepts with a score using the EntityLinker
for umls_ent in entity._.kb_ents:
  # Show concepts linked to the entity with their information
	print(linker.kb.cui_to_entity[umls_ent[0]])


  extended_neighbors[empty_vectors_boolean_flags] = numpy.array(neighbors)[:-1]
  extended_distances[empty_vectors_boolean_flags] = numpy.array(distances)[:-1]


Name:  bulbar muscular atrophy
CUI: C1839259, Name: Bulbo-Spinal Atrophy, X-Linked
Definition: An X-linked recessive form of spinal muscular atrophy. It is due to a mutation of the gene encoding the ANDROGEN RECEPTOR.
TUI(s): T047
Aliases (abbreviated, total: 39): 
	 Bulbospinal Muscular Atrophy, X linked, SMAX1, X Linked Spinal and Bulbar Muscular Atrophy, Bulbospinal muscular atrophy, kennedy's syndrome, Atrophy, Muscular, Spinobulbar, X Linked Bulbo Spinal Atrophy, X-Linked Spinal and Bulbar Muscular Atrophy, Bulbo Spinal Atrophy, X Linked, X-Linked Bulbo-Spinal Atrophy
CUI: C0026846, Name: Muscular Atrophy
Definition: Derangement in size and number of muscle fibers occurring with aging, reduction in blood supply, or following immobilization, prolonged weightlessness, malnutrition, and particularly in denervation.
TUI(s): T046
Aliases (abbreviated, total: 32): 
	 Muscle atrophy, NOS, ATROPHY MUSCLE, amyotrophia, Muscle wasting, NOS, Muscle Atrophy, Muscle wasting disorder, Muscular 