## Spacy's NER and entity linking

[Spacy](https://spacy.io/) is an open-source NLP library. Its components are not SOTA but they are robust, easy to use and fast.

We'll demo how to use it for simplr tasks and then try a pretrained entity linker that links to wikidata items. The [spacy-entity-linker](https://pypi.org/project/spacy-entity-linker/) is not great, but worth looking at.

You may need to do the following:
 * pip install spacy
 * python -m spacy download en
 * python -m spacy download en_core_web_md

In [2]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_md  #medium en pipeline

### Load one of Spacy's language models. This is a medium sized one for English

In [3]:
nlp = spacy.load("en_core_web_md")

spacy_entity_linker


### Add an entitly linking as a last stage in the pipeline, using [one](https://pypi.org/project/spacy-entity-linker/) developed for Spacy that links to Wikidata.

In [4]:
# add pipeline (declared through entry_points in setup.py)
nlp.add_pipe("entityLinker", last=True)

<spacy_entity_linker.EntityLinker.EntityLinker at 0x7f90ac571f10>

### Input text can be a phrase, sentence or short paragraph

In [5]:
text = "Trials involving vaccines, antiviral drugs, immunotherapies, monoclonal antibodies, stem cells, and nitric oxide are summarized in Table 1 ."

### Analyze a simple text string by using the nlp object 

In [6]:
doc = nlp(text)
# look at it's properties to see what we can do...
print("doc properties:", dir(doc))  

doc properties: ['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merge', '_get_array_attrs', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'copy', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_dict', 'from_disk', 'from_docs', 'get_extension', 'get_lca_matrix', 'has_annotation', 'has_extension', 'has_unknown_spaces', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'noun_chunks', 'noun_chunks_iterator', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_ents', 'set_extension', 'similarity', 'spans', 'tensor', 'text'

### We can display the dependency diagram in a notebook.  Specifying the compact option uses square arcs, which takes less space.

In [7]:
displacy.render(doc, style="dep", options={'compact':True})

### Access the strings of the entities and noun_chunks found 

In [8]:
print('Named entities:', doc.ents)
print('Noun chunks:', list(doc.noun_chunks))

Named entities: (1,)
Noun chunks: [Trials, vaccines, antiviral drugs, immunotherapies, monoclonal antibodies, stem cells, nitric oxide, Table]


### Display the text marking its entities and their types.  The default types are the 18 types from [Ontonotes](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf)

In [None]:
displacy.render(doc, style="ent")

### Get the entity mentions and their types (based on ontonotes)

In [None]:
[(X.text, X.label_) for X in doc.ents]

### Noun chunks might correspond to nominal entities

In [None]:
[(X.text, X.label_) for X in doc.noun_chunks]

### text input can be more than one sentence

In [1]:
text2 = """The wrapping artist Christo took two weeks in June 1995 to wrap the Berlin German Reichstag in lengths of material. Find reports on this artistic event. 
Any information on either its preparation or its execution is relevant, including political debates and decisions and technical preparations in Germany."""

In [None]:
doc2 = nlp(text2)
print("\nentities:")
print([(X.text, X.label_) for X in doc2.ents])
print("\nnoun chunks:")
print([(X.text, X.label_) for X in doc2.noun_chunks])

In [None]:
displacy.render(doc2, style="ent")

### A simple entity linker connects entities to Wikidata items

In [None]:
text3 = """Relevant documents will give details of statements made by princess Diana concerning her marriage during her famous BBC interview with Martin Bashir."""

doc3 = nlp(text3)


In [None]:
displacy.render(doc3, style="ent")


In [None]:
print("\nentities:")
print([(X.text, X.label_) for X in doc3.ents])
print("\nnoun chunks:")
print([(X.text, X.label_) for X in doc3.noun_chunks])

In [None]:
# returns all entities in the whole document
all_linked_entities = doc3._.linkedEntities
print(f"Found {len(all_linked_entities)} linked entites in the document")
# iterates over sentences and prints linked entities in each
for sent in doc3.sents:
    sent._.linkedEntities.pretty_print()

fin

In [None]:
docx3 = nlp("I like music by the Kingston Trio and I love songs by the Beatles.")
displacy.render(docx3, style="ent")

In [None]:
docx4 = nlp("Republicans and Democrats are interested in recruiting people \
  who belong to Protestant churches.  They also want to get support from Latinos and prople \
  of British descent. ")
displacy.render(docx4, style="ent")