## Using Stanza with Spacy

[Stanza](https://stanfordnlp.github.io/stanza/) is a new multilingual version of Stanford's CoreNLP-based NLP tools that is intended to work with Python. The Stanza NER model may be better annd it's possible to add its coreference model as well.

This notebook shows how it can work with SpaCy using the [spacy_stanza](https://github.com/explosion/spacy-stanza) package.

You may need to do the following:
 * pip install spacy
 * pip install stanza
 * python -m stanza download en

In [9]:
import stanza
import spacy_stanza
from spacy import displacy

### Load a language model and pipeline

In [2]:
nlp = spacy_stanza.load_pipeline("en")

spacy_entity_linker


2021-06-11 20:43:48 INFO: Loading these models for language: en (English):
| Processor | Package   |
-------------------------
| tokenize  | combined  |
| pos       | combined  |
| lemma     | combined  |
| depparse  | combined  |
| sentiment | sstplus   |
| ner       | ontonotes |

2021-06-11 20:43:48 INFO: Use device: cpu
2021-06-11 20:43:48 INFO: Loading: tokenize
2021-06-11 20:43:48 INFO: Loading: pos
2021-06-11 20:43:48 INFO: Loading: lemma
2021-06-11 20:43:48 INFO: Loading: depparse
2021-06-11 20:43:49 INFO: Loading: sentiment
2021-06-11 20:43:49 INFO: Loading: ner
2021-06-11 20:43:50 INFO: Done loading processors!


In [24]:
text = """John F. Kennedy International Airport is an international airport in Queens, \
New York, USA. It is one of the primary airports serving New York City."""

### Analyze a simple text string by using the nlp object 

In [25]:
doc = nlp(text)
# look at it's properties to see what we can do...
print("doc properties:", dir(doc))  

doc properties: ['_', '__bytes__', '__class__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '_bulk_merge', '_get_array_attrs', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'copy', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_dict', 'from_disk', 'from_docs', 'get_extension', 'get_lca_matrix', 'has_annotation', 'has_extension', 'has_unknown_spaces', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'noun_chunks', 'noun_chunks_iterator', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_ents', 'set_extension', 'similarity', 'spans', 'tensor', 'text'

### look at the entities found

In [28]:
print(f"Named entities: {doc.ents}")

Named entities: (John F. Kennedy International Airport, Queens, New York, USA, New York City)


### See the named entities and their types in the text. 

In [29]:
displacy.render(doc, style="ent")

### Display a dependency diagram

In [31]:
displacy.render(doc, style="dep", options={'compact':True})

### Get the entity mentions and their types (based on ontonotes)

In [32]:
[(X.text, X.label_) for X in doc.ents]

[('John F. Kennedy International Airport', 'ORG'),
 ('Queens', 'GPE'),
 ('New York', 'GPE'),
 ('USA', 'GPE'),
 ('New York City', 'GPE')]

### Noun chunks might correspond to nominal entities, but using Stanza seems to interfere with getting all of the noun chunks that are not named entities.

In [34]:
[(X.text, X.label_) for X in doc.noun_chunks]

[('John', 'NP'),
 ('International Airport', 'NP'),
 ('New York', 'NP'),
 (', USA', 'NP'),
 ('It', 'NP')]

### Another text example

In [36]:
text2 = """The wrapping artist Christo took two weeks in June 1995 to wrap the Berlin German Reichstag \
 in lengths of material. Find reports on this artistic event. Any information on either its preparation or \
 its execution is relevant, including political debates and decisions and technical preparations in Germany."""

In [37]:
doc2 = nlp(text2)
print("\nentities:")
print([(X.text, X.label_) for X in doc2.ents])
print("\nnoun chunks:")
print([(X.text, X.label_) for X in doc2.noun_chunks])


entities:
[('Christo', 'PERSON'), ('two weeks', 'DATE'), ('June 1995', 'DATE'), ('the Berlin German Reichstag', 'EVENT'), ('Germany', 'GPE')]

noun chunks:
[('The wrapping artist', 'NP'), ('Christo', 'NP'), ('Any information', 'NP')]


### Berlin German Reichstag (a building) is tagged as an event when it should be a FAC or possibly a LOC or ORG

In [39]:
displacy.render(doc2, style="ent")

### Another example

In [44]:
text3 = """Relevant documents will give details of statements made by princess Diana concerning \
  her marriage during her famous BBC interview with Martin Bashir."""
doc3 = nlp(text3)

### Render the dependancy diagram with the more general deplacy module

In [45]:
import deplacy
deplacy.render(doc3)

Relevant   ADJ   <╗                                 amod
documents  NOUN  ═╝<════════════════════════════╗   nsubj
will       AUX   <════════════════════════════╗ ║   aux
give       VERB  ═══════════════════════════╗═╝═╝═╗ root
details    NOUN  ═════════════════════════╗<╝     ║ obj
of         ADP   <══════════════════════╗ ║       ║ case
statements NOUN  ═════════════════════╗═╝<╝       ║ nmod
made       VERB  ═════╗═════════════╗<╝           ║ acl
by         ADP   <══╗ ║             ║             ║ case
princess   NOUN  ═╗═╝<╝             ║             ║ obl
Diana      PROPN <╝                 ║             ║ flat
concerning VERB  <════════════════╗ ║             ║ case
           SPACE                  ║ ║             ║ 
her        PRON  <══════════════╗ ║ ║             ║ nmod:poss
marriage   NOUN  ═════════════╗═╝═╝<╝             ║ obl
during     ADP   <══════════╗ ║                   ║ case
her        PRON  <════════╗ ║ ║                   ║ nmod:poss
famous     ADJ   <══════╗ ║ ║

### another example

In [46]:
displacy.render(doc3, style="ent")

In [53]:
print(f"entities: {[(X.text, X.label_) for X in doc3.ents]}\n")

print(f"noun chunks: {[(X.text, X.label_) for X in doc3.noun_chunks]}\n")

print(f"token tags: {[(T.text, T.tag_, T.pos) for T in doc3]}\n")


entities: [('Diana', 'PERSON'), ('BBC', 'ORG'), ('Martin Bashir', 'PERSON')]

noun chunks: [('Relevant documents', 'NP')]

token tags: [('Relevant', 'JJ', 84), ('documents', 'NNS', 92), ('will', 'MD', 87), ('give', 'VB', 100), ('details', 'NNS', 92), ('of', 'IN', 85), ('statements', 'NNS', 92), ('made', 'VBN', 100), ('by', 'IN', 85), ('princess', 'NN', 92), ('Diana', 'NNP', 96), ('concerning', 'VBG', 100), ('  ', '_SP', 103), ('her', 'PRP$', 95), ('marriage', 'NN', 92), ('during', 'IN', 85), ('her', 'PRP$', 95), ('famous', 'JJ', 84), ('BBC', 'NNP', 96), ('interview', 'NN', 92), ('with', 'IN', 85), ('Martin', 'NNP', 96), ('Bashir', 'NNP', 96), ('.', '.', 97)]



fin

In [57]:
docx1 = nlp("I listen to WXPN on the radio.  I also listen to WXPN-FM somtimes. I like the news that NBC has")
displacy.render(docx1, style="ent")

In [61]:
docx2 = nlp("I have a pet dog and I also like cats.  I do not like mice")
displacy.render(docx2, style="ent")

In [65]:
docx3 = nlp("I like music by the Kingston Trio and I love songs by the Beatles.")
displacy.render(docx3, style="ent")