## Comparing Spacy's models for NER

[Spacy](https://spacy.io/) is an open-source NLP library. Its components are not SOTA but they are robust, easy to use and fast.

We'll demo how to use it for simplr tasks and then try a pretrained entity linker that links to wikidata items. The [spacy-entity-linker](https://pypi.org/project/spacy-entity-linker/) is not great, but worth looking at.

You may need to do the following:
 * pip install spacy
 * python -m spacy download en
 * python -m spacy download en_core_web_md

In [13]:
import spacy
from spacy import displacy
from collections import defaultdict

### Load one of Spacy's language models. This is a medium sized one for English

In [33]:
model1 = "en_core_web_md"
model2 = "en_core_web_lg"
model3 = "en_core_web_trf"

nlp1 = spacy.load(model1)
nlp3 = spacy.load(model2)
nlp3 = spacy.load(model3)

In [39]:
def ent_summary(doc, model):
    edict = defaultdict(int)
    for ent in [(X.text, X.label_) for X in doc.ents]:
        if ent[1] not in ['QUANTITY','DATE','ORDINAL','CARDINAL', 'MONEY', 'PERCENT', 'TIME']:
            edict[ent] += 1
    print(f"Model {model} found {len(edict)} unique entities from {len(doc.ents)} mentions")
    for ent, number in sorted(edict.items()):
        print(ent, number)

### Load text from a topic report and process with the two models

In [55]:
# report_number = "1014"
# report_number = "1023"
report_number = "1024"
text = open(f"report_data/{report_number}_report.txt").read()
doc1 = nlp1(text)
doc2 = nlp2(text)
doc3 = nlp3(text)

In [56]:
ent_summary(doc1, model1)

Model en_core_web_md found 57 unique entities from 139 mentions
('Adam Zemke', 'PERSON') 1
('Board of Directors', 'ORG') 1
('Board of Trustees', 'ORG') 1
('Congress', 'ORG') 1
('Dianne Feinstein', 'PERSON') 1
('Dominique Moceanu', 'PERSON') 1
('ESPN', 'ORG') 1
('FBI', 'ORG') 1
('Feinstein', 'PERSON') 1
('House', 'ORG') 3
('Jamie Dantzscher', 'PERSON') 1
('Jessica Howard', 'PERSON') 1
('Juliet Macur', 'PERSON') 1
('Karolyi Ranch', 'GPE') 1
('Kathie Klages', 'PERSON') 1
("Kellogg's", 'ORG') 1
('Klages', 'PERSON') 2
('Larry Nassar', 'PERSON') 1
('Lou Anna Simon', 'PERSON') 1
('MSU', 'ORG') 4
('Mark Dantonio', 'PERSON') 1
('Mark Hollis', 'PERSON') 1
('Mattie Larson', 'PERSON') 1
('Michigan State', 'ORG') 1
('Michigan State University', 'ORG') 3
('Nassar', 'ORG') 19
('Olympian Aly Raisman', 'PERSON') 1
('Outside the Lines', 'ORG') 1
('Paul Ryan', 'PERSON') 1
('Penny', 'PERSON') 1
('Procter & Gamble', 'ORG') 1
('Rachael Denhollander', 'PERSON') 1
('Rick Adams', 'PERSON') 1
('SafeSport', 'ORG

In [57]:
ent_summary(doc2, model2)

Model en_core_web_lg found 58 unique entities from 143 mentions
('AT&T.', 'ORG') 1
('Adam Zemke', 'PERSON') 1
('Aly Raisman', 'PERSON') 1
('Board of Directors', 'ORG') 1
('Board of Trustees', 'ORG') 1
('Congress', 'ORG') 1
('Dianne Feinstein', 'PERSON') 1
('Dominique Moceanu', 'PERSON') 1
('ESPN', 'ORG') 1
('FBI', 'ORG') 1
('Feinstein', 'PERSON') 1
('House', 'ORG') 3
('Jamie Dantzscher', 'PERSON') 1
('Jessica Howard', 'PERSON') 1
('Juliet Macur', 'PERSON') 1
('Karolyi Ranch', 'ORG') 1
('Kathie Klages', 'PERSON') 1
("Kellogg's", 'ORG') 1
('Klages', 'PERSON') 2
('Larry Nassar', 'PERSON') 1
('Lou Anna Simon', 'PERSON') 1
('MSU', 'ORG') 4
('Mark Dantonio', 'PERSON') 1
('Mark Hollis', 'PERSON') 1
('Mattie Larson', 'PERSON') 1
('Michigan State', 'ORG') 1
('Michigan State University', 'ORG') 3
('NGBs', 'ORG') 1
('Nassar', 'PERSON') 19
('Outside the Lines', 'WORK_OF_ART') 1
('Paul Ryan', 'PERSON') 1
('Penny', 'PERSON') 1
('Procter & Gamble', 'ORG') 1
('Rachael Denhollander', 'PERSON') 1
('Rick

In [58]:
ent_summary(doc3, model3)

Model en_core_web_trf found 58 unique entities from 143 mentions
('AT&T.', 'ORG') 1
('Adam Zemke', 'PERSON') 1
('Aly Raisman', 'PERSON') 1
('Board of Directors', 'ORG') 1
('Board of Trustees', 'ORG') 1
('Congress', 'ORG') 1
('Dianne Feinstein', 'PERSON') 1
('Dominique Moceanu', 'PERSON') 1
('ESPN', 'ORG') 1
('FBI', 'ORG') 1
('Feinstein', 'PERSON') 1
('House', 'ORG') 3
('Jamie Dantzscher', 'PERSON') 1
('Jessica Howard', 'PERSON') 1
('Juliet Macur', 'PERSON') 1
('Karolyi Ranch', 'ORG') 1
('Kathie Klages', 'PERSON') 1
("Kellogg's", 'ORG') 1
('Klages', 'PERSON') 2
('Larry Nassar', 'PERSON') 1
('Lou Anna Simon', 'PERSON') 1
('MSU', 'ORG') 4
('Mark Dantonio', 'PERSON') 1
('Mark Hollis', 'PERSON') 1
('Mattie Larson', 'PERSON') 1
('Michigan State', 'ORG') 1
('Michigan State University', 'ORG') 3
('NGBs', 'ORG') 1
('Nassar', 'PERSON') 19
('Outside the Lines', 'WORK_OF_ART') 1
('Paul Ryan', 'PERSON') 1
('Penny', 'PERSON') 1
('Procter & Gamble', 'ORG') 1
('Rachael Denhollander', 'PERSON') 1
('Ric