# Testing Flair's NER

As usual, let's just borrow best practices from someone who has done it before:  
https://medium.com/analytics-vidhya/practical-approach-of-state-of-the-art-flair-in-named-entity-recognition-46a837e25e6b

I came across Flair by accident, and was surprised to hear about a "simple" NLP package I hadn't heard of yet. The benchmarks against current SOTA are impressive, so let's give it a shot!

In [2]:
import flair

#import commands for flair NER
from flair.data import Sentence
from flair.models import SequenceTagger

### Boilerplate NER on a single string
(Copied from the link above)

In [None]:
#Load NER Model
tagger = SequenceTagger.load('ner')

In [5]:
#Sample text to run NER
text = 'Jackson is placed in Microsoft located in Redmond'
#passing text to sentence
sentence = Sentence(text)

In [6]:
# Run NER on sentence to identify Entities
tagger.predict(sentence)
# print the entities with below command
for entity in sentence.get_spans('ner'):
    print(entity)

Span [1]: "Jackson"   [− Labels: PER (0.9951)]
Span [5]: "Microsoft"   [− Labels: ORG (0.9908)]
Span [8]: "Redmond"   [− Labels: LOC (0.9586)]


Ooh, I like the percent confidence.

### Great, let's try it on a full article

- Go load my normal boilerpy article grabber

In [7]:
import boilerpy3
from boilerpy3 import extractors
extractor = extractors.ArticleExtractor()

In [8]:
doc = extractor.get_doc_from_url(r"https://www.washingtonpost.com/nation/2021/04/06/derek-chauvin-trial/")

In [23]:
#Import segtok library to split the paragraph into sentences
from segtok.segmenter import split_single
sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(doc.content)]



In [None]:
# sentences

In [25]:
#predicting entities
tagger.predict(sentences)
# print the entities with below command
for sent in sentences:
    for entity in sent.get_spans('ner'):
        print(entity)

2021-04-07 00:27:42,581 Ignore 1 sentence(s) with no tokens.
Span [1]: "UTC"   [− Labels: ORG (0.625)]
Span [1]: "MINNEAPOLIS"   [− Labels: ORG (0.5678)]
Span [7,8]: "Derek Chauvin"   [− Labels: PER (0.9999)]
Span [26,27]: "George Floyd"   [− Labels: PER (0.9983)]
Span [32]: "Floyd"   [− Labels: PER (1.0)]
Span [1,2]: "Johnny Mercil"   [− Labels: PER (0.9996)]
Span [7]: "Minneapolis"   [− Labels: LOC (0.9944)]
Span [26]: "Floyd"   [− Labels: PER (0.9995)]
Span [2,3]: "Steve Schleicher"   [− Labels: PER (0.9998)]
Span [5]: "Mercil"   [− Labels: PER (0.9987)]
Span [13]: "Chauvin"   [− Labels: PER (1.0)]
Span [16]: "Floyd"   [− Labels: PER (0.9988)]
Span [22]: "Chauvin"   [− Labels: PER (1.0)]
Span [6]: "Mercil"   [− Labels: PER (0.9987)]
Span [13]: "MPD"   [− Labels: ORG (0.9162)]
Span [14]: "Mercil"   [− Labels: PER (0.9975)]
Span [2,3]: "Eric Nelson"   [− Labels: PER (0.9999)]
Span [5]: "Chauvin"   [− Labels: PER (1.0)]
Span [23]: "Floyd"   [− Labels: PER (0.999)]
Span [3]: "Mercil"   

### Let's try another one  

----

In [26]:
doc = extractor.get_doc_from_url(r"https://www.cnn.com/2021/04/06/health/covid-neurological-psychological-lancet-wellness/index.html")

In [30]:
#Import segtok library to split the paragraph into sentences
from segtok.segmenter import split_single
sentences = [Sentence(sent, use_tokenizer=True) for sent in split_single(doc.content)]



In [31]:
#predicting entities
tagger.predict(sentences)
# print the entities with below command
for sent in sentences:
    for entity in sent.get_spans('ner'):
        print(entity)

2021-04-07 00:42:37,826 Ignore 1 sentence(s) with no tokens.
Span [2]: "Birx"   [− Labels: PER (0.6829)]
Span [10]: "Covid-19"   [− Labels: MISC (0.9864)]
Span [6]: "Covid-19"   [− Labels: MISC (0.8169)]
Span [33,34]: "Lancet Psychiatry"   [− Labels: ORG (0.8155)]
Span [16]: "Covid-19"   [− Labels: MISC (0.9149)]
Span [11]: "Covid-19"   [− Labels: MISC (0.7916)]
Span [16,17]: "Maxime Taquet"   [− Labels: PER (0.9999)]
Span [27,28,29]: "University of Oxford"   [− Labels: ORG (0.996)]
Span [16]: "Covid-19"   [− Labels: MISC (0.776)]
Span [15]: "Covid-19"   [− Labels: MISC (0.6785)]
Span [12]: "Taquet"   [− Labels: PER (0.9999)]
Span [20]: "Covid-19"   [− Labels: MISC (0.9739)]
Span [26]: "US"   [− Labels: LOC (0.9974)]
Span [6]: "Covid-19"   [− Labels: MISC (0.9867)]
Span [5]: "Covid-19"   [− Labels: MISC (0.9618)]
Span [3]: "Covid-19"   [− Labels: ORG (0.5987)]
Span [14]: "Taquet"   [− Labels: PER (1.0)]
Span [12]: "Covid-19"   [− Labels: MISC (0.8482)]
Span [27,28]: "Musa Sami"   [− La

### Thoughts

- Flair's NER seems to work pretty well. 
- That article is less robust/difficult than some docs I'd want to throw through NER, but it's a decent test. 
- You can see **it thought "UTC" was an org** (It's a timezone).
- There are fewer tag types than you'd find with SpaCy
- **It's weird that some labels come back at 1.0**; seems a little cocky!

I found this article, which you might get some use out of if you're trying to benchmark NER and Semantic Annotation Platforms.
https://towardsdatascience.com/benchmark-ner-algorithm-d4ab01b2d4c3
