# Flair NER for Detecting Book Titles

Flair like SpaCy is trained on Ontonotes and so can detect Works of Art.

It seems to perform comparably to SpaCy; a more careful analysis including preprocessing would be needed to choose between them.

In [1]:
import flair, torch
flair.device = torch.device('cpu') 

import pandas as pd

2022-06-27 21:00:28.518400: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-27 21:00:28.518449: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
book_recs = pd.read_csv('../data/02_intermediate/hn_ask_book_recommendations.csv')

In [3]:
from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")

2022-06-27 21:00:32,178 loading file /home/edward/.flair/models/ner-english-ontonotes-large/2da6c2cdd76e59113033adf670340bfd820f0301ae2e39204d67ba2dc276cc28.ec1bdb304b6c66111532c3b1fc6e522460ae73f1901848a4d0362cdf9760edb1
2022-06-27 21:00:50,599 SequenceTagger predicts: Dictionary with 76 tags: <unk>, O, B-CARDINAL, E-CARDINAL, S-PERSON, S-CARDINAL, S-PRODUCT, B-PRODUCT, I-PRODUCT, E-PRODUCT, B-WORK_OF_ART, I-WORK_OF_ART, E-WORK_OF_ART, B-PERSON, E-PERSON, S-GPE, B-DATE, I-DATE, E-DATE, S-ORDINAL, S-LANGUAGE, I-PERSON, S-EVENT, S-DATE, B-QUANTITY, E-QUANTITY, S-TIME, B-TIME, I-TIME, E-TIME, B-GPE, E-GPE, S-ORG, I-GPE, S-NORP, B-FAC, I-FAC, E-FAC, B-NORP, E-NORP, S-PERCENT, B-ORG, E-ORG, B-LANGUAGE, E-LANGUAGE, I-CARDINAL, I-ORG, S-WORK_OF_ART, I-QUANTITY, B-MONEY


In [4]:
import re
import html

def clean(text):
    text = html.unescape(text)
    text = text.replace('<i>', '"')
    text = text.replace('</i>', '"')
    text = text.replace('<p>', '\n\n')
    text = re.sub('<a href="(.*?)".*?>.*?</a>', r'\1', text)
    return text

In [5]:
books = book_recs.text.head(15).map(clean).to_list()

In [6]:
%%time
sentences = [Sentence(book) for book in books]
for sentence in sentences:
    tagger.predict(sentence)

CPU times: user 25.6 s, sys: 153 ms, total: 25.7 s
Wall time: 6.66 s


It does pretty well, comparable to SpaCy's model (in some ways better, some worse).

In [7]:
for sentence in sentences:
    print(sentence.text)
    for entity in sentence.get_spans('ner'):
        print(entity)
    print()

- Existential Rationalism : Handling Hume 's Fork ( second edition ) - Living with the Himalayan Masters - The Outsider - Hirohito : Behind the Myth
Span[5:6]: "Hume" → WORK_OF_ART (0.4963)
Span[9:10]: "second" → ORDINAL (1.0)
Span[14:17]: "with the Himalayan" → WORK_OF_ART (0.7983)
Span[22:27]: "Hirohito : Behind the Myth" → WORK_OF_ART (0.9999)

The Coming of Neo-Feudalism by Joel Kotkin
Span[5:7]: "Joel Kotkin" → PERSON (1.0)

Probably " Reaper " , by Will Wight . It ’s not an insightful nonfiction book or a piece of high literature , but the whole Cradle series is very , very fun .
Span[2:3]: "Reaper" → WORK_OF_ART (1.0)
Span[6:8]: "Will Wight" → PERSON (1.0)
Span[26:27]: "Cradle" → WORK_OF_ART (0.9994)

A Gentleman in Moscow by Amor Towles . I spent a lot of the year in isolation , only seeing a few people and this book felt like an appropriate analogy . It was also very heartwarming when I really needed something to lift me up .
Span[0:4]: "A Gentleman in Moscow" → WORK_OF_ART (1