# Extracting Book Titles with SpaCy NER

The SpaCy English Transformers model can often detect "Work of Art" and "Person" which are good candidates for the title and author of a book.

It does have some problem with punctuation; things like space around quotes and sentences seem to make a large difference. This may be improved with some preprocessing.

In [1]:
import spacy

import pandas as pd

2022-06-27 20:36:32.765935: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-06-27 20:36:32.766010: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Let's get the posts from `0012-ask-hn` which are responses to Ask HN book recommendations.

In [2]:
book_recs = pd.read_csv('../data/02_intermediate/hn_ask_book_recommendations.csv')

Cleaning the text.

In [3]:
import re
import html

def clean(text):
    text = html.unescape(text)
    text = text.replace('<i>', '"')
    text = text.replace('</i>', '"')
    text = text.replace('<p>', '\n\n')
    text = re.sub('<a href="(.*?)".*?>.*?</a>', r'\1', text)
    return text

In [4]:
nlp = spacy.load('en_core_web_trf')

It's pretty slow on CPU

In [5]:
%time docs = list(nlp.pipe(book_recs.text.head(20).map(clean).to_list()))

CPU times: user 18.7 s, sys: 281 ms, total: 19 s
Wall time: 5.08 s


## Analyzing Results

In [6]:
from spacy import displacy

It gets two of the 4, which is pretty impressive given the lack of context (from the parent).

In [7]:
displacy.render(docs[0], 'ent')

Spacy can't predict the sentences.

In [8]:
list(docs[0].sents)

[- Existential Rationalism: Handling Hume's Fork (second edition)
 - Living with the Himalayan Masters
 - The Outsider
 - Hirohito: Behind the Myth]

It just misses the " - Feudalism" part of the title

In [9]:
displacy.render(docs[1], 'ent')

This gets it right; though Cradle is a Series which we need to disambiguate.

In [10]:
displacy.render(docs[2], 'ent')

Perfect

In [11]:
displacy.render(docs[3], 'ent')

Perfect

In [12]:
displacy.render(docs[4], 'ent')

It has a bit of trouble with the boundaries here, but it's pretty close.

In [13]:
displacy.render(docs[5], 'ent')

This is perfec;t I suspect Beanpole is a movie.

In [14]:
displacy.render(docs[6], 'ent')

It didn't pick this up without the context

In [15]:
displacy.render(docs[7], 'ent')



Putting some contect doesn't help this model

In [16]:
title = book_recs.iloc[7].title_parent
question = book_recs.iloc[7].text_parent
answer = book_recs.iloc[7].text

in_context = clean(title) + '\n' + clean(question) + clean(answer)

In [17]:
displacy.render(nlp(in_context), 'ent')

Good, but misses that Aristotle is a person

In [18]:
displacy.render(docs[8], 'ent')

In [19]:
displacy.render(docs[9], 'ent')

For some reason misses the `learn"` (tokenization issue?). We'll need to do some post-processing to separate books from courses.

In [20]:
displacy.render(docs[10], 'ent')

In [21]:
displacy.render(docs[11], 'ent')

In [22]:
displacy.render(docs[12], 'ent')

Interestingly if we change the list into separate systems it finds all the books.

In [23]:
displacy.render(nlp(clean(book_recs.iloc[12].text).replace('\n*', '.')), 'ent')

This is good; no false positives.

In [24]:
displacy.render(docs[13], 'ent')

Misses it

In [25]:
displacy.render(docs[14], 'ent')

This does pretty well

In [26]:
displacy.render(docs[15], 'ent')

## Can we use a smaller model?

Unfortunately even the Large model does not predict any Work of Art.

I'm not sure if this is because it's not in the model, or just the text is too different from the model.

In [27]:
nlp = spacy.load('en_core_web_lg')



In [28]:
%time docs = list(nlp.pipe(book_recs.text.head(20).map(clean).to_list()))

CPU times: user 163 ms, sys: 0 ns, total: 163 ms
Wall time: 161 ms


In [29]:
for doc in docs[:10]:
    displacy.render(doc, 'ent')