# Natural Language Processing (NLP) with spaCy
In this notebook, we will use the Natural Language Processing library for Python called [spaCy][SPACY]. 
The functionality provided by [spaCy][SPACY] allows us to quickly extract parts-of-speech (POS) from text descriptions and to identify entities using [spaCy's][SPACY] named entity recognition (NER). We will initially  use [spaCy][SPACY]'s rule-based matching functionality to create a subject matcher that we can then apply to both Sinopia's RDF metadata as well as the associated full-text for a select sample of these resources.


[SPACY]: https://spacy.io/

In [31]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

import datetime
import json
import string

import pathlib
import pandas as pd
import numpy as np

import kglab
import rdflib
import requests
import spacy
import helpers
import widgets

## Creating a FAST Panda DataFrame
Using a csv file derived from OCLC's [FAST](https://www.oclc.org/research/areas/data-science/fast.html)(Faceted Application of Subject Terminology) topic list, we read this csv into a Panda's DataFrame.

In [3]:
fast_topics = pd.read_csv("data/topic_uri_label_utf8.csv", names=["URL", "name"])

To get a sense of what data is in the `fast_topics` DataFrame we can look at the *shape*, *info*, and look at a random sample

In [4]:
print(f"Shape of fast_topics {fast_topics.shape}")
fast_topics.info()

Shape of fast_topics (460110, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460110 entries, 0 to 460109
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   URL     460110 non-null  object
 1   name    460110 non-null  object
dtypes: object(2)
memory usage: 7.0+ MB


In [5]:
fast_topics.sample(10)

Unnamed: 0,URL,name
245202,http://id.worldcat.org/fast/1077452,"Privacy, Right of--Public opinion"
28981,http://id.worldcat.org/fast/997183,Liberalism
319681,http://id.worldcat.org/fast/1067112,Plecotus townsendii
345441,http://id.worldcat.org/fast/1175518,Wills--Computer programs
120542,http://id.worldcat.org/fast/990231,Labor supply--Statistics--History
20292,http://id.worldcat.org/fast/970779,Industrial accidents--Law and legislation
116156,http://id.worldcat.org/fast/822987,Automobile industry and trade--Finance--Law an...
338929,http://id.worldcat.org/fast/1163594,Vaccines industry--Government policy
297630,http://id.worldcat.org/fast/1101961,Russelia
372172,http://id.worldcat.org/fast/1048881,Ostriches--Eggs--Incubation


From our sample, we see that the `name` has punctuation like **(),--** that will need to be removed later in our workflow.

## spaCy Phrase Matcher
With the [FAST][FAST] dataframe as our source, we now will create a [spaCy][SPACY] [PhraseMatcher](https://spacy.io/usage/rule-based-matching#phrasematcher) made up of phrases extracted from the dataframe using the [FAST][FAST] urls as identifiers.

We start by importing [spaCy][SPACY] base English vocabulary and create an empty natural language processing (nlp) pipeline that we pass into a new matcher object that will lower-case all of the [FAST][FAST] patterns.

[FAST]: https://www.oclc.org/research/areas/data-science/fast.html
[SPACY]: https://spacy.io/

In [6]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
fast_topic_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [7]:
"African American teenagers--Education".strip(string.punctuation).replace(
    "--", " "
).split()

['African', 'American', 'teenagers', 'Education']

### Creatings Patterns
We will iterate through the dataframe and for each [FAST](https://www.oclc.org/research/areas/data-science/fast.html) series, using the **URL** as the identififer and for the name, remove punctuation, split each name into a list of words, and tokenize each term, and then add to the `fast_topic_matcher`. 

In [8]:
start = datetime.datetime.utcnow()
print(f"Started at {start} of FAST Matching")
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_list = row[1]["name"].strip(string.punctuation).replace("--", " ").split()
    name_list = [nlp.make_doc(row) for row in name_list]
    fast_topic_matcher.add(ident, name_list)
end = datetime.datetime.utcnow()
print(
    f"Finished adding FAST Topic patterns at {end}, total time {(end-start).seconds / 60.}"
)

Started at 2022-07-04 22:27:28.030998 of FAST Matching
Finished adding FAST Topic patterns at 2022-07-04 22:28:29.616713, total time 1.0166666666666666


Applying the spaCy Phrase matcher to titles and summaries from the Sinopia 

## Loading Sinopia Stage Knowledge Graph
Just like in the previous Jupyter notebook, we will load the saved knowledge graph that we created at the beginning and then we will query the graph using SPARQL.

In [9]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

<kglab.kglab.KnowledgeGraph at 0x13891ab90>

### RDF Literals Pandas DataFrame
With the Sinopia Stage knowledge graph loaded, we will query our Graph for *titles*, *labels*, and any *summary* triples and apply our `fast_topic_matcher` to the results.


In [10]:
stage_text_nodes = stage_kg.query_as_df(
    """PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>

SELECT ?title ?label ?summary ?url

WHERE {
  OPTIONAL {
      ?title_bnode  bf:mainTitle ?title .
      ?url bf:title ?title_bnode .
  }
  OPTIONAL {
      ?label_bnode rdfs:label ?label .
      ?url bf:title ?label_bnode .
  }
  OPTIONAl {
      ?url bf:summary ?summary_bnode .
      ?summary_bnode rdfs:label ?summary  .
  }
}
"""
)

In [11]:
stage_text_nodes.to_json("data/stage-text-nodes.json")

In [12]:
stage_text_nodes.shape

(2102, 4)

In [13]:
stage_text_nodes.sample(10)

Unnamed: 0,title,url,label,summary
1930,"Purcell, Henry, 1659-1695. Dido and Aeneas",<https://api.stage.sinopia.io/resource/2b57c79...,Dido and Aeneas,
819,Fifty comedies and tragedies,<https://api.stage.sinopia.io/resource/0124bf5...,,
363,Ulysses,<https://api.stage.sinopia.io/resource/c02abeb...,,
1855,The politics of birth.,<https://api.stage.sinopia.io/resource/638fc75...,,Several manuscript and typescript drafts of an...
540,"Ar't asleepe husband : a boulster lecture, sto...",<https://api.stage.sinopia.io/resource/78a91d5...,,
821,Ane detectioun of the duinges of Marie Quene o...,<https://api.stage.sinopia.io/resource/e777db9...,,
1061,COVID-19,<https://api.stage.sinopia.io/resource/397f4a3...,,
1833,Sämtliche Bläserquintette,<https://api.stage.sinopia.io/resource/ea0393c...,,
1001,Taiwania (Taipei),<https://api.stage.sinopia.io/resource/c069a5c...,,
1643,Huai Yang wen hua yan jiu,<https://api.stage.sinopia.io/resource/f93bf1f...,,


In [14]:
stage_text_nodes.tail()

Unnamed: 0,title,url,label,summary
2097,Виде киге,<https://api.stage.sinopia.io/resource/4b31a77...,,
2098,Historisch-theoretisch und practische Untersuc...,<https://api.stage.sinopia.io/resource/c16a32c...,,
2099,Tunnel (Compilation),<https://api.stage.sinopia.io/resource/0432583...,,
2100,Emma (Austen),<https://api.stage.sinopia.io/resource/1bd7b0a...,,
2101,Physica. English,<https://api.stage.sinopia.io/resource/7b03cc3...,,


In [40]:
stage_text_nodes.loc[~stage_text_nodes["summary"].isnull()]

Unnamed: 0,title,url,label,summary
34,Introduction to the sagas of Icelanders,<https://api.stage.sinopia.io/resource/d42bf1f...,,"""Combining an accessible approach with innovat..."
41,Training your boxer,<https://api.stage.sinopia.io/resource/ba38374...,,"""An informative and fully illustrated manual t..."
42,Training your boxer,<https://api.stage.sinopia.io/resource/ba38374...,,"""Noted for affectionate loyalty to their maste..."
47,Validating RDF data,<https://api.stage.sinopia.io/resource/cac41b2...,,Describes two technologies for RDF validation:...
76,The stainless steel rat wants you,<https://api.stage.sinopia.io/resource/4650180...,,"After saving the world, diGriz is called on t..."
...,...,...,...,...
2042,Stainless steel rat for president,<https://api.stage.sinopia.io/resource/2181701...,,The Stainless Steel rat is back! Slippery Jim ...
2056,Little women. English. 1968,<https://api.stage.sinopia.io/resource/e7308cc...,,"The adventures of Meg, Jo, Beth, and Amy as th..."
2063,Idiot. English (Myers),<https://api.stage.sinopia.io/resource/dcc1b50...,,"""The main idea of the novel is to depict the p..."
2064,Idiot. English (Myers),<https://api.stage.sinopia.io/resource/dcc1b50...,,"""Into a compelling real portrait of nineteenth..."


In [23]:
stage_text_nodes.iloc[1855].summary

'Several manuscript and typescript drafts of an unpublished work, with working title "The politics of birth"'

In [42]:
doc1855 = nlp(stage_text_nodes.iloc[1855].summary)

The [spaCy][SPACY] `fast_topic_matcher` is greedy, meaning that it matches on any terms contained in the document as we can see in this example. 

[SPACY]: https://spacy.io/

In [43]:
len(fast_topic_matcher(doc1855))

97777

In [44]:
first_match = fast_topic_matcher(doc1855)[0]

In [45]:
first_match

(8446557437550807002, 0, 1)

In [46]:
nlp.vocab.strings[first_match[0]]

'http://id.worldcat.org/fast/936123'

In [22]:
fast_topics[fast_topics["URL"] == "http://id.worldcat.org/fast/936123"].iloc[0]["name"]

'Functions of several complex variables'

### Second Phrase Matcher Iteration 
At this point our `fast_topic_matcher` isn't that usefully when any terms in a list are matched. As is often the case when working with data, we iteratively test and changes our approach with our second phrase matcher. Instead of extracting each word in the FAST text, we will use the entire text as a phrase. 

In [41]:
fast_topic_matcher2 = PhraseMatcher(nlp.vocab, attr="LOWER")

start = datetime.datetime.utcnow()
print(f"Started at {start} of FAST Matching")
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_list = [
        nlp.make_doc(row[1]["name"].strip(string.punctuation).replace("--", " ")),
    ]
    fast_topic_matcher2.add(ident, name_list)
end = datetime.datetime.utcnow()
print(
    f"Finished adding FAST Topic patterns at {end}, total time {(end-start).seconds / 60.}"
)

Started at 2022-07-04 23:53:58.815774 of FAST Matching
Finished adding FAST Topic patterns at 2022-07-04 23:54:33.162680, total time 0.5666666666666667


In [47]:
len(fast_topic_matcher2(doc1855))

2

In [49]:
doc1855_matches = fast_topic_matcher2(doc1855)

In [50]:
doc1855_matches

[(15031021037593339281, 4, 5), (17604063676832794131, 8, 9)]

In [51]:
for match in doc1855_matches:
    fast_url = nlp.vocab.strings[match[0]]
    label = fast_topics[fast_topics["URL"] == fast_url].iloc[0]["name"]
    print(fast_url, label)

http://id.worldcat.org/fast/897348 Drafts
http://id.worldcat.org/fast/1180188 Work


## spaCy Named Entity Recognition (NER)
One of the strengths of [spaCy][SPACY] is it's pre-existing models for named entity recognition (NER). In NER tasks, existing entities like people, locations, time periods are identified and tagged for later analysis. 

To demonstrate, we will load a small English language model and perform NER on some of the *titles*, *labels*, and any *summary* triples contained in the `stage_text_nodes` dataframe.

### Download model and Create NLP pipeline
We first need to download the `en_core_web_sm` model for our environment and then create an instance of a NLP pipeline.

[SPACY]: https://spacy.io/

In [52]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.3.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [53]:
en_nlp = spacy.load("en_core_web_sm")

### Generate Documents and Display Entities
Using a different sample from the `stage_text_nodes` dataframe, we apply the `en_nlp` pipeline to some of these values to see if we can extract meaningful entities.

In [56]:
sample2 = stage_text_nodes.sample(10)

In [57]:
sample2

Unnamed: 0,title,url,label,summary
833,"Tragedie of Charles Duke of Byron, Marshall of...",<https://api.stage.sinopia.io/resource/e8d4bd8...,,
766,Cultural policy for arts education,<https://api.stage.sinopia.io/resource/203d66b...,,
1380,Warres of Pompey and Caesar,<https://api.stage.sinopia.io/resource/b112474...,,
1873,Bi︠e︡dnye li︠u︡di,<https://api.stage.sinopia.io/resource/b1d0744...,,
2042,Stainless steel rat for president,<https://api.stage.sinopia.io/resource/2181701...,,The Stainless Steel rat is back! Slippery Jim ...
1405,SPECTS 2021,<https://api.stage.sinopia.io/resource/7031994...,,
179,Orange County illustrated: Orange County befor...,<https://api.stage.sinopia.io/resource/a53bf60...,,
1265,Work for narrator and orchestra,<https://api.stage.sinopia.io/resource/93408d3...,,
1043,"Variant2, title",<https://api.stage.sinopia.io/resource/c9960db...,,
1109,Eastward hoe,<https://api.stage.sinopia.io/resource/c24ca29...,,


In [61]:
doc_833 = en_nlp(stage_text_nodes.iloc[833].title)
doc_2042 = en_nlp(stage_text_nodes.iloc[2042].summary)

In [63]:
print("Entities for 833 title")
print(stage_text_nodes.iloc[833].title)
for ent in doc_833.ents:
    print(ent, ent.label_)
print("\n\nEntities for 2042")
print(stage_text_nodes.iloc[2042].summary)
for ent in doc_2042.ents:
    print(ent, ent.label_)

Entities for 833 title
Tragedie of Charles Duke of Byron, Marshall of France
Charles Duke PERSON
Marshall PERSON
France GPE


Entities for 2042
The Stainless Steel rat is back! Slippery Jim diGriz, the future's most lovable, laughable, larcenous conman tumed counterspy, retums for yet another high-tension mission. This time the Special Corps has given the Rat a daring assignment - liberate a backward tourist planet from the clutches of an aging dictator. With his lovely but lethal wife, Angelina, and his two stalwart sons, James and Bolivar, diGriz pits ballots against bullets in the fight for freedom. He's vowed to restore truth, justice, and democracy to the world of Parisio-Aqui, if he has to lie, cheat, and steal to do it.
Stainless Steel ORG
Slippery Jim PERSON
the Special Corps ORG
Angelina GPE
two CARDINAL
James and Bolivar ORG
Parisio-Aqui FAC


### spaCy NER and POS Visualizations
The [spaCy][SPACY] project provides the `displacy` class that has two helpful visualizers. The first **ner** displays the identified entities in context of the text and the second identifies the parts-of-speech and how they are related to each other in the **dep** option. 

[SPACY]: https://spacy.io/

In [66]:
from spacy import displacy

displacy.render(doc_833, jupyter=True)

In [67]:
displacy.render(doc_2042, jupyter=True, style="ent")

## Exercise 1
Download and create a `nlp` pipeline using the *en_core_web_sm* model and see what entities are identified using your sample from the `stage_text_nodes` dataframe.

> **Extra credit** Find a non-english resource in `stage_text_nodes`, find a corresponding 
> [spaCy][SPACY] trained pipeline at https://spacy.io/models and follow the above steps to 
> identify any entities in the *title*, *label*, or *summary* values.

[SPACY]: https://spacy.io/

## Full-Text and Sinopia RDF Metadata
Exciting possibilities of applying machine learning to library workflows comes when we combine linked-data cataloging with the availability of the full-text. For example, we can run NER on the entirety of the full-text, and add any identified entities to our RDF. 

> Special thanks to Tim Thompson and Greta de Groat for providing examples of cataloged
> resources in Sinopia that have available full-text!


### Example One: *Alarmingly suspicious*
First we will download the Sinopia record for this Work, https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c, and extract the `data` property to create a RDF graph.

In [69]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)

print(f"Number of triples {len(example1_graph)}")

Number of triples 48


Second, we open the full-text that was retrieved from the [Haithtrust](https://babel.hathitrust.org/cgi/pt?id=uiuo.ark:/13960/t14n5kz7q&view=1up&seq=3) and load all of the text and save to the `example1_text` variable. 

In [70]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

print(f"Number of characters in the text is {len(example1_text):,}")

Number of characters in the text is 61,249


Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io/

In [71]:
example1_doc = en_nlp(example1_text)
print(f"Number of entities: {len(example1_doc.ents):,}")

Number of entities: 587


Of the 587 entities, we will extract the **PERSON** and **GPE** (location) entities for further analysis. 

In [72]:
people, locations = [], []

for ent in example1_doc.ents:
    if ent.label_.startswith("PERSON"):
        people.append(ent)
    if ent.label_.startswith("GPE"):
        locations.append(ent)
print(f"NER found {len(people):,} people and {len(locations):,} locations")

NER found 281 people and 34 locations


Looking at the `people` list, we notice a number of duplications (for now we don't care where the person entity was found in the text) so we filter this list.

In [74]:
print(people[0])

J.
	   Palgrave Simpson


In [75]:
people = set([" ".join(ent.text.split()) for ent in people])
locations = set([" ".join(ent.text.split()) for ent in locations])

In [76]:
len(people), len(locations)

(103, 20)

With these identified names, we can use the VIAF [SRU Search](https://platform.worldcat.org/api-explorer/apis/VIAF/AuthorityCluster/SRUSearch) on each of the `people` entities to see if we can narrow down our list using this authority source. In the helpers module, the `viaf_people_search` function returns a dictionary VIAF urls that match the term and the first main heading of the record.

We can then aggregate these results and see what remains. 

In [77]:
example1_people = {}
for i, term in enumerate(list(people)):
    # Removes newlines, tabs, and spaces from the entity text
    viaf_result = helpers.viaf_people_search(term)
    example1_people.update(viaf_result)
    if not i % 10 and i > 0:
        print(".", end="")
    if not i % 25:
        print(f"{i}", end="")

0..25...50..75...100

In [78]:
example1_people

{'http://viaf.org/viaf/8865152744551427850006/': ['Flahault, Emilie de 1819-1895'],
 'http://viaf.org/viaf/6089152988212712790001/': ['Abell, John 1653-1724'],
 'http://viaf.org/viaf/1233159233932703370076/': ['Palmer, W.'],
 'http://viaf.org/viaf/467155284807287060878/': ["Gentleman lately residing at the English ambassador's at Paris"],
 'http://viaf.org/viaf/138389/': ['Sharp, Tony.'],
 'http://viaf.org/viaf/8102151837999320520004/': ['Souligne, De'],
 'http://viaf.org/viaf/6770159477671927990009/': ['Tavčar-Kalcher, Gabrijela'],
 'http://viaf.org/viaf/4210164963959424300006/': ['Peter R. Slater'],
 'http://viaf.org/viaf/947157704199544440008/': ['Bellanger, Augustin, 1868-19..'],
 'http://viaf.org/viaf/9970163710044729700003/': ['Wareham, Pete'],
 'http://viaf.org/viaf/9956149198279174940000/': ['Stair, Henry H.'],
 'http://viaf.org/viaf/9923158491050611920006/': ['Bossley, Pete'],
 'http://viaf.org/viaf/100904454/': ['White, James, 1863-1928.'],
 'http://viaf.org/viaf/60024852/': 

### Example 2 - *Improving water supply networks*
The second example is the report, *Improving water supply networks*, with the following Sinopia Resources:

- BIBFRAME Instance https://api.stage.sinopia.io/resource/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6
- BIBFRAME Work https://api.stage.sinopia.io/resource/3ccb6255-a090-4751-b68f-7f7861986452

First we will retrieve the resources and add them to a RDF graph.

In [None]:
example2_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example2_graph.namespace_manager.bind(ns, url)

for sinopia_url in [
    "https://api.stage.sinopia.io/resource/ae7a9f2c-a5db-4472-80fe-9a30f311386d",
    "https://api.stage.sinopia.io/resource/3ccb6255-a090-4751-b68f-7f7861986452",
]:
    sinopia_result = requests.get(sinopia_url)
    example2_graph.parse(
        data=json.dumps(sinopia_result.json()["data"]), format="json-ld"
    )
print(f"Total number of triples for example 2 {len(example2_graph):,}")

Second, we will open and read the full-text for *Improving water supply networks*

In [None]:
with open("data/ae7a9f2c-a5db-4472-80fe-9a30f311386d.txt") as fo:
    example2_text = fo.read()

print(f"Number of characters in the text is {len(example2_text):,}")

Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io

In [None]:
example2_doc = en_nlp(example2_text)
print(f"Number of entities: {len(example2_doc.ents):,}")

In [None]:
example2_people, example2_locations = [], []

for ent in example2_doc.ents:
    if ent.label_.startswith("PERSON"):
        example2_people.append(ent)
    if ent.label_.startswith("GPE"):
        example2_locations.append(ent)
print(
    f"NER found {len(example2_people):,} people and {len(example2_locations):,} locations"
)

We now de-duplicate `example2_people` and `example2_locations`example2_people

In [None]:
example2_people = set([" ".join(ent.text.split()) for ent in example2_people])
example2_locations = set([" ".join(ent.text.split()) for ent in example2_locations])

In [None]:
example2_people

In [None]:
example2_locations

## Exercise 2
For the following Sinopia Resources, create a graph, and a NLP pipeline to identify any people and locations in the full-text. The full-text is available in `data/5f8c765d-dc4f-45a0-b2d8-87fe29a38712.txt` file.

- BIBFRAME Work https://api.stage.sinopia.io/resource/5f8c765d-dc4f-45a0-b2d8-87fe29a38712 
- BIBFRAME Instance https://api.stage.sinopia.io/resource/2e041e16-1b4d-431f-b2d0-396f62270efa
- BIBFRAME Item https://api.stage.sinopia.io/resource/53e68480-f82b-4dea-9b9a-68fd5ceacb41