# Natural Language Processing (NLP) with spaCy
In this notebook, we will use the Natural Language Processing library for Python called [spaCy][SPACY]. 
The functionality provided by [spaCy][SPACY] allows us to quickly extract parts-of-speech (POS) from text descriptions and to identify entities using [spaCy's][SPACY] named entity recognition (NER). We will initially  use [spaCy][SPACY]'s rule-based matching functionality to create a subject matcher that we can then apply to both Sinopia's RDF metadata as well as the associated full-text for a select sample of these resources.


[SPACY]: https://spacy.io/

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

import datetime
import json
import string

import pathlib
import pandas as pd

import kglab
import rdflib
import helpers
import requests
import spacy

## Creating a FAST Panda DataFrame
Using a csv file derived from OCLC's [FAST](https://www.oclc.org/research/areas/data-science/fast.html)(Faceted Application of Subject Terminology) topic list, we read this csv into a Panda's DataFrame.

In [None]:
fast_topics = pd.read_csv("data/topic_uri_label_utf8.csv", names=["URL", "name"])

To get a sense of what data is in the `fast_topics` DataFrame we can look at the *shape*, *info*, and look at a random sample

In [None]:
print(f"Shape of fast_topics {fast_topics.shape}")
fast_topics.info()

In [None]:
fast_topics.sample(10)

From our sample, we see that the `name` has punctuation like **(),--** that will need to be removed later in our workflow.

## spaCy Phrase Matcher
With the [FAST][FAST] dataframe as our source, we now will create a [spaCy][SPACY] [PhraseMatcher](https://spacy.io/usage/rule-based-matching#phrasematcher) made up of phrases extracted from the dataframe using the [FAST][FAST] urls as identifiers.

We start by importing [spaCy][SPACY] base English vocabulary and create an empty natural language processing (nlp) pipeline that we pass into a new matcher object that will lower-case all of the [FAST][FAST] patterns.

[FAST]: https://www.oclc.org/research/areas/data-science/fast.html
[SPACY]: https://spacy.io/

In [None]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
fast_topic_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [None]:
"African American teenagers--Education".strip(string.punctuation).replace(
    "--", " "
).split()

### Creatings Patterns
We will iterate through the dataframe and for each [FAST](https://www.oclc.org/research/areas/data-science/fast.html) series, using the **URL** as the identififer and for the name, remove punctuation, split each name into a list of words, and tokenize each term, and then add to the `fast_topic_matcher`. 

In [None]:
start = datetime.datetime.utcnow()
print(f"Started at {start} of FAST Matching")
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_list = row[1]["name"].strip(string.punctuation).replace("--", " ").split()
    name_list = [nlp.make_doc(row) for row in name_list]
    fast_topic_matcher.add(ident, name_list)
end = datetime.datetime.utcnow()
print(
    f"Finished adding FAST Topic patterns at {end}, total time {(end-start).seconds / 60.}"
)

Applying the spaCy Phrase matcher to titles and summaries from the Sinopia 

## Loading Sinopia Stage Knowledge Graph
Just like in the previous Jupyter notebook, we will load the saved knowledge graph that we created at the beginning and then we will query the graph using SPARQL.

In [None]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

### RDF Literals Pandas DataFrame
With the Sinopia Stage knowledge graph loaded, we will query our Graph for *titles*, *labels*, and any *summary* triples and apply our `fast_topic_matcher` to the results.

> **NOTE** The `stage_text_nodes` dataframe has been saved in the **data** directory 
> and can be loaded for use instead of running this SPARQL query.

In [None]:
stage_text_nodes = stage_kg.query_as_df(
    """PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>

SELECT ?title ?label ?summary ?url

WHERE {
  OPTIONAL {
      ?title_bnode  bf:mainTitle ?title .
      ?url bf:title ?title_bnode .
  }
  OPTIONAL {
      ?label_bnode rdfs:label ?label .
      ?url bf:title ?label_bnode .
  }
  OPTIONAl {
      ?url bf:summary ?summary_bnode .
      ?summary_bnode rdfs:label ?summary  .
  }
}
"""
)

In [None]:
stage_text_nodes.to_json("data/stage-text-nodes.json")

In [None]:
stage_text_nodes.shape

In [None]:
stage_text_nodes.sample(10)

In [None]:
stage_text_nodes.tail()

In [None]:
doc = nlp(stage_text_nodes.iloc[11484].summary)

In [None]:
len(fast_topic_matcher(doc))

In [None]:
first_match = fast_topic_matcher(doc)[0]

In [None]:
first_match

In [None]:
nlp.vocab.strings[first_match[0]]

In [None]:
fast_topics[fast_topics["URL"] == "http://id.worldcat.org/fast/869075"].iloc[0]["name"]

In [None]:
matches_11484 = helpers.get_matches(
    stage_text_nodes.iloc[11484].title, nlp, fast_topic_matcher, fast_topics, 25
)

In [None]:
matches_11484

The [spaCy][SPACY] `fast_topic_matcher` is greedy, meaning that it matches on any terms contained in the document as we can see in this example. 

[SPACY]: https://spacy.io/

In [None]:
matches_11500 = helpers.get_matches(
    stage_text_nodes.iloc[11500].title, nlp, fast_topic_matcher, fast_topics, 25
)

In [None]:
matches_11500

## spaCy Named Entity Recognition (NER)
One of the strengths of [spaCy][SPACY] is it's pre-existing models for named entity recognition (NER). In NER tasks, existing entities like people, locations, time periods are identified and tagged for later analysis. 

To demonstrate, we will load a small English language model and perform NER on some of the *titles*, *labels*, and any *summary* triples contained in the `stage_text_nodes` dataframe.

### Download model and Create NLP pipeline
We first need to download the `en_core_web_sm` model for our environment and then create an instance of a NLP pipeline.

[SPACY]: https://spacy.io/

In [None]:
! python -m spacy download en_core_web_sm

In [5]:
en_nlp = spacy.load("en_core_web_sm")

### Generate Documents and Display Entities
Using a different sample from the `stage_text_nodes` dataframe, we apply the `en_nlp` pipeline to some of these values to see if we can extract meaningful entities.

In [None]:
sample2 = stage_text_nodes.sample(10)

In [None]:
sample2

In [None]:
doc_10315 = en_nlp(stage_text_nodes.iloc[10315].title)
doc_3087 = en_nlp(stage_text_nodes.iloc[3087].summary)

In [None]:
print("Entities for 10315")
for ent in doc_10315.ents:
    print(ent, ent.label_)
print("\n\nEntities for 3087")
for ent in doc_3087.ents:
    print(ent, ent.label_)

### spaCy NER and POS Visualizations
The [spaCy][SPACY] project provides the `displacy` class that has two helpful visualizers. The first **ner** displays the identified entities in context of the text and the second identifies the parts-of-speech and how they are related to each other in the **dep** option. 

[SPACY]: https://spacy.io/

In [None]:
from spacy import displacy

displacy.render(doc_3087, jupyter=True, style="ent")

In [None]:
displacy.render(doc_10315, jupyter=True)

## Exercise 1
Download and create a `nlp` pipeline using the *en_core_web_sm* model and see what entities are identified using your sample from the `stage_text_nodes` dataframe.

> **Extra credit** Find a non-english resource in `stage_text_nodes`, find a corresponding 
> [spaCy][SPACY] trained pipeline at https://spacy.io/models and follow the above steps to 
> identify any entities in the *title*, *label*, or *summary* values.

[SPACY]: https://spacy.io/

## Full-Text and Sinopia RDF Metadata
Exciting possibilities of applying machine learning to library workflows comes when we combine linked-data cataloging with the availability of the full-text. For example, we can run NER on the entirety of the full-text, and add any identified entities to our RDF. 

> Special thanks to Tim Thompson and Greta de Groat for providing examples of cataloged
> resources in Sinopia that have available full-text!


### Example One: *Alarmingly suspicious*
First we will download the Sinopia record for this Work, https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c, and extract the `data` property to create a RDF graph.

In [None]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)

Second, we open the full-text that was retrieved from the [Haithtrust](https://babel.hathitrust.org/cgi/pt?id=uiuo.ark:/13960/t14n5kz7q&view=1up&seq=3) and load all of the text and save to the `example1_text` variable. 

In [None]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

print(f"Number of characters in the text is {len(example1_text):,}")

Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io/

In [None]:
example1_doc = en_nlp(example1_text)
print(f"Number of entities: {len(example1_doc.ents):,}")

Of the 576 entities, we will extract the **PERSON** and **GPE** (location) entities for further analysis. 

In [None]:
people, locations = [], []

for ent in example1_doc.ents:
    if ent.label_.startswith("PERSON"):
        people.append(ent)
    if ent.label_.startswith("GPE"):
        locations.append(ent)
print(f"NER found {len(people):,} people and {len(locations):,} locations")

Looking at the `people` list, we notice a number of duplications (for now we don't care where the person entity was found in the text) so we filter this list.

In [None]:
people = set([" ".join(ent.text.split()) for ent in people])
locations = set([" ".join(ent.text.split()) for ent in locations])

In [None]:
len(people), len(locations)

With these identified names, we can use the VIAF [SRU Search](https://platform.worldcat.org/api-explorer/apis/VIAF/AuthorityCluster/SRUSearch) on each of the `people` entities to see if we can narrow down our list using this authority source. In the helpers module, the `viaf_people_search` function returns a dictionary VIAF urls that match the term and the first main heading of the record.

We can then aggregate these results and see what remains. 

In [None]:
example1_people = {}
for i, term in enumerate(list(people)):
    # Removes newlines, tabs, and spaces from the entity text
    viaf_result = helpers.viaf_people_search(term)
    example1_people.update(viaf_result)
    if not i % 10 and i > 0:
        print(".", end="")
    if not i % 25:
        print(f"{i}", end="")

In [None]:
example1_people

### Example 2 - *Seismic performance assessment of buildings*
The second example is the report, *Seismic performance assessment of buildingss*, described by the following Sinopia Resources:

- BIBFRAME Work https://api.stage.sinopia.io/resource/2ffc86d1-4850-4e4b-974c-49ded8ce4b3f
- BIBFRAME Instance https://api.stage.sinopia.io/resource/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6


First we will retrieve these resources and add them to a RDF graph.

In [2]:
example2_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example2_graph.namespace_manager.bind(ns, url)

for sinopia_url in [
    "https://api.stage.sinopia.io/resource/2ffc86d1-4850-4e4b-974c-49ded8ce4b3f",
    "https://api.stage.sinopia.io/resource/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6",
]:
    sinopia_result = requests.get(sinopia_url)
    example2_graph.parse(
        data=json.dumps(sinopia_result.json()["data"]), format="json-ld"
    )
print(f"Total number of triples for example 2 {len(example2_graph):,}")

Total number of triples for example 2 152


Second, we will open and read the full-text for *Improving water supply networks*

In [3]:
with open("data/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6.txt") as fo:
    example2_text = fo.read()

print(f"Number of characters in the text is {len(example2_text):,}")

Number of characters in the text is 671,016


Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io

In [6]:
example2_doc = en_nlp(example2_text)
print(f"Number of entities: {len(example2_doc.ents):,}")

Number of entities: 10,220


In [30]:
ent_types = {}
for ent in example2_doc.ents:
    if ent.label_ in ent_types:
        ent_types[ent.label_] += 1
    else:
        ent_types[ent.label_] = 1

In [31]:
ent_types

{'ORG': 1820,
 'WORK_OF_ART': 324,
 'PRODUCT': 672,
 'CARDINAL': 4170,
 'GPE': 686,
 'PERSON': 685,
 'FAC': 42,
 'DATE': 533,
 'MONEY': 136,
 'ORDINAL': 138,
 'NORP': 34,
 'QUANTITY': 36,
 'PERCENT': 794,
 'LAW': 61,
 'LOC': 22,
 'TIME': 62,
 'EVENT': 5}

In [19]:
example2_people, example2_locations = [], []

for ent in example2_doc.ents:
    if ent.label_.startswith("PERSON"):
        example2_people.append(ent)
    if ent.label_.startswith("GPE"):
        example2_locations.append(ent)
print(
    f"NER found {len(example2_people):,} people and {len(example2_locations):,} locations"
)

NER found 685 people and 686 locations


In [29]:
example2_people[90].start, example2_people[90].end

(639, 641)

In [None]:
example2_people[90]

We now de-duplicate `example2_people` and `example2_locations`example2_people

In [8]:
example2_people = set([" ".join(ent.text.split()) for ent in example2_people])
example2_locations = set([" ".join(ent.text.split()) for ent in example2_locations])

In [10]:
len(example2_people)

359

In [18]:
type(list(example2_people)[52])

str

In [15]:
for person in list(example2_people)[50:100]:
    print(person)

Russell Larsen
Udit S.
Khalid Mosalam
-1.204
John Wallace
gm β
Yin-Nan Huang
C-3
Christopher Rojahn
Travis Chrupalo
Dan Gramer
Table G-2
Haselton Baker Risk Group
+1
Steven McCabe Morley Builders
Atkinson
Mohamed M. Talaat
Peter J. May
Continuous Distributions
β2 
Harmsen
Pampanin et al.
Appendix D.
K-1 Overturning
Erica Hays
Appendix G
Monte Carlo
David R. Bonneville
Steve Mahin
W
Table H-1
B.1
Farzad Naeim
max min R
Bruce R. Ellingwood
Exterior Walls
John Hooper
Appendix K:
Jennifer Tobin-Gurley
r.
B.
W.
Katherine Wade
i.
Stephen A. Mahin
Gayle Johnson
Jack Baker
William O’Brien
David Bonneville
Appendix E. Population


In [11]:
len(example2_locations)

198

In [14]:
for location in list(example2_locations)[0:50]:
    print(location)

R.O.
Tehran
I-1
Vairo
D.C.
SF
B.7
L.
T.Y.
Bozorgnia
Western United States
Oakland
Medina
D.M.
South Carolina
Yang
Michigan
≤T ≤T
G.
C.A.
lnS −lnθ
7.0E-05
Chesterfield
the United States
S.R.
Normal Distributions
Whittaker
P.G.
Oregon
Mill Valley
MC0085
Virginia
San Mateo
V.V.
Reston
Newport Beach
Stockton
Iran
Carson
Washington
H.K.
Cherry Street
Redwood City
Loads
W.J.
Dublin
Escondido
Γ
Fragilities
Sequential


## Exercise 2
For the following Sinopia Resources, create a graph, and a NLP pipeline to identify any people and locations in the full-text. The full-text is available in `data/a90f911e-d03e-427f-a36e-ed1e3d1a9b3b.txt` file.

- BIBFRAME Work https://api.stage.sinopia.io/resource/d1e377d0-9ca6-4377-a9a1-a20303c5fe66
- BIBFRAME Instance https://api.stage.sinopia.io/resource/a90f911e-d03e-427f-a36e-ed1e3d1a9b3b