# Natural Language Processing (NLP) with spaCy
In this notebook, we will use the Natural Language Processing library for Python called [spaCy][SPACY]. 
The functionality provided by [spaCy][SPACY] allows us to quickly extract parts-of-speech (POS) from text descriptions and to identify entities using [spaCy's][SPACY] named entity recognition (NER). We will initially  use [spaCy][SPACY]'s rule-based matching functionality to create a subject matcher that we can then apply to both Sinopia's RDF metadata as well as the associated full-text for a select sample of these resources.


[SPACY]: https://spacy.io/

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

import datetime
import json
import string

import pathlib
import pandas as pd

import kglab
import rdflib
import helpers
import requests
import spacy

## Creating a FAST Panda DataFrame
Using a csv file derived from OCLC's [FAST](https://www.oclc.org/research/areas/data-science/fast.html)(Faceted Application of Subject Terminology) topic list, read into a Panda's DataFrame.

In [2]:
fast_topics = pd.read_csv("data/topic_uri_label_utf8.csv", names=["URL", "name"])

To get a sense of what data is in the `fast_topics` DataFrame we can look at the *shape*, *info*, and look at a random sample

In [3]:
print(f"Shape of fast_topics {fast_topics.shape}")
fast_topics.info()

Shape of fast_topics (460110, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460110 entries, 0 to 460109
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   URL     460110 non-null  object
 1   name    460110 non-null  object
dtypes: object(2)
memory usage: 7.0+ MB


In [4]:
fast_topics.sample(10)

Unnamed: 0,URL,name
418315,http://id.worldcat.org/fast/1058658,Personal trainers--Vocational guidance
264449,http://id.worldcat.org/fast/1117830,Shrimps--Marketing
252943,http://id.worldcat.org/fast/1027823,Motor vehicles--Lubrication
323616,http://id.worldcat.org/fast/1190577,"Quartets (Percussion, violoncellos (2))"
333090,http://id.worldcat.org/fast/2002782,Victims of violent crimes--Monuments
82852,http://id.worldcat.org/fast/941570,German language--Idioms
22816,http://id.worldcat.org/fast/969325,Indians--Commerce
7860,http://id.worldcat.org/fast/809780,Annuities--Government policy
224511,http://id.worldcat.org/fast/1106661,Scheherazade (Legendary character) in literature
19482,http://id.worldcat.org/fast/991255,Land degradation--Environmental aspects


From our sample, we see that the `name` has punctuation like **(),--** that will need to be removed later in our workflow.

## spaCy Phrase Matcher
With the [FAST][FAST] dataframe as our source, we now will create a [spaCy][SPACY] [PhraseMatcher](https://spacy.io/usage/rule-based-matching#phrasematcher) made up of phrases extracted from the dataframe using the [FAST][FAST] urls as identifiers.

We start by importing [spaCy][SPACY] base English vocabulary and create an empty natural language processing (nlp) pipeline that we pass into a new matcher object that will lower-case all of the [FAST][FAST] patterns.

[FAST]: https://www.oclc.org/research/areas/data-science/fast.html
[SPACY]: https://spacy.io/

In [5]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
fast_topic_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [6]:
"Rural population--Economic aspects--U.S. states".strip(string.punctuation).replace(
    "--", " "
).split()

['Rural', 'population', 'Economic', 'aspects', 'U.S.', 'states']

### Creatings Patterns
We will iterate through the dataframe and for each [FAST](https://www.oclc.org/research/areas/data-science/fast.html) series, using the **URL** as the identififer and for the name, remove punctuation, split each name into a list of words, and tokenize each term, and then add to the `fast_topic_matcher`. 

In [7]:
start = datetime.datetime.utcnow()
print(f"Started at {start} of FAST Matching")
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_list = row[1]["name"].strip(string.punctuation).replace("--", " ").split()
    name_list = [nlp.make_doc(row) for row in name_list]
    fast_topic_matcher.add(ident, name_list)
end = datetime.datetime.utcnow()
print(
    f"Finished adding FAST Topic patterns at {end}, total time {(end-start).seconds / 60.}"
)

Started at 2022-07-10 23:07:21.435007 of FAST Matching
Finished adding FAST Topic patterns at 2022-07-10 23:08:20.368977, total time 0.9666666666666667


Applying the spaCy Phrase matcher to titles and summaries from the Sinopia 

## Loading Sinopia Stage Knowledge Graph
Just like in the previous Jupyter notebook, we will load the saved knowledge graph that we created at the beginning and then we will query the graph using SPARQL.

In [8]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

<kglab.kglab.KnowledgeGraph at 0x13155bd90>

### RDF Literals Pandas DataFrame
With the Sinopia Stage knowledge graph loaded, we will query our Graph for *titles*, *labels*, and any *summary* triples and apply our `fast_topic_matcher` to the results.

> **NOTE** The `stage_text_nodes` dataframe has been saved in the **data** directory 
> and can be loaded for use instead of running this SPARQL query.

In [9]:
stage_text_nodes = stage_kg.query_as_df(
    """PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>

SELECT ?title ?label ?summary ?url

WHERE {
  OPTIONAL {
      ?title_bnode  bf:mainTitle ?title .
      ?url bf:title ?title_bnode .
  }
  OPTIONAL {
      ?label_bnode rdfs:label ?label .
      ?url bf:title ?label_bnode .
  }
  OPTIONAl {
      ?url bf:summary ?summary_bnode .
      ?summary_bnode rdfs:label ?summary  .
  }
}
"""
)

In [10]:
stage_text_nodes.to_json("data/stage-text-nodes.json")

In [11]:
stage_text_nodes.shape

(2108, 4)

In [12]:
stage_text_nodes.sample(10)

Unnamed: 0,title,url,label,summary
896,Wird die katholische Kirche Jan Hus rehabiliti...,<https://api.stage.sinopia.io/resource/084201e...,,
444,Same line twice,<https://api.stage.sinopia.io/resource/2f4f3f3...,,
749,"An Icelandic Primer with grammar, notes, and g...",<https://api.stage.sinopia.io/resource/47adca9...,,
508,Funeral song,<https://api.stage.sinopia.io/resource/5484503...,,
1121,Ṣūrat Miṣr fī kitābāt al-raḥḥālah al-Maghāriba...,<https://api.stage.sinopia.io/resource/66f6221...,,
348,Picietl-tobacco,<https://api.stage.sinopia.io/resource/1075de0...,,
477,Seitsemän koiraveljestä,<https://api.stage.sinopia.io/resource/e714170...,,
1392,En sjøens helt,<https://api.stage.sinopia.io/resource/0a4223a...,,
1466,Swearing in special 2021,<https://api.stage.sinopia.io/resource/6a4ca86...,,
958,"Little women, or, Meg, Jo, Beth, and Amy",<https://api.stage.sinopia.io/resource/54a7dec...,Little women,


In [13]:
stage_text_nodes.tail()

Unnamed: 0,title,url,label,summary
2103,Librerías,<https://api.stage.sinopia.io/resource/214490d...,,
2104,Violence in South Asia,<https://api.stage.sinopia.io/resource/b866f9f...,contemporary perspectives,
2105,Stainless steel rat for president,<https://api.stage.sinopia.io/resource/2181701...,,The Stainless Steel rat is back! Slippery Jim ...
2106,"Birnie, Boyle and Redgwell’s international law...",<https://api.stage.sinopia.io/resource/438076c...,,
2107,Książki z Mysiej,<https://api.stage.sinopia.io/resource/efa4430...,,


In [14]:
doc = nlp(stage_text_nodes.iloc[2105].summary)

In [15]:
len(fast_topic_matcher(doc))

323844

In [16]:
first_match = fast_topic_matcher(doc)[0]

In [17]:
first_match

(11474646060884680703, 0, 1)

In [18]:
nlp.vocab.strings[first_match[0]]

'http://id.worldcat.org/fast/1354598'

In [19]:
fast_topics[fast_topics["URL"] == "http://id.worldcat.org/fast/869075"].iloc[0]["name"]

'Come and Take It Cannon'

In [20]:
stage_text_nodes.iloc[1578].title

'Ground water development - issues and sustainable solutions'

In [21]:
matches_1578 = helpers.get_matches(
    stage_text_nodes.iloc[1578].title, nlp, fast_topic_matcher, fast_topics, 25
)

In [22]:
matches_1578

[('http://id.worldcat.org/fast/1431947', 'Long-tailed ground squirrel'),
 ('http://id.worldcat.org/fast/1087897',
  'Radioactive waste disposal in the ground--Government policy--U.S. states'),
 ('http://id.worldcat.org/fast/1087903',
  'Radioactive waste disposal in the ground--Planning'),
 ('http://id.worldcat.org/fast/948181', 'Ground wetas'),
 ('http://id.worldcat.org/fast/1113920',
  'Sewage disposal in the ground--Research'),
 ('http://id.worldcat.org/fast/948170', 'Ground squirrels--Behavior'),
 ('http://id.worldcat.org/fast/956353',
  'High speed ground transportation--Environmental aspects'),
 ('http://id.worldcat.org/fast/1087915',
  'Radioactive waste disposal in the ground--Technological innovations'),
 ('http://id.worldcat.org/fast/1097545', 'Richardson ground squirrel'),
 ('http://id.worldcat.org/fast/1429795', 'Arctic ground squirrel'),
 ('http://id.worldcat.org/fast/1113915',
  'Sewage disposal in the ground--Evaluation'),
 ('http://id.worldcat.org/fast/948138', 'Ground 

The [spaCy][SPACY] `fast_topic_matcher` is greedy, meaning that it matches on any terms contained in the document. Changing the pattern matching strategy, and instead of parsing out each term in the FAST label, just create a single phrase with the `--` removed, adjusts what is matched.

[SPACY]: https://spacy.io/

In [23]:
strict_fast_topic_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [24]:
start = datetime.datetime.utcnow()
print(f"Started at {start} for Strict FAST Matching")
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_str = row[1]["name"].strip(string.punctuation).replace("--", " ")
    name_str = [
        nlp.make_doc(name_str),
    ]
    strict_fast_topic_matcher.add(ident, name_list)
end = datetime.datetime.utcnow()
print(
    f"Finished adding strict FAST Topic patterns at {end}, total time {(end-start).seconds / 60.}"
)

Started at 2022-07-10 23:08:27.874389 for Strict FAST Matching
Finished adding strict FAST Topic patterns at 2022-07-10 23:09:01.180846, total time 0.55


In [25]:
strict_matches_1578 = helpers.get_matches(
    stage_text_nodes.iloc[1578].title, nlp, strict_fast_topic_matcher, fast_topics, 25
)

In [26]:
strict_matches_1578

[]

## So the matcher may be too strict!

## spaCy Named Entity Recognition (NER)
One of the strengths of [spaCy][SPACY] is it's pre-existing models for named entity recognition (NER). In NER tasks, existing entities like people, locations, time periods are identified and tagged for later analysis. 

[SPACY]: https://spacy.io/

### Download model and Create NLP pipeline

To demonstrate, we will load a small English language model and perform NER on some of the *titles*, *labels*, and any *summary* triples contained in the `stage_text_nodes` dataframe.

In [27]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
You should consider upgrading via the '/Users/jpnelson/70-79 Presentations/70.02 RDF-AI Workshop - 02022 LD4 Conference/ld4-env/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [28]:
en_nlp = spacy.load("en_core_web_sm")

### Generate Documents and Display Entities
Using a different sample from the `stage_text_nodes` dataframe, we apply the `en_nlp` pipeline to some of these values to see if we can extract meaningful entities.

In [29]:
sample2 = stage_text_nodes.sample(10)

In [30]:
sample2

Unnamed: 0,title,url,label,summary
1561,"Mosaico do tempo, 70 anos de fotografia",<https://api.stage.sinopia.io/resource/edb2295...,,
1322,The ball,<https://api.stage.sinopia.io/resource/224547f...,,
1532,The politics of birth.,<https://api.stage.sinopia.io/resource/638fc75...,,Several manuscript and typescript drafts of an...
990,Statistical inference in admixed populations,<https://api.stage.sinopia.io/resource/79ef178...,,
1370,Mother of the Forest big tree exhibit of 1875-...,<https://api.stage.sinopia.io/resource/879212b...,,
756,Al fooles,<https://api.stage.sinopia.io/resource/0f16f9b...,,
215,國立臺灣大學理學院植物學系研究報告,<https://api.stage.sinopia.io/resource/809b600...,Zhi wu ke xue qi kan,
703,Neuland,<https://api.stage.sinopia.io/resource/e065846...,,
865,Analysis and evaluation in the production proc...,<https://api.stage.sinopia.io/resource/577728e...,,
666,Sherlock Holmes,<https://api.stage.sinopia.io/resource/2e27d57...,,


In [31]:
def show_entities(index_number, column):
    print(f"Entities for {index_number}")
    doc = en_nlp(stage_text_nodes.iloc[index_number][column])
    for ent in doc.ents:
        print(f"\t{ent}, {ent.label_}")
    return doc

In [32]:
doc_624_summary = show_entities(624, "summary")

Entities for 624
	Hans, PERSON
	Gretel, PERSON
	one, CARDINAL
	Hans, NORP
	Gretel, PERSON


In [33]:
doc_1222_title = show_entities(1222, "title")

Entities for 1222
	First, ORDINAL
	Americans, NORP


### spaCy NER and POS Visualizations
The [spaCy][SPACY] project provides the `displacy` class that has two helpful visualizers. The first **ner** displays the identified entities in context of the text and the second identifies the parts-of-speech and how they are related to each other in the **dep** option. 

[SPACY]: https://spacy.io/

In [34]:
from spacy import displacy

displacy.render(doc_624_summary, jupyter=True, style="ent")

In [35]:
spacy.explain("NORP")

'Nationalities or religious or political groups'

In [36]:
displacy.render(doc_1222_title, jupyter=True, style="ent")

## Exercise 1
Download and create a `nlp` pipeline using the *en_core_web_sm* model and see what entities are identified using your sample from the `stage_text_nodes` dataframe.

> **Extra credit** Find a non-english resource in `stage_text_nodes`, find a corresponding 
> [spaCy][SPACY] trained pipeline at https://spacy.io/models and follow the above steps to 
> identify any entities in the *title*, *label*, or *summary* values.

[SPACY]: https://spacy.io/

## Full-Text and Sinopia RDF Metadata
Exciting possibilities of applying machine learning to library workflows comes when we combine linked-data cataloging with the availability of the full-text. For example, we can run NER on the entirety of the full-text, and add any identified entities to our RDF. 

> Special thanks to Tim Thompson and Greta de Groat for providing examples of cataloged
> resources in Sinopia that have available full-text!


### Example One: *Alarmingly suspicious*
First we will download the Sinopia record for this Work, https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c, and extract the `data` property to create a RDF graph.

In [37]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)

<Graph identifier=Ne7a37ca72d4141e68db85167006d05b6 (<class 'rdflib.graph.Graph'>)>

Second, we open the full-text that was retrieved from the [Haithtrust](https://babel.hathitrust.org/cgi/pt?id=uiuo.ark:/13960/t14n5kz7q&view=1up&seq=3) and load all of the text and save to the `example1_text` variable. 

In [38]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

print(f"Number of characters in the text is {len(example1_text):,}")

Number of characters in the text is 61,249


Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io/

In [39]:
example1_doc = en_nlp(example1_text)
print(f"Number of entities: {len(example1_doc.ents):,}")

Number of entities: 587


Of the 576 entities, we will extract the **PERSON** and **GPE** (location) entities for further analysis. 

In [40]:
people, locations = [], []

for ent in example1_doc.ents:
    if ent.label_.startswith("PERSON"):
        people.append(ent)
    if ent.label_.startswith("GPE"):
        locations.append(ent)
print(f"NER found {len(people):,} people and {len(locations):,} locations")

NER found 281 people and 34 locations


Looking at the `people` list, we notice a number of duplications (for now we don't care where the person entity was found in the text) so we filter this list.

In [41]:
people = set([" ".join(ent.text.split()) for ent in people])
locations = set([" ".join(ent.text.split()) for ent in locations])

In [42]:
len(people), len(locations)

(103, 20)

With these identified names, we can use the VIAF [SRU Search](https://platform.worldcat.org/api-explorer/apis/VIAF/AuthorityCluster/SRUSearch) on each of the `people` entities to see if we can narrow down our list using this authority source. In the helpers module, the `viaf_people_search` function returns a dictionary VIAF urls that match the term and the first main heading of the record.

We can then aggregate these results and see what remains. 

In [43]:
example1_people = {}
for i, term in enumerate(list(people)):
    # Removes newlines, tabs, and spaces from the entity text
    viaf_result = helpers.viaf_people_search(term)
    example1_people.update(viaf_result)
    if not i % 10 and i > 0:
        print(".", end="")
    if not i % 25:
        print(f"{i}", end="")

0..25...50..75...100

In [44]:
example1_people

{'http://viaf.org/viaf/4819162906466578110009/': ['Ball, John'],
 'http://viaf.org/viaf/9924151247972544270006/': ['Voigt, Emil 1881-'],
 'http://viaf.org/viaf/9874158792833139040004/': ['Johan Georg Frederik Ræder juriste danois'],
 'http://viaf.org/viaf/138389/': ['Sharp, Tony.'],
 'http://viaf.org/viaf/310152742908027731493/': ['Valletort, Lady 19.sc'],
 'http://viaf.org/viaf/272152742908127731486/': ['Valletort, Lord'],
 'http://viaf.org/viaf/26152742893627730540/': ['Popeley, William 1718+'],
 'http://viaf.org/viaf/9994156012397449700001/': ['Schneider, Raimund'],
 'http://viaf.org/viaf/9985152684042123430003/': ['Schneider, Konrad'],
 'http://viaf.org/viaf/9983160668220203560000/': ['Schneider, Wolfgang 1941-'],
 'http://viaf.org/viaf/9995160909265912870007/': ['Look, Emily'],
 'http://viaf.org/viaf/9978165628873142480009/': ['Hyatt, Emily'],
 'http://viaf.org/viaf/9972162669668555500005/': ['La Trobe-Bateman, Emily'],
 'http://viaf.org/viaf/9999160668126703560006/': ['Engel, Wol

### Example 2 - *Seismic performance assessment of buildings*
The second example is the report, *Seismic performance assessment of buildingss*, described by the following Sinopia Resources:

- BIBFRAME Work https://api.stage.sinopia.io/resource/2ffc86d1-4850-4e4b-974c-49ded8ce4b3f
- BIBFRAME Instance https://api.stage.sinopia.io/resource/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6


First we will retrieve these resources and add them to a RDF graph.

In [45]:
example2_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example2_graph.namespace_manager.bind(ns, url)

for sinopia_url in [
    "https://api.stage.sinopia.io/resource/2ffc86d1-4850-4e4b-974c-49ded8ce4b3f",
    "https://api.stage.sinopia.io/resource/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6",
]:
    sinopia_result = requests.get(sinopia_url)
    example2_graph.parse(
        data=json.dumps(sinopia_result.json()["data"]), format="json-ld"
    )
print(f"Total number of triples for example 2 {len(example2_graph):,}")

Total number of triples for example 2 152


Second, we will open and read the full-text for *Improving water supply networks*

In [46]:
with open("data/e865acc1-9b24-401e-a0eb-36ca1ca8b1d6.txt") as fo:
    example2_text = fo.read()

print(f"Number of characters in the text is {len(example2_text):,}")

Number of characters in the text is 671,016


Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io

In [47]:
example2_doc = en_nlp(example2_text)
print(f"Number of entities: {len(example2_doc.ents):,}")

Number of entities: 10,220


In [48]:
ent_types = {}
for ent in example2_doc.ents:
    if ent.label_ in ent_types:
        ent_types[ent.label_] += 1
    else:
        ent_types[ent.label_] = 1

In [49]:
ent_types

{'ORG': 1820,
 'WORK_OF_ART': 324,
 'PRODUCT': 672,
 'CARDINAL': 4170,
 'GPE': 686,
 'PERSON': 685,
 'FAC': 42,
 'DATE': 533,
 'MONEY': 136,
 'ORDINAL': 138,
 'NORP': 34,
 'QUANTITY': 36,
 'PERCENT': 794,
 'LAW': 61,
 'LOC': 22,
 'TIME': 62,
 'EVENT': 5}

In [50]:
example2_people, example2_locations = [], []

for ent in example2_doc.ents:
    if ent.label_.startswith("PERSON"):
        example2_people.append(ent)
    if ent.label_.startswith("GPE"):
        example2_locations.append(ent)
print(
    f"NER found {len(example2_people):,} people and {len(example2_locations):,} locations"
)

NER found 685 people and 686 locations


In [51]:
example2_people[90].start, example2_people[90].end

(639, 641)

In [52]:
example2_people[90]

Christopher Higgins

We now de-duplicate `example2_people` and `example2_locations`example2_people

In [53]:
example2_people = set([" ".join(ent.text.split()) for ent in example2_people])
example2_locations = set([" ".join(ent.text.split()) for ent in example2_locations])

In [54]:
len(example2_people)

359

In [55]:
type(list(example2_people)[52])

str

In [56]:
for person in list(example2_people)[50:100]:
    print(person)

Vamvatsikos
P. Benson Shing
Christopher Higgins
max min characterization
B10
ASTM
Appendix K
E2022.112 Filing
.................................................................... G-2
Kramer
Richard Behr
Appendix J.
Deborah Beck
Duy Vu
Stojadinovic
Peter Morris
r.
Hussein
Earthquake Spectra
Finishes
Juan Murcia-Delso
Robert P. Kennedy
Xibei Xincun
D. Jared DeBock
Seismic Hazard and Risk Analysis
Baker
Ronald Hamburger
Ross
m.

D402 Wet
Christopher Smith
Keith
Daniel Saldana
Drirfitf
Chi-Chi
Bruce R. Ellingwood
C. Allin Cornell
Nirmal Kumawat Udit S. Tambe
Angie Harris
Toro
William O’Brien
  h  ln(H
Bob Bachman
J.4
John Hooper
Jack Moehle
Part II
Scott Hagie Angie Harris
Masonry Walls


In [57]:
len(example2_locations)

198

In [58]:
for location in list(example2_locations)[0:50]:
    print(location)

Denver
U.S
SF
Deierlein
A.S.
Medina
New York 14223
Bertero
S.L.
Christopoulos
Moehle
San Mateo
FRAGILITY
B.S.J.
La Jolla
South Carolina
Tehran
Vancouver
Reston
N-S
Escondido
W.J.
New Jersey
Illinois
Stanford
Nevada
Vairo
T.Y.
Michigan
Somerville
Whittaker
Corvallis
America
Longmont
−
Hills
Ohio
Hebron
B.7
Tobin
Missouri
∆e
Coef
Cascadia
Carson
Krawinkler
Loads
Youngs
Housner
Benjamin


## Exercise 2
For the following Sinopia Resources, create a graph, and a NLP pipeline to identify any people and locations in the full-text. The full-text is available in `data/a90f911e-d03e-427f-a36e-ed1e3d1a9b3b.txt` file.

- BIBFRAME Work https://api.stage.sinopia.io/resource/d1e377d0-9ca6-4377-a9a1-a20303c5fe66
- BIBFRAME Instance https://api.stage.sinopia.io/resource/a90f911e-d03e-427f-a36e-ed1e3d1a9b3b

# Extra Credit

## Custom NER Ruler
Another approach we can take is to build a custom [Named Entity Ruler](https://spacy.io/usage/rule-based-matching#entityruler) for the FAST Topics.  

In [59]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipe("ner")
fast_ruler = nlp.add_pipe("entity_ruler")

In [60]:
start = datetime.datetime.utcnow()
print(f"Started at {start} for FAST Entities Matching")
all_patterns = []
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_list = row[1]["name"].strip(string.punctuation).split("--")
    fast_patterns = []
    for name in name_list:
        pattern = []
        terms = name.split(" ")
        for term in terms:
            term = term.strip()
            if len(term) < 1:
                continue
            pattern.append({"LOWER": term.lower()})
        fast_patterns.append(pattern)
    all_patterns.append({"label": row[1]["name"],
                         "pattern": fast_patterns})
    current_count = len(all_patterns)
    if not current_count%100 and current_count > 0:
        print(".", end="")
    if not current_count%1000:
        print(f"{current_count:,}", end="")
fast_ruler.add_patterns(all_patterns)
end = datetime.datetime.utcnow()
print(
    f"Finished adding FAST Patterns at {end}, total time {(end-start).seconds / 60.}"
)

Started at 2022-07-10 23:16:47.009660 for FAST Entities Matching
..........1,000..........2,000..........3,000..........4,000..........5,000..........6,000..........7,000..........8,000..........9,000..........10,000..........11,000..........12,000..........13,000..........14,000..........15,000..........16,000..........17,000..........18,000..........19,000..........20,000..........21,000..........22,000..........23,000..........24,000..........25,000..........26,000..........27,000..........28,000..........29,000..........30,000..........31,000..........32,000..........33,000..........34,000..........35,000..........36,000..........37,000..........38,000..........39,000..........40,000..........41,000..........42,000..........43,000..........44,000..........45,000..........46,000..........47,000..........48,000..........49,000..........50,000..........51,000..........52,000..........53,000..........54,000..........55,000..........56,000..........57,000..........58,000..........59,000

ValueError: [E154] One of the attributes or values is not supported for token patterns. Please use the option `validate=True` with the Matcher, PhraseMatcher, or EntityRuler for more details.