# Natural Language Processing (NLP) with spaCy
In this notebook, we will use the Natural Language Processing library for Python called [spaCy][SPACY]. 
The functionality provided by [spaCy][SPACY] allows us to quickly extract parts-of-speech (POS) from text descriptions and to identify entities using [spaCy's][SPACY] named entity recognition (NER). We will initially  use [spaCy][SPACY]'s rule-based matching functionality to create a subject matcher that we can then apply to both Sinopia's RDF metadata as well as the associated full-text for a select sample of these resources.


[SPACY]: https://spacy.io/

In [79]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

import datetime
import json
import string

import pathlib
import pandas as pd

import kglab
import rdflib
import requests
import spacy
import helpers
import widgets

## Creating a FAST Panda DataFrame
Using a csv file derived from OCLC's [FAST](https://www.oclc.org/research/areas/data-science/fast.html)(Faceted Application of Subject Terminology) topic list, we read this csv into a Panda's DataFrame.

In [3]:
fast_topics = pd.read_csv("data/topic_uri_label_utf8.csv", names=["URL", "name"])

To get a sense of what data is in the `fast_topics` DataFrame we can look at the *shape*, *info*, and look at a random sample

In [4]:
print(f"Shape of fast_topics {fast_topics.shape}")
fast_topics.info()

Shape of fast_topics (460110, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460110 entries, 0 to 460109
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   URL     460110 non-null  object
 1   name    460110 non-null  object
dtypes: object(2)
memory usage: 7.0+ MB


In [5]:
fast_topics.sample(10)

Unnamed: 0,URL,name
329938,http://id.worldcat.org/fast/1142396,Tamashek language--Grammar
136448,http://id.worldcat.org/fast/824830,BAM P14 (Electronic computer system)
206210,http://id.worldcat.org/fast/1004058,Lutheran Church--Sources
71361,http://id.worldcat.org/fast/970390,Indo-Aryan languages--Social aspects
393693,http://id.worldcat.org/fast/1145126,Technology--History--Sources
318328,http://id.worldcat.org/fast/1195938,"Trio sonatas (Trombones (2), continuo)"
131078,http://id.worldcat.org/fast/995299,"Ledoux, Maggie (Fictitious character)"
292551,http://id.worldcat.org/fast/1017982,Metallurgical plants--Health aspects
97758,http://id.worldcat.org/fast/858677,Choruses (Women's voices)
459298,http://id.worldcat.org/fast/1741095,Uga (Dog)


From our sample, we see that the `name` has punctuation like **(),--** that will need to be removed later in our workflow.

## spaCy Phrase Matcher
With the [FAST][FAST] dataframe as our source, we now will create a [spaCy][SPACY] [PhraseMatcher](https://spacy.io/usage/rule-based-matching#phrasematcher) made up of phrases extracted from the dataframe using the [FAST][FAST] urls as identifiers.

We start by importing [spaCy][SPACY] base English vocabulary and create an empty natural language processing (nlp) pipeline that we pass into a new matcher object that will lower-case all of the [FAST][FAST] patterns.

[FAST]: https://www.oclc.org/research/areas/data-science/fast.html
[SPACY]: https://spacy.io/

In [16]:
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher

nlp = English()
fast_topic_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [10]:
"African American teenagers--Education".strip(string.punctuation).replace(
    "--", " "
).split()

['African', 'American', 'teenagers', 'Education']

### Creatings Patterns
We will iterate through the dataframe and for each [FAST](https://www.oclc.org/research/areas/data-science/fast.html) series, using the **URL** as the identififer and for the name, remove punctuation, split each name into a list of words, and tokenize each term, and then add to the `fast_topic_matcher`. 

In [17]:
start = datetime.datetime.utcnow()
print(f"Started at {start} of FAST Matching")
# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    ident = row[1]["URL"]
    name_list = row[1]["name"].strip(string.punctuation).replace("--", " ").split()
    name_list = [nlp.make_doc(row) for row in name_list]
    fast_topic_matcher.add(ident, name_list)
end = datetime.datetime.utcnow()
print(
    f"Finished adding FAST Topic patterns at {end}, total time {(end-start).seconds / 60.}"
)

Started at 2021-07-18 16:51:19.635053 of FAST Matching
Finished adding FAST Topic patterns at 2021-07-18 16:52:28.180169, total time 1.1333333333333333


Applying the spaCy Phrase matcher to titles and summaries from the Sinopia 

## Loading Sinopia Stage Knowledge Graph
Just like in the previous Jupyter notebook, we will load the saved knowledge graph that we created at the beginning and then we will query the graph using SPARQL.

In [2]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

<kglab.kglab.KnowledgeGraph at 0x7ff82b9bec40>

### RDF Literals Pandas DataFrame
With the Sinopia Stage knowledge graph loaded, we will query our Graph for *titles*, *labels*, and any *summary* triples and apply our `fast_topic_matcher` to the results.

> **NOTE** The `stage_text_nodes` dataframe has been saved in the **data** directory 
> and can be loaded for use instead of running this SPARQL query.

In [19]:
stage_text_nodes = stage_kg.query_as_df(
    """PREFIX bf: <http://id.loc.gov/ontologies/bibframe/>

SELECT ?title ?label ?summary ?url

WHERE {
  OPTIONAL {
      ?title_bnode  bf:mainTitle ?title .
      ?url bf:title ?title_bnode .
  }
  OPTIONAL {
      ?label_bnode rdfs:label ?label .
      ?url bf:title ?label_bnode .
  }
  OPTIONAl {
      ?url bf:summary ?summary_bnode .
      ?summary_bnode rdfs:label ?summary  .
  }
}
"""
)

In [74]:
stage_text_nodes.to_json("data/stage-text-nodes.json")

In [20]:
stage_text_nodes.shape

(11505, 4)

In [21]:
stage_text_nodes.sample(10)

Unnamed: 0,title,url,label,summary
11484,Pocket full of rye,<https://api.stage.sinopia.io/resource/a43a25a...,,"""It started with a spot of tea. It ended in tr..."
1646,Radical cataloging,<https://api.stage.sinopia.io/resource/a677de9...,,
7539,Thrawn,<https://api.stage.sinopia.io/resource/37f1f83...,,
8601,Annonij monachi Benedictini diserti & veridici...,<https://api.stage.sinopia.io/resource/c14c1a1...,,
3094,Beyond Einstein,<https://api.stage.sinopia.io/resource/8c99154...,,
5901,Caṭṭagrāme Isalāma o aitihya,<https://api.stage.sinopia.io/resource/8bba4df...,,
10968,If I never met you,<https://api.stage.sinopia.io/resource/810ae29...,,
1569,"Hans Brinker, or, The silver skates",<https://api.stage.sinopia.io/resource/2443ec8...,Silver skates,
3124,Art and the Second World War,<https://api.stage.sinopia.io/resource/c9e5bfa...,,"""Art and the Second World War is the first boo..."
2033,Vide kige,<https://api.stage.sinopia.io/resource/be5cc0e...,,


In [51]:
stage_text_nodes.tail()

Unnamed: 0,title,url,label,summary
11500,Advanced examination techniques in orthopaedics,<https://api.stage.sinopia.io/resource/aed065d...,,
11501,Silver skates,<https://api.stage.sinopia.io/resource/fbb4be6...,,When a new friend gives Hans and his sister Gr...
11502,"Go down, Moses",<https://api.stage.sinopia.io/resource/87b9231...,,
11503,American pastoral,<https://api.stage.sinopia.io/resource/e7f97c7...,,
11504,भारतीय लोकरंगमञ्च,<https://api.stage.sinopia.io/resource/1c3762c...,,


In [44]:
doc = nlp(stage_text_nodes.iloc[11484].summary)

In [45]:
len(fast_topic_matcher(doc))

152588

In [46]:
first_match = fast_topic_matcher(doc)[0]

In [47]:
first_match

(681304568864832690, 1, 2)

In [48]:
nlp.vocab.strings[first_match[0]]

'http://id.worldcat.org/fast/869075'

In [49]:
fast_topics[fast_topics["URL"] == "http://id.worldcat.org/fast/869075"].iloc[0]["name"]

'Come and Take It Cannon'

In [34]:
matches_11484 = helpers.get_matches(
    stage_text_nodes.iloc[11484].title, nlp, fast_topic_matcher, fast_topics, 25
)

In [36]:
matches_11484

[('http://id.worldcat.org/fast/1067601', 'Pocket gophers--Control'),
 ('http://id.worldcat.org/fast/1067613', 'Pocket watches--Prices'),
 ('http://id.worldcat.org/fast/1050035', 'Pacific pocket mouse'),
 ('http://id.worldcat.org/fast/1738717', 'Great Basin pocket mouse'),
 ('http://id.worldcat.org/fast/1067611',
  'Pocket watches--Collectors and collecting'),
 ('http://id.worldcat.org/fast/1067614', 'Pocket watches--Taxation'),
 ('http://id.worldcat.org/fast/1743182', 'Coarse-haired pocket mice'),
 ('http://id.worldcat.org/fast/1067604', 'Pocket gophers--Variation'),
 ('http://id.worldcat.org/fast/1199307', 'Spiny pocket mice'),
 ('http://id.worldcat.org/fast/1894990', 'Pocket Power (Race horse)'),
 ('http://id.worldcat.org/fast/932539', 'Forest spiny pocket mice'),
 ('http://id.worldcat.org/fast/1067612', 'Pocket watches--History'),
 ('http://id.worldcat.org/fast/1067593',
  'Pocket calculators--Study and teaching (Elementary)'),
 ('http://id.worldcat.org/fast/1067610', 'Pocket watche

The [spaCy][SPACY] `fast_topic_matcher` is greedy, meaning that it matches on any terms contained in the document as we can see in this example. 

[SPACY]: https://spacy.io/

In [53]:
matches_11500 = helpers.get_matches(
    stage_text_nodes.iloc[11500].title, nlp, fast_topic_matcher, fast_topics, 25
)

In [55]:
matches_11500

[('http://id.worldcat.org/fast/797422',
  'Advanced placement programs (Education)--Planning'),
 ('http://id.worldcat.org/fast/797425',
  'Advanced very high resolution radiometers'),
 ('http://id.worldcat.org/fast/797420',
  'Advanced placement programs (Education)'),
 ('http://id.worldcat.org/fast/2002149', 'AAT Advanced Diploma in Accounting'),
 ('http://id.worldcat.org/fast/1736474',
  'Certificate in Advanced English (Educational test)'),
 ('http://id.worldcat.org/fast/797418', 'Advanced guard (Military science)'),
 ('http://id.worldcat.org/fast/1920889', 'Advanced Diploma in Insurance'),
 ('http://id.worldcat.org/fast/797416', 'Advanced Photo System'),
 ('http://id.worldcat.org/fast/797417', 'Advanced SEAL Delivery System'),
 ('http://id.worldcat.org/fast/797419',
  'Advanced guard (Military science)--History'),
 ('http://id.worldcat.org/fast/1033160',
  'National Advanced Driving Simulator'),
 ('http://id.worldcat.org/fast/797423',
  'Advanced placement programs (Education)--Sta

## spaCy Named Entity Recognition (NER)
One of the strengths of [spaCy][SPACY] is it's pre-existing models for named entity recognition (NER). In NER tasks, existing entities like people, locations, time periods are identified and tagged for later analysis. 

To demonstrate, we will load a small English language model and perform NER on some of the *titles*, *labels*, and any *summary* triples contained in the `stage_text_nodes` dataframe.

### Download model and Create NLP pipeline
We first need to download the `en_core_web_sm` model for our environment and then create an instance of a NLP pipeline.

[SPACY]: https://spacy.io/

In [56]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[K     |████████████████████████████████| 13.7 MB 4.1 MB/s eta 0:00:01
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [58]:
en_nlp = spacy.load("en_core_web_sm")

### Generate Documents and Display Entities
Using a different sample from the `stage_text_nodes` dataframe, we apply the `en_nlp` pipeline to some of these values to see if we can extract meaningful entities.

In [62]:
sample2 = stage_text_nodes.sample(10)

In [63]:
sample2

Unnamed: 0,title,url,label,summary
8921,MacBeth,<https://api.stage.sinopia.io/resource/92821e1...,,
4520,Casa de los espíritus,<https://api.stage.sinopia.io/resource/1b18099...,,
5518,Nashim deti︠a︡m,<https://api.stage.sinopia.io/resource/38ec10d...,Nashim deti︠a︡m,
3612,Nashimʺ di︠e︡ti︠a︡m,<https://api.stage.sinopia.io/resource/feece69...,Nashimʺ di︠e︡ti︠a︡m,
10315,Rest in Power : The Enduring Life of Trayvon M...,<https://api.stage.sinopia.io/resource/efd9b48...,,
8962,Journal of the Japanese Society of Lymphoretic...,<https://api.stage.sinopia.io/resource/4e93b69...,,
9023,The duality of vision,<https://api.stage.sinopia.io/resource/50f9226...,,
3087,Dragon Fire,<https://api.stage.sinopia.io/resource/76245d7...,,"William S. Cohen, former Secretary of Defense,..."
119,Ways of Seeing,<https://api.stage.sinopia.io/resource/7b114a1...,,
10730,Dickens quarterly,<https://api.stage.sinopia.io/resource/2ca15a0...,,


In [68]:
doc_10315 = en_nlp(stage_text_nodes.iloc[10315].title)
doc_3087 = en_nlp(stage_text_nodes.iloc[3087].summary)

In [70]:
print("Entities for 10315")
for ent in doc_10315.ents:
    print(ent, ent.label_)
print("\n\nEntities for 3087")
for ent in doc_3087.ents:
    print(ent, ent.label_)

Entities for 10315
The Enduring Life of Trayvon Martin WORK_OF_ART


Entities for 3087
William S. Cohen PERSON
Defense ORG
US GPE
Dragon Fire PERSON
Defense ORG
Vietnam GPE
Michael Patrick Santini PERSON
the United States GPE
America GPE
hours TIME
Santini PERSON
Democratic NORP
Bill Clinton PERSON
Republican NORP
William S. Cohen PERSON
1997 DATE
20th ORDINAL
Defense ORG
first ORDINAL
U.S. GPE
Cohen PERSON
first ORDINAL
Defense ORG


### spaCy NER and POS Visualizations
The [spaCy][SPACY] project provides the `displacy` class that has two helpful visualizers. The first **ner** displays the identified entities in context of the text and the second identifies the parts-of-speech and how they are related to each other in the **dep** option. 

[SPACY]: https://spacy.io/

In [72]:
from spacy import displacy

displacy.render(doc_3087, jupyter=True, style="ent")

In [73]:
displacy.render(doc_10315, jupyter=True)

## Exercise 1
Download and create a `nlp` pipeline using the *en_core_web_sm* model and see what entities are identified using your sample from the `stage_text_nodes` dataframe.

> **Extra credit** Find a non-english resource in `stage_text_nodes`, find a corresponding 
> [spaCy][SPACY] trained pipeline at https://spacy.io/models and follow the above steps to 
> identify any entities in the *title*, *label*, or *summary* values.

[SPACY]: https://spacy.io/

## Full-Text and Sinopia RDF Metadata
Exciting possibilities of applying machine learning to library workflows comes when we combine linked-data cataloging with the availability of the full-text. For example, we can run NER on the entirety of the full-text, and add any identified entities to our RDF. 

> Special thanks to Tim Thompson and Greta de Groat for providing examples of cataloged
> resources in Sinopia that have available full-text!


### Example One: *Alarmingly suspicious*
First we will download the Sinopia record for this Work, https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c, and extract the `data` property to create a RDF graph.

In [81]:
example1_result = requests.get(
    "https://api.sinopia.io/resource/65a2b059-5ac1-48a6-adbb-870712c3060c"
)
example1_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example1_graph.namespace_manager.bind(ns, url)
example1_graph.parse(
    data=json.dumps(example1_result.json().get("data")), format="json-ld"
)

<Graph identifier=Ne45942fd52754556bbac789e8013ac75 (<class 'rdflib.graph.Graph'>)>

Second, we open the full-text that was retrieved from the [Haithtrust](https://babel.hathitrust.org/cgi/pt?id=uiuo.ark:/13960/t14n5kz7q&view=1up&seq=3) and load all of the text and save to the `example1_text` variable. 

In [84]:
with open("data/65a2b059-5ac1-48a6-adbb-870712c3060c.txt") as fo:
    example1_text = fo.read()

print(f"Number of characters in the text is {len(example1_text):,}")

Number of characters in the text is 61,249


Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io/

In [87]:
example1_doc = en_nlp(example1_text)
print(f"Number of entities: {len(example1_doc.ents):,}")

Number of entities: 576


Of the 576 entities, we will extract the **PERSON** and **GPE** (location) entities for further analysis. 

In [164]:
people, locations = [], []

for ent in example1_doc.ents:
    if ent.label_.startswith("PERSON"):
        people.append(ent)
    if ent.label_.startswith("GPE"):
        locations.append(ent)
print(f"NER found {len(people):,} people and {len(locations):,} locations")

NER found 327 people and 21 locations


Looking at the `people` list, we notice a number of duplications (for now we don't care where the person entity was found in the text) so we filter this list.

In [165]:
people = set([" ".join(ent.text.split()) for ent in people])
locations = set([" ".join(ent.text.split()) for ent in locations])

In [166]:
len(people), len(locations)

(114, 12)

With these identified names, we can use the VIAF [SRU Search](https://platform.worldcat.org/api-explorer/apis/VIAF/AuthorityCluster/SRUSearch) on each of the `people` entities to see if we can narrow down our list using this authority source. In the helpers module, the `viaf_people_search` function returns a dictionary VIAF urls that match the term and the first main heading of the record.

We can then aggregate these results and see what remains. 

In [167]:
example1_people = {}
for i, term in enumerate(list(people)):
    # Removes newlines, tabs, and spaces from the entity text
    viaf_result = helpers.viaf_people_search(term)
    example1_people.update(viaf_result)
    if not i % 10 and i > 0:
        print(".", end="")
    if not i % 25:
        print(f"{i}", end="")

0..25...50..75...100.

In [169]:
example1_people

{'http://viaf.org/viaf/18046007/': ['Calamy, Edmund, 1671-1732'],
 'http://viaf.org/viaf/6194158792853239040000/': ['Butler, Harry, 1889-1924'],
 'http://viaf.org/viaf/9937154441768335460002/': ['Stankovski, Vlado'],
 'http://viaf.org/viaf/9672160668125703560008/': ['Barnitz, John G'],
 'http://viaf.org/viaf/9611161763145222560005/': ['Boyd, Chuck (Musician)'],
 'http://viaf.org/viaf/18153773422361430930/': ['Sharp, Nicholas (Cricket historian)'],
 'http://viaf.org/viaf/15153773419161430970/': ['Sharp, Nicholas (Lawyer)'],
 'http://viaf.org/viaf/202397991/': ['Sharp, Nicholas Andrew 1944-'],
 'http://viaf.org/viaf/12460530/': ['Sharp, Nicholas J.H.'],
 'http://viaf.org/viaf/1340160668441903560000/': ['Curzon-Siggers, Helen'],
 'http://viaf.org/viaf/722159234869803372966/': ['Fischer, Jean E'],
 'http://viaf.org/viaf/411159234291403371362/': ['Lawson, James Morris'],
 'http://viaf.org/viaf/295800268/': ['Matthews, Gwynda, 1926-'],
 'http://viaf.org/viaf/280473255/': ['Barton, Frank M. (

### Example 2 - *Improving water supply networks*
The second example is the report, *Improving water supply networks*, with the following Sinopia Resources:

- BIBFRAME Instance https://api.stage.sinopia.io/resource/ae7a9f2c-a5db-4472-80fe-9a30f311386d
- BIBFRAME Item https://api.stage.sinopia.io/resource/3ccb6255-a090-4751-b68f-7f7861986452

First we will retrieve all three resources and add them to a RDF graph.

In [154]:
example2_graph = rdflib.Graph()
for ns, url in helpers.NAMESPACES.items():
    example2_graph.namespace_manager.bind(ns, url)

for sinopia_url in [
    "https://api.stage.sinopia.io/resource/ae7a9f2c-a5db-4472-80fe-9a30f311386d",
    "https://api.stage.sinopia.io/resource/3ccb6255-a090-4751-b68f-7f7861986452",
]:
    sinopia_result = requests.get(sinopia_url)
    example2_graph.parse(
        data=json.dumps(sinopia_result.json()["data"]), format="json-ld"
    )
print(f"Total number of triples for example 2 {len(example2_graph):,}")

Total number of triples for example 2 62


Second, we will open and read the full-text for *Improving water supply networks*

In [155]:
with open("data/ae7a9f2c-a5db-4472-80fe-9a30f311386d.txt") as fo:
    example2_text = fo.read()

print(f"Number of characters in the text is {len(example2_text):,}")

Number of characters in the text is 251,321


Third, we will run the full-text through our [spaCy][SPACY] en_nlp pipeline and then examine the identified entities.

[SPACY]: https://spacy.io

In [156]:
example2_doc = en_nlp(example2_text)
print(f"Number of entities: {len(example2_doc.ents):,}")

Number of entities: 1,707


In [170]:
example2_people, example2_locations = [], []

for ent in example2_doc.ents:
    if ent.label_.startswith("PERSON"):
        example2_people.append(ent)
    if ent.label_.startswith("GPE"):
        example2_locations.append(ent)
print(
    f"NER found {len(example2_people):,} people and {len(example2_locations):,} locations"
)

NER found 127 people and 64 locations


We now de-duplicate `example2_people` and `example2_locations`example2_people

In [171]:
example2_people = set([" ".join(ent.text.split()) for ent in example2_people])
example2_locations = set([" ".join(ent.text.split()) for ent in example2_locations])

In [172]:
example2_people

{'Anglian',
 'Anglian Water',
 'Austria',
 'B. K. Wankhede',
 'Bambos Charalambous',
 'Billed Metered',
 'Bulk',
 'Bulk Flow',
 'CS6.3 Leak Detector',
 'Calm Networks',
 'Cathy Wright Bare',
 'Continuous',
 'Croatia',
 'Elio Arniella',
 'Frank Nielsen',
 'Gary Wyeth',
 'Gary Wyeth Downloaded',
 'Gram Panchayat',
 'Groundwater',
 'Handheld',
 'Henrik Inancsi',
 'Jeevan Pradhikaran',
 'Jens F. Bastrup',
 'Jens Ravn Knudsen',
 'K. R. Otari',
 'Kamstrup',
 'Kapil Gram Panchayat',
 'Ken',
 'Ken Hirschman',
 'Leak Detector',
 'Meter',
 'Meter Reading',
 'Miya Bahamas',
 'Mr S. K. Bhopale',
 'Nevil Muncaster',
 'Niccolò Machiavelli',
 'PERFORMANCE',
 'PRV',
 'Patents Act',
 'Pressurised',
 'Robin Lindberg',
 'Roland Liemberger',
 'Sarah Walker',
 'Stig Knudsen',
 'Stuart Hamilton',
 'Söderhamn',
 'Water Balance',
 'Whatdoesaholesoundlike',
 'Yorkshire Water',
 'Yorkshire Water’s'}

In [173]:
example2_locations

{'Atlanta',
 'Automatic',
 'Bahamas',
 'Canada',
 'City',
 'CityinAsia',
 'England',
 'Enigma3hyQs',
 'Georgia',
 'India',
 'Kamstrup',
 'Lincolnshire',
 'London',
 'Maharashtra',
 'Malaysia',
 'Mumbai',
 'New Providence',
 'SC',
 'SCADA',
 'Sheffield',
 'Spain',
 'Sweden',
 'Technolog',
 'The City of Sheridan',
 'UK',
 'US',
 'USA',
 'United Kingdom',
 'Wyoming',
 'm3',
 'the City of Sheridan',
 'the North of England'}

## Exercise 2
For the following Sinopia Resources, create a graph, and a NLP pipeline to identify any people and locations in the full-text. The full-text is available in `data/5f8c765d-dc4f-45a0-b2d8-87fe29a38712.txt` file.

- BIBFRAME Work https://api.stage.sinopia.io/resource/5f8c765d-dc4f-45a0-b2d8-87fe29a38712 
- BIBFRAME Instance https://api.stage.sinopia.io/resource/2e041e16-1b4d-431f-b2d0-396f62270efa
- BIBFRAME Item https://api.stage.sinopia.io/resource/53e68480-f82b-4dea-9b9a-68fd5ceacb41