# Natural Language Processing (NLP) with spaCy
In this notebook, we will use the Natural Language Processing library for Python called [spaCy][SPACY]. 
The functionality provided by [spaCy][SPACY] allows us to quickly extract parts-of-speech (POS) from text descriptions and to identify entities using [spaCy's][SPACY] named entity recognition (NER).


[SPACY]: https://spacy.io/

In [4]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext lab_black

import pathlib
import pandas as pd

import helpers
import widgets

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


## Loading Sinopia Stage Knowledge Graph
Just like in the previous Jupyter notebook, we will load the saved knowledge graph that we created at the beginning and then we will query the graph using SPARQL.

In [7]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

http://desktop.loc.gov/search?view=document&id=Infobasedcrmg0Dash0Dash0Dash247&hl=true&fq=allresources|true# does not look like a valid URI, trying to serialize this will break.
https://api.stage.sinopia.io/resource/this is a test does not look like a valid URI, trying to serialize this will break.
ld4p:RT:bf2:2D graphic material:Item does not look like a valid URI, trying to serialize this will break.
urn:ld4p:qa:gettyaat:Objects__Object_Groupings and Systems does not look like a valid URI, trying to serialize this will break.


<kglab.kglab.KnowledgeGraph at 0x7fe93151a7c0>

### RDF Literals Pandas DataFrame


In [2]:
from spacy.lang.en import English

nlp = English()

## Creating a FAST Phrase Matcher

In [1]:
import pandas as pd

fast_topics = pd.read_csv("data/topic_uri_label_utf8.csv", names=["URL", "name"])

In [4]:
fast_topics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 460110 entries, 0 to 460109
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   URL     460110 non-null  object
 1   name    460110 non-null  object
dtypes: object(2)
memory usage: 7.0+ MB


In [5]:
fast_topics.head()

Unnamed: 0,URL,name
0,http://id.worldcat.org/fast/799409,African American teenagers--Education
1,http://id.worldcat.org/fast/912703,Enthusiasm--Religious aspects--Christianity
2,http://id.worldcat.org/fast/966912,Identity (Psychology) in old age
3,http://id.worldcat.org/fast/817698,Artists' studios
4,http://id.worldcat.org/fast/833340,Bisexual women--Health and hygiene


In [13]:
"African American teenagers--Education".replace("-"," ").replace("(", "").replace(")","").split()

['African', 'American', 'teenagers', 'Education']

### Creatings Patterns

In [None]:
topic_patterns = []

# Iterate through the FAST Topic DataFrame
for row in fast_topics.iterrows():
    print(row[1]["URL"], row[1]["name"])
    topic_patterns.append({})

http://id.worldcat.org/fast/799409 African American teenagers--Education
http://id.worldcat.org/fast/912703 Enthusiasm--Religious aspects--Christianity
http://id.worldcat.org/fast/966912 Identity (Psychology) in old age
http://id.worldcat.org/fast/817698 Artists' studios
http://id.worldcat.org/fast/833340 Bisexual women--Health and hygiene
http://id.worldcat.org/fast/924092 Figure sculpture, Croatian
http://id.worldcat.org/fast/880474 Corynebacterium pyogenes
http://id.worldcat.org/fast/837631 Brain--Hemorrhage--Patients
http://id.worldcat.org/fast/803446 Airplanes--Refueling
http://id.worldcat.org/fast/928383 Flutes (4) with chamber orchestra
http://id.worldcat.org/fast/807737 Ammonia
http://id.worldcat.org/fast/857655 Chinese newspapers--Language
http://id.worldcat.org/fast/914131 Epigrams
http://id.worldcat.org/fast/918422 Experimental drama, French--History and criticism
http://id.worldcat.org/fast/961265 Hospitals--Design and construction--Cost effectiveness
http://id.worldcat.org

## Exercises