Sources:
https://www.analyticsvidhya.com/blog/2020/03/spacy-tutorial-learn-natural-language-processing/


I'm using SpaCy's Large English NLP CNN in order to complete my task.

In [None]:
!python -m spacy download en_core_web_lg

2023-04-04 00:27:08.270132: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-04 00:27:12.265323: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0
[38;5;2m✔ Download and in

Instead of creating my own list of random words, I decided to use the same Jane Austen text we used in the previous project.

In [None]:
import tensorflow as tf
# Load file data
path_to_file = tf.keras.utils.get_file('austen.txt', 'https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/austen/austen.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
print('Length of text: {} characters'.format(len(text)))

Downloading data from https://raw.githubusercontent.com/byui-cse/cse450-course/master/data/austen/austen.txt
Length of text: 4906787 characters


Using SpaCy, I created an NLP Object (Doc) that I will work with. I ended the text at 500,000 characters so I wouldn't crash my colab (anything more than 1,000,000 characters would consume too much RAM).

In [None]:
import spacy
nlp = spacy.load('en_core_web_lg')
# nlp.max_length = len(text)
doc = nlp(text[0:500000])



Prints all the parts of speech and entities that the SpaCy CNN model can identify.

In [None]:
for token in doc:
  print(token.text, '-->', token.pos_)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
sake --> NOUN
of --> ADP
what --> PRON
is --> AUX
more --> ADV
important --> ADJ
than --> ADP
my --> PRON

 --> SPACE
comfort --> NOUN
, --> PUNCT
a --> DET
habit --> NOUN
of --> ADP
self --> NOUN
- --> PUNCT
command --> NOUN
in --> ADP
you --> PRON
, --> PUNCT
a --> DET
consideration --> NOUN
of --> ADP
what --> PRON
is --> AUX
your --> PRON

 --> SPACE
duty --> NOUN
, --> PUNCT
an --> DET
attention --> NOUN
to --> ADP
propriety --> NOUN
, --> PUNCT
an --> DET
endeavour --> NOUN
to --> PART
avoid --> VERB
the --> DET
suspicions --> NOUN
of --> ADP

 --> SPACE
others --> NOUN
, --> PUNCT
to --> PART
save --> VERB
your --> PRON
health --> NOUN
and --> CCONJ
credit --> NOUN
, --> PUNCT
and --> CCONJ
restore --> VERB
your --> PRON
tranquillity --> NOUN
. --> PUNCT

 --> SPACE
These --> PRON
are --> AUX
the --> DET
motives --> NOUN
which --> PRON
I --> PRON
have --> AUX
been --> AUX
pressing --> VERB
on --> ADP
you --> PRON
.

In [None]:
for ent in doc.ents:
  print(ent.text, '-->', ent.label_)

Emma Woodhouse --> PERSON
nearly twenty-one years --> DATE
vex --> PERSON
two --> CARDINAL
Sixteen years --> DATE
Taylor --> PERSON
Woodhouse --> PERSON
Emma --> PERSON
Taylor --> PERSON
Emma --> PERSON
Taylor --> PERSON
Emma --> PERSON
consciousness.--Miss --> PERSON
Taylor --> PERSON
Taylor --> PERSON
first --> ORDINAL
Emma --> PERSON
first --> ORDINAL
third --> ORDINAL
a long evening --> TIME
Weston --> PERSON
Taylor --> PERSON
every hour --> TIME
sixteen years --> DATE
five years old --> DATE
seven
years --> DATE
Isabella --> PERSON
change?--It --> PERSON
only half a mile --> QUANTITY
Emma --> PERSON
Weston --> PERSON
only half a mile --> QUANTITY
Taylor --> PERSON
Woodhouse --> PERSON
years --> DATE
London --> GPE
only sixteen miles --> QUANTITY
daily --> DATE
October --> DATE
November --> DATE
evening --> TIME
Hartfield --> GPE
Christmas --> DATE
Isabella --> PERSON
Highbury --> GPE
Hartfield --> GPE
Woodhouses --> PERSON
first --> ORDINAL
Taylor --> PERSON
half a day --> DATE
Em

Create a custom pattern based on what I want to find from the text. For this project, I decided I wanted to find the names of all the people in the text. It's not perfect, but it does its job.

In [None]:
# import spacy Matcher
from spacy.matcher import Matcher

# initialize the matcher with spacy vocab
matcher = Matcher(nlp.vocab)

# create the pattern to use
pattern = [{'ENT_TYPE': 'PERSON'}]

# add rule to matcher
matcher.add('custom_1', [pattern], on_match=None)

In [None]:
matches = matcher(doc)
names = []
for match_id, start, end in matches:
  matched_span = doc[start:end]
  if matched_span.text not in names:
    names.append(matched_span.text)

for name in names:
  print(name)

Emma
Woodhouse
vex
Taylor
consciousness.--Miss
Weston
Isabella
change?--It
Woodhouses
Taylor!--I
large.--And
James
Hannah
Knightley
Miss
Farmer
Mitchell
’s
Elton
Churchill
Enscombe
Frank
Perry
Bates
Perrys
Donwell
Abbey
Goddard
Smith
Harriet
Martin
Serle
a


Martins
Alderneys
Welch
Nash
Prince
Richardson
Elizabeth
Robert
rather
solemnly
book”--was
good
nothing.--You
John
Henry
Bella
George
,
said--



“
.
Knightley.--“Robert
Woodhouse”--he
stopt
moment--“or
united
Courtship_--a
clever!--Could
Smiths
Eltons
Shakespeare
Michaelmas
Cole
all!--You
charade.--You
Garrick
Catherine
Jane
Fairfax
Stilton
dropt
Wingfield
Woodhouse--“yes
Westons
F.
C.
Knightleys
go.--You
miserable;--I
Bateses
Graham
Knightley.--“It
sir.--We
Perry.--I
at
Hartfield
every
hour
strange!--After
ill
Weston;--Mrs
me?--Absurd
Weston?--Judge
Emma?--what
Smith!--Miss
her;--William
Coxe
William
bore
Hetty
Dixon
Campbell
shewn
Wrapt
question--
Patty
Hawkins
Hawkins--
Elton!--no
Hawkins!--Well
arm.--Mr
:
”--Ford
Woodhouse?)--