## NLP with Spacy

In this notebook we will go through the Spacy library with [Mattingly, William. Introduction to spaCy 3, 2021 (1st ed.)](https://spacy.pythonhumanities.com/).

The following are the sections available in this notebook:
- [The Installation](#install)
- [The First Doc Object](#install)
- [Tokens](#token)
- [The Part of Speech and Dependencies](#posdep)
- [Named Entity Recognition](#wrwner)
- [Vectors](#vector)
- [Pipelines](#pipeline)
    - [Rule-Based Pipelines](#rbpipeline)
- [Matchers](#matcher)
- [Custom Components](#cuscomp)
- [Incorporating RegEx](#regex)
- [Using Financial Data](#findata)


<a id = 'install'></a>
### The Installation

In [None]:
## first we need to install the packages required
!pip install spacy
!pip install ruamel-yaml

## and we also need the en library
## which we need to download
!python -m spacy download en_core_web_sm

<a id = 'docobj'></a>
### The First Doc Object

In [7]:
## import the library
import spacy
## and then load the small en model into an object
nlp = spacy.load('en_core_web_sm')

In [10]:
## loading a text file
## and then passing it to the nlp object
with open('C://Users//12145//Documents//GitHub//Python//data//wiki_us.txt', 'r') as tx:
    doc = nlp(tx.read())
## in first glance, it looks similar to 
## a normal string object
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [17]:
## geeting the shape of the tensors
doc.tensor.shape

(652, 96)

<a id ='token'></a>
### Working with Tokens

Each token within a doc object, has many attributes, such as `.text`, `.left_edge`, `.right_edge`, `.ent_type_`, `.pos_`, `.dep_`.

In [48]:
for sent in doc[:50].sents:
    print(sent.sentiment)
## the type of entity
for token in doc[:5]:
    print(token.text,token.ent_type_)
for token in doc[10:20]:
    print(f"For {token.text} lemma_ is {token.lemma_} and morph is {token.morph} and language is {token.lang_}")

0.0
0.0
The GPE
United GPE
States GPE
of GPE
America GPE
For , lemma_ is , and morph is PunctType=Comm and language is en
For commonly lemma_ is commonly and morph is  and language is en
For known lemma_ is know and morph is Aspect=Perf|Tense=Past|VerbForm=Part and language is en
For as lemma_ is as and morph is  and language is en
For the lemma_ is the and morph is Definite=Def|PronType=Art and language is en
For United lemma_ is United and morph is Number=Sing and language is en
For States lemma_ is States and morph is Number=Sing and language is en
For ( lemma_ is ( and morph is PunctSide=Ini|PunctType=Brck and language is en
For U.S. lemma_ is U.S. and morph is Number=Sing and language is en
For or lemma_ is or and morph is ConjType=Cmp and language is en


<a id ='posdep'></a>
### Working with the Part of Speech and Dependencies

In [49]:
sent = "Today is a beautiful day!"
doc2 = nlp(sent)
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Today NOUN nsubj
is AUX ROOT
a DET det
beautiful ADJ amod
day NOUN attr
! PUNCT punct


In [50]:
## and we can actually visualize this replationship
## using the displacy module
from spacy import displacy
displacy.render(doc2, style ='dep')

<a id ='wrwner'></a>
### Working with Named Entity Recognition

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [53]:
displacy.render(doc, style='ent')

<a id ='vector'></a>
### Vectors in Spacy

In [None]:
## we will start working with the medium model for this step
## because this model has word vectors
!python -m spacy download en_core_web_md

In [56]:
nlpmd = spacy.load('en_core_web_md')
## and next generating a new doc object
## using the medium model
with open('C://Users//12145//Documents//GitHub//Python//data/wiki_us.txt', 'r') as t:
    docmd = nlpmd(t.read())
sent1 = list(docmd.sents)[0]
sent1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

In [71]:
## we can calculate the similarity of two objects
doc2 = nlpmd("Today is a beautiful day!")
doc3 = nlpmd("Spring is finally here.")
print(f"{doc2.text} <-> {doc3.text} similarity score is {doc2.similarity(doc3)}")
doc2 = nlpmd("I am hungry.")
doc3 = nlpmd("There's no food in the fridge.")
print(f"{doc2.text} <-> {doc3.text} similarity score is {doc2.similarity(doc3)}")
## and we can get the similarity
## for individual words as well
token1 = doc2[2]
token2 = doc3[3]
print(f"{token1.text} <-> {token2.text} similarity score is {token1.similarity(token2)}")

Today is a beautiful day! <-> Spring is finally here. similarity score is 0.762100153100941
I am hungry. <-> There's no food in the fridge. similarity score is 0.17335815959208917
hungry <-> food similarity score is 0.37080809473991394


<a id ='pipeline'></a>
### Working with Pipelines

In [77]:
## first we need to start with a blank English model
enmodel = spacy.blank('en')
## and then we can add different pipes to it
enmodel.add_pipe('sentencizer')
## the benefit of using this as opposed to
## using a large model, is the time it'll take
## to process a large body of text
## and we can also analyze our pipeline
print(enmodel.analyze_pipes())
## and compare it with larger models
print('\nThe small model pipeliens:\n')
print(nlp.analyze_pipes())

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'], 'requires': [], 'scores': ['sents_f', 'sents_p', 'sents_r'], 'retokenizes': False}}, 'problems': {'sentencizer': []}, 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}, 'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

The small model pipeliens:

{'summary': {'tok2vec': {'assigns': ['doc.tensor'], 'requires': [], 'scores': [], 'retokenizes': False}, 'tagger': {'assigns': ['token.tag'], 'requires': [], 'scores': ['tag_acc'], 'retokenizes': False}, 'parser': {'assigns': ['token.dep', 'token.head', 'token.is_sent_start', 'doc.sents'], 'requires': [], 'scores': ['dep_uas', 'dep_las', 'dep_las_per_type', 'sents_p', 'sents_r', 'sents_f'], 'retokenizes': False}, 'attribute_ruler': {'assigns': [], 'requires': [], 'scores': [], 'retokenizes': False}, 'lemmatizer': {'assigns': ['token.lemma'], 'requires': [], 'scores': ['lemma_acc'], 'retokenizes': False}, 'ner': {'assigns': ['doc

<a id ='rbpipeline'></a>
#### Rule-Based Pipelines

In [86]:
## there are cases in which we have domain knowledge
## that can be translated into rules
## which then can be used as a pipe
## in our model, which would result in an accurate
## output, but in cases where the rules are complicated
## and can't be simplified, we will use ML pipes
## to extract/analyze our data
## we will use the small model for this case
nlp = spacy.load('en_core_web_sm')
example = "West Chesterfieldville was referenced in Mr. Deeds."
ex_doc = nlp(example)
## and then extrcating the entities
for ent in ex_doc.ents:
    print(ent.text, ent.label_)
## lets suppose that we want to extract
## these entities in a diffent way
## and we can achieve that by using a ruler
ruler = nlp.add_pipe('entity_ruler')
print(nlp.analyze_pipes())
## and then we can create a pattern
## which is a list of dictionary
pattern = [
    {'label':'GPE', 'pattern':'West Chesterfieldville'},
    {'label':'FILM', 'pattern':'Mr. Deeds'}
]
## and then add it to the model
ruler.add_patterns(pattern)
## and then re-create the doc obj
ex_doc = nlp(example)
for ent in ex_doc.ents:
    print(ent.text, ent.label_)
## checking the result, nothing's changed

West Chesterfieldville PERSON
Deeds PERSON
{'summary': {'tok2vec': {'assigns': ['doc.tensor'], 'requires': [], 'scores': [], 'retokenizes': False}, 'tagger': {'assigns': ['token.tag'], 'requires': [], 'scores': ['tag_acc'], 'retokenizes': False}, 'parser': {'assigns': ['token.dep', 'token.head', 'token.is_sent_start', 'doc.sents'], 'requires': [], 'scores': ['dep_uas', 'dep_las', 'dep_las_per_type', 'sents_p', 'sents_r', 'sents_f'], 'retokenizes': False}, 'attribute_ruler': {'assigns': [], 'requires': [], 'scores': [], 'retokenizes': False}, 'lemmatizer': {'assigns': ['token.lemma'], 'requires': [], 'scores': ['lemma_acc'], 'retokenizes': False}, 'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'], 'requires': [], 'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'], 'retokenizes': False}, 'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'], 'requires': [], 'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'], 'retokenizes': False}}, 'pr

In [87]:
## in order to have our pipe come before/after 
## another certain pipe, we can pass that in when
## creating the pipe object
nlp = spacy.load('en_core_web_sm')
ruler = nlp.add_pipe('entity_ruler', before='ner')
ruler.add_patterns(pattern)
ex_doc = nlp(example)
for ent in ex_doc.ents:
    print(ent.text, ent.label_)

West Chesterfieldville GPE
Mr. Deeds FILM


<a id ='matcher'></a>
### Working with Matcher

The power of matcher is in using the part-of-speech, the morphological analysis, dependency label, lemma, and shape to find a specific match.

In [92]:
from spacy.matcher import Matcher
## creating a fresh model
nlp = spacy.load('en_core_web_sm')
## and then creating a macher from that model
matcher = Matcher(nlp.vocab)
## then a pattern
pattern = [{'LIKE_EMAIL':True}]
## and pass it to our matcher obj
## with something like a label
## and it should be a [[{}]]
matcher.add('EMAIL_ADDRESS', [pattern])
## and then lets create an example
example = "This is an email address: john.doe@random.co"
doc = nlp(example)
## and then use the matcher
## on our doc obj
matches = matcher(doc)
print(matches)
## which is list of tuple
## with 3 values, 1st is a Lexeme
## and the last two are the location of the tokens
print(doc[matches[0][1]:matches[0][2]])
## and we can also access the vocab with the Lexeme
print(nlp.vocab[matches[0][0]].text)

[(16571425990740197027, 6, 7)]
john.doe@random.co
EMAIL_ADDRESS


In [102]:
## lets load a new text
with open('C://Users//12145//Documents//GitHub//Python//data//wiki_mlk.txt', 'r') as ml:
    text = ml.read()
## and start a new model
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
doc = nlp(text)
## and lets suppose that we want to use POS
## to exctract proper nouns
## of one or more (OP is mapped to +)
## and get the longest combination
## but the issue is that the result is out of order
## which can be fixed by a sorting
pattern = [{'POS':"PROPN", "OP":'+'}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
matches = matcher(doc)
matches.sort(key = lambda x:x[1])
for match in matches[:10]:
    print(doc[match[1]:match[2]])

Martin Luther King Jr.
Michael King Jr.
January
April
Baptist
King
Mahatma Gandhi
Martin Luther King Sr.
King
King


In [104]:
## if we were to have a more complicated pattern
## where the noun is followed by a verb
## we can add that to the sequence
matcher = Matcher(nlp.vocab)
doc = nlp(text)
pattern = [{'POS':"PROPN", "OP":'{2,}'}, {'POS':"VERB"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
matches = matcher(doc)
matches.sort(key = lambda x:x[1])
for match in matches[:10]:
    print(doc[match[1]:match[2]])

Director J. Edgar Hoover considered
United States beginning


In [129]:
## loading the Alice in Wonderland
import json
with open('C://Users//12145//Documents//GitHub//Python//data//alice.json', 'r') as a:
    text = json.load(a)
doc = nlp(text[0][2][0].replace('`', "'"))
## lets work on a more complicated pattern
## trying to extract the quotes from the text
matcher = Matcher(nlp.vocab)
quote_verbs = ["think", "say"]
pattern = [
          {"ORTH":"'"},
          {"IS_ALPHA":True, "OP":"+"},
          {"IS_PUNCT":True, "OP":"*"},
          {"ORTH":"'"},
          {"POS":"VERB", "LEMMA":{"IN":quote_verbs}},
          {"POS":"PROPN", "OP":"+"},
          {"ORTH":"'"},
          {"IS_ALPHA":True, "OP":"+"},
          {"IS_PUNCT":True, "OP":"*"},
          {"ORTH":"'"}
]
matcher.add("QUOTE", [pattern], greedy="LONGEST")
matches = matcher(doc)
matches.sort(key=lambda x:x[1])
for match in matches:
    print(doc[match[1]:match[2]])

'and what is the use of a book,' thought Alice 'without pictures or conversation?'


<a id ='cuscomp'></a>
### Working with Custom Components

In [137]:
## starting fresh with the small model
nlp = spacy.load('en_core_web_sm')
doc = nlp('Britain is a place. Mary is a doctor.')
for ent in doc.ents:
    print(ent.text, ent.label_)
## lets suppose that we want to remove all the GPEs
## we can simply do that by creating a custom component
## first we need to import Language module
from spacy.language import Language
@Language.component("remove_gpe")
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for en in doc.ents:
        if en.label_ == "GPE":
            original_ents.remove(en)
    ## and we have to update the doc entities
    doc.ents = original_ents
    return doc
## and now we have to add it to our pipeline
nlp.add_pipe("remove_gpe")
## and now lets test to see if it's working
doc = nlp('Britain is a place. Mary is a doctor.')
for ent in doc.ents:
    print(ent.text, ent.label_)

Britain GPE
Mary PERSON
Mary PERSON


<a id ='regex'></a>
### Incorporating RegEx

In [148]:
import re
from spacy.tokens import Span
## regex can't be used for multi-token extractions
## but it can still be very useful for finding
## patterns that can't easily be defined
## by the available matching options
text = 'Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is a common name.'
re_pattern = r'Paul [A-Z]\w+'
## we can simply get all the matches from the text
re_matches = re.finditer(re_pattern, text)
for match in re_matches:
    print(match)
## next, we need to create a blank English model
nlp = spacy.blank('en')
doc = nlp(text)
## first creating a list of our entities
## which should be empty, given that the model is blank
original_ents = list(doc.ents)
new_ent = []
for match in re.finditer(re_pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    print(span)
    ## now we can add these to our ent list
    if span is not None:
        new_ent.append((span.start, span.end, span.text))
    ## the important note here is that now the start and end
    ## are in token index, as opposed to char, which was the case
    ## with our original spans from regex
print(new_ent)
## next we can inject these into our doc entities
for ent in new_ent:
    start, end, name = ent
    new_span = Span(doc,start=start, end=end, label="PERSON")
    original_ents.append(new_span)
## and then update the doc ents with the new list
doc.ents = original_ents
print(doc.ents)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>
Paul Newman
Paul Hollywood
[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]
(Paul Newman, Paul Hollywood)


In [156]:
from spacy.util import filter_spans
## the use of this is when we create a custom component around it
@Language.component('add_person')
def add_person(doc):
    re_pattern = r'Paul [A-Z]\w+'
    new_ent = []
    for match in re.finditer(re_pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            new_span = Span(doc, span.start, span.end, "PERSON")
        new_ent.append(new_span)
    ## in order to avoid getting an error
    ## for having two entities for one span
    ## we can use the filter_span to only keep the
    ## longest entity identified
    filtered_ents = filter_spans(new_ent)
    doc.ents = filtered_ents
    return doc
nlp = spacy.blank('en')
# nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('add_person')
text = 'Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is a common name.'
doc = nlp(text)
for ents in doc.ents:
    print(ents.text, ents.label_)

Paul Newman PERSON
Paul Hollywood PERSON


<a id ='findata'></a>
### Working on Financial Data

In [172]:
## we need pandas to load our data
import pandas as pd
stock_df = pd.read_csv('C://Users//12145//Documents//GitHub//Python//data//stocks.tsv', sep='\t')
with open('C://Users//12145//Documents//GitHub//Python//data//article.txt', 'r') as a:
    article = a.read()
## storing the symbols and company names into pattern
## this won't work, because it tries to match all at once
# pattern = [{'label':'STOCK', 'pattern':stock_df.Symbol.tolist()}, {'label':'COMPANY', 'pattern':stock_df.CompanyName.tolist()}]
pattern=[{'label':'STOCK', 'pattern':s} for s in stock_df.Symbol.tolist()]
pattern.extend([{'label':'COMPANY', 'pattern':s} for s in stock_df.CompanyName.tolist()])
## and lets create a blank model
nlp = spacy.blank('en')
entity_r = nlp.add_pipe('entity_ruler')
entity_r.add_patterns(pattern)
doc = nlp(article)
for ent in doc.ents:
    print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
Apple COMPANY
Nasdaq COMPANY
two COMPANY
ET STOCK
Nasdaq COMPANY
JD.com COMPANY
Kroger COMPANY
Nasdaq COMPANY
Nasdaq COMPANY


In [176]:
## and to make it easier to see where these were reference
from spacy import displacy
displacy.render(doc, style='ent')