#  4）Vocabulary and Matching


#    Rule-Based Matching



SpaCy's rule-based matcher engine and components not only helps you find the words and phrases we are looking for, they give us access to the tokens within the document and relationships.
We can use it to access and analyze the surrounding tokens.
https://explosion.ai/demos/matcher

In [151]:
#load our model
import spacy
nlp=spacy.load('en_core_web_sm')

In [152]:
#import the matcher library
from spacy.matcher import Matcher
matcher=Matcher(nlp.vocab) #we created matcher object and pass nlp.vocab
#we can add/remove specific named matchers to matcher as needed

In [153]:
#Create a list, inside add series of dictionary.Each dictionary describes one token and its attributes. 
#we can show "hello world" in following ways:
#Hello World  hello World  Hello WORLD
#Hello-World
patterns = [
    [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],
    [{"LOWER": "hello"}, {"LOWER": "world"}]
]
matcher.add("HelloWorld", patterns)  #lower and is_punct are the attributes

In [154]:
#create a document
doc=nlp("'Hello World' are the first two printed words for most of the programmers, printing 'Hello-World' is the most common for beginners")

In [155]:
doc

'Hello World' are the first two printed words for most of the programmers, printing 'Hello-World' is the most common for beginners

Finding the matches

In [156]:
find_matches=matcher(doc) #passin doc to matcher object and store this in a variable
print(find_matches) #it returns (string ID,index start,index end)

[(15578876784678163569, 1, 3), (15578876784678163569, 18, 21)]


In [157]:
#define a function to find matchers
for match_id,start,end in find_matches:
  string_id=nlp.vocab.strings[match_id]
  span=doc[start:end]
  print(match_id, string_id,start,end,span.text)

15578876784678163569 HelloWorld 1 3 Hello World
15578876784678163569 HelloWorld 18 21 Hello-World


Setting pattern options and quantifiers

In [158]:
#redefine patterns
patterns1 = [
    [{'LOWER':'hello'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'world'}],
    [{'LOWER':'hello'},{'LOWER':'world'}]
]
#'OP':'*-->allow patterns to match with 0 or multiple times for any punctuation
#add the new pattern to 'Hello World' matcher:
matcher.add('Hello World', patterns1)

In [159]:
doc2=nlp('You can print Hello World or hello world or Hello-World')

In [160]:
find_matches=matcher(doc2)
print(find_matches)

[(15578876784678163569, 3, 5), (8585552006568828647, 3, 5), (15578876784678163569, 6, 8), (8585552006568828647, 6, 8), (15578876784678163569, 9, 12), (8585552006568828647, 9, 12)]



#Phrase Matching







In [161]:
#import the phrase library
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.matcher import PhraseMatcher
matcher=PhraseMatcher(nlp.vocab)

In [162]:
phrase_list=['Barack Obama','Angela Merkel','Washington, D.C.']

In [163]:
#convert each phrase to a document object,using list comprehension
phrase_patterns=[nlp(text) for text in phrase_list]

In [164]:
phrase_patterns #not string

[Barack Obama, Angela Merkel, Washington, D.C.]

In [165]:
type(phrase_patterns[0]) 

spacy.tokens.doc.Doc

In [166]:
#pass each doc object into the matcher
matcher.add('TerminologyList',None,*phrase_patterns)

In [167]:
doc3=nlp('German Chancellor Angela Merkel and US President Barack Obama'
          'converse in the Ovel Office inside the White House in Washington,D.C.')

In [168]:
find_matches=matcher(doc3)
print(find_matches)

[(3766102292120407359, 2, 4), (3766102292120407359, 18, 21)]


In [169]:
#define a function to find the matches
for match_id, start,end in find_matches:
  string_id=nlp.vocab.strings[match_id]
  span=doc3[start:end] #get matched span
  print(match_id,string_id,start,end ,span.text)

3766102292120407359 TerminologyList 2 4 Angela Merkel
3766102292120407359 TerminologyList 18 21 Washington,D.C.


# 5)Speech Tagging

# POS Tagging

In [170]:
s1='Apple is looking at buying U.K. startup for $1 billion'

In [171]:
import spacy
nlp=spacy.load(name='en_core_web_sm')

In [172]:
doc=nlp(s1)
for token in doc:
  print(token.text,token.pos_,token.tag_,spacy.explain(token.tag_)) #token.tag_:assigned attributes,get the string
  #pos:The token’s universal part of speech 

Apple PROPN NNP noun, proper singular
is AUX VBZ verb, 3rd person singular present
looking VERB VBG verb, gerund or present participle
at ADP IN conjunction, subordinating or preposition
buying VERB VBG verb, gerund or present participle
U.K. PROPN NNP noun, proper singular
startup NOUN NN noun, singular or mass
for ADP IN conjunction, subordinating or preposition
$ SYM $ symbol, currency
1 NUM CD cardinal number
billion NUM CD cardinal number


In [173]:
for key,val in doc.count_by(spacy.attrs.POS).items():
  print(key,doc.vocab[key].text,val)

96 PROPN 2
87 AUX 1
100 VERB 2
85 ADP 2
92 NOUN 1
99 SYM 1
93 NUM 2


In [174]:
from spacy import displacy

In [175]:
displacy.render(docs=doc,style='dep',options={'distance':100},jupyter=True)

# 6)Named Entity Recognition

https://spacy.io/usage/linguistic-features#named-entities

Identify a variety of named and numeric entities, including companies, locations, organizations and products,by asking the model for a prediction.

In [176]:
s1='Apple is looking at buying U.K. startup for $1 billion'
s2='San Francisco considers banning sidewalk delivery robots'
s3='facebook is hiring a new vice president in U.S.'

In [177]:
import spacy
nlp=spacy.load(name='en_core_web_sm')

In [178]:
doc5=nlp(s1)
doc5.ents  #Named entities are available as the ents property of a Doc

(Apple, U.K., $1 billion)

In [179]:
for ent in doc5.ents:
  print(ent.text,ent.label_,str(spacy.explain(ent.label_)))

Apple ORG Companies, agencies, institutions, etc.
U.K. GPE Countries, cities, states
$1 billion MONEY Monetary values, including unit


In [180]:
doc6=nlp(s2)
doc6.ents

(San Francisco,)

In [181]:
for ent in doc6.ents:
  print(ent.text,ent.label_,str(spacy.explain(ent.label_)))

San Francisco GPE Countries, cities, states


In [182]:
doc7=nlp(s3)
doc7.ents

(U.S.,)

In [183]:
for ent in doc7.ents:
  print(ent.text,ent.label_,str(spacy.explain(ent.label_)))
#however, this time it doesn't count facebook as an entity

U.S. GPE Countries, cities, states


In [184]:
ORG=doc7.vocab.strings['ORG']
from spacy.tokens import Span
new_ent=Span(doc3,0,1,label=ORG)
doc7.ents=list(doc7.ents)+[new_ent]

In [185]:
doc7.ents

(facebook, U.S.)

In [186]:
from spacy import displacy
displacy.render(docs=doc5,style='ent',jupyter=True)

In [187]:
displacy.render(docs=doc5,style='ent',options={'ents':['ORG']},jupyter=True)

# 7)Sentence Segmentation 

In [188]:
s4='This is a sentence.This is second sentence. This is last sentence.'
s5='This is a sentence;This is second sentence;This is last sentence.'

In [189]:
import spacy
nlp=spacy.load(name='en_core_web_sm')

In [190]:
doc8=nlp(s4)

In [191]:
for sent in doc8.sents:
  print(sent.text)

This is a sentence.
This is second sentence.
This is last sentence.


In [192]:
s6='This is a sentence.This is second U.K. sentence. This is last sentence.'

In [193]:
doc9=nlp(s6)
for sent in doc9.sents:
  print(sent.text)  #even if there exists a dot in U.K., the sentence still does not split.

This is a sentence.
This is second U.K. sentence.
This is last sentence.


In [194]:
doc10=nlp(s5)
for sent in doc10.sents:
  print(sent.text) #we get the whole sentence

This is a sentence;This is second
sentence;This is last sentence.


In [195]:
def set_custom_boundaries(doc):
  for token in doc[:-1]:
    if token.text==';':
     print(token.i)
    doc[token.i+1].is_sent_start=True #set the start of the new sentence
    return doc