## Vocabulary and Phrase Matching with SpaCy
<p>The spaCy library comes with Matcher tool that can be used to specify custom rules for phrase matching. The process to use the Matcher tool is pretty straight forward. The first thing you have to do is define the patterns that you want to match. Next, you have to add the patterns to the Matcher tool and finally, you have to apply the Matcher tool to the document that you want to match your rules with. This is best explained with the help of an example.<b>

For rule-based matching, you need to perform the following steps:</p>

# Rule-Based Matching

In [5]:
import spacy
from nltk.tokenize import word_tokenize
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher
m_tool = Matcher(nlp.vocab)

## Stop words

In [19]:
all_stopwords = list(set([",","!",'"','#','$','%','&','\\',"'","(",")",'*','+','-','.','/',':',';','<','=','>','?','@','[',']','^','_','`','{','|','}','~',"'"]))

text = "Nick likes to play football, however he is not too fond of tennis. ?"
text_tokens = word_tokenize(text)
tokens_without_sw = [word for word in text_tokens if not word in all_stopwords]
separator = " "

print(separator.join(tokens_without_sw))

Nick likes to play football however he is not too fond of tennis


## how to use Spacy Lemmatizer to get a word into basic form

In [27]:
doc = nlp("I need an overview on invoice")

for token in doc:
    print(token, token.lemma_, token.pos_)

I -PRON- PRON
need need VERB
an an DET
overview overview NOUN
on on ADP
invoice invoice NOUN


## Defining Patterns
<p>The next step is to define the patterns that will be used to filter similar phrases. Suppose we want to find the phrases "quick-brown-fox", "quick brown fox", "quickbrownfox" or "quick brownfox". To do so, we need to create the following four patterns:</p>

- p1 looks for the phrase "quickbrownfox"
- p2 looks for the phrase "quick-brown-fox"
- p3 tries to search for "qucik brown fox"
- p4 looks for the phrase "quick brownfox"

In [39]:
p1 = [{"LOWER": "quickbrownfox"}]
p2 = [{'LOWER': 'quick'}, {'IS_PUNCT': True}, {'LOWER': 'brown'}, {'IS_PUNCT': True}, {'LOWER': 'fox'}]
p3 = [{'LOWER': 'quick'}, {'LOWER': 'brown'}, {'LOWER': 'fox'}]
p4 =  [{'LOWER': 'quick'}, {'LOWER': 'brownfox'}]

<p>The token attribute LOWER defines that the phrase should be converted into lower case before matching.
<br>
Once the patterns are defined, we need to add them to the Matcher object that we created earlier.</p>
<p>Here "QBF" is the name of our matcher. You can give it any name.</p>

In [40]:
m_tool.add('QBF', None, p1, p2,p3,p4)

## Here "QBF" is the name of our matcher. You can give it any name.
<p>We have our matcher ready. The next step is to apply the matcher on a text document and see if we can get any match. Let's first create a simple document:</p>

In [44]:
sentence = nlp(u"The quick-brown-fox jumps over the lazy dog. The quick brown fox eats well. The quickbrownfox is dead. The dog misses the quick brownfox")

<p>To apply the matcher to a document. The document is needed to be passed as a parameter to the matcher object. The result will be all the ids of the phrases matched in the document, along with their starting and ending positions in the document. Execute the following script:</p>

In [45]:
phrase_matches = m_tool(sentence)
print(phrase_matches )

[(12825528024649263697, 1, 6), (12825528024649263697, 13, 16), (12825528024649263697, 20, 21), (12825528024649263697, 28, 30)]


In [47]:
for match_id, start, end in phrase_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end,'\t', span.text)

12825528024649263697 QBF 1 6 	 quick-brown-fox
12825528024649263697 QBF 13 16 	 quick brown fox
12825528024649263697 QBF 20 21 	 quickbrownfox
12825528024649263697 QBF 28 30 	 quick brownfox


Let's write a simple pattern that can identify the phrase "quick--brown--fox" or quick-brown---fox.
Let's first remove the previous matcher QBF.
https://spacy.io/usage/linguistic-features#adding-patterns-attributes

In [52]:
m_tool.remove('QBF')
p1 = [{'LOWER': 'quick'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'brown'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'fox'}]
m_tool.add('QBF', None, p1)
sentence = nlp(u'The quick--brown--fox jumps over the  quick-brown---fox')
phrase_matches = m_tool(sentence)

for match_id, start, end in phrase_matches:
    string_id = nlp.vocab.strings[match_id]  
    span = sentence[start:end]                   
    print(match_id, string_id, start, end, span.text)

12825528024649263697 QBF 1 6 quick--brown--fox
12825528024649263697 QBF 10 15 quick-brown---fox


# Phrase-Based Matching
In the last section, we saw how we can define rules that can be used to identify phrases from the document. In addition to defining rules, we can directly specify the phrases that we are looking for.
This is a more efficient way of phrase matching.

In this section, we will be doing phrase matching inside a Wikipedia article on Artificial intelligence.

Before we see the steps to perform phrase-matching, let's first parse the Wikipedia article that we will be using to perform phrase matching. Execute the following script:

In [56]:
import spacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import PhraseMatcher
phrase_matcher = PhraseMatcher(nlp.vocab)

# Create Phrase List
In the second step, you need to create a list of phrases to match and then convert the list to spaCy NLP documents as shown in the following script:

In [61]:
phrases = ['machine learning', 'robots', 'intelligent agents']
patterns = [ nlp(text) for text in phrases ]
phrase_matcher.add('AI', None, *patterns)
print(patterns)

[machine learning, robots, intelligent agents]


Finally, you need to add your phrase list to the phrase matcher.
## Applying Matcher to the Document
Like rule-based matching, we again need to apply our phrase matcher to the document. However, our parsed article is not in spaCy document format. Therefore, we will convert our article into sPacy document format and will then apply our phrase matcher to the article.

In [62]:
process = "In computer science, artificial intelligence(AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Leading AI textbooks define the field as the study of intelligent agents: any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals.1 Colloquially, the term artificial intelligence is often used to describe machines(or computers) that mimic cognitive functions that humans associate with the human mind, such as learning and problem solving.2As machines become increasingly capable, tasks considered to require intelligence are often removed from the definition of AI, a phenomenon known as the AI effect.3 A quip in Tesler's Theorem says AI is whatever hasn't been done yet.4 For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.5 Modern machine capabilities generally classified as AI include successfully understanding human speech, 6 competing at the highest level in strategic game systems(such as chess and Go), 7 autonomously operating cars, intelligent routing in content delivery networks, and military simulations.Artificial intelligence was founded as an academic discipline in 1956, and in the years since has experienced several waves of optimism, 89 followed by disappointment and the loss of funding(known as an AI winter), 1011 followed by new approaches, success and renewed funding.912 For most of its history, AI research has been divided into subfields that often fail to communicate with each other.13 These sub-fields are based on technical considerations, such as particular goals(e.g. robotics or machine learning), 14 the use of particular tools(logic or artificial neural networks), or deep philosophical differences.151617 Subfields have also been based on social factors(particular institutions or the work of particular researchers).13The traditional problems(or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects.14 General intelligence is among the field's long-term goals.18 Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics. The AI field draws upon computer science, information engineering, mathematics, psychology, linguistics, philosophy, and many other fields.The field was founded on the assumption that human intelligence can be so precisely described that a machine can be made to simulate it.19 This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence. These issues have been explored by myth, fiction and philosophy since antiquity.20 Some people also consider AI to be a danger to humanity if it progresses unabated.21, 22. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment.23In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understandingand AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science, software engineering and operations research."

In [64]:
# matched_phrases = phrase_matcher(process)

# Stop Words

In [66]:
import spacy
sp = spacy.load('en_core_web_sm')
print(sp.Defaults.stop_words)

{'ten', 'nobody', 'our', 'none', 'using', 'along', 'fifteen', 'except', 'forty', 'before', 'take', 'are', 'nine', 'while', 'often', 'whenever', 'after', 'his', 'top', 'twenty', 'hundred', 'anyway', 'anyhow', 'whither', 'here', 'done', 'indeed', 'various', 'already', "'m", 'between', 'four', 'against', 'ca', 'across', 'former', 'thereafter', 'fifty', 'which', 'seemed', 'him', 'perhaps', 'whereas', 'something', 'within', 'next', 'such', 'be', 'noone', 'they', 'name', 'both', 'give', 'made', 'us', 'somewhere', 'through', 'one', 'sometimes', 'really', 'yet', 'each', 'himself', 'together', 'their', 'few', 'regarding', 'whereupon', 'in', 'no', 'otherwise', 'sixty', 'third', 'two', 'my', 'down', 'these', 'n’t', 'thereupon', 'up', 'was', 'serious', 'most', 'some', 'how', 'an', 'latter', 'nor', 'around', 'being', 'everyone', 'for', 'is', 'unless', 'than', 'may', 'would', 'where', 'been', 'she', 'front', 'move', 'six', 'somehow', 'towards', 'bottom', 'go', 'did', 'since', 'twelve', 'becoming', '

In [71]:
sp.vocab['wonder'].is_stop

True

To add or remove stopwords in spaCy, you can use sp.Defaults.stop_words.add() and sp.Defaults.stop_words.remove() methods respectively.

In [70]:
sp.Defaults.stop_words.add('wonder')
sp.vocab['wonder'].is_stop = True

In [14]:
import re

sentence = 'week revenue'
search = re.search(r"""(?:last|past|ago|previous)""", sentence)
is_past = search.group() if search else False
print(is_past)

False
