# 6. HOw to use the spaCy Matcher


In [1]:
import spacy 
from spacy.matcher import Matcher

## 6.1 Basic Example

In [8]:
nlp = spacy.load("en_core_web_sm")
# print(nlp)
matcher = Matcher(nlp.vocab)
# print(matcher)
pattern = [{"LIKE_EMAIL":True}]
# print(pattern)
matcher.add("EMAIL_ADDRESS", [pattern])
# print(matcher)
doc = nlp("This is email address: manish@outlook.com")
matches = matcher(doc)
print(matches)

[(16571425990740197027, 5, 6)]


The line of code matcher = Matcher(nlp.vocab) is used in the context of spaCy, an advanced Natural Language Processing (NLP) library in Python. Here’s what it does:

**Matcher**: This is a class provided by spaCy that allows you to match sequences of tokens based on patterns you define. It’s a powerful tool for finding words and phrases in text using rule-based matching.
**nlp.vocab**: The nlp object is typically a spaCy Language instance that contains the processing pipeline and various NLP tools. The .vocab attribute of nlp is a storage container for vocabulary items and has data on word types, including their lexical attributes.



The output [(16571425990740197027, 5, 6)] is a list of tuples that you would typically receive from using spaCy’s Matcher object. Here’s what each element of the tuple represents:

* 16571425990740197027: This is the match ID, a unique identifier for the matched rule. In spaCy, match IDs are usually the hash value of the rule name added to the Matcher. For example, if you added a rule named “EMAIL_ADDRESS”, this number would correspond to the hash of that string.
* 5: This is the start index of the match in the Doc object. It indicates the position of the first token in the document that matches the pattern.
* 6: This is the end index of the match in the Doc object. It indicates the position right after the last token in the document that matches the pattern.
* So, in the context of your previous code, this output means that the Matcher found one match for an email address in the text, starting at the fifth token and ending before the sixth token. Given the input text “This is email address: manish@outlook.com”, the match would be “manish@outlook.com”, which is the email address identified by the pattern [{"LIKE_EMAIL":True}].

In [9]:
print(nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


## 6.2. Attributes Taken by Matcher¶
* ORTH - The exact verbatim of a token (str)

* TEXT - The exact verbatim of a token (str)

* LOWER - The lowercase form of the token text (str)

* LENGTH - The length of the token text (int)

* IS_ALPHA

* IS_ASCII

* IS_DIGIT

* IS_LOWER

* IS_UPPER

* IS_TITLE

* IS_PUNCT

* IS_SPACE

* IS_STOP

* IS_SENT_START

* LIKE_NUM

* LIKE_URL

* LIKE_EMAIL

* SPACY

* POS

* TAG

* MORPH

* DEP

* LEMMA

* SHAPE

* ENT_TYPE

_ - Custom extension attributes (Dict[str, Any])

* OP

## 6.3 Applied Matcher

In [15]:
with open ("F:\spaCy-master\data\wiki_mlk.txt", "r") as f:
    text = f.read()
    # print(text)
    nlp = spacy.load("en_core_web_sm")
    # print(nlp)



  with open ("F:\spaCy-master\data\wiki_mlk.txt", "r") as f:


<spacy.lang.en.English object at 0x0000029FAFD303E0>


In [17]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN"}]  # proper noun 
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
# print(len(matches)
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])


(3232560085755078826, 0, 1) Martin
(3232560085755078826, 1, 2) Luther
(3232560085755078826, 2, 3) King
(3232560085755078826, 3, 4) Jr.
(3232560085755078826, 6, 7) Michael
(3232560085755078826, 7, 8) King
(3232560085755078826, 8, 9) Jr.
(3232560085755078826, 10, 11) January
(3232560085755078826, 15, 16) April
(3232560085755078826, 23, 24) Baptist


## 6.4.1 Improving it with Multi-Word Tokens 


In [24]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}]
matcher.add("PROPER_NOUN",[pattern])
doc = nlp(text)
# print(doc)
matches = matcher(doc)
# print(matches)
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])


(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


* pattern = [{"POS":"PROPN", "OP":"+"}]: This line defines a pattern that the Matcher will use to find matches in the text. The pattern is looking for one or more tokens with the part-of-speech tag PROPN, which stands for “proper noun”. The OP key specifies the operator, with "+" meaning “one or more times”.

* matcher.add("PROPER_NOUN",[pattern]): This line adds the defined pattern to the Matcher under the rule name “PROPER_NOUN”. When the Matcher finds a sequence of tokens that matches the pattern, it will label the sequence with this rule name.

## 6.4.3 Sorting it to Apperance

In [25]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN", "OP":"+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print(matches)

[(3232560085755078826, 0, 4), (3232560085755078826, 6, 9), (3232560085755078826, 10, 11), (3232560085755078826, 15, 16), (3232560085755078826, 23, 24), (3232560085755078826, 49, 50), (3232560085755078826, 69, 71), (3232560085755078826, 83, 88), (3232560085755078826, 89, 90), (3232560085755078826, 113, 114), (3232560085755078826, 117, 118), (3232560085755078826, 128, 132), (3232560085755078826, 133, 134), (3232560085755078826, 140, 141), (3232560085755078826, 146, 148), (3232560085755078826, 149, 150), (3232560085755078826, 151, 152), (3232560085755078826, 163, 164), (3232560085755078826, 165, 166), (3232560085755078826, 167, 168), (3232560085755078826, 172, 173), (3232560085755078826, 174, 175), (3232560085755078826, 185, 186), (3232560085755078826, 193, 195), (3232560085755078826, 240, 242), (3232560085755078826, 243, 244), (3232560085755078826, 245, 246), (3232560085755078826, 247, 251), (3232560085755078826, 252, 253), (3232560085755078826, 262, 263), (3232560085755078826, 270, 271)

* pattern = [{"POS":"PROPN", "OP":"+"}]: This line defines a pattern that the Matcher will use to find matches in the text. The pattern is looking for one or more tokens with the part-of-speech tag PROPN, which stands for “proper noun”. The OP key specifies the operator, with "+" meaning “one or more times”.

* matcher.add("PROPER_NOUNS", [pattern], greedy="LONGEST"): This line adds the defined pattern to the Matcher under the rule name “PROPER_NOUNS”. The greedy="LONGEST" parameter tells the Matcher to prefer longer matches over shorter ones when patterns overlap.

* matches.sort(key = lambda x: x[1]): This line sorts the matches by their start index in ascending order. This is useful if you want to process or display the matches in the order they appear in the text.

## 6.4.4 Adding in Sequunces

In [26]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

7
(3232560085755078826, 49, 51) King advanced
(3232560085755078826, 89, 91) King participated
(3232560085755078826, 113, 115) King led
(3232560085755078826, 167, 169) King helped
(3232560085755078826, 247, 252) Director J. Edgar Hoover considered
(3232560085755078826, 322, 324) King won
(3232560085755078826, 485, 488) United States beginning


## 6.5. Finding Quotes and Speakers

In [37]:
import json 
with open("F:\spaCy-master\data\\alice.json", "r") as f:
    data = json.load(f)
    text = data[0][2][0]
    print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


  with open("F:\spaCy-master\data\\alice.json", "r") as f:


In [38]:
text = text.replace("`", "'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [39]:
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"}, {'IS_ALPHA':True, "OP":"+"}, {"IS_PUNCT":True, "OP":"*"}, {"ORTH":"'"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])

for match in matches[:10]:
    print(match, doc[match[1]:match[2]])


(3232560085755078826, 47, 58) 'and what is the use of a book,'
(3232560085755078826, 60, 67) 'without pictures or conversation?'


## 6.5.1. Find Speaker

In [40]:
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace( "`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


### Problem with this Approach

In [45]:
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


### 6.5.3. Adding More Patterns

In [42]:
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace( "`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy='LONGEST')
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0
