## Chapter 6: How to use spacy Matcher

In [29]:
import spacy
from spacy.matcher import Matcher

In [30]:
nlp = spacy.load("en_core_web_sm")

In [31]:
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESSS", [pattern]) # A list withing a list is required

In [32]:
doc = nlp("This is an email address: wmattingly@aol.com")
matches = matcher(doc)

In [33]:
print(matches) # index: 0 = lexeme 1 = Start token 2 = End token

[(7369011462154749965, 6, 7)]


In [34]:
print(nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESSS


In [35]:
with open("data\wiki_srn.txt", "r") as f:
    text = f.read()

In [36]:
print(text)

Srinivasa Ramanujan was an Indian mathematician. Though he had almost no formal training in pure mathematics, he made substantial contributions to mathematical analysis, number theory, infinite series, and continued fractions, including solutions to mathematical problems then considered unsolvable.

Ramanujan initially developed his own mathematical research in isolation. According to Hans Eysenck, "he tried to interest the leading professional mathematicians in his work, but failed for the most part. What he had to show them was too novel, too unfamiliar, and additionally presented in unusual ways; they could not be bothered".[4] Seeking mathematicians who could better understand his work, in 1913 he began a postal correspondence with the English mathematician G. H. Hardy at the University of Cambridge, England. Recognising Ramanujan's work as extraordinary, Hardy arranged for him to travel to Cambridge. In his notes, Hardy commented that Ramanujan had produced groundbreaking new theo

In [37]:
nlp = spacy.load("en_core_web_sm")

In [38]:
# Find proper nouns
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN","OP":"+"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

357
(451313080118390996, 0, 1) Srinivasa
(451313080118390996, 0, 2) Srinivasa Ramanujan
(451313080118390996, 1, 2) Ramanujan
(451313080118390996, 46, 47) Ramanujan
(451313080118390996, 58, 59) Hans
(451313080118390996, 58, 60) Hans Eysenck
(451313080118390996, 59, 60) Eysenck
(451313080118390996, 127, 128) G.
(451313080118390996, 127, 129) G. H.
(451313080118390996, 128, 129) H.


In [39]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN","OP":"+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

196
(451313080118390996, 976, 980) Town Higher Secondary School
(451313080118390996, 1252, 1256) Town Higher Secondary School
(451313080118390996, 1843, 1847) Saiva Muthaiah Mudali street
(451313080118390996, 563, 566) Sarangapani Sannidhi Street
(451313080118390996, 592, 595) Tamil Brahmin Iyengar
(451313080118390996, 609, 612) Kuppuswamy Srinivasa Iyengar
(451313080118390996, 650, 653) Sarangapani Sannidhi Street
(451313080118390996, 932, 935) Kangayan Primary School
(451313080118390996, 1024, 1027) S. L. Loney
(451313080118390996, 1182, 1185) G. S. Carr


In [40]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN","OP":"+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x:x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

196
(451313080118390996, 0, 2) Srinivasa Ramanujan
(451313080118390996, 46, 47) Ramanujan
(451313080118390996, 58, 60) Hans Eysenck
(451313080118390996, 127, 129) G. H.
(451313080118390996, 132, 133) University
(451313080118390996, 134, 135) Cambridge
(451313080118390996, 136, 137) England
(451313080118390996, 139, 140) Ramanujan
(451313080118390996, 152, 153) Cambridge
(451313080118390996, 158, 159) Hardy


In [41]:
# Proper nouns before verbs
matcher = Matcher(nlp.vocab)
pattern = [{"POS":"PROPN","OP":"+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x:x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

32
(451313080118390996, 158, 160) Hardy commented
(451313080118390996, 413, 415) Hardy stated
(451313080118390996, 452, 455) ill healthâ€”now believed
(451313080118390996, 699, 701) Ramanujan contracted
(451313080118390996, 936, 938) Ramanujan performed
(451313080118390996, 974, 976) Ramanujan entered
(451313080118390996, 1162, 1164) Ramanujan obtained
(451313080118390996, 1278, 1280) Iyer introduced
(451313080118390996, 1338, 1340) Ramanujan ran
(451313080118390996, 1406, 1408) Ramanujan failed


In [42]:
import json
with open ("data/alice.json", "r") as f:
    data = json.load(f)

In [43]:
text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [44]:
text = text.replace("`","'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [45]:
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"},
           {"IS_ALPHA":True, "OP" : "+"},
           {"IS_PUNCT":True, "OP" : "*"},
           {"ORTH": "'"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x:x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

2
(451313080118390996, 47, 58) 'and what is the use of a book,'
(451313080118390996, 60, 67) 'without pictures or conversation?'


In [46]:
# Find out who is speaking
speak_lemmas = ["think","say"]
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"},
           {"IS_ALPHA":True, "OP" : "+"},
           {"IS_PUNCT":True, "OP" : "*"},
           {"ORTH": "'"},
           {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}},
           {"POS": "PROPN", "OP": "+"},
           {"ORTH":"'"},
           {"IS_ALPHA":True, "OP" : "+"},
           {"IS_PUNCT":True, "OP" : "*"},
           {"ORTH": "'"}
           ]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x:x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

1
(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [47]:
for text in data[0][2]:
    doc = nlp(text)
    matches = matcher(doc)
    print(len(matches))
    matches.sort(key = lambda x:x[1])
    for match in matches[:10]:
        print(match, doc[match[1]:match[2]])

0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [48]:
speak_lemmas = ["think","say"]
text = data[0][2][0].replace("`","'")
matcher = Matcher(nlp.vocab)
pattern1 = [{"ORTH": "'"},
            {"IS_ALPHA":True, "OP": "+"},
            {"IS_PUNCT": True, "OP": "*"},
            {"ORTH": "`"}, 
            {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}},
            {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}
           ]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]

pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]

matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy="LONGEST")
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print(len(matches))
    for match in matches[:10]:
        print(match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 60) 'and what is the use of a book,' thought Alice
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0
