## Phrase Matcher

In [74]:
import spacy

In [75]:
from spacy.matcher import Matcher

In [76]:
nlp = spacy.load('en_core_web_sm')

- stored in `nlp.vocab`, not in `doc.ents`
- not looking for entuty in text, but a structure within the text
- note that all the matches are token-wise
  eg: `FS440` is not `IS_DIGIT`, since it is one token and is not just digits, but in `FS 440`, 440 is `IS_DIGIT`

### When to use over RegEx?

when linguistic components are to be found, and use RegEx when you want to extract a complex pattern, that is not dependent on parts-of-speech

In [77]:
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])
doc = nlp("My email address is mazz84002@gmail.com")
matches = matcher(doc) # matches is a list of tuples

In [78]:
print(matches)

[(16571425990740197027, 4, 5)]


This was added to `nlp.vocab` as a unique lexemme, as shown `16571425990740197027`

In [79]:
print(nlp.vocab[matches[0][0]])
print(nlp.vocab[matches[0][0]].text)

<spacy.lexeme.Lexeme object at 0x7f89c2cd7980>
EMAIL_ADDRESS


## Extraction Examples

In [80]:
with open('data/wiki_mlk.txt', 'r') as f:
    text = f.read()

In [81]:
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his famous 

### Extracting Proper Nouns

In [82]:
nlp = spacy.load('en_core_web_md')

In [83]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

103
(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist
(451313080118390996, 49, 50) King
(451313080118390996, 69, 70) Mahatma
(451313080118390996, 70, 71) Gandhi
(451313080118390996, 83, 84) Martin
(451313080118390996, 84, 85) Luther
(451313080118390996, 85, 86) King
(451313080118390996, 86, 87) Sr
(451313080118390996, 89, 90) King
(451313080118390996, 113, 114) King
(451313080118390996, 117, 118) Montgomery


Solving the problem of seperated tokens (grabbing multi word tokens)

In [84]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}] # OP: + means looking for PROPN occuring 1 or more times
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

172
(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 6, 8) Michael King
(451313080118390996, 7, 8) King
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 7, 9) King Jr.
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist
(451313080118390996, 49, 50) King


Problem - Overlapping tokens

In [85]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST") # greedy=LONGETS looks for the longest token
doc = nlp(text)
matches = matcher(doc)
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

63
(451313080118390996, 469, 474) Martin Luther King Jr. Day
(451313080118390996, 536, 541) Martin Luther King Jr. Memorial
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 83, 87) Martin Luther King Sr
(451313080118390996, 128, 132) Southern Christian Leadership Conference
(451313080118390996, 247, 251) Director J. Edgar Hoover
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 325, 328) Nobel Peace Prize
(451313080118390996, 422, 425) James Earl Ray
(451313080118390996, 463, 466) Congressional Gold Medal
(451313080118390996, 503, 506) President Ronald Reagan
(451313080118390996, 69, 71) Mahatma Gandhi
(451313080118390996, 146, 148) Albany Movement
(451313080118390996, 193, 195) Lincoln Memorial
(451313080118390996, 240, 242) Federal Bureau
(451313080118390996, 370, 372) Vietnam War
(451313080118390996, 392, 394) Poor People
(451313080118390996, 455, 457) Presidential Medal
(451313080118390996, 485, 487) United States
(451313080118390996, 528, 530) 

Problem - the matches are sorted longest to shortest, not in order of text

In [86]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start token in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

63
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist
(451313080118390996, 49, 50) King
(451313080118390996, 69, 71) Mahatma Gandhi
(451313080118390996, 83, 87) Martin Luther King Sr
(451313080118390996, 89, 90) King
(451313080118390996, 113, 114) King
(451313080118390996, 117, 118) Montgomery
(451313080118390996, 128, 132) Southern Christian Leadership Conference
(451313080118390996, 133, 134) SCLC
(451313080118390996, 140, 141) SCLC
(451313080118390996, 146, 148) Albany Movement
(451313080118390996, 149, 150) Albany
(451313080118390996, 151, 152) Georgia
(451313080118390996, 163, 164) Birmingham
(451313080118390996, 165, 166) Alabama
(451313080118390996, 167, 168) King


### Extracting Proper Nouns followed by Verbs

In [87]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

7
(451313080118390996, 49, 51) King advanced
(451313080118390996, 89, 91) King participated
(451313080118390996, 113, 115) King led
(451313080118390996, 198, 200) SCLC put
(451313080118390996, 247, 252) Director J. Edgar Hoover considered
(451313080118390996, 322, 324) King won
(451313080118390996, 485, 488) United States beginning


In [88]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "VERB", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

51
(451313080118390996, 5, 6) born
(451313080118390996, 28, 29) became
(451313080118390996, 50, 51) advanced
(451313080118390996, 59, 60) inspired
(451313080118390996, 90, 91) participated
(451313080118390996, 93, 94) led
(451313080118390996, 100, 101) vote
(451313080118390996, 114, 115) led
(451313080118390996, 122, 123) became
(451313080118390996, 143, 144) led
(451313080118390996, 155, 156) organize
(451313080118390996, 169, 170) organize
(451313080118390996, 178, 179) delivered
(451313080118390996, 183, 184) Have
(451313080118390996, 199, 200) put
(451313080118390996, 212, 213) choosing
(451313080118390996, 221, 222) carried
(451313080118390996, 225, 226) were
(451313080118390996, 237, 238) turned
(451313080118390996, 251, 252) considered


In [89]:
matcher = Matcher(nlp.vocab)
pattern = [{"IS_SENT_START": True}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

18
(451313080118390996, 0, 1) Martin
(451313080118390996, 49, 50) King
(451313080118390996, 72, 73) He
(451313080118390996, 88, 89) 


(451313080118390996, 136, 137) As
(451313080118390996, 167, 168) King
(451313080118390996, 196, 197) 


(451313080118390996, 224, 225) There
(451313080118390996, 270, 271) FBI
(451313080118390996, 336, 337) In
(451313080118390996, 351, 352) In
(451313080118390996, 373, 374) 


(451313080118390996, 409, 410) His
(451313080118390996, 420, 421) Allegations
(451313080118390996, 450, 451) King
(451313080118390996, 469, 470) Martin
(451313080118390996, 509, 510) Hundreds
(451313080118390996, 535, 536) The


In [97]:
matcher = Matcher(nlp.vocab)
pattern = [{"IS_UPPER": True}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

0


In [98]:
matcher = Matcher(nlp.vocab)
pattern = [{"IS_LOWER": True}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

55
(451313080118390996, 1, 2) was
(451313080118390996, 2, 3) beginning
(451313080118390996, 3, 4) to
(451313080118390996, 4, 5) get
(451313080118390996, 5, 6) very
(451313080118390996, 6, 7) tired
(451313080118390996, 7, 8) of
(451313080118390996, 8, 9) sitting
(451313080118390996, 9, 10) by
(451313080118390996, 10, 11) her
(451313080118390996, 11, 12) sister
(451313080118390996, 12, 13) on
(451313080118390996, 13, 14) the
(451313080118390996, 14, 15) bank
(451313080118390996, 16, 17) and
(451313080118390996, 17, 18) of
(451313080118390996, 18, 19) having
(451313080118390996, 19, 20) nothing
(451313080118390996, 20, 21) to
(451313080118390996, 21, 22) do


In [99]:
matcher = Matcher(nlp.vocab)
pattern = [{"IS_SPACE": True}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

0


In [90]:
import json
with open('data/alice.json', 'r') as f:
    data = json.load(f)

In [91]:
text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [92]:
text = text.replace("`", "'") # standardise the text
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


### Extract all quotation marks

In [93]:
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"}, # starts with a '
           {"IS_ALPHA": True, "OP": "+"}, # has many alphabets after that (that's why OP:+)
           {"IS_PUNCT": True, "OP": "*"}, # may or may not have a ounctuation in between 
           {"ORTH":"'"} # end with a '
          ]
matcher.add("QUOTATIONS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

2
(1846643219120237910, 47, 58) 'and what is the use of a book,'
(1846643219120237910, 60, 67) 'without pictures or conversation?'


In [94]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"}, # starts with a '
           {"IS_ALPHA": True, "OP": "+"}, # has many alphabets after that (that's why OP:+)
           {"IS_PUNCT": True, "OP": "*"}, # may or may not have a ounctuation in between 
           {"ORTH":"'"}, # end with a '
           {"POS": "VERB", "LEMMA":{"IN": speak_lemmas}} # now, the quotation mark is proceeded by a verb, which is either `think`, or `say` lemmatised form
          ]
matcher.add("QUOTATIONS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

1
(1846643219120237910, 47, 59) 'and what is the use of a book,' thought


In [95]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"}, # starts with a '
           {"IS_ALPHA": True, "OP": "+"}, # has many alphabets after that (that's why OP:+)
           {"IS_PUNCT": True, "OP": "*"}, # may or may not have a ounctuation in between 
           {"ORTH":"'"}, # end with a '
           {"POS": "VERB", "LEMMA":{"IN": speak_lemmas}}, # now, the quotation mark is proceeded by a verb, which is either `think`, or `say` lemmatised form
           {"POS": "PROPN", "OP":"+"},
          ]
matcher.add("QUOTATIONS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

1
(1846643219120237910, 47, 60) 'and what is the use of a book,' thought Alice


In [96]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH":"'"}, # starts with a '
           {"IS_ALPHA": True, "OP": "+"}, # has many alphabets after that (that's why OP:+)
           {"IS_PUNCT": True, "OP": "*"}, # may or may not have a ounctuation in between 
           {"ORTH":"'"}, # end with a '
           {"POS": "VERB", "LEMMA":{"IN": speak_lemmas}}, # now, the quotation mark is proceeded by a verb, which is either `think`, or `say` lemmatised form
           {"POS": "PROPN", "OP":"+"},
           {"ORTH":"'"}, # starts with a '
           {"IS_ALPHA": True, "OP": "+"}, # has many alphabets after that (that's why OP:+)
           {"IS_PUNCT": True, "OP": "*"}, # may or may not have a ounctuation in between 
           {"ORTH":"'"}, # end with a '
          ]
matcher.add("QUOTATIONS", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1]) # sorting by the start toiken in the tuples list
print(len(matches))
for match in matches[:20]:
    print(match, doc[match[1]:match[2]])

1
(1846643219120237910, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
