In [1]:
!pip install --upgrade pip --quiet
!pip install spacy --quiet
!python -m spacy download en_core_web_sm --quiet

✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import spacy
from spacy.matcher import Matcher

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
# vocab is a storage container for lexical types.
matcher = Matcher(nlp.vocab)
pattern = [
    {"LIKE_EMAIL": True}
]
matcher.add('EMAIL_ADDRESS',[pattern])

In [5]:
doc = nlp("This is an email address: 1680434@park.edu")
matches = matcher(doc)
# prints 3 items in a tuple (lexeme, start, end)
matches

[(16571425990740197027, 6, 7)]

A Lexeme has no string context – it’s a word type, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse, or lemma (if lemmatization depends on the part-of-speech tag).

In [6]:
# check what the lexeme refers to in the nlp vocab
nlp.vocab[matches[0][0]].text

'EMAIL_ADDRESS'

In [10]:
with open('data/wiki_mlk.txt', 'r') as f:
    text = f.read()

In [11]:
text

'Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.\n\nKing participated in and led marches for blacks\' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his f

In [12]:
nlp2 = spacy.load('en_core_web_sm')

In [13]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp2(text)
matches = matcher(doc)

print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

103
(3232560085755078826, 0, 1) Martin
(3232560085755078826, 1, 2) Luther
(3232560085755078826, 2, 3) King
(3232560085755078826, 3, 4) Jr.
(3232560085755078826, 6, 7) Michael
(3232560085755078826, 7, 8) King
(3232560085755078826, 8, 9) Jr.
(3232560085755078826, 10, 11) January
(3232560085755078826, 16, 17) April
(3232560085755078826, 23, 24) American


We got all the proper nouns but we would like to get multi word tokens. eg Identify Martin Luther King Jr. as one token instead of 4.
```
OP - Operator or quantifier to determine how often to match a token pattern.
+  - Require the pattern to match 1 or more times.

```

In [14]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP":"+"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp2(text)
matches = matcher(doc)
print(len(matches))

for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

173
(3232560085755078826, 0, 1) Martin
(3232560085755078826, 0, 2) Martin Luther
(3232560085755078826, 1, 2) Luther
(3232560085755078826, 0, 3) Martin Luther King
(3232560085755078826, 1, 3) Luther King
(3232560085755078826, 2, 3) King
(3232560085755078826, 0, 4) Martin Luther King Jr.
(3232560085755078826, 1, 4) Luther King Jr.
(3232560085755078826, 2, 4) King Jr.
(3232560085755078826, 3, 4) Jr.


It is now generating multi-word tokens but there is some repetition. We need to eliminate the duplicates by getting the longest version.

In [18]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP":"+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp2(text)
matches = matcher(doc)

print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

62
(3232560085755078826, 470, 475) Martin Luther King Jr. Day
(3232560085755078826, 537, 542) Martin Luther King Jr. Memorial
(3232560085755078826, 0, 4) Martin Luther King Jr.
(3232560085755078826, 84, 88) Martin Luther King Sr
(3232560085755078826, 129, 133) Southern Christian Leadership Conference
(3232560085755078826, 248, 252) Director J. Edgar Hoover
(3232560085755078826, 6, 9) Michael King Jr.
(3232560085755078826, 326, 329) Nobel Peace Prize
(3232560085755078826, 423, 426) James Earl Ray
(3232560085755078826, 464, 467) Congressional Gold Medal


Everything looks right. But if we look closely, the greedy longest doesn't arrange by order in which the pronouns appear in the document. To confirm this loop over the bottom 10. i.e `matches[-10:]`. Its like it considers the longer tokens first. We can use a lambda function to sort the tokens by the `start` index. Remember (lexeme, start, end)

In [19]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP":"+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp2(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1])

print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

62
(3232560085755078826, 0, 4) Martin Luther King Jr.
(3232560085755078826, 6, 9) Michael King Jr.
(3232560085755078826, 10, 11) January
(3232560085755078826, 16, 17) April
(3232560085755078826, 23, 25) American Baptist
(3232560085755078826, 50, 51) King
(3232560085755078826, 70, 72) Mahatma Gandhi
(3232560085755078826, 84, 88) Martin Luther King Sr
(3232560085755078826, 90, 91) King
(3232560085755078826, 114, 115) King


New task to check any time the proper noun is followed by a verb.

In [22]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP":"+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp2(text)
matches = matcher(doc)
matches.sort(key= lambda x: x[1])

print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

6
(3232560085755078826, 50, 52) King advanced
(3232560085755078826, 90, 92) King participated
(3232560085755078826, 114, 116) King led
(3232560085755078826, 248, 253) Director J. Edgar Hoover considered
(3232560085755078826, 323, 325) King won
(3232560085755078826, 486, 489) United States beginning


In [40]:
import json
with open('data/alice.json', 'r') as f:
    data = json.load(f)

In [41]:
text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [42]:
text = text.replace("`", "'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


```
ORTH     - The exact verbatim text of a token.

IS_ALPHA - Token text consists of alphabetic characters

*        - Allow the pattern to match 0 or more times.
```

In [43]:
matcher = Matcher(nlp.vocab)
pattern = [
    {"ORTH": "'"},
    {"IS_ALPHA": True, "OP":"+"},
    {"IS_PUNCT": True, "OP":"*"},
    {"ORTH": "'"}
    ]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

print(len(matches))
for match in matches:
    print(matches, doc[match[1]:match[2]])


2
[(3232560085755078826, 47, 58), (3232560085755078826, 60, 67)] 'and what is the use of a book,'
[(3232560085755078826, 47, 58), (3232560085755078826, 60, 67)] 'without pictures or conversation?'


Check who said those words

In [44]:
speak_lemmas = ['think','speak']
matcher = Matcher(nlp.vocab)
pattern = [
    {"ORTH": "'"},
    {"IS_ALPHA": True, "OP":"+"},
    {"IS_PUNCT": True, "OP":"*"},
    {"ORTH": "'"},
    {"POS":"VERB", "LEMMA":{"IN": speak_lemmas}},
    {"POS":"PROPN", "OP":"+"}
    ]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

print(len(matches))
for match in matches:
    print(matches, doc[match[1]:match[2]])

1
[(3232560085755078826, 47, 60)] 'and what is the use of a book,' thought Alice


Getting to know the entire quote.

In [45]:
speak_lemmas = ['think','speak']
matcher = Matcher(nlp.vocab)
pattern = [
    {"ORTH": "'"},
    {"IS_ALPHA": True, "OP":"+"},
    {"IS_PUNCT": True, "OP":"*"},
    {"ORTH": "'"},
    {"POS":"VERB", "LEMMA":{"IN": speak_lemmas}},
    {"POS":"PROPN", "OP":"+"},
    {"ORTH": "'"},
    {"IS_ALPHA": True, "OP":"+"},
    {"IS_PUNCT": True, "OP":"*"},
    {"ORTH": "'"}
    ]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])

print(len(matches))
for match in matches:
    print(matches, doc[match[1]:match[2]])

1
[(3232560085755078826, 47, 67)] 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


Loop over entire paragraph in `CHAPTER I` as opposed to what we did for the first sentence in that paragraph.

In [52]:
for text2 in data[0][2]:
    text2 = text2.replace("`", "'")
    doc = nlp(text2)
    matches = matcher(doc)
    matches.sort(key=lambda x: x[1])
    # print(len(matches))
    for match in matches[:10]:
        print(matches, doc[match[1]:match[2]])
    

[(3232560085755078826, 47, 67)] 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


We only get one output. The problem is that our matcher is singular. It should be multi-varied.

In [53]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy='LONGEST')
for text3 in data[0][2]:
    text3 = text3.replace("`", "'")
    doc = nlp(text3)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0
