# Week 06: Dependency Parser and spacy
The assignment this week is to identify the grammar pattern VERB-PREP-NOUN using two different methods. You will practice the various functionalities of spacy in the process. 

Data used in this assignment:  
https://drive.google.com/file/d/1OIZPsDezgLaBjw3OX30YFyeFkzegtwP8/view?usp=sharing

* sentences.s2orc.txt

spacy tutorials:  
https://www.machinelearningplus.com/spacy-tutorial-nlp/#phrasematcher  
https://spacy.io/usage/linguistic-features#entity-linking

## Requirements
* pandas
* spacy



### Installation of spacy

In [None]:
! pip install spacy
! python -m spacy download en_core_web_sm

### Read Data

In [1]:
import pandas as pd
def loadData(path):
    with open(path, encoding = 'utf-8') as f:
        sents = []
        for line in f.readlines():
            line = line.strip("\n").split("\t")
            sents.append(line[1])
    return pd.DataFrame({"sentence":sents})
data = loadData("data/sentences.s2orc.txt")
print(data.head())


                                            sentence
0  Meanwhile, an analysis of the literature shows...
1  Meanwhile, this list can be supplemented with ...
2  At the same time, in many cases, several instr...
3  It is not possible to give a systematic assess...
4  Correlation was calculated for the years, wher...


In [2]:
import re
import pandas as pd
import spacy
nlp = spacy.load('en_core_web_sm')

### Spacy example
If you have any probelm, look up the documentation [here](https://spacy.io/usage/linguistic-features)


In [3]:
example_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost.
He began immediately to rant about the gas price .
"""

# Remove newline character
example_text = re.sub("\n", '', example_text)
example_doc = nlp(example_text)

<font color="red">**[ TODO ]**</font> Please print out the 2nd sentence in the example_text

In [4]:
sents = list(example_doc.sents)
print(sents[1])

Citizens who had their main investment in the share-market are facing a great loss.


Let's start with some simple linguistic features we have been dealing with.

<font color="red">**[ TODO ]**</font> Please print out the following token features of the first sentence in example_text:  
text,  lemma,  POS

In [5]:
for token in sents[0]:
    print(token.text + ' ' + token.lemma_ + ' ' + token.pos_)

The the DET
economic economic ADJ
situation situation NOUN
of of ADP
the the DET
country country NOUN
is be AUX
on on ADP
edge edge NOUN
, , PUNCT
as as SCONJ
the the DET
stock stock NOUN
market market NOUN
crashed crash VERB
causing cause VERB
loss loss NOUN
of of ADP
millions million NOUN
. . PUNCT


<font color="red">**[ TODO ]**</font> Data Process 1: Please run the s2orc data through spacy and store the result in data_doc

In [45]:
data_doc = data['sentence'].apply(nlp).to_list()

### Named Entity Recognition
Named Entity: a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name.  

The following is an example of named entity recognition using spacy

In [7]:
ner_doc = nlp("Ada Lovelace was born in New York at Thanksgiving.")

# Document level
for e in ner_doc.ents:
    print(e.text, e.label_) 

Ada Lovelace PERSON
New York GPE
Thanksgiving DATE


In [8]:
from spacy import displacy
displacy.render(ner_doc,style='ent',jupyter=True)

<font color="red">**[ TODO ]**</font> Data Process 2: Please replace all named entities in data_doc with their labels.  
For example,  
"Ada Lovelace was born in New York at Thanksgiving." should be adjusted to  
"PERSON was born in GPE at DATE."

In [9]:
data_doc1 = []
for sentence in data_doc:
    text = str(sentence)
    for e in sentence.ents:
        text = text.replace(e.text, e.label_)
    data_doc1.append(nlp(text))

### Dependency Parser

If you have probelms concerning the dependency parser tags, look up the documentation [here](https://universaldependencies.org/en/dep/index.html). 


In [36]:
# Example of Dependency Parser
print(sents[2])
for token in sents[2]:
    print(token.text, token.dep_)

Many companies might lay off thousands of people to reduce labor cost.
Many amod
companies nsubj
might aux
lay ROOT
off prt
thousands dobj
of prep
people pobj
to aux
reduce advcl
labor compound
cost dobj
. punct


In [20]:
from spacy import displacy

displacy.render(sents[2], style="dep")

To traverse a dependency tree, use the following properties of token object.  
token.children, token.lefts, token.rights  

If you have any probelms, please check [here](https://spacy.io/api/token#children)

<font color="red">**[ TODO ]**</font> Please identify a VERB-PREP-NOUN grammar structure in sent[2] by traversing the dependency tree.  
Expected output:  
(lay, off, thousands)


In [43]:
result = []
for token in sents[2]:
    if token.pos_ == 'VERB':
        if token.i - sents[2].start + 2 < len(sents[2]):
            if (token.nbor(1).pos_ == 'ADP') & (token.nbor(2).pos_ == 'NOUN'):
                result.append((token.text, token.nbor(1).text, token.nbor(2).text))
        for right_child in token.rights:
            if right_child.pos_ =='ADP':
                for child in right_child.rights:
                    if child.pos_ == 'NOUN':
                        result.append((token.text, right_child.text, child.text))
            if right_child.pos_ =='NOUN':
                for child in right_child.lefts:
                    if child.pos_ == 'ADP':
                        result.append((token.text, right_child.text, child.text))
print(result)

[('lay', 'off', 'thousands')]


<font color="red">**[ TODO ]**</font>  Please identify all VERB-PREP-NOUN grammar structure in data_doc by traversing the dependency trees and save the results in a list of tuples dep_gp.


In [71]:
dep_gp = []
for sentence in data_doc1:
    for token in sentence:
        if token.pos_ == 'VERB':
            if token.i + 2 < len(sentence):
                if (token.nbor(1).pos_ == 'ADP') & (token.nbor(2).pos_ == 'NOUN'):
                    dep_gp.append((token.text, token.nbor(1).text, token.nbor(2).text))
            for right_child in token.rights:
                if right_child.pos_ =='ADP':
                    for child in right_child.rights:
                        if child.pos_ == 'NOUN':
                            dep_gp.append((token.text, right_child.text, child.text))
                if right_child.pos_ =='NOUN':
                    for child in right_child.lefts:
                        if child.pos_ == 'ADP':
                            dep_gp.append((token.text, child.text, right_child.text))

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in dep_gp with the verb "charge".


In [108]:
for e in dep_gp:
    if e[0] == 'run':
        print(e)

('run', 'at', 'number')
('run', 'with', 'algorithms')
('run', 'with', 'algorithms')
('run', 'from', 'government')
('run', 'as', 'phase')
('run', 'by', 'family')
('run', 'by', 'members')
('run', 'between', 'power')
('run', 'between', 'cycles')
('run', 'for', 'offices')
('run', 'on', 'step')
('run', 'on', 'parameters')
('run', 'by', 'volunteers')
('run', 'by', 'volunteers')
('run', 'on', 'sets')


### Rule Based Methods 
We can also custom build rules for spacy to match patterns.  
[Documentation](https://spacy.io/api/matcher)

In [67]:
from spacy.matcher import Matcher 

In [68]:
# Example text
text = """I visited Manali last time . Around same budget trips ? I was visiting Ladakh this summer . I have planned visiting New York and other abroad places for next year. Have you ever visited Kodaikanal? """
text = re.sub('\n', '', text)
match_doc = nlp(text)

In [90]:
# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Write a pattern that matches a form of "visit" + place
my_pattern = [{"LEMMA": "visit"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("Visting_places", [my_pattern])
matches = matcher(match_doc)

# Counting the no of matches
print(" matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", match_doc[start:end].text)

 matches found: 4
Match found: visited Manali
Match found: visiting Ladakh
Match found: visiting New
Match found: visited Kodaikanal


<font color="red">**[ TODO ]**</font> Please identify all VERB-PREP-NOUN grammar structure in data_doc by applying a matcher rule and store the results in a list of tuples rule_gp. 


In [97]:
rule_gp = []

# Initialize the matcher
matcher = Matcher(nlp.vocab)

pattern = [{"POS": "VERB"}, {"POS": "ADP"}, {"POS": "NOUN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("VERB-PREP-NOUN", [pattern])

for sentence in data_doc1:
    matches = matcher(sentence)
    for match_id, start, end in matches:
        token = sentence[start:end].text.split(' ')
        rule_gp.append((token[0], token[1], token[2]))

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in rule_gp with the verb "charge".


In [107]:
for e in rule_gp:
    if e[0] == 'run':
        print(e)

('run', 'by', 'family')
('run', 'on', 'step')
('run', 'by', 'volunteers')


## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=258852025) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to elearn. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.