# Week 06: Dependency Parser and spacy
The assignment this week is to identify the grammar pattern VERB-PREP-NOUN using two different methods. You will practice the various functionalities of spacy in the process. 

Data used in this assignment:  
https://drive.google.com/file/d/1OIZPsDezgLaBjw3OX30YFyeFkzegtwP8/view?usp=sharing

* sentences.s2orc.txt

spacy tutorials:  
https://www.machinelearningplus.com/spacy-tutorial-nlp/#phrasematcher  
https://spacy.io/usage/linguistic-features#entity-linking

## Requirements
* pandas
* spacy



### Installation of spacy

In [None]:
! pip install spacy
! python -m spacy download en_core_web_sm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
2022-10-27 02:00:26.267163: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 1.6 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Read Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import pandas as pd

def loadData(path):
    with open(path) as f:
        sents = []
        for line in f.readlines():
            line = line.strip("\n").split("\t")
            sents.append(line[1])
    return pd.DataFrame({"sentence":sents})
data = loadData("/content/drive/MyDrive/graduate/nlp/week6/sentences.s2orc.txt")
print(data.head())


                                            sentence
0  Meanwhile, an analysis of the literature shows...
1  Meanwhile, this list can be supplemented with ...
2  At the same time, in many cases, several instr...
3  It is not possible to give a systematic assess...
4  Correlation was calculated for the years, wher...


In [None]:
import re
import spacy
nlp = spacy.load('en_core_web_sm')

### Spacy example
If you have any probelm, look up the documentation [here](https://spacy.io/usage/linguistic-features)


In [None]:
example_text = """The economic situation of the country is on edge , as the stock 
market crashed causing loss of millions. Citizens who had their main investment 
in the share-market are facing a great loss. Many companies might lay off 
thousands of people to reduce labor cost.
He began immediately to rant about the gas price .
"""

# Remove newline character
example_text = re.sub("\n", '', example_text)
example_doc = nlp(example_text)
print(example_doc)

The economic situation of the country is on edge , as the stock market crashed causing loss of millions. Citizens who had their main investment in the share-market are facing a great loss. Many companies might lay off thousands of people to reduce labor cost.He began immediately to rant about the gas price .


<font color="red">**[ TODO ]**</font> Please print out the 2nd sentence in the example_text

In [None]:
sents = []
[sents.append(sent) for sent in example_doc.sents]
# for sent in example_doc.sents:
#   sents.append(sent)
print(sents[1])

Citizens who had their main investment in the share-market are facing a great loss.


Let's start with some simple linguistic features we have been dealing with.

<font color="red">**[ TODO ]**</font> Please print out the following token features of the first sentence in example_text:  
text,  lemma,  POS

In [None]:
for token in sents[0]:
    print(token,token.lemma_,token.pos_,)

The the DET
economic economic ADJ
situation situation NOUN
of of ADP
the the DET
country country NOUN
is be AUX
on on ADP
edge edge NOUN
, , PUNCT
as as SCONJ
the the DET
stock stock NOUN
market market NOUN
crashed crash VERB
causing cause VERB
loss loss NOUN
of of ADP
millions million NOUN
. . PUNCT


<font color="red">**[ TODO ]**</font> Data Process 1: Please run the s2orc data through spacy and store the result in data_doc

In [None]:
doc_sentences = ""
for index,sentence in enumerate(data.sentence):
  if index == 0:
    doc_sentences = sentence
    continue
  doc_sentences = doc_sentences + " " + sentence

In [None]:
nlp.max_length = len(doc_sentences) + 100
data_doc = nlp(doc_sentences)

In [None]:
data_sents = []
[data_sents.append(sent) for sent in data_doc.sents]
print(data_sents[0])

Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed.


In [None]:
for token in data_sents[0]:
  print(token,token.lemma_,token.pos_,)

Meanwhile meanwhile ADV
, , PUNCT
an an DET
analysis analysis NOUN
of of ADP
the the DET
literature literature NOUN
shows show VERB
that that SCONJ
the the DET
development development NOUN
of of ADP
indicators indicator NOUN
of of ADP
financial financial ADJ
stability stability NOUN
has have AUX
not not PART
yet yet ADV
been be AUX
completed complete VERB
. . PUNCT


### Named Entity Recognition
Named Entity: a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name.  

The following is an example of named entity recognition using spacy

In [None]:
ner_doc = nlp("Ada Lovelace was born in New York at Thanksgiving. Ada Lovelace, who is a nice Ada Lovelace, was born in the US at Thanksgiving.")
# Document level
for e in ner_doc.ents:
  print(e.text,e.label_)

Ada Lovelace PERSON
New York GPE
Thanksgiving DATE
Ada Lovelace PERSON
Ada Lovelace PERSON
US GPE
Thanksgiving DATE


In [None]:
from spacy import displacy
displacy.render(ner_doc,style='ent',jupyter=True)

In [None]:
# Document level
import string 
from spacy import displacy
# nlp.add_pipe("merge_entities")

ner_doc_string = str(ner_doc)
for e in reversed(ner_doc.ents): 
    start = e.start_char
    end = start + len(e.text)
    ner_doc_string = ner_doc_string[:start] + e.label_ + ner_doc_string[end:]

nlp.max_length = len(ner_doc_string) + 100
ner_doc = nlp(ner_doc_string)
print(ner_doc)

PERSON was born in GPE at DATE. PERSON, who is a nice PERSON, was born in the GPE at DATE.


<font color="red">**[ TODO ]**</font> Data Process 2: Please replace all named entities in data_doc with their labels.  
For example,  
"Ada Lovelace was born in New York at Thanksgiving." should be adjusted to  
"PERSON was born in GPE at DATE."

In [None]:
### Before replace Named Entity 

count = 0
for index in range(len(data_doc)):
  if str(data_doc[index+1]) == ',' :
    print(data_doc[index], end='')
  else:
    print(data_doc[index], end=' ')
  if str(data_doc[index]) == '.':
    print('\n')
    count += 1
  if count == 10:
    break
    

Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed . 

Meanwhile, this list can be supplemented with instruments of monetary policy, which also have an impact on financial stability . 

At the same time, in many cases, several instruments are used to reduce financial instability, which contributes to the achievement of various intermediate goals . 

It is not possible to give a systematic assessment of financial stability and coordinate the use of monetary, macro - prudential and micro - prudential policies in order to reduce systemic risks . 

Correlation was calculated for the years, where the information is available for both indicators . 

Table 4 defines the criteria for market and institutional balance of financial stability, formed for the Russian economy . 

The development of a risk map is necessary in order to determine the objects of regulation . 

Blowing out a bubble has little effect on the a

In [None]:
### Replace Named Entity 
import string 

data_doc_string = str(data_doc)
for e in reversed(data_doc.ents): 
    start = e.start_char
    end = start + len(e.text)
    data_doc_string = data_doc_string[:start] + e.label_ + data_doc_string[end:]
nlp.max_length = len(data_doc_string) + 100
data_doc = nlp(data_doc_string)

In [None]:
### After replace Named Entity 

count = 0
for index in range(len(data_doc)):
  if str(data_doc[index+1]) == ',':
    print(data_doc[index], end='')
  else:
    print(data_doc[index], end=' ')
  if str(data_doc[index]) == '.':
    print('\n')
    count += 1
  if count == 10:
    break
    

Meanwhile, an analysis of the literature shows that the development of indicators of financial stability has not yet been completed . 

Meanwhile, this list can be supplemented with instruments of monetary policy, which also have an impact on financial stability . 

At the same time, in many cases, several instruments are used to reduce financial instability, which contributes to the achievement of various intermediate goals . 

It is not possible to give a systematic assessment of financial stability and coordinate the use of monetary, macro - prudential and micro - prudential policies in order to reduce systemic risks . 

Correlation was calculated for DATE, where the information is available for both indicators . 

Table CARDINAL defines the criteria for market and institutional balance of financial stability, formed for the NORP economy . 

The development of a risk map is necessary in order to determine the objects of regulation . 

Blowing out a bubble has little effect on the as

### Dependency Parser

If you have probelms concerning the dependency parser tags, look up the documentation [here](https://universaldependencies.org/en/dep/index.html). 


In [None]:
# Example of Dependency Parser
print(sents[2])
for token in sents[2]:
    print(token.text, token.dep_)

Many companies might lay off thousands of people to reduce labor cost.
Many amod
companies nsubj
might aux
lay ROOT
off prt
thousands dobj
of prep
people pobj
to aux
reduce advcl
labor compound
cost dobj
. punct


In [None]:
from spacy import displacy

displacy.render(sents[2], style="dep",jupyter=True)

To traverse a dependency tree, use the following properties of token object.  
token.children, token.lefts, token.rights  

If you have any probelms, please check [here](https://spacy.io/api/token#children)

<font color="red">**[ TODO ]**</font> Please identify a VERB-PREP-NOUN grammar structure in sent[2] by traversing the dependency tree.  
Expected output:  
(lay, off, thousands)


In [None]:
grammar_structure = {}
print(sents[2])
for token in sents[2]:
    if token.pos_ == 'VERB' and str(token.nbor(1).pos_) == 'ADP' and str(token.nbor(2).pos_) == 'NOUN':
      if token.lemma_ not in grammar_structure:
          grammar_structure[token.lemma_] = [(token,token.nbor(1),token.nbor(2))]
      else:
        grammar_structure[token.lemma_].append((token,token.nbor(1),token.nbor(2)))
      
    else:
      if token.pos_ == 'VERB':
        token_verb_rights = [right for right in token.rights]
        for token_verb_right in token_verb_rights:
          if str(token_verb_right.pos_) == 'ADP':
            verb_next_rights = [right for right in token_verb_right.rights]
            for verb_next_right in verb_next_rights:
              if str(verb_next_right.pos_) == 'NOUN':
                if token.lemma_ not in grammar_structure:
                  grammar_structure[token.lemma_] = [(token,token_verb_right,verb_next_right)]
                else:
                  grammar_structure[token.lemma_].append((token,token_verb_right,verb_next_right))
       
print(grammar_structure)
print(grammar_structure['lay'])

Many companies might lay off thousands of people to reduce labor cost.
{'lay': [(lay, off, thousands)]}
[(lay, off, thousands)]


<font color="red">**[ TODO ]**</font>  Please identify all VERB-PREP-NOUN grammar structure in data_doc by traversing the dependency trees and save the results in a list of tuples dep_gp.


In [None]:
dep_gp = {}
structure_count = 0
for sentence in data_sents:
  for token in sentence:
    if token.pos_ == 'VERB' and str(token.nbor(1).pos_) == 'ADP' and str(token.nbor(2).pos_) == 'NOUN':
      if token.lemma_ not in dep_gp:
          dep_gp[token.lemma_] = [(token,token.nbor(1),token.nbor(2))]
      else:
        dep_gp[token.lemma_].append((token,token.nbor(1),token.nbor(2)))
      structure_count += 1 
    else:
      if token.pos_ == 'VERB':
        token_verb_rights = [right for right in token.rights]
        for token_verb_right in token_verb_rights:
          if str(token_verb_right.pos_) == 'ADP':
            verb_next_rights = [right for right in token_verb_right.rights]
            for verb_next_right in verb_next_rights:
              if str(verb_next_right.pos_) == 'NOUN':
                structure_count += 1 
                if token.lemma_ not in dep_gp:
                  dep_gp[token.lemma_] = [(token,token_verb_right,verb_next_right)]
                else:
                  dep_gp[token.lemma_].append((token,token_verb_right,verb_next_right))
                
print('total keywords:',len(dep_gp))
print('total v-prep-n number:',structure_count)

total keywords: 1098
total v-prep-n number: 7149


In [None]:
dict(list(dep_gp.items())[:3])

{'supplement': [(supplemented, with, instruments)],
 'contribute': [(contributes, to, achievement),
  (contribute, to, harmonics),
  (contribute, to, luminance),
  (contribute, to, cancer),
  (contribute, to, generation),
  (contributes, to, mechanism),
  (contributing, to, flux),
  (contributes, to, stability),
  (contribute, to, carcinogenesis),
  (contribute, to, development),
  (contributing, to, development),
  (contributes, to, block),
  (contributed, to, injury),
  (contributed, to, results),
  (contribute, to, signal),
  (contribute, at, densities),
  (contribute, to, variation),
  (contribute, to, growth),
  (contribute, to, difference),
  (contribute, to, response),
  (contributed, to, magnitude),
  (contributed, to, detection),
  (contribute, to, ontology),
  (contribute, to, standardisation),
  (contribute, to, output),
  (contribute, towards, challenge),
  (contributing, to, differentiation),
  (contribute, to, development),
  (contribute, to, momentum),
  (contribute, to,

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in dep_gp with the verb "provide".


In [None]:
dep_gp['provide']

[(provided, by, government),
 (provided, by, companies),
 (provided, for, system),
 (provided, for, variable),
 (provide, into, superconductivity),
 (provide, at, scale),
 (provide, to, reasoning),
 (provide, to, issues),
 (provides, for, expansion),
 (provide, on, models),
 (provides, at, sites),
 (provides, to, surface),
 (provided, by, T),
 (provide, in, sections),
 (provided, by, methods),
 (provide, through, forums),
 (provides, with, source),
 (provide, with, incentives),
 (provided, in, figure),
 (provide, for, wave),
 (providing, to, system),
 (provide, into, importance),
 (provide, with, practice),
 (provided, by, O),
 (provided, for, cultivation),
 (provides, to, students),
 (provide, from, depth),
 (provided, in, pieces),
 (provide, with, suggestions),
 (provided, on, types),
 (provide, with, way),
 (provide, for, systems),
 (provides, to, walls),
 (provided, during, period),
 (providing, with, accuracy),
 (provide, for, hole),
 (provide, to, participants),
 (provide, at, le

### Rule Based Methods 
We can also custom build rules for spacy to match patterns.  
[Documentation](https://spacy.io/api/matcher)

In [None]:
from spacy.matcher import Matcher 

In [None]:
# Example text
text = """I visited Manali last time . Around same budget trips ? I was visiting Ladakh this summer . I have planned visiting New York and other abroad places for next year. Have you ever visited Kodaikanal? """
text = re.sub('\n', '', text)
match_doc = nlp(text)

In [None]:
# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Write a pattern that matches a form of "visit" + place
my_pattern = [{"LEMMA": "visit"}, {"POS": "PROPN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("Visting_places", [my_pattern])
matches = matcher(match_doc)

# Counting the no of matches
print(" matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
    print("Match found:", match_doc[start:end].text)

 matches found: 4
Match found: visited Manali
Match found: visiting Ladakh
Match found: visiting New
Match found: visited Kodaikanal


<font color="red">**[ TODO ]**</font> Please identify all VERB-PREP-NOUN grammar structure in data_doc by applying a matcher rule and store the results in a list of tuples rule_gp. 


In [None]:
rule_gp = {}
# Initialize the matcher
matcher = Matcher(nlp.vocab)

# Write a pattern that matches a form of "visit" + place
my_pattern = [{"POS": "VERB"}, {"POS": "ADP"},{"POS": "NOUN"}]

# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("Verb_Prep_Noun", [my_pattern])
matches = matcher(data_doc)

# Counting the no of matches
print(" matches found:", len(matches))

# Iterate over the matches and print the span text
for match_id, start, end in matches:
  verb = data_doc[start:end].text.split()[0]
  prep = data_doc[start:end].text.split()[1]
  noun = data_doc[start:end].text.split()[2]
  if verb not in rule_gp:
    rule_gp[verb] = [(verb,prep,noun)]
  else:
    rule_gp[verb].append((verb,prep,noun))
    

 matches found: 1114


In [None]:
dict(list(rule_gp.items())[:3])

{'supplemented': [('supplemented', 'with', 'instruments'),
  ('supplemented', 'with', 'ORG')],
 'made': [('made', 'in', 'order'),
  ('made', 'of', 'particle'),
  ('made', 'by', 'researchers'),
  ('made', 'among', 'ORG'),
  ('made', 'by', 'particle'),
  ('made', 'by', 'PERSON'),
  ('made', 'on', 'DATE'),
  ('made', 'as', 'athletes'),
  ('made', 'on', 'development')],
 'resulting': [('resulting', 'in', 'infertility')]}

<font color="red">**[ TODO ]**</font>  Please print out all VERB-PREP-NOUN grammar patterns in rule_gp with the verb "provide".


In [None]:
rule_gp['provided']

[('provided', 'by', 'T'), ('provided', 'by', 'PERSON')]

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1OKbXhcv6E3FEQDPnbHEHEeHvpxv01jxugMP7WwnKqKw/edit#gid=258852025) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to elearn. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.