## COMP8240: Practical Week 12 (Solutions)

**spaCy** is a Python library for natural language processing tasks such as Information Extraction.

Follow the instructions here: https://spacy.io/usage and install spaCy on your computer.

In this practical session, we will be working with spacy and look at some Information Extraction tasks similar to those introduced in the lecture of Week 11.

**Task 1**: Import spaCy, load the English language model (en_core_web_sm), create a doc object and print the part of speech tags for the sentence: "The craft of the carpetmaker includes traditional professional skills, such as wool processing, gathering of natural vegetable, animal or mineral dyes and yarn making."

In [1]:
# Import spacy.
import spacy

# Load the English language model.
nlp = spacy.load('en_core_web_sm')

# Text
text = "The craft of the carpetmaker includes traditional professional skills, such as wool processing, gathering of natural vegetable, animal or mineral dyes and yarn making."

# Create a doc object. 
doc = nlp(text)

for token in doc:
    print(token.text, '->', token.pos_)

The -> DET
craft -> NOUN
of -> ADP
the -> DET
carpetmaker -> NOUN
includes -> VERB
traditional -> ADJ
professional -> ADJ
skills -> NOUN
, -> PUNCT
such -> ADJ
as -> SCONJ
wool -> NOUN
processing -> NOUN
, -> PUNCT
gathering -> NOUN
of -> ADP
natural -> ADJ
vegetable -> NOUN
, -> PUNCT
animal -> NOUN
or -> CCONJ
mineral -> NOUN
dyes -> NOUN
and -> CCONJ
yarn -> NOUN
making -> NOUN
. -> PUNCT


**Task 2**: Extract all nouns from the same sentence.

In [2]:
for token in doc:
    # Check POS of token.
    if token.pos_== 'NOUN':
        # Print token.
        print(token.text)

craft
carpetmaker
skills
wool
processing
gathering
vegetable
animal
mineral
dyes
yarn
making


**Task 3**: Display the entire dependency tree of the same sentence.

In [3]:
from spacy import displacy 

displacy.render(doc, style="dep", jupyter=True, options={"distance":120})

**Task 4**: Extract the syntactic subject (nsubj) and the direct object (dobj) from the dependency tree.

In [4]:
for token in doc:
    # Extract the nominal subject.
    if (token.dep_== 'nsubj'):
        print(token.text)
    # Extract the direct object.
    elif (token.dep_== 'dobj'):
        print(token.text)

craft
skills


**Task 5**: Import the Matcher of spaCy and define a single pattern that allows you to extract: "The craft of the carpetmaker includes traditional professional skills" as well as "The craft of the carpetmaker includes skills" from Text 1 and Text 2 below.

In [5]:
from spacy.matcher import Matcher 

# Initialize the Matcher with the shared vocabulary.
matcher = Matcher(nlp.vocab)    

# Define the pattern.
pattern = [{'POS':'DET'},  
           {'POS':'NOUN'},
           {'LOWER':'of'},
           {'POS':'DET'},
           {'POS':'NOUN'},
           {'LOWER':'includes'}, 
           {'DEP':'amod', 'OP':"*"},
           {'POS':'NOUN'},
          ] 
    
# Add the pattern to the matcher.
matcher.add("matching_0", None, pattern)

Text 1: "The craft of the carpetmaker includes traditional professional skills, such as wool processing, gathering of natural vegetable, animal or mineral dyes and yarn making."

In [6]:
# Text 1
text_1 = "The craft of the carpetmaker includes traditional professional skills, such as wool processing, gathering of natural vegetable, animal or mineral dyes and yarn making."

# Matcher class object 

# Create a doc object. 
doc = nlp(text_1) 

# Use the matcher on the doc.
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(start, end, span.text)

0 9 The craft of the carpetmaker includes traditional professional skills


Text 2: "The craft of the carpetmaker includes skills, such as wool processing, gathering of natural vegetable, animal or mineral dyes and yarn making."


In [7]:
# Text 2
text_2 = "The craft of the carpetmaker includes skills, such as wool processing, gathering of natural vegetable, animal or mineral dyes and yarn making."

# Create a doc object. 
doc = nlp(text_2) 

# Use the matcher on the doc.
matches = matcher(doc)

for match_id, start, end in matches:
    span = doc[start:end]  # The matched span
    print(start, end, span.text)


0 7 The craft of the carpetmaker includes skills


**Task 6**: Write a function subtree_matcher() that extracts the triple ('Microsoft', 'acquired', 'GitHub') from the passive sentence 'GitHub was recently acquired by Microsoft' as well as from the active sentence 'Microsoft recently acquired GitHub'.

In [8]:
def subtree_matcher(doc):
  x = '' ; y = '' ; z = ''
  subjpass = 0
  
  for i,tok in enumerate(doc):
    # Find dependency tag that contains the text "subjpass".    
    if tok.dep_.find("subjpass") == True:
      subjpass = 1

  # If subjpass == 1 then sentence is passive.
  if subjpass == 1:
    for i,tok in enumerate(doc):
      if tok.dep_.find("subjpass") == True:
        z = tok.text
        
      if tok.head == tok: # Extract head.   # Extract head.
        y = tok.text

      if tok.dep_.endswith("obj") == True:
        x = tok.text
  
  # If subjpass == 0 then sentence is not passive.
  else:
    for i,tok in enumerate(doc):
      if tok.dep_.endswith("subj") == True:
        x = tok.text
        
      if tok.head == tok: # Extract head.   # Extract head.
        y = tok.text

      if tok.dep_.endswith("obj") == True:
        z = tok.text

  return x, y, z

In [9]:
text = "GitHub was recently acquired by Microsoft." 

doc = nlp(text) 
subtree_matcher(doc)

('Microsoft', 'acquired', 'GitHub')

In [10]:
text = "Microsoft recently acquired GitHub." 

doc = nlp(text) 
subtree_matcher(doc)

('Microsoft', 'acquired', 'GitHub')