# Syntactic Parsing with Spacy 
Spacy provides a full-stack syntactic analysis toolkit. In this exercise, we will first take a look at how to use Spacy to perform different syntactic analysis tasks, and then try to use the analysis results to answer some questions.

In [1]:
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load() # this is an object that will process your text

In [2]:
# POS tagger

# make up (or cut and paste from web pages) different sentences here. 
doc = nlp('He went to South Africa for holiday.')
print('word\tPenn\tUniversal')
for ww in doc:
    print('{}\t{}\t{}'.format(ww,ww.tag_,ww.pos_))

word	Penn	Universal
He	PRP	PRON
went	VBD	VERB
to	IN	ADP
South	NNP	PROPN
Africa	NNP	PROPN
for	IN	ADP
holiday	NN	NOUN
.	.	PUNCT


In [3]:
# get help information for pos tags from nltk, which has a convenient help
# function
import nltk

# comment out the next line if you have already downloaded tagsets
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to C:\Users\jathin
[nltk_data]     varma\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping help\tagsets.zip.


True

In [4]:
# the nltk function allows you to print out definitions of each tag above, 
# and to find examples of words with this tag. 
# figure out what the different tags mean. 
print(nltk.help.upenn_tagset('NNP'))

NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
None


## Chunking

Try out chunking on many individual sentences and larger bodies of text from the web. (Save these to .txt files, and load them in as python strings using the read or readlines functions.) 

Do you agree with the chunks? Can you find cases where the chunking fails? Roughly what is the failure rate?  In a case where the chunking fails, try variations of the sentence to try to understand why the failure might occur. 


In [17]:
# chunking
nlp = en_core_web_sm.load()

doc = nlp('A tram in St Peters Square, Manchester. Photograph: Christopher Thomond')
for chunk in doc.noun_chunks:
    print(chunk)

A tram
St Peters Square
Manchester
Photograph
Christopher Thomond


## Named Entity Recognition 
There is a deceptively simple spacy pipeline for this complex process! 

Try named entity recognition for some sentences that you make up: does it work? 

Next, cut and paste some text from a newspaper website into a .txt document, and then load it into Python. Run Spacy on it, and see how many of the named entities are correctly identified.  Can you find cases where it makes mistakes? 

How could you use NER in applications? 

In [36]:
# now using Spacy
# named entity recognition (NER)
nlp = en_core_web_sm.load()
doc = nlp('AI, robotics and other forms of smart automation have the potential to bring great economic benefits, contributing up to $15 trillion to global GDP by 2030 according to PwC analysis. This extra wealth will also generate the demand for many jobs, but there are also concerns that it could displace many existing jobs.')

for i,ent in enumerate(doc.ents):
    print('named entity {}: {}, label {}'.format(i,ent.text,ent.label_))

named entity 0: AI, label ORG
named entity 1: up to $15 trillion, label MONEY
named entity 2: 2030, label DATE
named entity 3: PwC, label WORK_OF_ART


In [37]:
# visualization of NER results
from spacy import displacy
displacy.render(doc,style='ent')

# Dependency parsing

Use the Spacy dependency parser.  Try it out on many different sentences (one sentence at a time!) and try to understand the parse produced. 

Does the parse correctly display the structure of the sentence? 

Some sentences to compare. (It is easier to find structural differences when you compare different related sentences with different structures.) 

Compare:

A:  'I shot the elephant wearing my pajamas' 

B:  'I shot the elephant eating my tree'

In A, I am wearing my pajamas, and this phrase is connected to the subject of the sentence.  In B, the elephant is eating my tree, and this phrase depends on the object (the elephant).  There is evidently sufficient "background knowledge" somewhere in the parser to distinguish these possibilities. Try to make up other sentences to find out how the parser is detecting the correct (or incorrect) structure. 

Compare

C: 'John loves Mary, Mary loves John'

D: 'John loves Mary and Mary loves John' 

The parse of C shows two sentences side by side, joined by a ccomp dependency, which is correct. You can find out what the dependencies mean at universaldependencies.org  --- do this! Look them up! 

In D, something goes horribly wrong: the parser misinterprets the sentence. 

Now make up pairs of sentences of your own (similar sentences with different grammatical structure) and see if the parser recognises them. 

Cut and paste sentences from the web (news stories, fiction, tweets, etc) and see if you agree with Spacy's parsed structures. 


In [8]:
# dependency parsing 

doc = nlp('John loves Mary, Mary loves John')
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])

# the output below will be a little obscure: 
# the tree visualisation in the next cell is a lot clearer, but perhaps
# only for relatively short sentences. 

John nsubj loves VERB []
loves ccomp loves VERB [John, Mary]
Mary dobj loves VERB []
, punct loves VERB []
Mary nsubj loves VERB []
loves ROOT loves VERB [loves, ,, Mary, John]
John dobj loves VERB []


In [9]:
# visualize the dependency parsing result
# Obviously only do this for one sentence at a time! 
from spacy import displacy
displacy.render(doc,style='dep')

In [10]:
# chunks also includes dependency information
# look at the Spacy documentation for chunks

doc = nlp('Amazon purchases Whole Foods for $13.4 billion.')
for chunk in doc.noun_chunks:
    print('{}|{}|{}|{}'.format(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text))

Amazon|Amazon|nsubj|purchases
Whole Foods|Foods|dobj|purchases


### A tiny application

The next cell shows one way to use dependency parsing to answer a semantic question: to find instances of one entity purchasing another.  Do you think this is an extendable approach?  

In [11]:
# A simple application of chunking and dependency parsing
def who_purchases_whom(doc):
    for chunk in doc.noun_chunks:
        if 'purchase' in chunk.root.head.text and 'subj' in chunk.root.dep_:
            subj = chunk.text
        elif 'purchase' in chunk.root.head.text and 'obj' in chunk.root.dep_:
            obj = chunk.text
    return subj, obj

subj, obj = who_purchases_whom(doc)
print('{} bought {}'.format(subj,obj))

Amazon bought Whole Foods
