# NLP Core 2 Exercise 2: Sensible PP attachment

#### In this exercise, we will learn about **POS tagging** and **dependency parsing** and study the well-known **PP attachment problem**.

## Introduction and POS tagging

#### First, let's take a look at spaCy's Part-of-Speech (POS) tagging and dependency parsing abilities. Here's how we load a sentence into a spaCy document object and view its dependency parse:

In [1]:
! python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [2]:
import spacy
from spacy import displacy
nlp = spacy.load('en')
test_doc = nlp('I write code.')
displacy.render(test_doc, jupyter = True)

#### spaCy also tokenizes the sentence for you. You can view tokens and their POS tags as follows:

In [3]:
print([(token, token.pos_) for token in test_doc])

[(I, 'PRON'), (write, 'VERB'), (code, 'NOUN'), (., 'PUNCT')]


**Now let's try applying this to a real dataset. NLTK includes an API for accessing many free open textual corpora, including the Project Gutenberg collection of public domain books. We'll load an array of the sentences of Jane Austen's 1811 novel *Sense and Sensibility* for our tests:**

In [4]:
from collections import Counter
import string
import random
from itertools import chain
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import gutenberg
nltk.download('gutenberg')
nltk.download('stopwords')
nltk.download('punkt')
sentences = gutenberg.sents('austen-sense.txt')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Questions:
#### 1. How many sentences are in the novel? 

In [5]:
len(sentences)


4999

#### How many unique tokens?

With punctuation

In [6]:
tokens = [word for item in sentences for word in item]

len(set(tokens))


6828

Without punctuation

In [7]:
# remove punctuation from each word
table = str.maketrans('', '', string.punctuation)
tokens_punct = [word.translate(table) for item in sentences for word in item if word.isalpha()]

len(set(tokens_punct))

6713

Without punctuations and stopwords

In [8]:
stop_words = set([w.lower() for w in stopwords.words('english')])

tokens_stop_punct = [word.translate(table) for item in sentences 
                     for word in item if word.isalpha() 
                     and word not in stop_words]
len(set(tokens_stop_punct))

6579

#### 2. What are the five most common verbs, counting inflections, in the novel? 


**source link:**<br>https://www.nltk.org/book/ch02.html<br>https://realpython.com/natural-language-processing-spacy-python/<br>https://spacy.io/api/language#pipe<br>https://docs.python.org/3.6/library/itertools.html

We will take words without punctuation and stop-words

In [9]:
# chain('ABC', 'DEF') --> A B C D E F

pipes = nlp.pipe(chain(tokens_stop_punct), 
                 disable=[ "parser", "tokenizer","ner"],
                 batch_size=50)

words_verb_clean = Counter((ent.text, ent.pos_) for t in pipes 
                     for  ent in t if ent.pos_=='VERB')


In [10]:
words_verb_clean.most_common(5)

[(('could', 'VERB'), 568),
 (('would', 'VERB'), 507),
 (('said', 'VERB'), 397),
 (('must', 'VERB'), 279),
 (('know', 'VERB'), 230)]

Here we will take our sentences with punctuation and stop-words

In [11]:
# chain.from_iterable(['ABC', 'DEF']) --> A B C D E F

pipes = nlp.pipe(chain.from_iterable(sentences), 
                 disable=[ "parser", "tokenizer","ner"],
                 batch_size=2000)

words_verb = ((ent.text, ent.pos_) for t in pipes 
                     for  ent in t if ent.pos_=='VERB')


In [12]:
Counter(words_verb).most_common(5)

[(('could', 'VERB'), 568),
 (('would', 'VERB'), 507),
 (('said', 'VERB'), 397),
 (('will', 'VERB'), 354),
 (('can', 'VERB'), 295)]

Got the same result only in **third** word and in other words or count is  different, or different words!! ✩◝(◍⌣̎◍)◜✩

#### What are the five most common verbal lemmas (base forms of verbs)?

This time we will take the whole sentenses with punctuation

In [13]:
pipes = nlp.pipe(chain.from_iterable(sentences), 
                 disable=[ "parser", "tokenizer","ner"],
                 batch_size=2000)

lemma_verb = Counter((ent.lemma_,ent.pos_) for t in pipes 
                     for  ent in t if ent.pos_=='VERB')

In [14]:
lemma_verb.most_common(5)

[(('say', 'VERB'), 609),
 (('could', 'VERB'), 568),
 (('would', 'VERB'), 507),
 (('may', 'VERB'), 384),
 (('know', 'VERB'), 376)]

Here some check with different way

In [15]:
gutenberg.raw('austen-sense.txt')[:333]

'[Sense and Sensibility by Jane Austen 1811]\n\nCHAPTER 1\n\n\nThe family of Dashwood had long been settled in Sussex.\nTheir estate was large, and their residence was at Norland Park,\nin the centre of their property, where, for many generations,\nthey had lived in so respectable a manner as to engage\nthe general good opinion of their surr'

In [16]:
doc = nlp(gutenberg.raw('austen-sense.txt'))
sentence_spans = list(doc.sents)
lemma_verb = [ent.lemma_ for t in sentence_spans 
                     for  ent in t if ent.pos_=='VERB']
df = pd.DataFrame(lemma_verb)
pd.value_counts(df.values.flatten())[:5]

say      594
could    567
would    511
may      383
know     380
dtype: int64

we got different numbers, but the same words

## Dependency parsing and PP attachment

**As we saw above, spaCy also generates dependency parses that we can plot. These represent the grammatical relations that connect the different words and phrases in a sentence.**

**For the next task, we will consider how verbs and prepositional phrases can be related in sentences. (A *prepositional phrase* or *PP* is a phrase like "in the house", "on the table", "with my friend" which is headed by a prepisition like "in", "on", "with" ...).**

### Questions:
  #### 3. What is the difference between the prepositional phrases in the sentences in (A) and those in (B)? Plot their dependency parses with displacy.render and look for a difference in structure.

<b>(A)
  * I eat an apple in my room.
  * We listen to music at the theater.
  * John visited Brazil with his friend.
  
(B)
  * I see a fly in my soup.
  * She knows the man at the store.
  * I photographed a man with a bowtie.</b>

**source link:**<br>https://spacy.io/usage/visualizers<br>https://www.grammarly.com/blog/prepositional-phrase/#:~:text=When%20a%20prepositional%20phrase%20acts%20upon%20a%20verb%2C%20we%20say,is%20called%20an%20adverbial%20phrase.<br>https://www.gingersoftware.com/content/grammar-rules/preposition/prepositional-phrases/

In the sentences of text (A) and in the third sentence of text (B), we see the  `adverbial`  prepositional phrases, which act upon a Verbs. And in the first and second sentences of text (B) we see an `adjectival` prepositional phrases, which act upon a Nouns 

In [17]:
sent_a ='''I eat an apple in my room.
We listen to music at the theater.
John visited Brazil with his friend'''
doc = nlp(sent_a)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, 
                style="dep",
                jupyter = True, 
                options={'distance': 120})

In [18]:
sent_b ='''I see a fly in my soup.
She knows the man at the store.
I photographed a man with a bowtie.'''
doc = nlp(sent_b)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, 
                style="dep",
                jupyter = True, 
                options={'distance': 120})

<b>As you can imagine, it is not simple for the parser to decide where the prepositional phrase should be attached -- this is the **PP attachment problem**. Let's evaluate spaCy's default behavior towards PP attachment on our *Sense and Sensibility* corpus:</b>

### Questions:
####  4. Make an array of all tuples (verb lemma, preposition lemma) for prepositional phrases attached to the verb (like (A) above). <br>Hint: for a spaCy token object *token*, you can get its children with *token*.children and the child's relation to it with *child.dep_*.


**source link:**<br>https://stackoverflow.com/questions/40288323/what-do-spacys-part-of-speech-and-dependency-tags-mean<br>https://stackoverflow.com/questions/36610179/how-to-get-the-dependency-tree-with-spacy<br>https://spacy.io/usage/linguistic-features

In [19]:
def get_pps_A(doc):
    "Function to get PPs from a parsed document."
    pps = []
    sents_verb =[]
    for sent in doc:
        for token in sent:
            if token.pos_ == 'VERB'  :
                for child in token.children:
                    if child.dep_=='prep' :
                        sents_verb.append(sent)
                        pps.append((token.lemma_, child.lemma_))
    return pps,sents_verb
 

#example on the sentenses (A)
doc = nlp(sent_a)
sentence_spans = list(doc.sents)
pps_v, snts_v = get_pps_A(sentence_spans)
pps_n, snts_n = get_pps_A(sentence_spans)
print(pps_v, snts_v)
#example on the sentenses (B)
doc = nlp(sent_b)
sentence_spans = list(doc.sents)
pps_v, snts_v = get_pps_A(sentence_spans)
print(pps_v, snts_v)


[('eat', 'in'), ('listen', 'to'), ('listen', 'at'), ('visit', 'with')] [I eat an apple in my room.
, We listen to music at the theater.
, We listen to music at the theater.
, John visited Brazil with his friend]
[('photograph', 'with')] [I photographed a man with a bowtie.]


In [20]:
def get_pps_B(doc):
    "Function to get PPs from a parsed document."
    pps = []
    sents_verb =[]
    for sent in doc:
        for token in sent:
            if token.dep_ == 'dobj' :
                for child in token.children:
                    if child.dep_=='prep' :

                        sents_verb.append(sent)
                        pps.append((token.head.lemma_, child.lemma_))
    return pps,sents_verb
 

#example on the sentenses (B)
doc = nlp(sent_b)
sentence_spans = list(doc.sents)
pps_v, snts_v = get_pps_B(sentence_spans)
print(pps_v, snts_v)


[('see', 'in'), ('know', 'at')] [I see a fly in my soup.
, She knows the man at the store.
]


Now let's get prepositional phrases from our text

In [21]:
doc = nlp(gutenberg.raw('austen-emma.txt'))
sentence_spans = list(doc.sents)

pps_prep, sent_prep = get_pps_A(sentence_spans)

####  What are five most common (verb, preposition) pairs in this case?

In [22]:
Counter(pps_prep).most_common(5)

[(('think', 'of'), 139),
 (('go', 'to'), 72),
 (('come', 'to'), 66),
 (('speak', 'of'), 55),
 (('talk', 'of'), 54)]

####  5. Do the same where the prepositional phrase is attached to the verb's object (case (B)). 


In [23]:
doc = nlp(gutenberg.raw('austen-emma.txt'))
sentence_spans = list(doc.sents)

pps_dobj, sent_noun = get_pps_B(sentence_spans)

#### What are the five most common (verb, preposition) pairs in this case?

In [24]:
Counter(pps_dobj).most_common(5)

[(('have', 'of'), 183),
 (('give', 'of'), 60),
 (('have', 'for'), 42),
 (('make', 'of'), 40),
 (('have', 'in'), 28)]


### Bonus:
#### Look at a few random sentences from the corpus that are parsed as (A) or (B).

As sentence (A)

In [25]:

sampled = random.sample(sent_prep, 1)
doc = nlp(str(sampled))
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, 
                style="dep",
                jupyter = True, 
                options={'distance': 80})
str(sampled)

'[He looked with smiling penetration; and, on receiving\nno answer, added, "_]'

As sentence (B)

In [26]:

sampled = random.sample(sent_noun, 1)
doc = nlp(str(sampled))
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, 
                style="dep",
                jupyter = True, 
                options={'distance': 90})
str(sampled)

'[When he\nwas here before, we made the best of it; but there was a good deal\nof wet, damp, cheerless weather; there always is in February, you know,\n]'

#### Do you agree with the given parse? 

Yes i'm. 

#### Why or why not?

Because i'm not a very very deep linguist, for me it seems that the parser points very clear where is verb, which dependensies it has  and what kind of  prepositional phrase should be attached