# Analyzing Sentence Structure

In [1]:
import nltk, pprint
from nltk.corpus import treebank, ppattach
from collections import defaultdict
from nltk import Tree

### Some Grammatical Dilemmas
##### Linguistic Data and Unlimited Possibilities

We can concoct new sentences, one which has never been used in a language, which can understood by all speakers of the language. Sentences have a structure which can allow them to be extended, indefinitely. Sentences can be embedded inside larger sentences. (The last two sentences are actually indentical).

These structures are represented by a grammar. The purpose of a grammar is to give an explicit description of language. Is grammar largely finite set of oserved utterances and written texts? 

We're goonna consider 'language' to be an enormous collection of all grammatical sentences and a grammer is a formal notation that can be used for 'generating' the members of this set. Grammars use recursive productions of the form S -> S and S

##### Ubiquitous Ambiguity

> While hunting in Africa, **I shot an elephant in my pajamas**. How he got into my pajamas, I don't know.

In [7]:
gr_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")

In [8]:
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

parser = nltk.ChartParser(gr_grammar)
for tree in parser.parse(sent):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


In [32]:
gr_grammar = nltk.CFG.fromstring("""
S -> VP NP
PP -> P NP
NP -> Det N | Det N PP | 'I' | N | P
VP -> V NP | VP PP | V
Det -> 'an' | 'my' |'you'
N -> 'elephant' | 'pajamas' | 'Youtube' |'shape'
V -> 'shot' | 'close' | 'open' | 'play'
P -> 'in' | 'on' | 'of'
""")

In [33]:
sent = ['play', 'Brisingr', 'on', 'Youtube']
parser = nltk.ChartParser(gr_grammar)
# s = parser.parse(sent)
for tree in parser.parse(sent):
    print(tree)

In [8]:
print(nltk.ne_chunk(nltk.pos_tag(sent)).draw())

None


**Preposition**: a word governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause, as in ‘the man on the platform’, ‘she arrived after dinner’, ‘what did you do it for ?’.

There's no ambiguity in the meaning of any word. For example, 'shot' only refers to the act of using a gun but not taking a picture from the camera.

Notice there are two grammars above . One which takes the first rule of VP and other uses the second.

### What's the Use of Syntax?
##### Beyond the n-grams

We can use frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerated into nonsense. 



In [13]:
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_words(cfd, word, num =15):
    for i in range(num):
        print(word, end = ' ')
        word = cfd[word].max()

generate_words(cfd, 'living')

living creature that he said , and the land of the land of the land 

In [14]:
generate_words(cfd, 'love')

love , and the land of the land of the land of the land of 

Examples : 
1. He roared with me the pail slip down his back.
2. The worst part and tclumsy looking for whoever heard light

You intuitively understand that these sequences are 'word-salad' ( in other words, shit) but yu find it hard to pind down what's wrong with them. 
So one benefit of studying grammar is that it provides a conceptual framework and vocabulary for spelling out these intuitions. 

*the worst part and clumsy looking* looks like a **coordinate structure**, where two phrases are joinf by a coordination conjuction such as *and, but* or *or*

Coordinate Structure:

if **v1** and **v2** are both phrases of grmaatical category X, then **v1 and v2** is also a phrase of category X

>a.		The book's ending was (NP the worst part and the best part) for me.

>b.		On land they are (AP slow and clumsy looking).

In the first, two NPs (noun phrases) have been conjoined to make an NP, while in the second, two APs (adjective phrases) have been conjoined to make an AP.
 
> What we can't do is conjoin an NP and an AP, which is why *the worst parst* and clumsy looking* is ungrammatical.

Before we can formalize these ideas, we need to understand the concept of **constituent structure**.

Constituent structure is based on the observation that words combine with other words to form units. The evidence that a sequence of words forms such a unit is given subsitutability - that is , a sequence of words in a well-formaed sentence can be replaced by a shorted sequence without rendering the sentence ill-formed or changing its meaning. 

> The little bear saw the fine fat trout in the brook.

The fact that we can substitute He for *The little bear* indicates that the latter sequence is a unit. By contrast, we cannot replace *little bear saw* in the same way.

>He saw the fine fat trout in the brook.

>*The he the fine fat trout in the brook.

As we will see in the next section, a grammar specifies how the sentence can be subdivided into its immediate constituents, and how these can be further subdivided until we reach the level of individual words

> As we saw in 1, sentences can have arbitrary length. Consequently, phrase structure trees can have arbitrary depth. The cascaded chunk parsers we saw in 4 can only produce structures of bounded depth, so chunking methods aren't applicable here.


### Context Free Grammar
##### A Simple Grammar

> **In NLTK, context-free grammars are defined in the nltk.grammar module. **

![parse_rdparsewindow.png](attachment:parse_rdparsewindow.png)

In [3]:
nltk.app.rdparser()

Since our grammar licenses two trees for this sentence, the sentence is said to be structurally ambiguous. The ambiguity in question is called a prepositional phrase attachment ambiguity, as we saw earlier in this chapter. As you may recall, it is an ambiguity about attachment since the PP in the park needs to be attached to one of two places in the tree: either as a child of VP or else as a child of NP. When the PP is attached to  VP, the intended interpretation is that the seeing event happened in the park. However, if the  PP is attached to NP, then it was the man who was in the park, and the agent of the seeing (the dog) might have been sitting on the balcony of an apartment overlooking the park.

##### Writing Your Own Grammars
Create and edit your grammar in a text file ex: 'gram.cfg'. You can the load it into NLTK and parse with it as follows

In [16]:
%time
grammar = nltk.data.load('file:mygrammar.cfg')
sent = 'Mary saw Bob'.split()
rd_parser = nltk.RecursiveDescentParser(grammar)
for tree in rd.parser.parse(sent):
    print(tree)

Wall time: 0 ns


LookupError: 
**********************************************************************
  Resource [93mC:[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('C:')
  [0m
  Searched in:
    - ''
**********************************************************************


Make sure that you put a .cfg suffix on the filename, and that there are no spaces in the string 'file:mygrammar.cfg'. If the command print(tree) produces no output, this is probably because your sentence sent is not admitted by your grammar. In this case, call the parser with tracing set to be on: rd_parser =
nltk.RecursiveDescentParser(grammar1, trace=2). You can also check what productions are currently in the grammar with the command for p
in grammar1.productions(): print(p).

>**When you write CFGs for parsing in NLTK, you cannot combine grammatical categories with lexical items on the righthand side of the same production. Thus, a production such as PP -> 'of' NP is disallowed. In addition, you are not permitted to place multi-word lexical items on the righthand side of a production. So rather than writing NP -> 'New
York', you have to resort to something like NP -> 'New_York' instead.**

##### Recursion in Syntactic Structure

Recursive if category occuring on the left hand side of a prodution also appears on the righthand side of production.

In [6]:
nltk.app.srparser()



In [20]:
grammar1 = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP | V NP PP
PP -> P NP
V -> 'saw' | 'ate' | 'walked'
NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
Det -> "a" | "an" | "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
P -> "in" | "on" | "by" | "with"
""")

In [21]:
sr_parser = nltk.ShiftReduceParser(grammar1)
sent = 'Mary saw a dog'.split()
for tree in sr_parser.parse(sent):
    print(tree)

(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))


In [22]:
sr_parser = nltk.ShiftReduceParser(grammar1, trace =2)
sent = 'Mary saw a dog'.split()
for tree in sr_parser.parse(sent):
    print(tree)

Parsing 'Mary saw a dog'
    [ * Mary saw a dog]
  S [ 'Mary' * saw a dog]
  R [ NP * saw a dog]
  S [ NP 'saw' * a dog]
  R [ NP V * a dog]
  S [ NP V 'a' * dog]
  R [ NP V Det * dog]
  S [ NP V Det 'dog' * ]
  R [ NP V Det N * ]
  R [ NP V NP * ]
  R [ NP VP * ]
  R [ S * ]
(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))


##### Well Formed Substring Table

For every word in text we can loop up in out grammar what category it belongs to.

In [25]:
text = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
gr_grammar.productions(rhs = text[1])

[V -> 'shot']

In [38]:
def init_wfst(tokens, grammar):
    numtokens = len(tokens)
    wfst = [[None for i in range(numtokens + 1)] for j in range(numtokens +1)]
    for i in range(numtokens):
        productions = grammar.productions(rhs = tokens[i])
        wfst[i][i+1] = productions[0].lhs()
    return wfst

def complete_wfst(wfst, tokens, grammar, trace = False):
    index = dict((p.rhs(), p.lhs()) for p in grammar.productions())
    numtokens = len(tokens)
    for span in range(2, numtokens + 1):
        for start in range(numtokens + 1 - span):
            end = start + span
            for mid in range(start + 1, end):
                nt1, nt2 = wfst[start][mid], wfst[mid][end]
                if nt1 and nt2 and (nt1, nt2) in index:
                    wfst[start][end] = index[(nt1, nt2)]
                    if trace:
                        print("[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]" % \
                        (start, nt1, mid, nt2, end, start, index[(nt1,nt2)], end))
    return wfst

def display(wfst, tokens):
    print('\nWFST ' + ' '.join(("%-4d" % i) for i in range(1, len(wfst))))
    for i in range(len(wfst) - 1):
        print("%d  " %i, end = " ")
        for j in range(1, len(wfst)):
            print("%-4s"%(wfst[i][j] or '.'), end = " ")
        print()


In [39]:
tokens = 'I shot an elephant in my pajamas'.split()
wfst0 = init_wfst(tokens, gr_grammar)
display(wfst0, tokens)


WFST 1    2    3    4    5    6    7   
0   NP   .    .    .    .    .    .    
1   .    V    .    .    .    .    .    
2   .    .    Det  .    .    .    .    
3   .    .    .    N    .    .    .    
4   .    .    .    .    P    .    .    
5   .    .    .    .    .    Det  .    
6   .    .    .    .    .    .    N    


In [30]:
wfst1 = complete_wfst(wfst0, tokens, gr_grammar)
display(wfst1, tokens)


WFST 1    2    3    4    5    6    7   
0   NP   .    .    S    .    .    S    
1   .    V    .    VP   .    .    VP   
2   .    .    Det  NP   .    .    .    
3   .    .    .    N    .    .    .    
4   .    .    .    .    P    .    PP   
5   .    .    .    .    .    Det  NP   
6   .    .    .    .    .    .    N    


In [40]:
wfst1 = complete_wfst(wfst0, tokens, gr_grammar, trace = True)
display(wfst1, tokens)

[2] Det [3]   N [4] ==> [2]  NP [4]
[5] Det [6]   N [7] ==> [5]  NP [7]
[1]   V [2]  NP [4] ==> [1]  VP [4]
[4]   P [5]  NP [7] ==> [4]  PP [7]
[0]  NP [1]  VP [4] ==> [0]   S [4]
[1]  VP [4]  PP [7] ==> [1]  VP [7]
[0]  NP [1]  VP [7] ==> [0]   S [7]

WFST 1    2    3    4    5    6    7   
0   NP   .    .    S    .    .    S    
1   .    V    .    VP   .    .    VP   
2   .    .    Det  NP   .    .    .    
3   .    .    .    N    .    .    .    
4   .    .    .    .    P    .    PP   
5   .    .    .    .    .    Det  NP   
6   .    .    .    .    .    .    N    


In [4]:
nltk.app.chartparser()

grammar= (
('    ', 'S -> NP VP,')
('    ', 'VP -> VP PP,')
('    ', 'VP -> V NP,')
('    ', 'VP -> V,')
('    ', 'NP -> Det N,')
('    ', 'NP -> NP PP,')
('    ', 'PP -> P NP,')
('    ', "NP -> 'John',")
('    ', "NP -> 'I',")
('    ', "Det -> 'the',")
('    ', "Det -> 'my',")
('    ', "Det -> 'a',")
('    ', "N -> 'dog',")
('    ', "N -> 'cookie',")
('    ', "N -> 'table',")
('    ', "N -> 'cake',")
('    ', "N -> 'fork',")
('    ', "V -> 'ate',")
('    ', "V -> 'saw',")
('    ', "P -> 'on',")
('    ', "P -> 'under',")
('    ', "P -> 'with',")
)
tokens = ['John', 'ate', 'the', 'cake', 'on', 'the', 'table']
Calling "ChartParserApp(grammar, tokens)"...


### Dependencies and Dependency Grammar

Phrase structure grammar is concerned with how words and sequences of words combine to form constituents. Dependency grammar focusses on how words relate to other words. Dependency is a binary asymmetric relation that holds between a head and its dependents.

The head of a sentence is usually taken to be the tensed verb and every other word is either dependent on the sentence head or connets to it through a path of dependencies.




In [43]:
dep_grammar = nltk.DependencyGrammar.fromstring("""
'shot' -> 'I' | 'elephant' | 'in'
'elephant' -> 'an' | 'in'
'in' -> 'pajamas'
'pajamas' -> 'my'
""")

print(dep_grammar)

Dependency grammar with 7 productions
  'shot' -> 'I'
  'shot' -> 'elephant'
  'shot' -> 'in'
  'elephant' -> 'an'
  'elephant' -> 'in'
  'in' -> 'pajamas'
  'pajamas' -> 'my'


A dependency graph is **projective** if, when all the words are written in linear order, the edges can be drawn above the words without crossing. This is equivalent to saying that a word and all its descendents (dependents and dependents of its dependents, etc.) form a contiguous sequence of words within the sentence. 5.1 is projective, and we can parse many sentences in English using a projective dependency parser. The next example shows how dep_grammar provides an alternative approach to capturing the attachment ambiguity that we examined earlier with phrase structure grammar.

In [47]:
pdp = nltk.ProjectiveDependencyParser(dep_grammar)
sent = 'I shot an elephant in my pajamas'.split()
trees = pdp.parse(sent)
for tree in trees:
    print(tree)

(shot I (elephant an (in (pajamas my))))
(shot I (elephant an) (in (pajamas my)))


Shot is the head 

In [50]:
for tree in trees:
    print(tree.draw)

Various criteria have been proposed for deciding what is the head H and what is the dependent D in a construction C. Some of the most important are the following:

1. H determines the distribution class of C; or alternatively, the external syntactic properties of C are due to H.
2. H determines the semantic type of C.
3. H is obligatory while D may be optional.
4. H selects D and determines whether it is obligatory or optional.
5. The morphological form of D is determined by H (e.g. agreement or case government).

When we say in a phrase structure grammar that the immediate constituents of a PP are P and NP, we are implicitly appealing to the head / dependent distinction. A prepositional phrase is a phrase whose head is a preposition; moreover, the NP is a dependent of P. The same distinction carries over to the other types of phrase that we have discussed. The key point to note here is that although phrase structure grammars seem very different from dependency grammars, they implicitly embody a recognition of dependency relations. While CFGs are not intended to directly capture dependencies, more recent linguistic frameworks have increasingly adopted formalisms which combine aspects of both approaches.

### Valency and the Lexicon
Let us take a closer look at verbs and their dependents. The grammar in 3.3 correctly generates examples like (15d).

		
a. The squirrel was frightened.

b.		Chatterer saw the bear.

c.		Chatterer thought Buster was angry.

d.		Joe put the fish on the log.


These possibilities correspond to the following productions:

Table 5.1:

VP productions and their lexical heads

VP -> V Adj	was<br>
VP -> V NP	saw<br>
VP -> V S	thought<br>
VP -> V NP PP	put<br>


That is, was can occur with a following Adj, saw can occur with a following NP, thought can occur with a following S and put can occur with a following NP and PP. The dependents Adj, NP, PP and S are often called complements of the respective verbs and there are strong constraints on what verbs can occur with what complements. By contrast with (15d), the word sequences in (16d) are ill-formed:

(16)		
a.		*The squirrel was Buster was angry.

b.		*Chatterer saw frightened.

c.		*Chatterer thought the bear.

d.		*Joe put on the log.




###  Scaling Up

So far, we have only considered "toy grammars," small grammars that illustrate the key aspects of parsing. But there is an obvious question as to whether the approach can be scaled up to cover large corpora of natural languages. How hard would it be to construct such a set of productions by hand? In general, the answer is: very hard. Even if we allow ourselves to use various formal devices that give much more succinct representations of grammar productions, it is still extremely difficult to keep control of the complex interactions between the many productions required to cover the major constructions of a language. **In other words, it is hard to modularize grammars so that one portion can be developed independently of the other parts. This in turn means that it is difficult to distribute the task of grammar writing across a team of linguists. Another difficulty is that as the grammar expands to cover a wider and wider range of constructions, there is a corresponding increase in the number of analyses which are admitted for any one sentence. In other words, ambiguity increases with coverage.**

### Grammar Development

Parsing build trees over sentences (phrase structure grammar). We have used grammar for small sentences and of small range. What to do if we want to scale up the coverage of our grammar which would be more realistic?

##### Treebanks and Grammars

In [53]:
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print(t)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


In [59]:
def filter(tree):
    """Finds verbs that take sentential complements"""
    child_nodes = [child.label() for child in tree
                   if isinstance(child, nltk.Tree)]
    return (tree.label() == 'VP') and ('S' in child_nodes)


In [74]:
#Searching a treebank to find sentential complements
print([subtree for tree in treebank.parsed_sents() for subtree in tree.subtrees(filter)][10].draw())

None


In [68]:
entries = ppattach.attachments('training')
table = defaultdict(lambda: defaultdict(set))
for entry in entries:
    key = entry.noun1 + '-' +entry.prep + '-' +entry.noun2
    table[key][entry.attachment].add(entry.verb)
    
for key in sorted(table):
     if len(table[key]) > 1:
         print(key, 'N:', sorted(table[key]['N']), 'V:', sorted(table[key]['V']))

%-below-level N: ['left'] V: ['be']
%-from-year N: ['was'] V: ['declined', 'dropped', 'fell', 'grew', 'increased', 'plunged', 'rose', 'was']
%-in-August N: ['was'] V: ['climbed', 'fell', 'leaping', 'rising', 'rose']
%-in-September N: ['increased'] V: ['climbed', 'declined', 'dropped', 'edged', 'fell', 'grew', 'plunged', 'rose', 'slipped']
%-in-week N: ['declined'] V: ['was']
%-to-% N: ['add', 'added', 'backed', 'be', 'cut', 'go', 'grow', 'increased', 'increasing', 'is', 'offer', 'plummet', 'reduce', 'rejected', 'rise', 'risen', 'shaved', 'wants', 'yield', 'zapping'] V: ['fell', 'rise', 'slipped']
%-to-million N: ['declining'] V: ['advanced', 'climbed', 'cutting', 'declined', 'declining', 'dived', 'dropped', 'edged', 'fell', 'gained', 'grew', 'increased', 'jump', 'jumped', 'plunged', 'rising', 'rose', 'slid', 'slipped', 'soared', 'tumbled']
1-to-21 N: ['dropped'] V: ['dropped']
1-to-33 N: ['gained'] V: ['dropped', 'fell', 'jumped']
1-to-4 N: ['added'] V: ['gained']
1-to-47 N: ['jumped']

In [73]:
print(entries[:5])

[PPAttachment(sent='0', verb='join', noun1='board', prep='as', noun2='director', attachment='V'), PPAttachment(sent='1', verb='is', noun1='chairman', prep='of', noun2='N.V.', attachment='N'), PPAttachment(sent='2', verb='named', noun1='director', prep='of', noun2='conglomerate', attachment='N'), PPAttachment(sent='3', verb='caused', noun1='percentage', prep='of', noun2='deaths', attachment='N'), PPAttachment(sent='5', verb='using', noun1='crocidolite', prep='in', noun2='filters', attachment='V')]


Amongst the output lines of this program we find offer-from-group N: ['rejected'] V: ['received'], which indicates that received expects a separate PP complement attached to the VP, while rejected does not. As before, we can use this information to help construct the grammar.

In [75]:
python -m nltk.downloader large_grammar

SyntaxError: invalid syntax (<ipython-input-75-9154956624fc>, line 1)

### Pernicious Ambiguity

**Pernicious:** Having an harmful especially in a subtle manner

As mentioned before, as the coverage of grammar increases and the length of the sentence grows, the number of parse trees grows rapidly(ambiguity increases).

Let's explore this issue with the help of a simple example. The word fish is both a noun and a verb. We can make up the sentence fish fish fish, meaning fish like to fish for other fish. 

In [80]:
grammar = nltk.CFG.fromstring("""
S -> NP V NP
NP -> NP Sbar
Sbar -> NP V
NP -> 'fish'
V -> 'fish'
""")

Now we can try parsing a longer sentence, fish fish fish fish fish, which amongst other things, means 'fish that other fish fish are in the habit of fishing fish themselves'. We use the NLTK chart parser, 

In [81]:
tokens = ['fish'] * 5
cp =nltk.ChartParser(grammar)
for tree in cp.parse(tokens):
    print(tree)

(S (NP fish) (V fish) (NP (NP fish) (Sbar (NP fish) (V fish))))
(S (NP (NP fish) (Sbar (NP fish) (V fish))) (V fish) (NP fish))


As the length of this sentence goes up (3, 5, 7, ...) we get the following numbers of parse trees: 1; 2; 5; 14; 42; 132; 429; 1,430; 4,862; 16,796; 58,786; 208,012; ... (These are the Catalan numbers, which we saw in an exercise in 4). The last of these is for a sentence of length 23, the average length of sentences in the WSJ section of Penn Treebank. For a sentence of length 50 there would be over 1012 parses, and this is only half the length of the Piglet sentence (1), which young children process effortlessly. No practical NLP system could construct millions of trees for a sentence and choose the appropriate one in the context. It's clear that humans don't do this either!

Note that the problem is not with our choice of example. (Church & Patil, 1982) point out that the syntactic ambiguity of PP attachment in sentences like (18) also grows in proportion to the Catalan numbers.

(18)		Put the block in the box on the table.

So much for structural ambiguity; what about lexical ambiguity? As soon as we try to construct a broad-coverage grammar, we are forced to make lexical entries highly ambiguous for their part of speech. In a toy grammar, a is only a determiner, dog is only a noun, and runs is only a verb. However, in a broad-coverage grammar, a is also a noun (e.g. part a), dog is also a verb (meaning to follow closely), and runs is also a noun (e.g. ski runs). In fact, all words can be referred to by name: e.g. the verb 'ate' is spelled with three letters; in speech we do not need to supply quotation marks. Furthermore, it is possible to verb most nouns. Thus a parser for a broad-coverage grammar will be overwhelmed with ambiguity. Even complete gibberish will often have a reading, e.g. the a are of I. As (Klavans & Resnik, 1996) has pointed out, this is not word salad but a grammatical noun phrase, in which are is a noun meaning a hundredth of a hectare (or 100 sq m), and a and I are nouns designating coordinates, 

The solution to these problems is provided by probabilistic parsing, which allows us to rank the parses of an ambiguous sentence on the basis of evidence from corpora.

### Weighted Grammar


In [83]:
def give(t):
    return t.label == 'VP' and len(t) > 2 and t[1].label() == 'NP'\
    and (t[2].label()=='PP-DTV' or t[2]/label() == 'NP')\
    and ('give' in t[0].leaves() or 'gave' in t[0].leaves())
    
def sent(t):
    return ' '.join(token for token in t.leaves() if token[0] not in '*-0')

def print_node(t, width):
    output = '%s %s: %s / %s: %s' % (sent(t[0]), t[1].label(),
                                    sent(t[1]), t[2].label(), sent(t[2]))
    if len(output) > width:
        output = output[width] + "..."
        print(output)

In [84]:
for tree in treebank.parsed_sents():
    for t in tree.subtrees(give):
        print_node(t, 72)

**A probabilistic context free grammar (or PCFG)** is a context free grammar that associates a probability with each of its productions. It generates the same set of parses for a text that the corresponding context free grammar does, and assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the product of the probabilities of the productions used to generate it.

In [9]:
grammar = nltk.PCFG.fromstring("""
S -> NP VP        [1.0]
VP -> TV NP       [0.4]
VP -> IV          [0.3]
VP -> DATV NP NP  [0.3]
TV -> 'saw'       [1.0]
IV -> 'ate'       [1.0]
DATV -> 'gave'    [1.0]
NP -> 'telescopes'[0.8]
NP -> 'Jack'      [0.2]
""")

In [10]:
print(grammar)

Grammar with 9 productions (start state = S)
    S -> NP VP [1.0]
    VP -> TV NP [0.4]
    VP -> IV [0.3]
    VP -> DATV NP NP [0.3]
    TV -> 'saw' [1.0]
    IV -> 'ate' [1.0]
    DATV -> 'gave' [1.0]
    NP -> 'telescopes' [0.8]
    NP -> 'Jack' [0.2]


 In order to ensure that the trees generated by the grammar form a probability distribution, PCFG grammars impose the constraint that all productions with a given left-hand side must have probabilities that sum to one. 

In [11]:
viterbi_parser = nltk.ViterbiParser(grammar)
for tree in viterbi_parser.parse(['Jack', 'saw', 'telescopes']):
    print(tree.draw())

None


Now that parse trees are assigned probabilities, it no longer matters that there may be a huge number of possible parses for a given sentence. A parser will be responsible for finding the most likely parses.

In [17]:
example = "Jack saw a telescope".split()
tags = nltk.pos_tag(example)
chunk = nltk.ne_chunk(tags)

In [18]:
chunk.draw()

In [19]:
tags

[('Jack', 'NNP'), ('saw', 'VBD'), ('a', 'DT'), ('telescope', 'NN')]

In [20]:
example = "I cut my Knee".split()
tags = nltk.pos_tag(example)
chunk = nltk.ne_chunk(tags)

In [21]:
tags

[('I', 'PRP'), ('cut', 'VBD'), ('my', 'PRP$'), ('Knee', 'NNP')]

In [22]:
chunk.draw()

In [23]:
example = "I have a cut on my knee".split()
tags = nltk.pos_tag(example)
chunk = nltk.ne_chunk(tags)

In [24]:
tags

[('I', 'PRP'),
 ('have', 'VBP'),
 ('a', 'DT'),
 ('cut', 'NN'),
 ('on', 'IN'),
 ('my', 'PRP$'),
 ('knee', 'NN')]

In [25]:
example = "Show me the weather".split()
tags = nltk.pos_tag(example)
chunk = nltk.ne_chunk(tags)

In [26]:
tags

[('Show', 'VB'), ('me', 'PRP'), ('the', 'DT'), ('weather', 'NN')]

In [27]:
chunk.draw()