# Analyzing Sentence Structure

In [1]:
import nltk, pprint

### Some Grammatical Dilemmas
##### Linguistic Data and Unlimited Possibilities

We can concoct new sentences, one which has never been used in a language, which can understood by all speakers of the language. Sentences have a structure which can allow them to be extended, indefinitely. Sentences can be embedded inside larger sentences. (The last two sentences are actually indentical).

These structures are represented by a grammar. The purpose of a grammar is to give an explicit description of language. Is grammar largely finite set of oserved utterances and written texts? 

We're goonna consider 'language' to be an enormous collection of all grammatical sentences and a grammer is a formal notation that can be used for 'generating' the members of this set. Grammars use recursive productions of the form S -> S and S

##### Ubiquitous Ambiguity

> While hunting in Africa, **I shot an elephant in my pajamas**. How he got into my pajamas, I don't know.

In [5]:
gr_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")

In [7]:
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']

parser = nltk.ChartParser(gr_grammar)
for tree in parser.parse(sent):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


**Preposition**: a word governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause, as in ‘the man on the platform’, ‘she arrived after dinner’, ‘what did you do it for ?’.

There's no ambiguity in the meaning of any word. For example, 'shot' only refers to the act of using a gun but not taking a picture from the camera.

Notice there are two grammars above . One which takes the first rule of VP and other uses the second.

### What's the Use of Syntax?
##### Beyond the n-grams

We can use frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerated into nonsense. 



In [10]:
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

def generate_words(cfd, word, num =15):
    for i in range(num):
        print(word, end = ' ')
        word = cfd[word].max()

generate_words(cfd, 'living')

living creature that he said , and the land of the land of the land 

In [12]:
generate_words(cfd, 'love')

love , and the land of the land of the land of the land of 

Examples : 
1. He roared with me the pail slip down his back.
2. The worst part and tclumsy looking for whoever heard light

You intuitively understand that these sequences are 'word-salad' ( in other words, shit) but yu find it hard to pind down what's wrong with them. 
So one benefit of studying grammar is that it provides a conceptual framework and vocabulary for spelling out these intuitions. 

*the worst part and clumsy looking* looks like a **coordinate structure**, where two phrases are joinf by a coordination conjuction such as *and, but* or *or*

Coordinate Structure:

if **v1** and **v2** are both phrases of grmaatical category X, then **v1 and v2** is also a phrase of category X

>a.		The book's ending was (NP the worst part and the best part) for me.

>b.		On land they are (AP slow and clumsy looking).

In the first, two NPs (noun phrases) have been conjoined to make an NP, while in the second, two APs (adjective phrases) have been conjoined to make an AP.
 
> What we can't do is conjoin an NP and an AP, which is why *the worst parst* and clumsy looking* is ungrammatical.

Before we can formalize these ideas, we need to understand the concept of **constituent structure**.

Constituent structure is based on the observation that words combine with other words to form units. The evidence that a sequence of words forms such a unit is given subsitutability - that is , a sequence of words in a well-formaed sentence can be replaced by a shorted sequence without rendering the sentence ill-formed or changing its meaning. 

> The little bear saw the fine fat trout in the brook.

The fact that we can substitute He for *The little bear* indicates that the latter sequence is a unit. By contrast, we cannot replace *little bear saw* in the same way.

>He saw the fine fat trout in the brook.

>*The he the fine fat trout in the brook.

As we will see in the next section, a grammar specifies how the sentence can be subdivided into its immediate constituents, and how these can be further subdivided until we reach the level of individual words

> As we saw in 1, sentences can have arbitrary length. Consequently, phrase structure trees can have arbitrary depth. The cascaded chunk parsers we saw in 4 can only produce structures of bounded depth, so chunking methods aren't applicable here.


### Context Free Grammar
##### A Simple Grammar

![parse_rdparsewindow.png](attachment:parse_rdparsewindow.png)