# Assignment 4 - Linguistics

### Instructions and Hints:

* For this assignment, we will be looking at tokenization, morphology, and syntax. 
* This will follow in a similar way as the notebook we did in class, though it will be a bit more work. 
* Answer each question (or, in some cases, follow the command)
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment.

#### We will be using **[Tamarian](https://www.youtube.com/watch?v=ANvlLcOTy6M)** as our example language: 

In [1]:
sentences = [
    'Sinda his face black his eyes red',
    'Tamak',
    'The river Tamak in winter',
    'Darmok and Jalad at Tanagra',
    'Darmok and Jalad on the ocean',
    'Socath his eyes opened',
#    'The beast of Tanagra Usani his army Jakka when the walls fell', # don't worry about this one
    'Picard and Dathan at Eladrel',
    'Marab with sails unfurled',
    'Timba his arms open',
    'Timba at rest'
]

### 1. Tokenize the sentences 

* you will need to make sure everything is lower case
* you will need to represent the sentences as a 2D array of words

In [2]:
sentences = list(map(lambda x: x.lower().split(), sentences))
sentences

[['sinda', 'his', 'face', 'black', 'his', 'eyes', 'red'],
 ['tamak'],
 ['the', 'river', 'tamak', 'in', 'winter'],
 ['darmok', 'and', 'jalad', 'at', 'tanagra'],
 ['darmok', 'and', 'jalad', 'on', 'the', 'ocean'],
 ['socath', 'his', 'eyes', 'opened'],
 ['picard', 'and', 'dathan', 'at', 'eladrel'],
 ['marab', 'with', 'sails', 'unfurled'],
 ['timba', 'his', 'arms', 'open'],
 ['timba', 'at', 'rest']]

### 2. Use a stemmer or lemmatizer 

- (NLTK has several) 
-  You will know your stemmer/lemmatizer did its job because plural words will no longer be plural (e.g., 'eyes' -> 'eye') and past-tense words will no longer be past-tense (e.g. 'unfurled' -> 'unfurl')


In [3]:
import nltk
import nltk.stem.snowball as stem

In [4]:
wnl = stem.EnglishStemmer()
lemma_sentences = list()
for s in sentences:
    lemma_sentences.append([wnl.stem(t) for t in s])
lemma_sentences

[['sinda', 'his', 'face', 'black', 'his', 'eye', 'red'],
 ['tamak'],
 ['the', 'river', 'tamak', 'in', 'winter'],
 ['darmok', 'and', 'jalad', 'at', 'tanagra'],
 ['darmok', 'and', 'jalad', 'on', 'the', 'ocean'],
 ['socath', 'his', 'eye', 'open'],
 ['picard', 'and', 'dathan', 'at', 'eladrel'],
 ['marab', 'with', 'sail', 'unfurl'],
 ['timba', 'his', 'arm', 'open'],
 ['timba', 'at', 'rest']]

### 3. Write a grammar that can parse all of the sentences

* Try to write as few grammar rules as possible
* Use recursion where you can
* Use `S` as the start symbol
* All terminals need to be in quotes


In [5]:
import nltk

tamarian_grammar = nltk.CFG.fromstring("""
 S -> NP
 NP -> NP ADJP | NP PP | NP VP | Det N | Det N N | N | N CC N | N Pr N
 ADJP -> Pr N JJ
 PP -> P NP
 VP -> V
 Det -> 'the'
 N -> 'sinda' | 'face' | 'river' | 'tamak' | 'winter' | 'darmok' | 'jalad' | 'tanagra' | 'socath' | 'ocean' | 'eye' | 'picard' | 'dathan' | 'eladrel' | 'marab' | 'sail' | 'timba' | 'arm' | 'rest'
 V -> 'open' | 'unfurl'
 JJ -> 'black' | 'red'
 P -> 'in' | 'at' | 'on' | 'with'
 CC -> 'and'
 Pr -> 'his'

""")

## 4. Show that your grammar parses all of the sentences

* Use a parser that can use a CFG (NLTK has several) 
* Make sure there is a parse tree for each of the sentences

In [6]:
parser = nltk.ChartParser(tamarian_grammar)
for s in lemma_sentences:
    print(s)
    for tree in parser.parse(s):
        print(tree)
    print()
    print()

['sinda', 'his', 'face', 'black', 'his', 'eye', 'red']
(S
  (NP
    (NP (NP (N sinda)) (ADJP (Pr his) (N face) (JJ black)))
    (ADJP (Pr his) (N eye) (JJ red))))


['tamak']
(S (NP (N tamak)))


['the', 'river', 'tamak', 'in', 'winter']
(S
  (NP (NP (Det the) (N river) (N tamak)) (PP (P in) (NP (N winter)))))


['darmok', 'and', 'jalad', 'at', 'tanagra']
(S
  (NP
    (NP (N darmok) (CC and) (N jalad))
    (PP (P at) (NP (N tanagra)))))


['darmok', 'and', 'jalad', 'on', 'the', 'ocean']
(S
  (NP
    (NP (N darmok) (CC and) (N jalad))
    (PP (P on) (NP (Det the) (N ocean)))))


['socath', 'his', 'eye', 'open']
(S (NP (NP (N socath) (Pr his) (N eye)) (VP (V open))))


['picard', 'and', 'dathan', 'at', 'eladrel']
(S
  (NP
    (NP (N picard) (CC and) (N dathan))
    (PP (P at) (NP (N eladrel)))))


['marab', 'with', 'sail', 'unfurl']
(S
  (NP
    (NP (NP (N marab)) (PP (P with) (NP (N sail))))
    (VP (V unfurl))))
(S
  (NP
    (NP (N marab))
    (PP (P with) (NP (NP (N sail)) (VP (V unfu

For questions 5-7, just answer in marktown/raw text. No code necessary.

## 5. Does your parser have full coverage?

Ans: I believe it has full coverage becuase it was able to parse all of the sentences

## 6. Does your parser over-generate?

Ans: Yes, becuase for many of the sentences multiple parse trees were generated

## 7. Which sentences are ambiguous? How do you know?

Ans: As per the generated grammar, the following two sentences seem to be ambiguous to me.

- ['sinda', 'his', 'face', 'black', 'his', 'eye', 'red'] 
- ['marab', 'with', 'sail', 'unfurl']

Because, when a grammar produces more than one parse for a particular sentence, then that sentence is, by definition, syntactically ambiguous.

## 8. Parse this sentence:

* If you wrote your grammar right, this should be covered. If this isn't covered, then you'll need to go back and change your grammar.

In [7]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']

In [8]:
for tree in parser.parse(s):
    print(tree)

(S
  (NP
    (NP
      (NP (NP (N timba)) (ADJP (Pr his) (N face) (JJ red)))
      (ADJP (Pr his) (N eye) (JJ black)))
    (PP (P in) (NP (N winter)))))


## 9. Was your result in Questions 8 ambiguous?

* Answer in markdown or raw text, no code necessary

Ans: Yes, I believe so because it got parsed into two trees.

## 10. How expressive is your language?

* Answer in markdown or raw text, no code necessary

Ans: I think it is most expressive. Because, it has recursive rule which means it can generate or be used to parse an infinite number of sentences.

## 11. Make the grammar more general by treating POS tags as the terminals

In [9]:
tamarian_grammar = nltk.CFG.fromstring("""
   
 S -> NP
 NP -> NP ADJP | NP PP | NP VP | 'Det' 'N' | 'Det' 'N' 'N' | 'N' | 'N' 'CC' 'N' | 'N' 'Pr' 'N'
 ADJP -> 'Pr' 'N' 'JJ'
 PP -> 'P' NP
 VP -> 'V'
""")

## 12. What is your set of POS tags?

* show the list of strings (e.g., ['Adj', ...])



In [10]:
pos_tags = ['Det', 'N', 'V', 'JJ', 'P', 'CC', 'Pr']

## 13. Make a list for the POS tags that correspond to the sentence `s` below:

In [11]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']
# p = ['PN',  ??, ... ]
p = ['N', 'Pr', 'N', 'JJ', 'Pr', 'N', 'JJ', 'P', 'N']

## 14. Parse the sentence (represented as POS tags)

In [12]:
parser = nltk.ChartParser(tamarian_grammar)

for tree in parser.parse(p):
    print(tree)

(S (NP (NP (NP (NP N) (ADJP Pr N JJ)) (ADJP Pr N JJ)) (PP P (NP N))))


## Extra Credit! Do all of the above questions again, but add the sentence:

'The beast of Tanagra Usani his army Jakka when the walls fell'

*Done!*

## Submit

In [15]:
from client.api.notebook import Notebook
ok = Notebook('a4.ok')
ok.auth(inline=True)

Assignment: A4 Linguistics
OK, version v1.18.1

Successfully logged in as MostofanajmusSak@u.boisestate.edu


In [16]:
ok.submit()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Saving notebook... Saved 'A4-linguistics.ipynb'.
Submit... 100% complete
Submission successful for user: MostofanajmusSak@u.boisestate.edu
URL: https://okpy.org/bsu/nlp/sp21/a4/submissions/VQz57o

