# Assignment a - Linguistics

## Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* For this assignment, we will be looking at tokenization, morphology, and syntax. 
* This will follow in a similar way as the notebook we did in class, though it will be a bit more work. 
* Answer each question (or, in some cases, follow the command)
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment.

#### We will be using **[Tamarian](https://www.youtube.com/watch?v=ANvlLcOTy6M)** as our example language: 

In [1]:
sentences = [
    'Sinda his face black his eyes red',
    'Tamak',
    'The river Tamak in winter',
    'Darmok and Jalad at Tanagra',
    'Darmok and Jalad on the ocean',
    'Socath his eyes opened',
#    'The beast of Tanagra Usani his army Jakka when the walls fell', # don't worry about this one
    'Picard and Dathan at Eladrel',
    'Marab with sails unfurled',
    'Timba his arms open',
    'Timba at rest'
]

### 1. Tokenize the sentences 

* you will need to make sure everything is lower case
* you will need to represent the sentences as a 2D array of words

In [2]:
sentences = list(map(lambda x: x.lower().split(), sentences))
sentences

[['sinda', 'his', 'face', 'black', 'his', 'eyes', 'red'],
 ['tamak'],
 ['the', 'river', 'tamak', 'in', 'winter'],
 ['darmok', 'and', 'jalad', 'at', 'tanagra'],
 ['darmok', 'and', 'jalad', 'on', 'the', 'ocean'],
 ['socath', 'his', 'eyes', 'opened'],
 ['picard', 'and', 'dathan', 'at', 'eladrel'],
 ['marab', 'with', 'sails', 'unfurled'],
 ['timba', 'his', 'arms', 'open'],
 ['timba', 'at', 'rest']]

### 2. Use a stemmer or lemmatizer 

- (NLTK has several) 
-  You will know your stemmer/lemmatizer did its job because plural words will no longer be plural (e.g., 'eyes' -> 'eye') and past-tense words will no longer be past-tense (e.g. 'unfurled' -> 'unfurl')


In [3]:
import nltk
import nltk.stem.snowball as stem

snowball = stem.EnglishStemmer()
lemmatized = []

for elem in sentences:
    lemmatized.append([snowball.stem(word) for word in elem])
lemmatized

[['sinda', 'his', 'face', 'black', 'his', 'eye', 'red'],
 ['tamak'],
 ['the', 'river', 'tamak', 'in', 'winter'],
 ['darmok', 'and', 'jalad', 'at', 'tanagra'],
 ['darmok', 'and', 'jalad', 'on', 'the', 'ocean'],
 ['socath', 'his', 'eye', 'open'],
 ['picard', 'and', 'dathan', 'at', 'eladrel'],
 ['marab', 'with', 'sail', 'unfurl'],
 ['timba', 'his', 'arm', 'open'],
 ['timba', 'at', 'rest']]

### 3. Write a grammar that can parse all of the sentences

* Try to write as few grammar rules as possible
* Use recursion where you can
* Use `S` as the start symbol
* All terminals need to be in quotes


In [4]:
import nltk

tamarian_grammar = nltk.CFG.fromstring("""
 S -> NP
 NP -> N | N NP | N Adj | CC NP | P NP | Det NP | N Adj NP
 Det -> 'his' | 'the'
 N -> 'sinda' | 'face' | 'eye' | 'tamak' | 'river' | 'winter' | 'darmok' | 'jalad' | 'tanagra' | 'ocean' | 'socath' | 'picard' | 'eladrel' | 'dathan' | 'marab' | 'sail' | 'arm' | 'timba' | 'rest' 
 P -> 'in' | 'at' | 'on' | 'with'
 Adj -> 'black' | 'red' | 'open' | 'unfurl'
 CC -> 'and' 

""")

## 4. Show that your grammar parses all of the sentences

* Use a parser that can use a CFG (NLTK has several) 
* Make sure there is a parse tree for each of the sentences

In [5]:
parser = nltk.ChartParser(tamarian_grammar)
for sentence in lemmatized:
    print(sentence)
    for tree in parser.parse(sentence):
        print(tree)
    print('\n')

['sinda', 'his', 'face', 'black', 'his', 'eye', 'red']
(S
  (NP
    (N sinda)
    (NP
      (Det his)
      (NP (N face) (Adj black) (NP (Det his) (NP (N eye) (Adj red)))))))


['tamak']
(S (NP (N tamak)))


['the', 'river', 'tamak', 'in', 'winter']
(S
  (NP
    (Det the)
    (NP (N river) (NP (N tamak) (NP (P in) (NP (N winter)))))))


['darmok', 'and', 'jalad', 'at', 'tanagra']
(S
  (NP
    (N darmok)
    (NP (CC and) (NP (N jalad) (NP (P at) (NP (N tanagra)))))))


['darmok', 'and', 'jalad', 'on', 'the', 'ocean']
(S
  (NP
    (N darmok)
    (NP
      (CC and)
      (NP (N jalad) (NP (P on) (NP (Det the) (NP (N ocean))))))))


['socath', 'his', 'eye', 'open']
(S (NP (N socath) (NP (Det his) (NP (N eye) (Adj open)))))


['picard', 'and', 'dathan', 'at', 'eladrel']
(S
  (NP
    (N picard)
    (NP (CC and) (NP (N dathan) (NP (P at) (NP (N eladrel)))))))


['marab', 'with', 'sail', 'unfurl']
(S (NP (N marab) (NP (P with) (NP (N sail) (Adj unfurl)))))


['timba', 'his', 'arm', 'open']
(S 

For questions 5-7, just answer in marktown/raw text. No code necessary.

## 5. Does your parser have full coverage?

yes, it does, it parses through all of the sentences

## 6. Does your parser over-generate?

no, it doesn't because it only produces one parse for every sentence

## 7. Which sentences are ambiguous? How do you know?

none of them, there's only one parse for sentence

## 8. Parse this sentence:

* If you wrote your grammar right, this should be covered. If this isn't covered, then you'll need to go back and change your grammar.

In [6]:
w = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']

In [7]:
for tree in parser.parse(w):
    print(tree)

(S
  (NP
    (N timba)
    (NP
      (Det his)
      (NP
        (N face)
        (Adj red)
        (NP
          (Det his)
          (NP (N eye) (Adj black) (NP (P in) (NP (N winter)))))))))


## 9. Was your result in Questions 8 ambiguous?

* Answer in markdown or raw text, no code necessary

No, there was only one parse

## 10. How expressive is your language?

* Answer in markdown or raw text, no code necessary

Since it has at least one recursive rule, it can generate an infinite number of sentences

## 11. Make the grammar more general by treating POS tags as the terminals

In [8]:
tamarian_grammar = nltk.CFG.fromstring("""
    S   -> NP
    NP  -> 'N' | 'N' NP | 'N' 'Adj' | 'CC' NP | 'P' NP | 'Det' NP | 'N' 'Adj' NP
    
""")

## 12. What is your set of POS tags?

* show the list of strings (e.g., ['Adj', ...])



In [9]:
pos_tags = ['Det', 'N', 'P', 'Adj', 'CC']

## 13. Make a list for the POS tags that correspond to the sentence `s` below:

In [10]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']
p = ['N',  'Det', 'N', 'Adj', 'Det', 'N', 'Adj', 'P', 'N']

## 14. Parse the sentence (represented as POS tags)

In [11]:
parser = nltk.ChartParser(tamarian_grammar)

for tree in parser.parse(p):
    print(tree)

(S (NP N (NP Det (NP N Adj (NP Det (NP N Adj (NP P (NP N))))))))


## Extra Credit! Do all of the above questions again, but add the sentence:

'The beast of Tanagra Usani his army Jakka when the walls fell'

*Done!*

## Submit

In [12]:
from client.api.notebook import Notebook
ok = Notebook('a4.ok')
ok.auth(inline=True, force=True)

Assignment: A4 Linguistics
OK, version v1.13.11


Open the following URL:

https://okpy.org/client/login/

After logging in, copy the code from the web page and paste it into the box.
Then press the "Enter" key on your keyboard.

Paste your code here: fOkUqcCFEBXHUUWz4c2mQMBVBPei3G
Successfully logged in as emanuelhernandez@u.boisestate.edu


In [None]:
ok.submit()

<IPython.core.display.Javascript object>