## Assignment 2

### Part 1 (FSAs)

1.- Define a deterministic finite-state automaton that accepts strings that have an odd number of 0’s and any number of 1’s.


In [1]:
## Uploaded as picture

2.- Implement a regular expression stemmer that can process the following text. 

*Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.*

In [2]:
import nltk
import re

raw =""" Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, 
and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove 
inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma."""

tokens = nltk.word_tokenize(raw)

#print(tokens)
def stem_word(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    
    stem, suffix = re.findall(regexp, word)[0]
    return stem
    
print([stem_word(t) for t in tokens])

['Stemm', 'usual', 'refer', 'to', 'a', 'crude', 'heuristic', 'proces', 'that', 'chop', 'off', 'the', 'end', 'of', 'word', 'in', 'the', 'hope', 'of', 'achiev', 'thi', 'goal', 'correct', 'most', 'of', 'the', 'time', ',', 'and', 'often', 'includ', 'the', 'removal', 'of', 'derivational', 'affix', '.', 'Lemmatization', 'usual', 'refer', 'to', 'do', 'thing', 'proper', 'with', 'the', 'use', 'of', 'a', 'vocabulary', 'and', 'morphological', 'analysi', 'of', 'word', ',', 'normal', 'aim', 'to', 'remove', 'inflectional', 'ending', 'on', 'and', 'to', 'return', 'the', 'base', 'or', 'dictionary', 'form', 'of', 'a', 'word', ',', 'which', 'i', 'known', 'a', 'the', 'lemma', '.']


3.- Expand the grammar grammar1.cfg so that it also parses the sentence

*John said to Bob that Mary saw a man with a telescope*


In [3]:

#nltk.download('grammarA.cfg')
from nltk import CFG
from nltk.parse import RecursiveDescentParser
#cp = parse.load_parser('grammars/book_grammars/feat0.fcfg', trace=1)

g = CFG.fromstring("""
  S -> NP VP 
  VP -> V NP | V NP PP | V PP CP
  CP -> C S
  C -> "that"
  PP -> P NP
  V -> "saw" | "ate" | "walked"| "said"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park" 
  P -> "in" | "on" | "by" | "with" | "to" | 
  """)
#rd = RecursiveDescentParser(g)
sent = "John said to Bob that Mary saw a man with a telescope"
tokens = sent.split()

#trees = rd.parse(tokens)
p = nltk.ChartParser(g)
p.parse(tokens)

for tree in p.parse(tokens):
    print(tree)
    tree.draw()

    
    #for tree in trees:
   # print(tree)
    #tree.draw()


(S
  (NP John)
  (VP
    (V said)
    (PP (P to) (NP Bob))
    (CP
      (C that)
      (S
        (NP Mary)
        (VP
          (V saw)
          (NP (Det a) (N man))
          (PP (P with) (NP (Det a) (N telescope))))))))
(S
  (NP John)
  (VP
    (V said)
    (PP (P to) (NP Bob))
    (CP
      (C that)
      (S
        (NP Mary)
        (VP
          (V saw)
          (NP
            (Det a)
            (N man)
            (PP (P with) (NP (Det a) (N telescope)))))))))


### Part 2 (Wordnet)

In this assignment you will be creating a program for learning languages! We will print a fairy tale, and propose a simple test to check if the language learner knows the target language!

There is a file called `little-red-riding-hood-clean-5lines.txt`, which, as the name suggests, contains the story *Little Red Riding Hood*.

Your job is to do the following:

#### 1st step:

 * Open and load file
 * Read text and remove punctuation (Remember the second *Scientific Programming* class)
 * Tokenize and lemmatize text

The story only contains 5 lines. But each line can contain conversations, which are concatenated together in a single line.

##### Note: If somebody wants to work with a file that contains more lines, you can use the file called `little-red-riding-hood-clean.txt`, which has more lines (conversations were not concatenated together).

#### 2nd step:

Assuming you opened the file with 5 lines, for each paragraph you have to do the following:

 * Get synsets for all words (in English)
 
 * For each word, generate lemmas in a target language (and store them)
 
 * Choose 5 random words (make sure they have a target lemma)
 
 * For each of those random words, ask something that looks like this:
 
    * How youd you say the word `RANDOM_WORD` in Bulgarian (I use bulgarian as example, this can be any language of your choice)?
    * Propose, then one correct lemma (from Wordnet) and other 4 random words (it doesn't matter where you get this random words, but they should be different in each test)
    
When you make your experiments, please tell me the target language you are using, so that I test it with that language.

In [4]:
import string
import nltk
import re
from nltk import WordNetLemmatizer
from nltk.corpus import wordnet as wn
import random
from nltk import pos_tag

file = open('little-red-riding-hood-clean-5lines.txt', encoding="utf8") 

def removePunctuation(word):
    return re.sub("[^\w\s\-\']", "", word)

rh =[removePunctuation(line.strip('\n').lower()) for line in file]   

s = str(rh).strip("[").strip("'")
tokens = s.split() # list of all words in txt

### Create Lematized form ###
word_lem = WordNetLemmatizer()
lem = [word_lem.lemmatize(word, tag[0].lower()) if tag[0].lower() in ['a', 'n', 'v'] else word_lem.lemmatize(word) for word ,tag in pos_tag(tokens)]

### Get Synsets ###

syn_words = [ wn.synsets(word) for word in lem] # all synsets for the words

print(syn_words[1])

### Get  lemma names for the synsets from the previous code in Danish ###
lem_w = []
for i in syn_words:
    #print(i)
    for j in i:
        lem_w.append(j.lemma_names('dan'))
ind_no_syn = [] # list of all index words that have no lemmas
   
for i,j in enumerate(lem_w):  # stores the indexes for the words that dont have target lemmas
    if j == [] :
        ind_no_syn.append(i) 
    
    lem_w_clean = [x for x in lem_w if x != []]
    



### Create list that has only words with target lemmas ### 
word_with_lemmas = [x for x in lem if lem.index(x) not in ind_no_syn] # creates a list of all the words that have danish lemmas
#print(word_with_lemmas)


### Take random sample ### 
random_five = random.sample(word_with_lemmas,5) # take random sample of words that have target lemmas

print(random_five)


### get danish translation for the words ###


exp_1 = ['latch', 'suppose', 'take', 'afraid', 'got']

exp_2 = ['saw', 'got', 'anything', 'suppose', 'open']

syns_exp1 = [wn.synsets(x) for x in exp_1]
exp_1_lemmas = []

for i in syns_exp1:
    for j in i:
        exp_1_lemmas.append(j.lemma_names('dan'))


syns_exp2 = [wn.synsets(x) for x in exp_2]
exp2_lemmas =[]

for i in syns_exp2:
    for j in i:
        exp2_lemmas.append(j.lemma_names('dan'))


#print(rlemmas)


### More general code for checking any random 5 ###

syns_rand = [wn.synsets(x) for x in random_five]
rand_lemmas = []
for i in syns_rand:
    for j in i:
        rand_lemmas.append(j.lemma_names('dan'))

rand_lemmas = [rand_lemmas[x] for x in range(len(rand_lemmas)) if rand_lemmas[x] != []]

print("Random", rand_lemmas)
""" Experiment 1 

     random_five = ['latch', 'suppose', 'take', 'afraid', 'got']
     
     in danish ( from google translate): "låsen, formode, tage, bange, fik "
        
     the only lemma matches for exp1 was bange.
     
    Exp 2
    
     random_five = 'saw', 'got', 'anything', 'suppose', 'open'
     
     in danish = :sav, fik, hvad som helst, formode, aben

     The lemmas that mached were save for saw, and aben for open. 


"""


    




[]
['open', 'day', 'look', 'open', 'wine']
Random [['åbne'], ['åben'], ['åbenlys'], ['dag', 'døgn'], ['udseende'], ['kigge'], ["se_''ud"], ['lede', 'søge'], ['åbne'], ['åben'], ['åbenlys'], ['druevin', 'vin', 'vinsort']]


' Experiment 1 \n\n     random_five = [\'latch\', \'suppose\', \'take\', \'afraid\', \'got\']\n     \n     in danish ( from google translate): "låsen, formode, tage, bange, fik "\n        \n     the only lemma matches for exp1 was bange.\n     \n    Exp 2\n    \n     random_five = \'saw\', \'got\', \'anything\', \'suppose\', \'open\'\n     \n     in danish = :sav, fik, hvad som helst, formode, aben\n\n     The lemmas that mached were save for saw, and aben for open. \n\n\n'