***Assignment 2***

*Damiano Pedoni*

The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

PURE SYMBOLIC. The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:

a) Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed;

b) Verbs are all at present tense;

c) No pronouns are admitted;

d) Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);

Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity should be blocked onto first admissible tree.


**Setup**

In [3]:
# importing all the dependencies

import nltk

nltk.download('punkt')

import spacy

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
# downloading spacy tagging models

!python3 -m spacy download it_core_news_sm
!python3 -m spacy download en_core_web_sm
!python3 -m spacy download fr_core_news_sm
!python3 -m spacy download de_core_news_sm

nlp_it = spacy.load("it_core_news_sm")
nlp_en = spacy.load("en_core_web_sm")
nlp_fr = spacy.load("fr_core_news_sm")
nlp_de = spacy.load("de_core_news_sm")

2023-02-08 20:29:37.548314: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting it-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.4.0/it_core_news_sm-3.4.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: it-core-news-sm
Successfully installed it-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')
2023-02-08 20:29:54.284452: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev

**Grammar directives**

1. ✅ Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed
2. ✅ Verbs are all at present tense
3. ✅ No pronouns are admitted 
4. ✅ Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb) 
5. ✅ Ambiguity should be blocked onto first admissible tree.



**Tree generation**

In [5]:
# creating a grammar for every language, following all the grammar directives

grammars = {
    "en": """
      S -> NP VP PUNCT | NP VP
      NP -> NUM ADJ NOUN | DET NOUN | DET ADJ NOUN | ADJ NOUN | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      """,
    "de": """
      S -> NP VP PUNCT | NP VP
      NP -> NUM ADJ NOUN | DET NOUN | DET ADJ NOUN | ADJ NOUN | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      """,
    "it": """
      S -> NP VP PUNCT | NP VP
      NP -> NUM NOUN ADJP | DET NOUN | DET NOUN ADJ | NOUN ADJ | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      """,
    "fr": """
      S -> NP VP PUNCT | NP VP
      NP -> NUM NOUN ADJP | DET NOUN | DET NOUN ADJ | NOUN ADJ | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      """
}

In [6]:
# generating the tree using nltk parser.

def generate_tree(language, prompt):
  global grammars

  string_grammar = grammars[language]
  
  doc = None

  # using the correct spacy model

  if (language == "en"):
    doc = nlp_en(prompt)
  elif (language == "it"):
    doc = nlp_it(prompt)
  elif (language == "fr"):
    doc = nlp_fr(prompt)
  elif (language == "de"):
    doc = nlp_de(prompt)

  # creating the grammar dinamically, taking the tag from spacy and concatenating a string to generate the list of all the words corresponding to that tag

  grammar = {}

  for token in doc:
    word = "'" + token.text + "'"
    if not token.pos_ in grammar:
      grammar[token.pos_] = set()
    grammar[token.pos_].add(word)

  formatted_grammar = ""

  for tag in grammar.keys():
    formatted_grammar += f"""{tag} -> {" | ".join(grammar[tag])}\n"""

  current_grammar = string_grammar + formatted_grammar

  print(current_grammar)

  # generate the trees, using the first generated tree in case of ambiguity

  parser = nltk.ChartParser(nltk.CFG.fromstring(current_grammar))

  tokens = nltk.word_tokenize(prompt)

  trees = list(parser.parse(tokens))
  
  if len(trees) > 0:
    print(trees[0])

**Testing**

*Italian*

In [7]:
# testing italian

prompt = "Il gatto mangia il topo."
language = "it"

generate_tree(language, prompt)


      S -> NP VP PUNCT | NP VP
      NP -> NUM NOUN ADJP | DET NOUN | DET NOUN ADJ | NOUN ADJ | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      DET -> 'il' | 'Il'
NOUN -> 'topo' | 'gatto'
VERB -> 'mangia'
PUNCT -> '.'

(S
  (NP (DET Il) (NOUN gatto))
  (VP (VERB mangia) (NP (DET il) (NOUN topo)))
  (PUNCT .))


*English*

In [8]:
# testing english

prompt = "The cat eats the mouse."
language = "en"

generate_tree(language, prompt)


      S -> NP VP PUNCT | NP VP
      NP -> NUM ADJ NOUN | DET NOUN | DET ADJ NOUN | ADJ NOUN | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      DET -> 'The' | 'the'
NOUN -> 'mouse' | 'cat'
VERB -> 'eats'
PUNCT -> '.'

(S
  (NP (DET The) (NOUN cat))
  (VP (VERB eats) (NP (DET the) (NOUN mouse)))
  (PUNCT .))


*French*

In [9]:
# testing french

prompt = "Le chat mange la souris."
language = "fr"

generate_tree(language, prompt)


      S -> NP VP PUNCT | NP VP
      NP -> NUM NOUN ADJP | DET NOUN | DET NOUN ADJ | NOUN ADJ | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      DET -> 'Le' | 'la'
NOUN -> 'souris' | 'chat'
VERB -> 'mange'
PUNCT -> '.'

(S
  (NP (DET Le) (NOUN chat))
  (VP (VERB mange) (NP (DET la) (NOUN souris)))
  (PUNCT .))


*German*

In [10]:
# testing german

prompt = "Die Katze frisst die Maus."
language = "de"

generate_tree(language, prompt)


      S -> NP VP PUNCT | NP VP
      NP -> NUM ADJ NOUN | DET NOUN | DET ADJ NOUN | ADJ NOUN | DET NOUN PP | PROPN | NOUN
      VP -> VERB NP | VERB | VERB ADVP | VP SCONJ VP | AUX VP | AUX PART VP | VERB VP | VERB PP | VERB NOUN | AUX ADJP
      ADVP -> ADV
      PP -> NP
      ADJP -> ADJ | ADJ ADJP | ADJ PART ADJP
      DET -> 'Die' | 'die'
NOUN -> 'Katze' | 'Maus'
VERB -> 'frisst'
PUNCT -> '.'

(S
  (NP (DET Die) (NOUN Katze))
  (VP (VERB frisst) (NP (DET die) (NOUN Maus)))
  (PUNCT .))
