## NLP SECOND ASSIGNMENT

The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

1) PURE SYMBOLIC. The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:

   a) Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed;

    b) Verbs are all at present tense;

    c) No pronouns are admitted;

    d) Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);

    Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to     
    define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. 
    Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available
    in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity 
    should be blocked onto first admissible tree.

2) PURE ML. Candidates can develop a PLM with one-step Markov chains to forecast the following token, and used to generate the forecast of the
     POS tags to be attributed. In this case the PLM can be generated starting with a Corpus, that could be obtained online, for instance by 
     using the Wikipedia access API, or other available free repos (including those available with SketchEngine. In this approach, candidates should
     never use the forecasting to approach the determination of outcomes (for this would be identical purpose of distinguishing EN/non ENG (and
     then IT/non IT, FR/not FR or DE/not DE) but only to identify the POS model in a sequence. In this case, the candidate should output the most
     likely POS tagging, without associating the sequence to a tree in a direct fashion.

Candidates are free to employ PURE ML approach to simplify, or pre-process the text in order to improve the performance of a PURE SYMBOLIC approach while generating a mixed model.

## Preliminary imports

In [43]:
import nltk
import spacy
from nltk.tree import TreePrettyPrinter

## TREE GENERATOR

In [44]:
def tree_generator(file, base_grammar, nlp):
    for sent in file:
        parsed_sent = nlp(sent)
        for token in parsed_sent:
            print(token.text, token.pos_)
        print(f"{sent}\n")

        # Collect possible parts of speech and words for each part of speech
        grammar = {token.pos_: [f'"{token.text}"'] for token in parsed_sent}

        for type, words in grammar.items():
            for token in parsed_sent:
                if token.pos_ == type and f'"{token.text}"' not in words:
                    words.append(f'"{token.text}"')

            # Create grammar rules
            appo_string = f"{type} -> {' | '.join(words)}\n"
            base_grammar += appo_string

        nltk_grammar = nltk.CFG.fromstring(base_grammar)
        parser = nltk.ChartParser(nltk_grammar)

        spacy_tokenized = [token.text for token in parsed_sent]
        trees = list(parser.parse(spacy_tokenized))
        if trees:
            print(trees[0])
            print(TreePrettyPrinter(trees[0]).text())

## ITALIAN

In [45]:
file = [
    "L'automobile rossa corre velocemente."
]

# Load spacy pos tag for Italian
nlp = spacy.load("it_core_news_sm") 

# Grammar
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NOUN | NP ADJ | DET NP 
VP -> VP NP | VERB | VP ADV 
""" 

tree_generator(file, base_grammar, nlp)

L' DET
automobile NOUN
rossa ADJ
corre VERB
velocemente ADV
. PUNCT
L'automobile rossa corre velocemente.

(S
  (NP (NP (DET L') (NP (NOUN automobile))) (ADJ rossa))
  (VP (VP (VERB corre)) (ADV velocemente))
  (PUNCT .))
                     S                              
             ________|___________________________    
            NP                  |                |  
      ______|________           |                |   
     NP              |          VP               |  
  ___|______         |      ____|_______         |   
 |          NP       |     VP           |        |  
 |          |        |     |            |        |   
DET        NOUN     ADJ   VERB         ADV     PUNCT
 |          |        |     |            |        |   
 L'     automobile rossa corre     velocemente   .  



## ENGLISH

In [46]:
file = [
    "The red car runs quickly.",
]

# load spacy pos tag for English
nlp = spacy.load("en_core_web_sm") 

# grammar
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NOUN | ADJ NP  | DET NP 
VP -> VP NP | VERB | VP ADV | VP PUNCT 
"""
tree_generator(file, base_grammar, nlp)

The DET
red ADJ
car NOUN
runs VERB
quickly ADV
. PUNCT
The red car runs quickly.

(S
  (NP (DET The) (NP (ADJ red) (NP (NOUN car))))
  (VP (VP (VERB runs)) (ADV quickly))
  (PUNCT .))
             S                         
      _______|______________________    
     NP                |            |  
  ___|___              |            |   
 |       NP            VP           |  
 |    ___|___      ____|_____       |   
 |   |       NP   VP         |      |  
 |   |       |    |          |      |   
DET ADJ     NOUN VERB       ADV   PUNCT
 |   |       |    |          |      |   
The red     car  runs     quickly   .  



## GERMAN

In [47]:
file = [ 
    "Die hohe Palme schwankt sanft."
]

# load spacy pos tag for German
nlp = spacy.load("de_core_news_sm") 

# grammar
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NOUN | ADJ NP  | DET NP 
VP -> VP NP | VERB | VP ADV | VP PUNCT 
"""
tree_generator(file, base_grammar, nlp)

Die DET
hohe ADJ
Palme NOUN
schwankt VERB
sanft ADV
. PUNCT
Die hohe Palme schwankt sanft.

(S
  (NP (DET Die) (NP (ADJ hohe) (NP (NOUN Palme))))
  (VP (VP (VERB schwankt)) (ADV sanft))
  (PUNCT .))
               S                           
      _________|________________________    
     NP                      |          |  
  ___|____                   |          |   
 |        NP                 VP         |  
 |    ____|____        ______|____      |   
 |   |         NP     VP          |     |  
 |   |         |      |           |     |   
DET ADJ       NOUN   VERB        ADV  PUNCT
 |   |         |      |           |     |   
Die hohe     Palme schwankt     sanft   .  



## FRENCH

In [48]:
file = [
    "Le chat noir dormir tranquillement."
]

# load spacy pos tag for French
nlp = spacy.load("fr_core_news_sm") 

# grammar
base_grammar= """
S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
NP -> NOUN | NP ADJ | DET NP 
VP -> VP NP | VERB | VP ADV 
""" 
tree_generator(file, base_grammar, nlp)

Le DET
chat NOUN
noir ADJ
dormir VERB
tranquillement ADV
. PUNCT
Le chat noir dormir tranquillement.

(S
  (NP (NP (DET Le) (NP (NOUN chat))) (ADJ noir))
  (VP (VP (VERB dormir)) (ADV tranquillement))
  (PUNCT .))
              S                                  
          ____|_______________________________    
         NP               |                   |  
      ___|____            |                   |   
     NP       |           VP                  |  
  ___|___     |      _____|________           |   
 |       NP   |     VP             |          |  
 |       |    |     |              |          |   
DET     NOUN ADJ   VERB           ADV       PUNCT
 |       |    |     |              |          |   
 Le     chat noir dormir     tranquillement   .  

