## Assignement 2

The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

### 1) PURE SYMBOLIC
The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:
<ol>
<li>Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed; </li>
<li>Verbs are all at present tense;</li>
<li>No pronouns are admitted;</li>
<li>Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);</li>
</ol>

Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to     
define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. 
Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available
in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity 
should be blocked onto first admissible tree.

### 2) PURE ML.
Candidates can develop a PLM with one-step Markov chains to forecast the following token, and used to generate the forecast of the
POS tags to be attributed. In this case the PLM can be generated starting with a Corpus, that could be obtained online, for instance by 
using the Wikipedia access API, or other available free repos (including those available with SketchEngine. In this approach, candidates should
never use the forecasting to approach the determination of outcomes (for this would be identical purpose of distinguishing EN/non ENG (and
then IT/non IT, FR/not FR or DE/not DE) but only to identify the POS model in a sequence. In this case, the candidate should output the most
likely POS tagging, without associating the sequence to a tree in a direct fashion.

Candidates are free to employ PURE ML approach to simplify, or pre-process the text in order to improve the performance of a PURE SYMBOLIC approach while generating a mixed model.

### This software uses the first approch

The user can select the language as input.

In [18]:
#Imports
import spacy
import nltk

As first we define some grammar productions to make grammar and pos tagger on the same Non Terminals

In [19]:
spacy_to_nltk_grammar = """
N -> NOUN
V -> VERB
P -> ADP
"""
nlp = None
sentences = []
base_grammar = ""

### SetUp info
For each language setup:
<ul>
    <li>Load pos tagger from spacy.</li>
    <li>Some test sentences</li>
    <li>The base language grammar according to the specifics, then we apply the other grammar that we define before</li>
</ul>

Note: Not all sentences will work 

### ENGLISH SETUP

Note: English and German adjectives are prefixed to nouns

In [20]:
def load_english():
    # LOAD ENGLISH POS TAGGER
    nlp = spacy.load("en_core_web_sm")

    # ENGLISH TEST SENTENCES 
    sentences = ["The ginger cat runs fast.",
    "The gray cat is black.",
    "The cat is running away",
    "I love cats.",
    "Little cats are great.",
    "Fat cats are awesome."]

    # ENGLISH SPECIFIC GRAMMAR  
    base_grammar= """
    S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
    NP -> NUM ADJ N | N | ADJP NP  | DET NP 
    VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
    ADVP -> ADV 
    ADJP -> ADJ | ADJ ADJP
    PP -> P NP
    """ + spacy_to_nltk_grammar
    print("English setup done")
    return nlp, sentences, base_grammar


### GERMAN SETUP 

In [21]:
def load_german():
    # LOAD GERMAN POS TAGGER
    nlp = spacy.load("de_core_news_sm") 

    # GERMAN TEST SENTENCES 
    sentences = ["Die rote Katze rennt schnell.",
    "Die graue Katze ist schwarz.",
    "Die Katze läuft weg",
    "Ich liebe Katzen.",
    "Kleine Katzen sind toll.",
    "Fette Katzen sind großartig."]

    # GERMAN GRAMMAR  
    base_grammar= """
    S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
    NP -> NUM ADJ N | N | ADJP NP  | DET NP 
    VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
    ADVP -> ADV 
    ADJP -> ADJ | ADJ ADJP
    PP -> P NP
    """ + spacy_to_nltk_grammar
    print("German setup done")
    return nlp, sentences, base_grammar


### ITALIAN SETUP

Note: Italian and French adjectives are suffixed to nouns

In [22]:
def load_italian():
    # LOAD ITALIAN POS TAGGER
    nlp = spacy.load("it_core_news_sm") 

    # ITALIAN TEST SENTENCES 
    sentences = ["Il gatto rosso corre velocemente.",
    "Il gatto grigio è nero.",
    "Il gatto sta correndo via",
    "Adoro i gatti.",
    "I gatti piccoli sono fantastici.",
    "I gatti grassi sono fantastici."]

    # ITALIAN GRAMMAR  
    base_grammar= """
    S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
    NP -> NUM N ADJ | N | NP ADJP | DET NP 
    VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
    ADVP -> ADV 
    ADJP -> ADJ | ADJ ADJP
    PP -> P NP
    """ + spacy_to_nltk_grammar
    print("Italian setup done")
    return nlp, sentences, base_grammar

### FRENCH SETUP

In [23]:
def load_french():
    
    # LOAD FRENCH POS TAGGER
    nlp = spacy.load("fr_core_news_sm") 

    # FRENCH TEST SENTENCES 
    sentences = ["Le chat roux court vite.",
    "Le chat gris est noir.",
    "Le chat s'enfuit",
    "J'aime les chats.",
    "Les petits chats sont géniaux.",
    "Les gros chats sont géniaux."]

    # FRENCH GRAMMAR  
    base_grammar= """
    S -> NP VP PUNCT | NP VP | PUNCT NP VP PUNCT
    NP -> NUM N ADJ | N | NP ADJP | DET NP 
    VP -> VP NP | V | VP ADVP | VP SCONJ VP | AUX VP | VP PUNCT | AUX ADJP| AUX ADV 
    ADVP -> ADV 
    ADJP -> ADJ | ADJ ADJP
    PP -> P NP
    """ + spacy_to_nltk_grammar
    print("French setup done")
    return nlp, sentences, base_grammar

### Tree pipeline function
<ol>
    <li>Computes the pos tag of the sentence</li>
    <li>Add the Tagged words to the grammar as Terminal</li>
    <li>Create a parser with the specific grammar</li>
    <li>Print the tree</li>
</ol>

In [24]:
def generate_tree(nlp, sentences, base_grammar):
    """
    Given spacy pos tagger, senteces and a language, this function prints the grammar tree of each sentence in sentences.
    As first step it uses the pos tagger to estract the Terminals of that sentence.
    Then it add the grammar rules with "POS -> terminals".
    In the end generate a NLTK grammar that will be used for the parser to parse the sentence to get the tree.
    """
    for sentence in sentences: 
        #pos tagging
        sentence_pos = set()
        grammar = {}
        spacy_parsed_sent= nlp(sentence)
        print(f"{sentence}\n")
        for token in spacy_parsed_sent:
            print(f"{token.text } -> {token.pos_}")
            sentence_pos.add(token.pos_)
            if not token.pos_ in grammar:
                grammar[token.pos_] = []
            word = '"' + token.text + '"'
            if word not in grammar[token.pos_]:
                grammar[token.pos_].append(word)

        print("\n")
        #Grammar rules update
        grammar_rules = base_grammar
        for type in sentence_pos:  
            appo_string = f"{type} -> "
            index = len(grammar[type]) - 1
            for word in grammar[type][0:index]:
                appo_string+= " {} |".format(word)
            appo_string+= " {}\n".format(grammar[type][-1])
            grammar_rules+= appo_string 
        
        #Parser SetUp, and sentences parsing
        nltk_grammar = nltk.CFG.fromstring(grammar_rules)
        parser = nltk.ChartParser(nltk_grammar)
        spacy_tokenized = list(map(lambda e:e.text,spacy_parsed_sent))
        trees = list(parser.parse(spacy_tokenized))
        if trees: 
            print(trees[0]) 
            print(trees[0].pretty_print()) 
        else: print("No tree found")  

        print("\n\n")

### Input and loading stuffs for parsing

In [25]:
language_sentence = input("Sentence Language [EN/DE/IT/FR]: ")
#language_sentence = "EN"
language_sentence = language_sentence.upper()

In [26]:
if language_sentence == "EN": nlp, sentences, base_grammar = load_english(); generate_tree(nlp, sentences, base_grammar)
elif language_sentence == "DE": nlp, sentences, base_grammar = load_german(); generate_tree(nlp, sentences, base_grammar)
elif language_sentence == "IT": nlp, sentences, base_grammar = load_italian(); generate_tree(nlp, sentences, base_grammar)
elif language_sentence == "FR": nlp, sentences, base_grammar = load_french(); generate_tree(nlp, sentences, base_grammar)
else: print("language not valid")

Italian setup done
Il gatto rosso corre velocemente.

Il -> DET
gatto -> NOUN
rosso -> ADJ
corre -> VERB
velocemente -> ADV
. -> PUNCT


(S
  (NP (NP (DET Il) (NP (N (NOUN gatto)))) (ADJP (ADJ rosso)))
  (VP (VP (V (VERB corre))) (ADVP (ADV velocemente)))
  (PUNCT .))
                S                              
           _____|___________________________    
          NP               |                |  
      ____|_____           |                |   
     NP         |          VP               |  
  ___|____      |      ____|_______         |   
 |        NP    |     VP           |        |  
 |        |     |     |            |        |   
 |        N    ADJP   V           ADVP      |  
 |        |     |     |            |        |   
DET      NOUN  ADJ   VERB         ADV     PUNCT
 |        |     |     |            |        |   
 Il     gatto rosso corre     velocemente   .  

None



Il gatto grigio è nero.

Il -> DET
gatto -> NOUN
grigio -> ADJ
è -> AUX
nero -> ADJ
. -> PUN