# **Natural Language Processing: *Second assignment*** 

# ***Candidate: Martina Toffoli VR446059***

# **Prompt:**
The assignment consists in the development, in NLTK, OpenNLP, SketchEngine or GATE/Annie a pipeline that, starting from a text in input, in a given language (English, French, German and Italian are admissible) outputs the syntactic tree of the sentence itself, intended as a tree with root in S for sentence, and leaves on the tokens labelled with a single Part-of-speech. The generation of the tree can pass through one of the following models:

1) **PURE SYMBOLIC.** The tree is generated by a LR analysis with CF LL2 grammar as a base. Candidates can assume the following:

    a) Adjectives in English and German shall be only prefixed to nouns, whilst in French and Italian are only suffixed;

    b) Verbs are all at present tense;

    c) No pronouns are admitted;

    d) Only one adverb is admitted, always post-poned with respect to the verb (independently of the language, and the type of adverb);

  Overall the point above map a system that could be devised in regular expressions, but a Context-free grammar would be simpler to     
  define. Candidate can either define a system by themselves or use a syntactic tree generation system that can be found on GitHub. 
  Same happens for POS-tagging, where some of the above mentioned systems can be customized by existing techniques that are available
  in several fashions (including a pre-defined NLTK and OpenNLP libraries for POS-tagging and a module in GATE for the same purpose. Ambiguity 
  should be blocked onto first admissible tree.

2) **PURE ML.** Candidates can develop a PLM with one-step Markov chains to forecast the following token, and used to generate the forecast of the
     POS tags to be attributed. In this case the PLM can be generated starting with a Corpus, that could be obtained online, for instance by 
     using the Wikipedia access API, or other available free repos (including those available with SketchEngine. In this approach, candidates should
     never use the forecasting to approach the determination of outcomes (for this would be identical purpose of distinguishing EN/non ENG (and
     then IT/non IT, FR/not FR or DE/not DE) but only to identify the POS model in a sequence. In this case, the candidate should output the most
     likely POS tagging, without associating the sequence to a tree in a direct fashion.

Candidates are free to employ PURE ML approach to simplify, or pre-process the text in order to improve the performance of a PURE SYMBOLIC approach while generating a mixed model.

# **Implementation details:**
The generation of the tree choosen is *PURE ML* model that which provided a way to use the spaCy pipeline. 
In this assignment, spaCy is used for tokenizazion and segmentation of phrases. 

But *What is spaCy?* 

SpaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python and  is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

For parsing is used benepar (Barkeley Neural Parser) because parser models do not ship with a tokenizer or sentence splitter, and some models may not include a part-of-speech tagger either.

After all, a trivial model to print standard trees is to use a nltk tree-format.


To allow the software to run, we must first install the parser model benepar  using this command:

In [None]:
pip install benepar

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting benepar
  Downloading benepar-0.2.0.tar.gz (33 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting torch-struct>=0.5
  Downloading torch_struct-0.5-py3-none-any.whl (34 kB)
Collecting tokenizers>=0.9.4
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[tokenizers,torch]>=4.2.2
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece>=0.1.91
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m65.

The code:

In [None]:
import nltk
import benepar, spacy
import spacy.cli
from nltk.corpus.europarl_raw import english, french, german, italian
from nltk.tokenize import sent_tokenize
from nltk.tree import Tree

from spacy.lang.en.examples import sentences as englishSenteces
from spacy.lang.fr.examples import sentences as frenchSenteces
from spacy.lang.de.examples import sentences as germanSenteces
from spacy.lang.it.examples import sentences as italianSenteces

nltk.download('punkt')
nltk.download('europarl_raw')

benepar.download('benepar_en3')
benepar.download('benepar_fr2')
benepar.download('benepar_de2')
benepar.download('benepar_it3')

spacy.cli.download('en_core_web_sm')
spacy.cli.download('fr_core_news_sm')
spacy.cli.download('de_core_news_sm')
spacy.cli.download('it_core_news_sm')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package europarl_raw to /root/nltk_data...
[nltk_data]   Package europarl_raw is already up-to-date!
[nltk_data] Downloading package benepar_en3 to /root/nltk_data...
[nltk_data]   Package benepar_en3 is already up-to-date!
[nltk_data] Downloading package benepar_fr2 to /root/nltk_data...
[nltk_data]   Package benepar_fr2 is already up-to-date!
[nltk_data] Downloading package benepar_de2 to /root/nltk_data...
[nltk_data]   Package benepar_de2 is already up-to-date!
[nltk_data] Error loading benepar_it3: Package 'benepar_it3' not found
[nltk_data]     in index


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('it_core_news_sm')


**ENGLISH + Tree** 

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('benepar', config={'model': 'benepar_en3'})

'''
with europarl:
englishPhrase = sent_tokenize(english.raw())[0]
englishParsedString = list(nlp(englishPhrase).sents)[0]._.parse_string
'''
englishParsedString = list(nlp(englishSenteces[3]).sents)[0]._.parse_string
englishTree = Tree.fromstring(englishParsedString)
englishTree.pretty_print()

You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


                    S                                 
   _________________|_______________________________   
  |                 VP                              | 
  |      ___________|____________                   |  
  |     |       |                PP                 | 
  |     |       |         _______|____              |  
  NP    |       NP       |            NP            | 
  |     |    ___|___     |    ________|_______      |  
 NNP   VBZ  DT  JJ  NN   IN  DT      NNP     NNP    . 
  |     |   |   |   |    |   |        |       |     |  
London  is  a  big city  in the     United Kingdom  . 



**FRENCH + Tree**

In [None]:
nlp = spacy.load('fr_core_news_sm')
nlp.add_pipe('benepar', config={'model': 'benepar_fr2'})

'''
with europarl:
#frenchPhrase = sent_tokenize(french.raw())[1]
#frenchParsedString = list(nlp(frenchPhrase).sents)[0]._.parse_string
'''
frenchParsedString = list(nlp(frenchSenteces[1]).sents)[0]._.parse_string
frenchTree = Tree.fromstring(frenchParsedString)
frenchTree.pretty_print()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


                                          SENT                                                      
        ___________________________________|____________________________________                     
       |                   |                         NP                         |                   
       |                   |       __________________|___                       |                    
       NP                  |      |        |             PP                     PP                  
  _____|_________          |      |        |          ___|___               ____|___                 
 |     |         AP        VN     |        |         |       NP            |        NP              
 |     |         |         |      |        |         |    ___|______       |     ___|________        
DET    NC       ADJ        V     DET       NC        P  DET         NC     P   DET           NC     
 |     |         |         |      |        |         |   |          |      |    |      

**GERMAN + Tree**

In [None]:
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe('benepar', config={'model': 'benepar_de2'})

'''
with europarl:
germanPhrase = sent_tokenize(german.raw())[0]
germanParsedString = list(nlp(germanPhrase).sents)[0]._.parse_string
'''
germanParsedString = list(nlp(germanSenteces[3]).sents)[0]._.parse_string
germanTree = Tree.fromstring(germanParsedString)
germanTree.pretty_print()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


                   S                                              
        ___________|______________                                 
       |           |              NP                              
       |           |        ______|________                        
       |           |       |               PP                     
       |           |       |       ________|___________________    
       NN        VVFIN     NN    APPR     ADJA        ADJA     NN 
       |           |       |      |        |           |       |   
Bundesanwaltscha erhebt Anklage gegen mutmaßlichen Schweizer Spion
       ft                                                         



**ITALIAN + Tree**

In [None]:
nlp = spacy.load('it_core_news_sm')
nlp.add_pipe('benepar', config={'model': 'benepar_fr2'})

'''
with europarl:
italianPhrase = sent_tokenize(italian.raw())[2]
italianParsedString = list(nlp(italianPhrase).sents)[0]._.parse_string
'''
italianParsedString = list(nlp(italianSenteces[3]).sents)[0]._.parse_string
italianTree = Tree.fromstring(italianParsedString)
italianTree.pretty_print()

You're using a XLMRobertaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


                       SENT                           
   _____________________|__________________________    
  |     |               NP                         |  
  |     |    ___________|_________                 |   
  |     |   |    |      |         PP               |  
  |     |   |    |      |     ____|____            |   
  |     |   |    |      |    |         NP          |  
  |     |   |    |      |    |         |           |   
  NP    VN  |    |      |    |        NPP+         |  
  |     |   |    |      |    |     ____|_____      |   
 NPP    V  DET  ADJ     NC  P+D  NPP        ADJ  PONCT
  |     |   |    |      |    |    |          |     |   
Londra  è  una grande città del Regno      Unito   .  

