# Italian Computational Linguistics: an example case from EVALITA

Linguistica Applicata (dott.ssa Silvia Ballarè)

6 dicembre 2023

Ludovica Pannitto - Laboratorio Sperimentale LILEC

ludovica.pannitto@unibo.it

lilec.lab@unibo.it

![Laboratorio](../imgs/lab.png "Sito Lab")


![Laboratorio](../imgs/mission.png "Mission Lab")

## Aim of this tutorial:

* Get to know the Italian Computational Linguistics (CL) community and their initiatives
* Familiarize with the process of building a CL model
* See some example snippets of code



## Lecture outline:

1. Present the EVALITA campaign
2. Explore GxG task (from EVALITA 2018)
3. Load and manipulate data
4. Outline the main steps to reproduce one of the system that participated in the competition

# EVALITA


[Evalita](https://www.evalita.it/) is a periodic evaluation campaign of Natural Language Processing (NLP) and speech tools for the Italian language.

EVALITA is an initiative of the [Italian Association for Computational Linguistics (AILC)](https://www.ai-lc.it/en/) and it is endorsed by the [Italian Association for Artificial Intelligence (AI\*IA)](https://aixia.it/en/) and the [Italian Association for Speech Sciences (AISV)](https://www.aisv.it/).


![EVALITA](../imgs/EVALITA.png "EVALITA")


## What does it mean?

Every two years (approx.), the community decides on some _tasks_ that are considered to be interesting or challenging for some reasons.

Then researchers develop models to tackle the poposed _shared task_ and compete. A final analysis that highlights the strenghts of the different submitted models hopefully allows for a broader discussion on the task itself.



## Why is it important?

The general objective of EVALITA is to **promote the development of language and speech technologies** for the Italian language, providing a **shared framework** where different systems and approaches can be evaluated in a consistent manner.


The diffusion of shared tasks and shared evaluation practices is a crucial step towards the development of resources and tools for NLP and speech sciences.


As a side effect of the evaluation campaign, both training and test data are available to the scientific community as benchmarks for future improvements.

## Example tasks from past years:

| year | tasks list                         |
| ---  | ---                                |
| 2007 | Part of Speech Tagging             |
|      | Named Entity Recognition           |
| 2011 | Anaphora Resolution                |
|      | Frame Labeling over Italian Texts  |
| 2014 | Sentiment Polarity Classification  |
|      | Dependency Parsing                 |
| 2020 | Stance Detection                   |
|      | Multimodal Artefacts Recognition   |
|      | Ghigliottin-AI                     |


... to be continued!


# GxG (from Evalita 2018)

- [**Gender-X-Genre (GxG)**](https://sites.google.com/view/gxg2018) is a task on **author profiling** (in terms of gender) on Italian texts, with a specific focus on **cross-genre** performance.

- The [published paper](https://ceur-ws.org/Vol-2263/paper006.pdf)[<sup>1</sup>](#fn1) presents the task of a type of author profiling task: 
> **Author profiling is the task of automatically discovering latent user attributes from text**. Gender, which we focus on in this paper, and which is traditionally characterised as a binary feature, is one of such attributes.


---

<sub><span id="fn1">1: Dell’Orletta, Felice, and Malvina Nissim. "Overview of the evalita 2018 cross-genre gender prediction (gxg) task." EVALITA Evaluation of NLP and Speech Tools for Italian 12.1 (2018): 35.</span></sub>

## Motivation:

> ...we have not yet found the actual dataset-independent features that do indeed capture the way females and males might write differently. (And might let us wonder if this is a valid assumption at all.)


> if we can make gender prediction stable across very different genres, then we are more likely to have captured **deeper gender-specific traits** rather than dataset characteristics. As a by product, this task will yield a variety of models for gender prediction in Italian, also shedding light on **which genres favour or discourage in a way gender expression**, by looking at whether they are easier or harder to model.

## Computational modeling: 

> Given a (collection of) text(s) from a specific genre, the gender of the author has to be predicted.


> The task is cast as a **binary classification task**, with gender represented as F (female) or M (male). 


> Evaluation settings were designed bearing in mind the question at the core of this task: are there indicative traits across genres that can be leveraged to model gender in a rather genre-independent way?



## Which genres?

Considered genres:
- tweets
- youtube comments (from manually selected videos from a few general topics: travel, music, documentaries, politics)
- children writing ([essays](https://aclanthology.org/L16-1014/) written by Italian L1 learners collected during the first and second year of lower secondary school[<sup>2</sup>](#fn2))
- journalism (single-authored newspaper article from _La Repubblica_ and _Corriere della Sera_)
- personal diaries (freely available as part of the [_Fondazione Archivio Diaristico Nazionale della Città di Pieve Santo Stefano_](http://archiviodiari.org/index.php/iniziative-e-progetti/brani-di-dirai.html))

---

<sub><span id="fn2">2: Alessia Barbagli, Pietro Lucisano, Felice Dell’Orletta, Simonetta Montemagni, and Giulia Venturi. 2016. Cita: an l1 italian learners corpus to study the development of writing competence. In _Proceedings of the 10 thConference on Language Resources and Evaluation (LREC 2016)_</span></sub>

## The competition:

> In the cross-genre setting[<sup>3</sup>](#fn3), the only constraint is not using in training any single instance from the genre they are testing on. Other than that, participants were free to combine the other
datasets as they wished.



> Participants were also free to use external resources, provided the cross-genre settings were carefully preserved, and everything used was described in detail in their final report.


---

<sub><span id="fn3">3: participants were also encouraged to submit a same-genre model.</span></sub>

## Evaluation Measures:

> average **accuracy** for the two classes, i.e. F and M. 




## Baselines:


> For all settings, given that the datasets are balanced for gender distribution, through random assignment we will have 50% accuracy

# (Partial) Results

(Partial table of) results in terms of Accuracy of the Cross-Genre task


| Model Name           | CH    | DI    | JO    | TW    | YT    |
| ----------           | ---   | ---   | ---   | ---   | ---   |
| CapetownMilanoTirana | 0.535 | 0.635 | 0.515 | 0.555 | 0.503 |
| ItaliaNLP - SVM      | 0.540 | 0.514 | 0.505 | 0.586 | 0.513 |
| ItaliaNLP - STL      | 0.640 | 0.554 | 0.495 | 0.609 | 0.510 |


## Participants 

> **[CapetownMilanoTirana](https://ceur-ws.org/Vol-2263/paper028.pdf)**[<sup>4</sup>](#fn4):  classifier based on Support Vector Machine (SVM) as learning algorithm. They tested different n-gram features extracted at the word level as well as at the character level. In addition, they experimented feature abstraction transforming each word into a list of symbols and computing the length of the obtained word and its frequency.


> **[ItaliaNLP](https://ceur-ws.org/Vol-2263/paper013.pdf)**[<sup>5</sup>](#fn5) tested three different classification models: one based on linear SVM, and two based on Bi-directional Long Short Term Memory (Bi-LSTM). The two deep neural network architectures use 2-layers of Bi-LSTM. The first Bi-LSTM layer encodes each sentence as a token sequence, the second layer encodes the sentence sequence.

---

<sub><span id="fn4">4: Angelo Basile, Gareth Dwyer, and Chiara Rubagotti. 2018. _CapetownMilanoTirana_ for GxG at Evalita2018. Simple n-gram based models perform well for gender prediction. Sometimes. In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, _Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2018)_, Turin, Italy. 

<sub><span id="fn5">5: Andrea Cimino, Lorenzo De Mattei, and Felice Dell’Orletta. 2018. Multi-task Learning in Deep Neural Networks at EVALITA 2018. In Tommaso Caselli, Nicole Novielli, Viviana Patti, and Paolo Rosso, editors, _Proceedings of Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA
2018)_, Turin, Italy</span>
</sub>

# Data

Organizers provided training data in pseudo-xml format as follows:


| Dataset   | F    | M     | Tokens |
| ---       | ---  | ---   | ---    |
| Children  | 100  | 100   | 65986  |
| Diaries   | 100  | 100   | 82989  |
|Journalism | 100  | 100   | 113437 |
| Twitter   | 3000 | 3000  | 101534 |
| Youtube   | 2200 | 2200  | 90639  | 

## Examples from data:

```
<doc id="3140" genre="youtube" gender="M">
Ti sei spiegata benissimo complimenti
</doc>
<doc id="5744" genre="youtube" gender="M">
Salvini e' veramente uno schifoso. Nonostante tutto, un sacco di POLLI lo hanno votato.
</doc>
<doc id="9766" genre="youtube" gender="F">
favi fai un video dove assaggiate cibi strani del Giappone! dai zio che esce una figata pazzesca! spolliciate per farglielo leggere
</doc>
```

## Training - Test - Gold

Along with the training set (shown in previous slide), organizers also provide:

* Test set: same as training, but the attribute `gender` is unknown  
  ```
  <doc id="27618" genre="twitter" gender="?">
    @zioburp pregava gli scossoni del treno non le facessero scappare le maglie.
  </doc>
  ```


* Gold labels: the files, released after the competition, contain the _solution_ to test set items
    ```
    2757	M
    2758	M
    2759	F
    2760	M
    2761	M
    2762	M
    ```

# The pipeline

Both the participating models use pretty much the same approach:


> Each document is represented as a vector (list of features). Its label (M or F) is represented as a binary value (i.e., 0 for M and 1 for F).


> A classifier (see Support Vector Machine slide) is trained with cross-validation on the training set.



> Test documents are then represented with the same set of features and the model is used to assign a class (0 or 1) to each document.



> Accuracy is calculated on test set

## The classifier: Support Vector Machine

![Support Vector Machine](../imgs/SVM.png "Support Vector Machine")


## Steps:

1. Process data
2. Represent data in a suitable format
3. Build classifier
4. Classify new (`test`) data

# Step 1: Data Pre-processing

Before performing any linguistic annotation, data needs to be checked and _cleaned_. More specifically, concerning our data, we see:

- mentions
- links + emails
- hashtags


## How?
A common method is employing **regular expressions**: we need to look at training data and come up with some reasonable transformation. We then will apply the same transformation to test data.


We could do more: lowercase words that are entirely upper-cased, remove emoticons, normalize cases of letter repetitions etc...

## Prelimiar step (1): mentions `@'

`Mi è piaciuto un video di @YouTube da @carodexter`

`@disinformatico comincia a sembrare Internet delle Co@@ionate... :-D`

`@Reggiace neppure ad essi era concesso sbirciare le grazie di Sua Maestà Carolina d'Asburgo @CasertaReggia @CarolCarditello`

## Prelimiar step (2): links and emails

`Ahahah grande lassss — Essì https:// l.ask.fm/igoto/45DKECPW 7B667HQMHN2IG6NM56EDDIA3SGTODHVVBSS2J7FI2AT5AJKQFZEM6QRTKG3PYFAEIFXTKFAU34N7A55YPPSKNTBAUYOADFKYCASCGMX5GWID6KPLQSVKBSDZSKQY2H2QZTGX7K4GJSNBWTLK67BSL4ET2LHV77K3QK236UUDLFOUVQHX5FR2P3XGTIUKEOQF7U====== …`

`Riassumila facendo riferimento a questo documento che è talmente autorevole da essere stato censurato dalla tv italiana: https://www.youtube.com/watch?v=6ZUdDj8rv4E`

`più info sul sito ufficiale www.newbikeproducts.com/it`

`ciao Gabri,ho visto il tuo filmato torino islanda,sono anch' io una ciclista di 68 anni,certo non faccio quello che fai tu mi sono divertita molto vederdi
ciao.lucia.foieni@gmail.com`

`Qui le Nostre Offerte di lavoro. diego.lugato@swegon.it TRE INTERESSANTI POSIZIONI APERTE https:// lnkd.in/datkNAk`



## Prelimiar step (3): hashtags

`Chissà come farà #Salvini a stare nella solita coalizione con chi era al governo insieme alla fornero. #Passera `

`@alessiarotta tu non sei degna nemmeno di nominarlo buffona #mangiapaneatradimento `

`#heartbeatoftheday cosa succederebbe se invece di boss avessimo coach? #leadership #lovemarketing #coaching pic.twitter.com/MLE3R2syep `     

`#Ventura :"Ho escluso #Baselli e #Zappacosta perché devono capire quale strada seguire per diventare protagonisti in SerieA". Dalla panchina? `

## Proposal (from easier to harder) - Hashtags:

- replace with generic `_HASHTAG_` token
- some of them are useful for understanding the sentence, so we could just remove the `#` symbol
- could we keep track of how many substitutions we're performing?

### `#([^ ]+)` -> `$1 `

- `#` matches the character # with index 3510 (2316 or 438) literally (case sensitive)
- 1st Capturing Group `([^ ]+)`
    - Match a single character not present in the list below `[^ ]`
    - `+` matches the previous token between one and unlimited times, as many times as possible
    - ` ` matches the character [space]
-  matches the character [space]

## Proposal (from easier to harder) - Mentions:  



- every @something is transformed into `user` 
- some represent known cases (i.e., `@YouTube`, mentions to newspapers etc). Do we want to keep that?
- can we differentiate based on the context?

### `@YouTube` -> `YouTube`, `@[^ ]+ ` -> `user `


- `@` matches the character `@` 
- Match a single character not present in the list below `[^ ]`
    - `+` matches the previous token between one and unlimited times, as many times as possible
    - ` ` matches the character [space]
- ` ` matches the character [space]

## Proposal (from easier to harder) - Links and emails:        

- emails are easily searcheable and can be transformed into `this address` for instance
- links are hard to find: try our best with a complex regular expression and substitute them with `link`

### `\b[A-z0-9\.]+@[A-z0-9\.]+\.[a-z]+\b` -> `this address`

- `\b` assert position at a word boundary
- Match a single character present in the list below `[A-z0-9\.]`
    - `+` matches the previous token between one and unlimited times, as many times as possible
    - `A-z` matches a single character in the range between A and z
    - `0-9` matches a single character in the range between 0 and 9
    - `\.` matches the character `.` 
- `@` matches the character `@` 
- Match a single character present in the list below `[A-z0-9\.]`
    - `+` matches the previous token between one and unlimited times, as many times as possible
    - `A-z` matches a single character in the range between A and z
    - `0-9` matches a single character in the range between 0 and 9
    - `\.` matches the character `.` 
- `\.` matches the character `.`
- Match a single character present in the list below `[a-z]`
    - `+` matches the previous token between one and unlimited times, as many times as possible
    - `a-z` matches a single character in the range between a and z 
- `\b` assert position at a word boundary
   

### `\b([A-z0-9]+[\/\.][A-z0-9]+[\/\.]?)+[A-z0-9]*\b` -> `link`

# Step 2: Linguistic Pipeline

The text provided by organizers is raw. If we want to extract linguistic information from it we might need some annotation on top of raw text.
This can be done by hand by manual annotators, or with automatic tools.

We will take a look at some automatic tools: specifically [`spacy`](https://spacy.io/), a python library for natural language processing.

`spaCy` also provides a model for italian trained on news and media written text.


But **first things first**: anyone familiar with Python?

In [None]:
# Python 101

## There are tree fundamental types of entities: number, strings and booleans

print(3+7)

print("Hello world")

print(True)

In [None]:
## We can assign values to variables

n = 13
name = "Ludovica"
is_italian = True

## and perform operations on them

print(n+3)
print(name[0]+".")

In [None]:
## Things can get more complicated

if is_italian:
    print("Ciao", name, ", a casa tutto bene?")
else:
    print("Hey there", name, ", where are you from?")

## Using `spaCy`

Let's turn to linguistic processing now.

Let's see how to do:
1. load a model in spacy
2. process text
   - split text into sentences
   - lemma and Part-of-Speech tagging
   - morphological analysis
   - dependency parsing 

In [None]:
text = "Sei anni dopo una riforma che fu definita epocale, "\
        "la scuola superiore cambia volto. Crolla il liceo "\
        "classico e cambia pelle lo scientifico, che diventa sempre più light."

print(text)

In [None]:
# Import SpaCy and parse text

import spacy

nlp_pipeline = spacy.load("it_core_news_sm")

parsed_text = nlp_pipeline(text)

print(parsed_text)

In [None]:
# Print sentences one by one

sentences = list(parsed_text.sents)

for sentence in sentences:
    print("SENTENCE:", sentence)

In [None]:
# Print tokens

i = 0

for sentence in sentences:
    print("SENTENCE N.", i)
    i = i+1
    
    for token in sentence:
        print(token,"\t", token.lemma_,"\t", token.pos_)

    print()

In [None]:
# Print morphological analysis

for sentence in sentences:
    print("SENTENCE N.", i)
    i = i+1
    
    for token in sentence:
        print(token,"\t", token.lemma_,"\t", token.morph)

    print()


In [None]:
# Print dependency parsing

for sentence in sentences:
    print("SENTENCE N.", i)
    i = i+1
    
    for token in sentence:
        print(token.i, "\t", token,"\t", token.head.i,"\t", token.dep_)

    print()


In [None]:
# Save syntactic trees to a file

from spacy import displacy
from pathlib import Path

#svg = displacy.render(sentences, style="dep", jupyter=False)
for i, sent in enumerate(sentences):
    svg = displacy.render(sent, style="dep", jupyter=False)
    
    filepath = Path(f"dependency_plots_{i}.svg")
    filepath.open("w", encoding="utf-8").write(svg)

# Step 3: Represent Documents

We now have both training and test data in a new format:

```
### 1	journalism	M
0	E	e	CCONJ		cc	4
1	i	il	DET	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	det	2
2	giovani	giovane	NOUN	Number=Plur	nsubj	4
3	italiani	italiano	ADJ	Gender=Masc|Number=Plur	amod	2
4	vivono	vivere	VERB	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	ROOT	4
5	ancora	ancora	ADV		advmod	4
6	con	con	ADP		case	8
7	i	il	DET	Definite=Def|Gender=Masc|Number=Plur|PronType=Art	det	8
8	genitori	genitore	NOUN	Gender=Masc|Number=Plur	obl	4
9	.	.	PUNCT		punct	4
```

## Types of features

- **ItaliaNLP model**:
    * Raw and Lexical Text Features <sub>(number of tokens, character n-grams, word n-grams, lemma n-grams, repetition of n-grams chars, number of mentions, number of hashtags, punctuation.)</sub>
    * Morpho-syntactic Features <sub>(coarse grained Part-Of-Speech n-grams, Fine grained Part-Of-Speech n-grams, Coarse grained Part-Of-Speech distribution)</sub>
    * Lexicon features <sub>(Emoticons Presence, Lemma sentiment polarity n-grams, Polarity modifier, PMI score, sentiment polarity distribution, Most frequent sentiment polarity, Sentiment polarity in text sections, Word embeddings combination.)</sub>


- **CapetownMilanoTirana model**:
    * n-grams extracted at the word level as well as at the character level (3-10 n-grams and binary TF-IDF)
    * experiment with feature abstraction following the [bleaching approach](https://arxiv.org/pdf/1805.03122.pdf)[<sup>6</sup>](#fn6)
 
---
<sub><span id="fn6">6: Rob van der Goot, Nikola Ljubesic, Ian Matroos, Malvina Nissim, and Barbara Plank. 2018. Bleaching text: Abstract features for cross-lingual gender prediction. In _Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)_</span>
</sub>

## Bleaching process

|   | SHAPE | FREQ | LEN | ALPHA | 
|---|---    | ---  | --- | ---   |
| Questo  | Cvvccv | 46 | 06 | True |
| è | v | 650 | 01 | True |
| solo | cvcv | 116 | 04 | True |
| un | vc | 1 | 02 | True |
| esempio | vcvccvv | 1 | 07 | True |
| . | . | 60 | 01 | False |
| 😃 | 😃 | 0 | 1 | False |


Different features:
- some are just "scores" assigned to each document (i.e. `number of tokens`, `emoticons presence`)
  * `Questo è solo un esempio . 😃` -> `7`
- others transform the document in a series of elements (i.e. `freq`, `alpha`)
  * `Questo è solo un esempio . 😃` -> `[1,1,1,1,1,0,0]`
- others need a "vocabulary" (i.e., `ngrams`)
  * `Questo è solo un esempio . 😃` -> `[1, 1, 0, 0, 0, 1, 1, 1, 0, ...]` where positions in this list correspond to the presence of a specific n-gram

In [None]:
text = []
with open("example_tweet.conll") as fin:
    fin.readline()
    for line in fin:
        if len(line)>1:
            text.append(line.strip().split("\t"))

for line in text:
    print(line)

In [None]:
## number of tokens
number_of_tokens = len(text)

## emoticons presence
emoticons_list = [":-)", ":)", ":(", ":D"]
emoticons_presence = False
for token in text:
    lemma = token[2]
    if lemma in emoticons_list:
        emoticons_presence = True

In [None]:
## Alpha

repr_alpha = []
for token in text:
    lemma = token[2]
    if all (c.isalpha() for c in lemma):
        repr_alpha.append(1)
    else:
        repr_alpha.append(0)

print(repr_alpha)

In [None]:
# Compute vocabulary and token frequencies

vocabulary = {}

with open("TransformedData/training_parsed.txt") as fin:
    
    for line in fin:
        if not line.startswith("###"):
            line = line.strip().split("\t")
            if len(line)>1:
                lemma = line[2]
                if not lemma in vocabulary:
                    vocabulary[lemma] = 0
                else:
                    vocabulary[lemma] += 1
            

In [None]:
sorted_vocabulary = sorted(vocabulary.items(), key = lambda x: -x[1])

print(sorted_vocabulary[:10])

In [None]:
# Frequencies

frequencies_list = []
for token in text:
    lemma = token[2]
    frequencies_list.append(vocabulary[lemma])

print(frequencies_list)

In [None]:
# Bag of words

vocabulary_list = dict(zip(vocabulary.keys(), range(len(vocabulary))))

bag_of_words = []
for token in text:
    lemma = token[2]
    if lemma in vocabulary_list:
        bag_of_words.append(vocabulary_list[lemma])

print(bag_of_words)

# Step 4: Build Classifier

Building a (basic) Support Vector Machine Classifier is extremely easy in Python.


Let's take a quick look at the [`scikit-learn`](https://scikit-learn.org) library and the [SVM](https://scikit-learn.org/stable/modules/svm.html) function implemented there.

All we need is a set of training data, associated with a set of `0/1` labels representing their class.

In [None]:
from sklearn import svm

training_data = [[0, 0], [1, 1]]
labels = [0, 1]
clf = svm.SVC()
clf.fit(training_data, labels)
print(clf)

In [None]:
new_datapoint = [2, 4]

print(clf.predict([new_datapoint]))

In [None]:
new_datapoint = [-2, -1]
print(clf.predict([new_datapoint]))

In [None]:
import matplotlib.pyplot as plt

from sklearn import svm
from sklearn.datasets import make_blobs
from sklearn.inspection import DecisionBoundaryDisplay

# we create 40 separable points
X, y = make_blobs(n_samples=10, centers=2, random_state=6)

i=0
while i<len(X):
    print("POINT:", X[i])
    print("Label:", y[i])
    print()
    i = i+1


In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
plt.show()

In [None]:
# fit the model, don't regularize for illustration purposes
clf = svm.SVC(kernel="linear")
clf.fit(X, y)

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)

# plot the decision function
ax = plt.gca()
DecisionBoundaryDisplay.from_estimator(
    clf,
    X,
    plot_method="contour",
    colors="k",
    levels=[-1, 0, 1],
    alpha=0.5,
    linestyles=["--", "-", "--"],
    ax=ax,
)
# plot support vectors
ax.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
)
plt.show()