## Obiettivo della presentazione

- Ricostruire la pipeline di ricerca di Scivetti \& Schneider
- consultare il loro codice e identificare i punti salienti della pipeline
- riprodurre i risultati
- definire i punti critici nell'ottica di integrare dati sull'italiano

## La pubblicazione

[Link](https://aclanthology.org/2025.conll-1.24/)

- Paper + Software

- Paper: 
  - Introduction
  - The NPN Construction
  - **Dataset**
    - **Corpus Gathering and Cleaning**
    - Near Minimal Pairs
    - **Train/Test Split**
  - **Experiment 1: Constructions vs. Distractors**
    - **Methodology**
    - Results
  - **Experiment 2: Perturbing Word Order**
    - Results
    - Analysis
  - **Experiment 3: Semantic Disambiguation**
    - NtoN Subtypes
    - **Methodology**
    - Results
  - Related Work
  - Conclusion
  - Limitations

## La pipeline

Creare un dataset **>** 
Annotare il dataset **>** 
Estrarre da BERT i vettori contestuali **>** 
Allenare un classificatore su una porzione di dataset **>** 
Testare il classificatore su una porzione diversa di dati **>**
Analizzare gli errori

### Cosa ci aspettiamo?

- Dataset
  - In che formato?
  - Quali informazioni?
- Vettori
  - Che tipo di file?
  - Quanti vettori?
- Classificatore
  - Input? Output?
- Predizioni
  - In che formato?
  - Come valutiamo?

## 1. Dataset

### (3.1) Corpus Gathering and Cleaning

> First, we use a simple **pattern matching** query to extract instances of the sequence 
Noun + “to” + Noun from COCA. We extract the examples from the corpus in a **fixed window of 
+/- 50 tokens from the construction**

> then used Stanza (Qi et al., 2020) to **segment the results into sentences** and extract the sentences which contained NtoNs.

> We automatically **exclude sentences which contained “from”** preceding the construction

> we then manually clean the data, removing sentences that were either **too short (<5 tokens)** or 
contained **too many typos**

> We **annotate** all instances of the construction for their semantic subtype

####  More info

> **double annotate** roughly 25% of the dataset, achieving an agreement of 84% and a **Cohen’s kappa** 
value of .754 between the two annotators, indicating strong agreement. 
The final dataset has 6599 instances of NtoN, of which 1885 were double annotated.

> In total, we collect 456 total instances of NtoN distractors from COCA.

### (3.3) Train/Test Split

> we artificially shrink the dataset by **randomly sampling 20 sentences** for each noun lemma 
which occurs more than 20 times

> we generate **random train/test splits based on lemma of the noun** in the NtoN, meaning
that there are no lemmas that are seen in both the training set and the testing set.

> We take **80 percent of the NtoN distractor patterns for training and withhold twenty percent**. 
We take a **similar number of NtoN constructions** for training and then test on the remainder, 
ensuring **training sets are balanced between constructions and distractors**.

## Experiment 1:

> We probe the ability for BERT to distinguish natural instances of the NtoN construction from natural
examples of the NtoN distractor pattern

>  providing two baseline systems which give perspective on performance based on lexical cues: 
a control classifier and a non-contextual baseline based on GloVe embeddings
	- training a linear classifier on GloVe embeddings for the nouns in the construction as input.

> we train a separate probe based on embeddings from each layer of BERT and track
performance across layers. We use the BERT-base-cased model, available through the Huggingface
transformers library, and choose logistic regression as our linear classification architecture

> For all experiments and data settings, we run probes with 5 random seeds and report the
average results.


## Esperimento 2

> we manipulate the test set of the probe by creating 4 perturbed orderings
of each test example sentence: PNN, PN, NNP, NP. 

> Crucially, we do not retrain the linear probe on this perturbed data

## Esperimento 3

> we train a classifier to distinguish semantic subtypes of NtoN. [...] We also 
include examples of the NtoN distractor patterns which are not examples of the construction.

> we train control classifiers with a random label assigned to each lemma.

## La pipeline -- dopo aver letto il paper

Selezionare contesti dal COCA 

**>** 

Segmentare in frasi 

**>** 

Filtrare frasi per tenere solo istanze con NtoN, non precedute da `from`, lunghe almeno 5 token 

**>** 

Annotare ogni istanza del dataset come `distractor` o uno dei tipi semantici della costruzione 

**>** 

Per ogni elemento del dataset, estrarre da BERT-base-cased 12 vettori corrispondenti alla testa
della costruzione (preposizione) più un vettore di GloVe (corrispondente al lemma del NOUN) 

**>** 

Per 5 volte, scegliere una porzione casuale di dataset:
	- allenare un classificatore lineare sui vettori GloVe
	- per ogni layer di BERT, allenare un classificatore lineare

**>** 

Calcolare la media dell'accuratezza per ogni tipo di embedding

**>** 

Plottare i risultati