### `Idrissa Dicko & Tyler Marino & Simon Khan`

In [1]:
! python -m spacy download en_core_web_sm
! python -m spacy download fr_core_news_sm
#! pip install nltk pandas scikit-learn

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m2.2 MB/s[0m  [33m0:00:06[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m1.8 MB/s[0m  [33m0:00:09[0mm0:00:01[0m00:01[0m
[?25hInstalling collected packages: fr-core-news-sm
Successfully installed fr-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


# Exercise 1 : Lemmatization

In this exercise, the objective is to create your own lemmatizer for french language. We will test different lemmatization approaches : 
* Based on a dictionary
* Based on machine learning approach (you can use sklearn) or define your own architecture with pytorch
* With and without pos tag given as input

In all case you should compare your results and report performances of the proposed algorithm to [spacy](https://spacy.io/models/fr) lemmatizer (the different configuration).

You are free to use any machine-learning algorihtm/model, taking or not the context of sentences such as [LinearRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html) or training your own [RNN with pytorch](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). 
However you must always motivate your choices and compare results of the different configurations.

You will send the report to *thomas.gerald@universite-paris-saclay.fr* in PDF format named as following and the code (notebook with  output of the two exercises in a zip format) :


**report_[firstname]_[lastname].pdf**

The report for the two exercises must not exceed three pages !


## Dataset
To train or build your lemmatizer you have three files in *tabular separated values* format :
* [training-set.tsv](https://thomas-gerald.fr/TMC/resources/data/training-set.tsv) that you can use to train/build your dictionnary/model 
* [testing-set.tsv](https://thomas-gerald.fr/TMC/resources/data/testing-set.tsv) used to evaluate the different approaches
* [testing-gallica.tsv](https://thomas-gerald.fr/TMC/resources/data/testing-gallica.tsv) used as gold standard to evaluate performances [github (in french)](https://github.com/Gallicorpora/Lemmatisation)

In our case we have two possibilities for a lemma:
* (a) A sequence of characters, meaning that "to rule" an "a rule" are the same lemma
* (b) A sequence of characters, meaning that "to rule" represent the verb, a tuple ("rule", "V") while "a rule" is represented by the tuple ("rule", "N") 
In the (a) case the size of the vocabulary (output) will be 
## Spacy :

Below a small example using spacy lemmatization
```python
import spacy
nlp = spacy.load("en_core_web_sm")
text_a = "He is thirty years old"
text_b = "We still are champions"
print(f'Lemmatization A : {[(w.lemma_, w.pos_) for w in nlp(text_a)]}')
print(f'Lemmatization B : {[(w.lemma_, w.pos_) for w in nlp(text_b)]}')
```

In [6]:
import spacy
nlp_en = spacy.load("en_core_web_sm")
nlp_fr = spacy.load("fr_core_news_sm")
text_a = "On est toujours champions"
text_b="Il a trente ans"
print(f"Lemmatisation A: {[(w.lemma_,w.pos_)for w in nlp_fr(text_a)]}")
print(f"Lemmatisation B: {[(w.lemma_,w.pos_)for w in nlp_fr(text_b)]}")

Lemmatisation A: [('on', 'PRON'), ('être', 'AUX'), ('toujours', 'ADV'), ('champion', 'NOUN')]
Lemmatisation B: [('il', 'PRON'), ('avoir', 'AUX'), ('trente', 'NUM'), ('an', 'NOUN')]


### Reading data

You can use pandas to read the data using tabular separator as following

In [None]:
import pandas as pd
train_file = "data/training-set.tsv"
pd.read_csv(train_file, sep='\t', names=["token", "lemma", "pos"])


Unnamed: 0,token,lemma,pos
0,Certes,certes,ADV
1,",",",",PONCT
2,rien,rien,PRO
3,ne,ne,ADV
4,dit,dire,V
...,...,...,...
261384,effet,effet,N
261385,positif,positif,A
261386,.,.,PONCT
261387,tenir,tenir,V


In [None]:

w_vocabulary = {'unknow_word'}
l_vocabulary = set()
lp_vocabulary = set()

with open(train_file, 'r')  as f:
    for line in f:
        try:
            word, lemma, pos = line.split()
            w_vocabulary.add(word)
            l_vocabulary.add(lemma)
            lp_vocabulary.add((lemma, pos))
        except:
            pass

print(f'The input vocabulary contains : {len(w_vocabulary)} words' )
print(f'The number of str lemma is :  {len(l_vocabulary)}')
print(f'The number of lemma (considering PoS) is :  {len(lp_vocabulary)}') #marche (le nom) est différent de marche (le verbe)

The input vocabulary contains : 23271 words
The number of str lemma is :  15194
The number of lemma (considering PoS) is :  16144


In [11]:
import pandas as pd
from collections import Counter, defaultdict
from sklearn.metrics import accuracy_score
import spacy

#reading TSV files

def read_tsv(path):
    df = pd.read_csv(path, sep="\t", names=["token", "lemma", "pos"], keep_default_na=False)
    # normalize empties
    df["pos"] = df["pos"].replace("", "X")
    df["token"] = df["token"].astype(str)
    df["lemma"] = df["lemma"].astype(str)
    return df

train_path = "data/training-set.tsv"
test_path  = "data/testing-set.tsv"

train_df = read_tsv(train_path)
test_df  = read_tsv(test_path)



In [12]:
#TRAINING
#model A: token -> most frequent lemma
def train_mfl_token(df):
    counts = defaultdict(Counter)
    for tok, lem in zip(df["token"], df["lemma"]):
        counts[tok][lem] += 1
    model = {}
    for tok, c in counts.items():
        model[tok] = c.most_common(1)[0][0]
    return model

#model B: (token,pos) -> most frequent lemma
def train_mfl_token_pos(df):
    counts = defaultdict(Counter)
    for tok, pos, lem in zip(df["token"], df["pos"], df["lemma"]):
        counts[(tok, pos)][lem] += 1
    model = {}
    for key, c in counts.items():
        model[key] = c.most_common(1)[0][0]
    return model

mfl_tok = train_mfl_token(train_df)
mfl_tok_pos = train_mfl_token_pos(train_df)

#spacy lemma function (token-by-token, no sentence context here)
def spacy_lemma(token):
    doc = nlp_fr(token)
    return doc[0].lemma_ if len(doc) else ""



In [13]:
#INFERENCE
# ---- Predict functions (with optional fallback)
def predict_mfl_token(tokens, fallback="spacy"):
    preds = []
    for t in tokens:
        if t in mfl_tok:
            preds.append(mfl_tok[t])
        else:
            if fallback == "spacy":
                preds.append(spacy_lemma(t))
            elif fallback == "lower":
                preds.append(t.lower())
            else:
                preds.append(t)  # identity
    return preds

def predict_mfl_token_pos(tokens, poses, fallback="spacy"):
    preds = []
    for t, p in zip(tokens, poses):
        key = (t, p)
        if key in mfl_tok_pos:
            preds.append(mfl_tok_pos[key])
        else:
            if fallback == "spacy":
                preds.append(spacy_lemma(t))
            elif fallback == "lower":
                preds.append(t.lower())
            else:
                preds.append(t)
    return preds

def evaluate(y_true, y_pred, known_mask=None, label=""):
    acc = accuracy_score(y_true, y_pred)
    print(f"{label} accuracy = {acc:.4f}")
    if known_mask is not None:
        acc_known = accuracy_score(y_true[known_mask], pd.Series(y_pred)[known_mask])
        acc_unk   = accuracy_score(y_true[~known_mask], pd.Series(y_pred)[~known_mask])
        print(f"  known accuracy = {acc_known:.4f}")
        print(f"  unk   accuracy = {acc_unk:.4f}")

y_true = test_df["lemma"]
tokens = test_df["token"]
poses  = test_df["pos"]

known_mask_tok = tokens.isin(set(mfl_tok.keys()))

#evaluate model A (no pos)
pred_a = predict_mfl_token(tokens, fallback="spacy")
evaluate(y_true, pred_a, known_mask_tok, label="MFL token->lemma (fallback spaCy)")

#evaluate model B (with pos)
known_mask_tokpos = pd.Series(list(zip(tokens, poses))).isin(set(mfl_tok_pos.keys()))
pred_b = predict_mfl_token_pos(tokens, poses, fallback="spacy")
evaluate(y_true, pred_b, known_mask_tokpos, label="MFL (token,pos)->lemma (fallback spaCy)")

#spacy
pred_spacy = [spacy_lemma(t) for t in tokens]
evaluate(y_true, pred_spacy, None, label="spaCy token lemma baseline")

MFL token->lemma (fallback spaCy) accuracy = 0.9531
  known accuracy = 0.9694
  unk   accuracy = 0.6988


  type_true = type_of_target(y_true, input_name="y_true")
  type_pred = type_of_target(y_pred, input_name="y_pred")


MFL (token,pos)->lemma (fallback spaCy) accuracy = 0.9653
  known accuracy = 0.9855
  unk   accuracy = 0.6828


  type_true = type_of_target(y_true, input_name="y_true")
  type_pred = type_of_target(y_pred, input_name="y_pred")


spaCy token lemma baseline accuracy = 0.8150


## Results and Comparison with spaCy

We evaluate three lemmatization approaches on the testing set:

- **MFL token → lemma**: a most-frequent-lemma baseline learned from the training data, without PoS information.
- **MFL (token, PoS) → lemma**: the same approach, but incorporating the part-of-speech tag to reduce lexical ambiguity.
- **spaCy baseline**: lemmatization produced by the `fr_core_news_sm` model, used as a general-purpose reference.

### Overall Accuracy

| Model | Accuracy |
|------|----------|
| MFL token → lemma | 0.9531 |
| MFL (token, PoS) → lemma | **0.9653** |
| spaCy baseline | 0.8150 |

Both proposed approaches significantly outperform the spaCy lemmatizer, with an improvement of more than **15 accuracy points**. This result is expected, as the proposed models are directly trained on the same annotation scheme and domain as the evaluation data, while spaCy is a generic lemmatizer.

### Known vs Unknown Words

To better understand the behavior of the models, we distinguish between words seen during training (*known words*) and unseen words (*unknown words*).

| Model | Known accuracy | Unknown accuracy |
|------|---------------|------------------|
| MFL token → lemma | 0.9694 | 0.6988 |
| MFL (token, PoS) → lemma | **0.9855** | 0.6828 |

For known words, the accuracy is extremely high, reaching **98.6%** when PoS information is used. This shows that lemmatization is almost deterministic for observed lexical forms and that PoS tags effectively reduce ambiguity for homographic tokens (e.g., noun vs verb forms).

For unknown words, performance drops to approximately **69%**, highlighting the intrinsic difficulty of generalizing to unseen forms. Interestingly, incorporating PoS information does not improve performance on unknown words and even slightly degrades it, since PoS-specific token–lemma pairs remain unseen in these cases and the prediction relies mainly on the fallback strategy.

### Discussion

Despite its simplicity, the most-frequent-lemma approach proves to be a very strong baseline for lemmatization. The inclusion of PoS tags provides a clear benefit for known words, while the main limitation of the approach lies in its handling of out-of-vocabulary tokens. Overall, the proposed methods demonstrate that simple supervised lexical strategies can outperform more complex general-purpose lemmatizers when the domain and annotation scheme are well matched.