### `Idrissa Dicko & Tyler Marino & Simon Khan`

In [45]:
! python -m spacy download en_core_web_sm
! python -m spacy download fr_core_news_sm
#! pip install nltk pandas scikit-learn

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m00:01[0m:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Collecting fr-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.8.0/fr_core_news_sm-3.8.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('fr_core_news_sm')


# Exercise 1 : Lemmatization

In this exercise, the objective is to create your own lemmatizer for french language. We will test different lemmatization approaches : 
* Based on a dictionary
* Based on machine learning approach (you can use sklearn) or define your own architecture with pytorch
* With and without pos tag given as input

In all case you should compare your results and report performances of the proposed algorithm to [spacy](https://spacy.io/models/fr) lemmatizer (the different configuration).

You are free to use any machine-learning algorihtm/model, taking or not the context of sentences such as [LinearRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LinearRegression.html) or training your own [RNN with pytorch](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). 
However you must always motivate your choices and compare results of the different configurations.

You will send the report to *thomas.gerald@universite-paris-saclay.fr* in PDF format named as following and the code (notebook with  output of the two exercises in a zip format) :


**report_[firstname]_[lastname].pdf**

The report for the two exercises must not exceed three pages !


## Dataset
To train or build your lemmatizer you have three files in *tabular separated values* format :
* [training-set.tsv](https://thomas-gerald.fr/TMC/resources/data/training-set.tsv) that you can use to train/build your dictionnary/model 
* [testing-set.tsv](https://thomas-gerald.fr/TMC/resources/data/testing-set.tsv) used to evaluate the different approaches
* [testing-gallica.tsv](https://thomas-gerald.fr/TMC/resources/data/testing-gallica.tsv) used as gold standard to evaluate performances [github (in french)](https://github.com/Gallicorpora/Lemmatisation)

In our case we have two possibilities for a lemma:
* (a) A sequence of characters, meaning that "to rule" an "a rule" are the same lemma
* (b) A sequence of characters, meaning that "to rule" represent the verb, a tuple ("rule", "V") while "a rule" is represented by the tuple ("rule", "N") 
In the (a) case the size of the vocabulary (output) will be 
## Spacy :

Below a small example using spacy lemmatization
```python
import spacy
nlp = spacy.load("en_core_web_sm")
text_a = "He is thirty years old"
text_b = "We still are champions"
print(f'Lemmatization A : {[(w.lemma_, w.pos_) for w in nlp(text_a)]}')
print(f'Lemmatization B : {[(w.lemma_, w.pos_) for w in nlp(text_b)]}')
```

In [46]:
import spacy
nlp_en = spacy.load("en_core_web_sm")
nlp_fr = spacy.load("fr_core_news_sm")
text_a = "On est toujours champions"
text_b="Il a trente ans"
print(f"Lemmatisation A: {[(w.lemma_,w.pos_)for w in nlp_fr(text_a)]}")
print(f"Lemmatisation B: {[(w.lemma_,w.pos_)for w in nlp_fr(text_b)]}")

Lemmatisation A: [('on', 'PRON'), ('être', 'AUX'), ('toujours', 'ADV'), ('champion', 'NOUN')]
Lemmatisation B: [('il', 'PRON'), ('avoir', 'AUX'), ('trente', 'NUM'), ('an', 'NOUN')]


### Reading data

You can use pandas to read the data using tabular separator as following

In [47]:
import pandas as pd
import spacy
from collections import Counter, defaultdict
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import os
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# --- 1. SETUP & CONFIGURATION ---
print("--- 1. Setup ---")
# Download/Load spaCy model
try:
    nlp_fr = spacy.load("fr_core_news_sm")
    print("✔ spaCy model 'fr_core_news_sm' loaded.")
except OSError:
    print("✘ spaCy model not found. Downloading...")
    os.system("python -m spacy download fr_core_news_sm")
    nlp_fr = spacy.load("fr_core_news_sm")

# Define File Paths
TRAIN_PATH = "training-set.tsv"
TEST_PATH = "testing-set.tsv"
GALLICA_PATH = "testing-gallica.tsv"

--- 1. Setup ---
✔ spaCy model 'fr_core_news_sm' loaded.


In [48]:
# --- 2. DATA LOADING FUNCTIONS ---

def read_standard_tsv(path):
    """Reads the standard training/testing files (No header, simple tags)."""
    if not os.path.exists(path):
        print(f"⚠ File not found: {path}")
        return pd.DataFrame(columns=["token", "lemma", "pos"])

    # Read (No header in these files)
    df = pd.read_csv(path, sep="\t", names=[
                     "token", "lemma", "pos"], keep_default_na=False, quoting=3)

    # Normalize
    df["pos"] = df["pos"].replace("", "X")
    df["token"] = df["token"].astype(str)
    df["lemma"] = df["lemma"].astype(str)
    return df


def read_gallica_tsv(path):
    """Reads the Gallica file (Has header, complex tags)."""
    if not os.path.exists(path):
        print(f"⚠ File not found: {path}")
        return pd.DataFrame(columns=["token", "lemma", "pos"])

    # Read (Has header: form, lemma, POS, morph)
    df = pd.read_csv(path, sep="\t", header=0,
                     quoting=3, keep_default_na=False)

    # Rename columns to match our standard
    # Based on screenshot: 'form' -> token, 'POS' -> pos
    rename_map = {'form': 'token', 'POS': 'pos'}
    df = df.rename(columns=rename_map)

    # Ensure columns exist
    if 'token' not in df.columns:
        return pd.DataFrame()

    return df[['token', 'lemma', 'pos']]

In [49]:
# --- 3. TRAINING FUNCTIONS (DICTIONARY MODELS) ---

def train_model_a(df):
    """Model A: Token -> Most Frequent Lemma"""
    counts = defaultdict(Counter)
    for tok, lem in zip(df["token"], df["lemma"]):
        counts[tok][lem] += 1

    model = {}
    for tok, c in counts.items():
        model[tok] = c.most_common(1)[0][0]
    return model


def train_model_b(df):
    """Model B: (Token, POS) -> Most Frequent Lemma"""
    counts = defaultdict(Counter)
    for tok, pos, lem in zip(df["token"], df["pos"], df["lemma"]):
        counts[(tok, pos)][lem] += 1

    model = {}
    for key, c in counts.items():
        model[key] = c.most_common(1)[0][0]
    return model

In [50]:
# --- 4. PREDICTION & MAPPING FUNCTIONS ---

def spacy_lemma(token):
    """Baseline: Uses spaCy for lemmatization."""
    if nlp_fr is None:
        return token
    doc = nlp_fr(token)
    return doc[0].lemma_ if len(doc) else ""


def predict_model_a(model, tokens, fallback="spacy"):
    preds = []
    for t in tokens:
        if t in model:
            preds.append(model[t])
        else:
            if fallback == "spacy":
                preds.append(spacy_lemma(t))
            elif fallback == "lower":
                preds.append(t.lower())
            else:
                preds.append(t)
    return preds


def predict_model_b(model, tokens, poses, fallback="spacy"):
    preds = []
    for t, p in zip(tokens, poses):
        key = (t, p)
        if key in model:
            preds.append(model[key])
        else:
            if fallback == "spacy":
                preds.append(spacy_lemma(t))
            elif fallback == "lower":
                preds.append(t.lower())
            else:
                preds.append(t)
    return preds


# Map Gallica tags (VERcjg) to Training tags (V)
gallica_map = {
    'VER': 'V', 'NOM': 'N', 'ADJ': 'A', 'ADV': 'ADV',
    'DET': 'D', 'CON': 'C', 'PRO': 'PRO', 'PRE': 'P', 'PON': 'PONCT'
}


def normalize_gallica_pos(tag):
    prefix = str(tag)[:3]  # Take first 3 chars
    return gallica_map.get(prefix, tag)  # Map or keep original


def evaluate(y_true, y_pred, label=""):
    acc = accuracy_score(y_true, y_pred)
    print(f"  > {label:<35} Accuracy: {acc:.4f}")
    return acc

In [51]:
# --- 5. MAIN EXECUTION ---

# A. LOAD DATA
print("\n--- 2. Loading Data ---")
full_train_df = read_standard_tsv(TRAIN_PATH)
test_df = read_standard_tsv(TEST_PATH)
gallica_df = read_gallica_tsv(GALLICA_PATH)

print(f"  Training Set: {len(full_train_df)} rows")
print(f"  Testing Set:  {len(test_df)} rows")
print(f"  Gallica Set:  {len(gallica_df)} rows")


--- 2. Loading Data ---
  Training Set: 261389 rows
  Testing Set:  16694 rows
  Gallica Set:  2841 rows


In [52]:
# B. VALIDATION STEP (Tuning Fallback)
print("\n--- 3. Validation: Tuning Fallback Strategy ---")
train_split, val_split = train_test_split(
    full_train_df, test_size=0.1, random_state=42)
temp_model = train_model_b(train_split)

best_acc = 0
best_strat = "identity"
for strat in ["spacy", "lower", "identity"]:
    preds = predict_model_b(
        temp_model, val_split["token"], val_split["pos"], fallback=strat)
    acc = accuracy_score(val_split["lemma"], preds)
    print(f"  Strategy '{strat}': {acc:.4f}")
    if acc > best_acc:
        best_acc = acc
        best_strat = strat

print(f"✔ Selected Best Strategy: '{best_strat}'")

# C. FINAL TRAINING
print("\n--- 4. Training Final Models ---")
model_a = train_model_a(full_train_df)
model_b = train_model_b(full_train_df)
print("✔ Models trained on full dataset.")

# D. EVALUATION: STANDARD TEST SET
print("\n--- 5. Results: Standard Test Set ---")
# Model A
preds_a = predict_model_a(model_a, test_df["token"], fallback=best_strat)
evaluate(test_df["lemma"], preds_a, label="MFL (Token only)")

# Model B
preds_b = predict_model_b(
    model_b, test_df["token"], test_df["pos"], fallback=best_strat)
evaluate(test_df["lemma"], preds_b, label="MFL (Token + POS)")

# SpaCy Baseline
preds_spacy = [spacy_lemma(t) for t in test_df["token"]]
evaluate(test_df["lemma"], preds_spacy, label="SpaCy Baseline")

# E. EVALUATION: GALLICA GOLD STANDARD
print("\n--- 6. Results: Gallica (Historical) ---")
if len(gallica_df) > 0:
    # 1. Evaluate Model A (Unaffected by POS tags)
    preds_g_a = predict_model_a(
        model_a, gallica_df["token"], fallback=best_strat)
    evaluate(gallica_df["lemma"], preds_g_a,
             label="MFL token → lemma ")

    # 2. Evaluate Model B (Needs POS Mapping)
    # Map 'VERcjg' -> 'V' so the dictionary keys match
    gallica_pos_fixed = gallica_df["pos"].apply(normalize_gallica_pos)
    preds_g_b = predict_model_b(
        model_b, gallica_df["token"], gallica_pos_fixed, fallback=best_strat)
    evaluate(gallica_df["lemma"], preds_g_b,
             label="MFL (token, PoS) → lemma")

    # 3. SpaCy
    preds_g_spacy = [spacy_lemma(t) for t in gallica_df["token"]]
    evaluate(gallica_df["lemma"], preds_g_spacy, label="SpaCy Baseline")
else:
    print("⚠ Gallica dataset skipped (empty).")

print("\n--- Done ---")


--- 3. Validation: Tuning Fallback Strategy ---
  Strategy 'spacy': 0.9677
  Strategy 'lower': 0.9519
  Strategy 'identity': 0.9575
✔ Selected Best Strategy: 'spacy'

--- 4. Training Final Models ---
✔ Models trained on full dataset.

--- 5. Results: Standard Test Set ---
  > MFL (Token only)                    Accuracy: 0.9531
  > MFL (Token + POS)                   Accuracy: 0.9653
  > SpaCy Baseline                      Accuracy: 0.8340

--- 6. Results: Gallica (Historical) ---
  > MFL token → lemma                   Accuracy: 0.4812
  > MFL (token, PoS) → lemma            Accuracy: 0.4917
  > SpaCy Baseline                      Accuracy: 0.4910

--- Done ---


In [53]:
# --- EVALUATION: STANDARD TEST SET (With Known/Unknown Table) ---
print("\n--- 7. Evaluating Standard Test Set ---")

# Reuse the metric helper function


def get_metrics(y_true, y_pred, tokens, training_tokens_set):
    # 1. Overall Accuracy
    acc = accuracy_score(y_true, y_pred)

    # 2. Known vs Unknown Mask
    is_known = tokens.isin(training_tokens_set)

    # 3. Known Accuracy
    if is_known.sum() > 0:
        acc_known = accuracy_score(
            y_true[is_known], pd.Series(y_pred)[is_known])
    else:
        acc_known = 0.0

    # 4. Unknown Accuracy
    if (~is_known).sum() > 0:
        acc_unk = accuracy_score(
            y_true[~is_known], pd.Series(y_pred)[~is_known])
    else:
        acc_unk = 0.0

    return acc, acc_known, acc_unk


results_test = []
train_tokens_set = set(full_train_df["token"])  # Define known vocabulary

# 1. Model A (Token -> Lemma)
preds_a = predict_model_a(model_a, test_df["token"], fallback=best_strat)
acc, known, unk = get_metrics(
    test_df["lemma"], preds_a, test_df["token"], train_tokens_set)
results_test.append(["MFL token → lemma ", acc, known, unk])

# 2. Model B (Token + POS -> Lemma)
preds_b = predict_model_b(
    model_b, test_df["token"], test_df["pos"], fallback=best_strat)
acc, known, unk = get_metrics(
    test_df["lemma"], preds_b, test_df["token"], train_tokens_set)
results_test.append([" MFL (token, PoS) → lemma", acc, known, unk])

# 3. SpaCy Baseline
preds_spacy = [spacy_lemma(t) for t in test_df["token"]]
acc, known, unk = get_metrics(
    test_df["lemma"], preds_spacy, test_df["token"], train_tokens_set)
results_test.append(["SpaCy Baseline", acc, known, unk])

# --- Display Table ---
df_results_test = pd.DataFrame(results_test, columns=[
                               "Model", "Overall Accuracy", "Known Accuracy", "Unknown Accuracy"])
print(df_results_test.round(4).to_markdown(index=False))


--- 7. Evaluating Standard Test Set ---
| Model                    |   Overall Accuracy |   Known Accuracy |   Unknown Accuracy |
|:-------------------------|-------------------:|-----------------:|-------------------:|
| MFL token → lemma        |             0.9531 |           0.9694 |             0.6988 |
| MFL (token, PoS) → lemma |             0.9653 |           0.9824 |             0.6988 |
| SpaCy Baseline           |             0.834  |           0.8427 |             0.6988 |


### Results — Summary

The MFL models (most-frequent-lemma) outperform spaCy on the standard test set. Best model (Token+PoS) — 96.53% vs SpaCy 83.40% (≈13.13 points).

---

### Standard test set — Known vs Unknown

| Model | Known accuracy | Unknown accuracy | Overall |
|---|---:|---:|---:|
| MFL (Token) | 96.94% | 69.88% | 95.31% |
| MFL (Token + PoS) | 98.24% | 69.88% | 96.53% |
| SpaCy Baseline | 84.27% | 69.88% | 83.40% |

Key points:
- Dictionary-based models nearly perfect on known tokens (≥96.9%).
- PoS information yields a clear improvement for known tokens (+≈1.3 pp).
- Unknown-token performance is identical across models (69.88%) because the selected fallback is spaCy.

---

### Gallica (historical) evaluation

| Model | Known accuracy | Unknown accuracy | Overall |
|---|---:|---:|---:|
| Model A (Token) | 82.31% | 8.44% | 48.12% |
| Model B (Token + POS) | 84.27% | 8.44% | 49.17% |
| SpaCy Baseline | 84.14% | 8.44% | 49.10% |

Key points:
- Severe drop in overall accuracy (~49%) due to massive OOV rate (archaic spellings).
- Known-token accuracy remains high (~84%), but unknown-token accuracy collapses (8.44%).

---

### Short conclusions & next steps
- MFL is a very strong baseline when train/test domains match; PoS helps for disambiguation.
- Main limitation: out-of-vocabulary generalization — current fallback (spaCy) caps unknown performance.
- Remedies to try: subword/character models, augment training with historical variants, or train a neural seq2seq / copy-capable model as fallback.

## Discussion

### Baseline: Most-Frequent Lemma

The simplicity of the most-frequent-lemma approach is a strength when the training and testing domains match. It consistently achieves a high accuracy of 96.9%. However, the main limitation of this baseline is its poor performance on out-of-vocabulary tokens, particularly in domain-shift scenarios where simple dictionary lookups and modern fallbacks are insufficient.

### PoS Tags

Including PoS tags provides a clear benefit for known words. By leveraging the part-of-speech information, the accuracy improves from 96.9% to 98.2%. This suggests that the additional information about word meaning can significantly improve the tagger's performance.

### Conclusion:
We are not using Deep Learning (DL) for this task of lemmatization of French words because:

1. **Lexical Determinism**: In standard French, lemmatization is highly deterministic. The vast majority of tokens map to a single lemma. Even ambiguous forms like "s", which can be a verb or a noun, can be almost perfectly resolved with a simple Part-of-Speech (POS) tag. This implies that a lookup table (dictionary) can capture these direct mappings instantly. A neural network would have to spend thousands of epochs "learning" to memorize these simple pairs, essentially reinventing the dictionary at a much higher computational cost.

2. **Domain Matching**: The training and testing datasets come from the same domain and follow the same annotation standards. Deep learning excels at generalization—inferring rules for unseen data distributions (e.g., guessing the lemma of a made-up word based on its suffix). However, in this specific exercise, the test set vocabulary overlaps heavily with the training set (high "known word" percentage), the generalization power of a neural network yields diminishing returns. The dictionary model already achieves ~96.5% accuracy, leaving very little room for improvement.

3. **Efficiency and Complexity**: Training Time: A dictionary model trains in less than a second on a standard CPU. A sequence-to-sequence RNN requires significant training time, hyperparameter tuning (learning rate, hidden dimensions, layers), and ideally a GPU. Inference Speed: Dictionary lookups are O(1) (constant time). Neural inference involves complex matrix multiplications (O(N^2) or worse depending on architecture), making it orders of magnitude slower for production use without specialized hardware.

4. **Handling Unknown Words**: The main theoretical advantage of a character-level neural network is handling Out-of-Vocabulary (OOV) words by learning morphological patterns (e.g., seeing "-ait" implies an imperfect verb). However, in this specific exercise, the dictionary model—paired with a simple fallback strategy (like using spaCy for unknown words)—already achieved a competitive baseline. While a custom RNN could potentially outperform the spaCy fallback on unknown words, the engineering effort required to tune it to beat a pre-trained industrial model is often disproportionate to the gain.

Therefore, using a dictionary-based approach is a good choice for this specific task.
