
# Projet TAL – Évaluation d’un Moteur de Traduction Neuronale OpenNMT  
**Module :** Traitement Automatique des Langues (TAL)  
**Enseignant :** Nasredine SEMMAR  
**Année universitaire :** 2024-2025  
**Mark Salloum  ,  Galust Buniatyan**

Ce notebook présente l’ensemble de la chaîne de traitement réalisée dans le cadre du projet. Il se divise en trois grandes parties :

1. **Partie I : Préparation des données et exemple toy-ende**  
   (Vérification de l’installation, traitement d’un petit corpus toy, nettoyage avec Moses)

2. **Partie II : Entraînement et Évaluation sur Corpus en formes fléchies**  
   (Utilisation d’Europarl et EMEA non lemmatisés)

3. **Partie III : Lemmatisation, Réentraînement et Évaluation sur Corpus Lemmatisés**  
   (Prétraitement avec WordNetLemmatizer et FrenchLefffLemmatizer, réentraînement et évaluation BLEU)


## Partie I : Préparation des données et exemple toy-ende

### Vérification de l'environnement et installation des dépendances

In [1]:
import sys
import torch
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

Python version: 3.11.11 (main, Dec  4 2024, 08:55:07) [GCC 11.4.0]
PyTorch version: 2.2.2+cu121
CUDA available: True
CUDA device: Tesla T4


In [3]:
!pip install --upgrade pip
!pip install --upgrade gensim
!pip install OpenNMT-py
!pip install numpy



In [4]:
import onmt
print(f"OpenNMT-py version: {onmt.__version__}")

OpenNMT-py version: 3.5.1


###  Exemple sur Toy-Ende  
Nous récupérons et décompressons le corpus toy-ende pour vérifier l’installation d’OpenNMT.


In [12]:
!wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
!tar xf toy-ende.tar.gz

--2025-03-06 08:33:38--  https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 16.182.73.64, 54.231.226.128, 54.231.134.0, ...
Connecting to s3.amazonaws.com (s3.amazonaws.com)|16.182.73.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1662081 (1.6M) [application/x-gzip]
Saving to: ‘toy-ende.tar.gz’


2025-03-06 08:33:39 (3.76 MB/s) - ‘toy-ende.tar.gz’ saved [1662081/1662081]



### Construction du vocabulaire et entraînement sur Toy-Ende

In [14]:
%%writefile toy_en_de.yaml
# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt


# Vocabulary files that were just created
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

# Train on a single GPU
world_size: 1
#gpu_ranks: [0]

# Where to save the checkpoints
save_model: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 1000
valid_steps: 500

Writing toy_en_de.yaml


In [17]:
!onmt_build_vocab -config  'toy_en_de.yaml' -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2025-03-06 08:37:05,021 INFO] Counter vocab from 10000 samples.
[2025-03-06 08:37:05,021 INFO] Build vocab on 10000 transformed examples/corpus.
[2025-03-06 08:37:05,409 INFO] Counters src: 24995
[2025-03-06 08:37:05,409 INFO] Counters tgt: 35816


In [18]:
!CUDA_VISIBLE_DEVICES=0 onmt_train \
    -config 'toy_en_de.yaml' \
    -gpu_ranks 0

[2025-03-06 08:37:15,630 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-03-06 08:37:15,630 INFO] Missing transforms field for valid data, set to default: [].
[2025-03-06 08:37:15,630 INFO] Parsed 2 corpora from -data.
[2025-03-06 08:37:15,630 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-03-06 08:37:15,783 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the', ',', '.', 'of', 'and', 'to']
[2025-03-06 08:37:15,784 INFO] The decoder start token is: <s>
[2025-03-06 08:37:15,784 INFO] Building model...
[2025-03-06 08:37:19,182 INFO] Switching model to float32 for amp/apex_amp
[2025-03-06 08:37:19,183 INFO] Non quantized layer compute is fp32
[2025-03-06 08:37:19,503 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(25000, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.3, 

### Traduction et évaluation sur Toy-Ende

In [19]:
!onmt_translate -model toy-ende/run/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
PRED 1738: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> .
PRED SCORE: -1.4055

[2025-03-06 08:44:46,021 INFO] 
SENT 1739: ['Pope', 'Francis', 'will', 'create', 'new', '<unk>', 'of', 'the', 'Catholic', 'Church', 'for', 'his', 'first', 'time', 'on', 'February', '22', ',', 'the', '<unk>', 'announced', 'Thursday', '.']
PRED 1739: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> .
PRED SCORE: -1.3048

[2025-03-06 08:44:46,021 INFO] 
SENT 1740: ['<unk>', 'are', 'the', '<unk>', 'clergy', 'in', 'the', 'Catholic', 'Church', 'below', 'the', '<unk>', ',', 'and', 'they', '&apos;re', 'the', 'ones', 'who', '<unk>', '<unk>', ',', 'so', 'Francis', 'will', 'be', 'appointing', 'his', 'first', 'group', 'of', 'men', 'who', 'will', 'ultimately', 'help', 'choose', 'his', 'successor', '.']
PRED 1740: <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> <unk> 

In [21]:
!pip install sacrebleu
!sacrebleu toy-ende/tgt-test.txt \
    -i toy-ende/pred_1000.txt -m bleu -b --force

0.0
[0m

## Partie II : Entraînement et Évaluation sur corpus en formes fléchies

### Instructions pour l'exécution

Ce notebook a été conçu pour offrir deux modes d'exécution en fonction de votre environnement :

1. **Exécution avec Google Drive monté**  
   Si vous ne souhaitez pas attendre le téléchargement et le nettoyage des données, ou réentraîner le modèle (ce qui peut prendre environ une heure), vous pouvez simplement monter votre Google Drive.  
   Dans ce cas, changez la variable `MOUNT_DRIVE` à **True**.  
   Les fichiers de données, le code de nettoyage (Moses) et les modèles pré-entraînés sont déjà présents sur le Drive, vous pourrez ainsi sauter ces étapes et passer directement à l'entraînement (ou directement à l'évaluation si vous préférez utiliser les modèles existants).

2. **Exécution en mode local**  
   Si vous n'avez pas votre Drive configuré ou si vous préférez télécharger et nettoyer les données vous-même, laissez `MOUNT_DRIVE` à **False**.  
   Dans ce mode, le notebook créera le répertoire local `Projet_OpenNMT` et exécutera les cellules de téléchargement et de nettoyage avant d’entraîner le modèle.

En résumé, si vous voulez éviter les temps d'attente liés aux téléchargements et à l'entraînement, passez `MOUNT_DRIVE` à **True** et le notebook utilisera directement les fichiers et modèles pré-enregistrés dans votre Drive.


In [33]:
# Définir la variable d'environnement pour choisir le mode
MOUNT_DRIVE = False  # Mettre True si vous utilisez Google Drive, False si vous utilisez les fichiers locaux

if MOUNT_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    # Chemin vers le dossier du projet sur Drive (à adapter si nécessaire)
    BASE_PATH = "/content/drive/My Drive/Projet_OpenNMT"
else:
    # Crée le dossier local s'il n'existe pas et définit le chemin local
    !mkdir -p /content/Projet_OpenNMT
    BASE_PATH = "/content/Projet_OpenNMT"

print("BASE_PATH =", BASE_PATH)

# Se déplacer dans le répertoire du projet
%cd $BASE_PATH



BASE_PATH = /content/Projet_OpenNMT
/content/Projet_OpenNMT


In [34]:
!mkdir -p data/Europarl data/EMEA

# Télécharger Europarl
!wget -O data/Europarl/en-fr.txt.zip https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-fr.txt.zip
!unzip data/Europarl/en-fr.txt.zip -d data/Europarl

# Télécharger EMEA
!wget -O data/EMEA/en-fr.txt.zip https://object.pouta.csc.fi/OPUS-EMEA/v3/moses/en-fr.txt.zip
!unzip data/EMEA/en-fr.txt.zip -d data/EMEA

--2025-03-06 09:02:38--  https://object.pouta.csc.fi/OPUS-Europarl/v8/moses/en-fr.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 216449599 (206M) [application/zip]
Saving to: ‘data/Europarl/en-fr.txt.zip’


2025-03-06 09:02:54 (13.6 MB/s) - ‘data/Europarl/en-fr.txt.zip’ saved [216449599/216449599]

Archive:  data/Europarl/en-fr.txt.zip
replace data/Europarl/README? [y]es, [n]o, [A]ll, [N]one, [r]ename: --2025-03-06 09:03:15--  https://object.pouta.csc.fi/OPUS-EMEA/v3/moses/en-fr.txt.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36819447 (35M) [application/zip]
Saving to: ‘data/EMEA/en-fr.txt.zip’


2025-03-06 09:03:20 (9.18

In [35]:
!mkdir corpus_splitFinal
!mkdir corpus_splitFinal/train
!mkdir corpus_splitFinal/test
!mkdir corpus_splitFinal/dev
!mkdir corpus_splitFinal/run

# EUROPARL - Train (100K), Dev (3.75K), Test (500)
!head -100000 data/Europarl/Europarl.en-fr.en > corpus_splitFinal/train/Europarl_train_100k.en
!head -100000 data/Europarl/Europarl.en-fr.fr > corpus_splitFinal/train/Europarl_train_100k.fr

!tail -n +100001 data/Europarl/Europarl.en-fr.en | head -3750 > corpus_splitFinal/dev/Europarl_dev_3750.en
!tail -n +100001 data/Europarl/Europarl.en-fr.fr | head -3750 > corpus_splitFinal/dev/Europarl_dev_3750.fr

!tail -n +103751 data/Europarl/Europarl.en-fr.en | head -500 > corpus_splitFinal/test/Europarl_test_500.en
!tail -n +103751 data/Europarl/Europarl.en-fr.fr | head -500 > corpus_splitFinal/test/Europarl_test_500.fr

# EMEA - Train (10K), Test (500)
!head -10000 data/EMEA/EMEA.en-fr.en > corpus_splitFinal/train/Emea_train_10k.en
!head -10000 data/EMEA/EMEA.en-fr.fr > corpus_splitFinal/train/Emea_train_10k.fr

!tail -n +13751 data/EMEA/EMEA.en-fr.en | head -500 > corpus_splitFinal/test/Emea_test_500.en
!tail -n +13751 data/EMEA/EMEA.en-fr.fr | head -500 > corpus_splitFinal/test/Emea_test_500.fr


mkdir: cannot create directory ‘corpus_splitFinal’: File exists
mkdir: cannot create directory ‘corpus_splitFinal/train’: File exists
mkdir: cannot create directory ‘corpus_splitFinal/test’: File exists
mkdir: cannot create directory ‘corpus_splitFinal/dev’: File exists
mkdir: cannot create directory ‘corpus_splitFinal/run’: File exists


###  Nettoyage des corpus avec Moses  
Ici, nous utilisons le script de nettoyage de Moses pour préparer les corpus (Europarl et EMEA).  
Les chemins sont basés sur le dossier `corpus_splitFinal` de votre Drive.

In [41]:
import os

In [64]:
# Installation de Moses
!git clone https://github.com/moses-smt/mosesdecoder.git
# Définir le chemin complet vers le dossier mosesdecoder
MOSES_HOME = os.path.join(BASE_PATH, "mosesdecoder")
print("MOSES_HOME =", MOSES_HOME)

!perl "$MOSES_HOME/scripts/training/clean-corpus-n.perl" \
  "$BASE_PATH/corpus_splitFinal/train/Europarl_train_100k" \
  fr en \
  "$BASE_PATH/corpus_splitFinal/train/Europarl_train_100k.clean" \
  1 80


# 2) Europarl DEV (3750)
!perl "$MOSES_HOME/scripts/training/clean-corpus-n.perl" \
  "$BASE_PATH/corpus_splitFinal/dev/Europarl_dev_3750" \
  fr en \
  "$BASE_PATH/corpus_splitFinal/dev/Europarl_dev_3750.clean" \
  1 80

# 3) Europarl TEST (500)
!perl "$MOSES_HOME/scripts/training/clean-corpus-n.perl" \
  "$BASE_PATH/corpus_splitFinal/test/Europarl_test_500" \
  fr en \
  "$BASE_PATH/corpus_splitFinal/test/Europarl_test_500.clean" \
  1 80

# 4) Emea TRAIN (10K)
!perl "$MOSES_HOME/scripts/training/clean-corpus-n.perl" \
  "$BASE_PATH/corpus_splitFinal/train/Emea_train_10k" \
  fr en \
  "$BASE_PATH/corpus_splitFinal/train/Emea_train_10k.clean" \
  1 80

# 5) Emea TEST (500)
!perl "$MOSES_HOME/scripts/training/clean-corpus-n.perl" \
  "$BASE_PATH/corpus_splitFinal/test/Emea_test_500" \
  fr en \
  "$BASE_PATH/corpus_splitFinal/test/Emea_test_500.clean" \
  1 80

fatal: destination path 'mosesdecoder' already exists and is not an empty directory.
MOSES_HOME = /content/Projet_OpenNMT/mosesdecoder
clean-corpus.perl: processing /content/Projet_OpenNMT/corpus_splitFinal/train/Europarl_train_100k.fr & .en to /content/Projet_OpenNMT/corpus_splitFinal/train/Europarl_train_100k.clean, cutoff 1-80, ratio 9
..........(100000)
Input sentences: 100000  Output sentences:  98965
clean-corpus.perl: processing /content/Projet_OpenNMT/corpus_splitFinal/dev/Europarl_dev_3750.fr & .en to /content/Projet_OpenNMT/corpus_splitFinal/dev/Europarl_dev_3750.clean, cutoff 1-80, ratio 9

Input sentences: 3750  Output sentences:  3698
clean-corpus.perl: processing /content/Projet_OpenNMT/corpus_splitFinal/test/Europarl_test_500.fr & .en to /content/Projet_OpenNMT/corpus_splitFinal/test/Europarl_test_500.clean, cutoff 1-80, ratio 9

Input sentences: 500  Output sentences:  493
clean-corpus.perl: processing /content/Projet_OpenNMT/corpus_splitFinal/train/Emea_train_10k.fr & 

### Entraînement du modèle sur corpus fléchies avec EUROPARL





In [50]:
# Définir le chemin de votre fichier YAML (adapté à votre environnement)
config_path = BASE_PATH+'/europarl_config.yaml'

if not os.path.exists(config_path):
    with open(config_path, 'w', encoding='utf-8') as f:
        f.write("""# Configuration pour l'entraînement Europarl
save_data: corpus_splitFinal/run/europarl
## Vocabulaire
src_vocab: corpus_splitFinal/run/europarl.vocab.src
tgt_vocab: corpus_splitFinal/run/europarl.vocab.tgt
# Autoriser l'écrasement des fichiers
overwrite: True

# Corpus:
data:
    corpus_1:
        path_src: corpus_splitFinal/train/Europarl_train_100k.clean.en
        path_tgt: corpus_splitFinal/train/Europarl_train_100k.clean.fr
    valid:
        path_src: corpus_splitFinal/dev/Europarl_dev_3750.clean.en
        path_tgt: corpus_splitFinal/dev/Europarl_dev_3750.clean.fr

# Où sauvegarder les checkpoints
save_model: corpus_splitFinal/run/model
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500

# GPU configuration
world_size: 1
gpu_ranks: [0]  # Décommenter pour utiliser le GPU
""")
    print(f"Fichier {config_path} créé.")
else:
    print(f"Fichier {config_path} existe déjà, on ne le recrée pas.")

Fichier /content/Projet_OpenNMT/europarl_config.yaml créé.


In [51]:
!onmt_build_vocab \
    -config 'europarl_config.yaml' \
    -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2025-03-06 09:12:58,777 INFO] Counter vocab from 10000 samples.
[2025-03-06 09:12:58,777 INFO] Build vocab on 10000 transformed examples/corpus.
[2025-03-06 09:12:59,153 INFO] Counters src: 18409
[2025-03-06 09:12:59,153 INFO] Counters tgt: 23490


In [None]:
# Entraînement du modèle
!onmt_train \
    -config 'europarl_config.yaml'

[2025-03-05 00:34:00,910 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-03-05 00:34:00,910 INFO] Missing transforms field for valid data, set to default: [].
[2025-03-05 00:34:00,911 INFO] Parsed 2 corpora from -data.
[2025-03-05 00:34:00,911 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-03-05 00:34:00,981 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the', 'of', 'to', 'and', 'in', 'is']
[2025-03-05 00:34:00,981 INFO] The decoder start token is: <s>
[2025-03-05 00:34:00,981 INFO] Building model...
[2025-03-05 00:34:04,013 INFO] Switching model to float32 for amp/apex_amp
[2025-03-05 00:34:04,013 INFO] Non quantized layer compute is fp32
[2025-03-05 00:34:04,461 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(18416, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.3

### Traduction et Évaluation sur corpus formes fléchies avec Europarl

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_step_10000.pt" \
-src "/corpus_splitFinal/test/Europarl_test_500.clean.en" \
-output "/corpus_splitFinal/test/Europarl_test_pred_run1.fr" \



[2025-03-05 00:59:58,753 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_step_10000.pt
[2025-03-05 00:59:59,504 INFO] Loading data into the model
[2025-03-05 01:00:52,273 INFO] PRED SCORE: -0.7734, PRED PPL: 2.17 NB SENTENCES: 493
Time w/o python interpreter load/terminate:  53.55417513847351


In [None]:
!sacrebleu "/corpus_splitFinal/test/Europarl_test_500.clean.fr" \
-i "/corpus_splitFinal/test/Europarl_test_pred_run1.fr" \
-m bleu \
-b \
--force

15.7
[0m

### Entraînement du modèle sur corpus fléchies avec EUROPARL + EMEA

In [52]:
# Définir le chemin de votre fichier YAML (adapté à votre environnement)
config_path = BASE_PATH+'/europarl_emea_config.yaml'

if not os.path.exists(config_path):
    with open(config_path, 'w', encoding='utf-8') as f:
        f.write("""# Configuration pour l'entraînement Europarl + EMEA
save_data: corpus_splitFinal/run/europarl_emea
## Vocabulaire
src_vocab: corpus_splitFinal/run/europarl_emea.vocab.src
tgt_vocab: corpus_splitFinal/run/europarl_emea.vocab.tgt
overwrite: True

data:
  corpus_1:
    path_src: corpus_splitFinal/train/Europarl_train_100k.clean.en
    path_tgt: corpus_splitFinal/train/Europarl_train_100k.clean.fr
  corpus_2:
    path_src: corpus_splitFinal/train/Emea_train_10k.clean.en
    path_tgt: corpus_splitFinal/train/Emea_train_10k.clean.fr
  valid:
    path_src: corpus_splitFinal/dev/Europarl_dev_3750.clean.en
    path_tgt: corpus_splitFinal/dev/Europarl_dev_3750.clean.fr

save_model: corpus_splitFinal/run/model_europarl_emea
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500

# GPU configuration
world_size: 1
gpu_ranks: [0]
""")
    print(f"Fichier {config_path} créé.")
else:
    print(f"Fichier {config_path} existe déjà, on ne le recrée pas.")

Fichier /content/Projet_OpenNMT/europarl_emea_config.yaml créé.


In [57]:
!onmt_build_vocab \
    -config 'europarl_emea_config.yaml' \
    -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
Corpus corpus_2's weight should be given. We default it to 1 for you.
[2025-03-06 09:20:40,382 INFO] Counter vocab from 10000 samples.
[2025-03-06 09:20:40,382 INFO] Build vocab on 10000 transformed examples/corpus.
[2025-03-06 09:20:40,993 INFO] Counters src: 21834
[2025-03-06 09:20:40,993 INFO] Counters tgt: 27422


In [None]:
# Entraînement du modèle
!onmt_train \
    -config 'europarl_emea_config.yaml'

[2025-03-05 10:05:11,709 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-03-05 10:05:11,709 INFO] Missing transforms field for corpus_2 data, set to default: [].
[2025-03-05 10:05:11,710 INFO] Missing transforms field for valid data, set to default: [].
[2025-03-05 10:05:11,710 INFO] Parsed 3 corpora from -data.
[2025-03-05 10:05:11,711 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-03-05 10:05:11,793 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the', 'of', 'to', 'and', 'in', 'is']
[2025-03-05 10:05:11,793 INFO] The decoder start token is: <s>
[2025-03-05 10:05:11,793 INFO] Building model...
[2025-03-05 10:05:14,982 INFO] Switching model to float32 for amp/apex_amp
[2025-03-05 10:05:14,982 INFO] Non quantized layer compute is fp32
[2025-03-05 10:05:15,486 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
     

### Traduction et Évaluation sur corpus formes fléchies Europarl + Emea

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_europarl_emea_step_10000.pt" \
-src "/corpus_splitFinal/test/Europarl_test_500.clean.en" \
-output "/corpus_splitFinal/test/Europarl_test_pred_run2.fr" \



[2025-03-05 10:31:54,028 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_europarl_emea_step_10000.pt
[2025-03-05 10:31:54,715 INFO] Loading data into the model
[2025-03-05 10:33:07,467 INFO] PRED SCORE: -0.6972, PRED PPL: 2.01 NB SENTENCES: 493
Time w/o python interpreter load/terminate:  73.47135281562805


In [None]:
!sacrebleu "/corpus_splitFinal/test/Europarl_test_500.clean.fr" \
-i "/corpus_splitFinal/test/Europarl_test_pred_run2.fr" \
-m bleu \
-b \
--force

17.5
[0m

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_europarl_emea_step_10000.pt" \
-src "/corpus_splitFinal/test/Emea_test_500.clean.en" \
-output "/corpus_splitFinal/test/Emea_test_pred_run2.fr" \



[2025-03-05 10:35:11,025 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_europarl_emea_step_10000.pt
[2025-03-05 10:35:11,702 INFO] Loading data into the model
[2025-03-05 10:35:38,709 INFO] PRED SCORE: -0.7852, PRED PPL: 2.19 NB SENTENCES: 500
Time w/o python interpreter load/terminate:  27.719173908233643


In [None]:
!sacrebleu "/corpus_splitFinal/test/Emea_test_500.clean.fr" \
-i "/corpus_splitFinal/test/Emea_test_pred_run2.fr" \
-m bleu \
-b \
--force

5.8
[0m

## Partie III : Lemmatisation, Réentraînement et Évaluation sur Corpus Lemmatisés

### Installation et configuration des lemmatizers

#### Pour l’anglais (WordNetLemmatizer)


In [55]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# Initialiser le lemmatizer pour l'anglais
lemmatizer = WordNetLemmatizer()


[nltk_data] Downloading package wordnet to /root/nltk_data...


#### Pour le français (FrenchLefffLemmatizer)

In [58]:
!pip install git+https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer.git


Collecting git+https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer.git
  Cloning https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer.git to /tmp/pip-req-build-7bmds0o7
  Running command git clone --filter=blob:none --quiet https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer.git /tmp/pip-req-build-7bmds0o7
  Resolved https://github.com/ClaudeCoulombe/FrenchLefffLemmatizer.git to commit bc0ebd0135a6cc78f48ddf184069b4c0b9c017d8
  Preparing metadata (setup.py) ... [?25l[?25hdone


In [59]:
from french_lefff_lemmatizer.french_lefff_lemmatizer import FrenchLefffLemmatizer

# Initialiser le lemmatizer pour le français
french_lemmatizer = FrenchLefffLemmatizer()

###Fonctions de lemmatisation

In [60]:
def lemmatize_english_line(line):
    # Découper la ligne en tokens
    tokens = line.strip().split()
    # Lemmatiser chaque token
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    # Recomposer la phrase
    return ' '.join(lemmatized_tokens)


In [61]:
def lemmatize_french_line(line):
    tokens = line.strip().split()
    lemmatized_tokens = [french_lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(lemmatized_tokens)


In [62]:
def lemmatize_file_english(input_path, output_path):
    with open(input_path, 'r', encoding='utf-8') as fin, open(output_path, 'w', encoding='utf-8') as fout:
        for line in fin:
            fout.write(lemmatize_english_line(line) + '\n')

def lemmatize_file_french(input_path, output_path):
    with open(input_path, 'r', encoding='utf-8') as fin, open(output_path, 'w', encoding='utf-8') as fout:
        for line in fin:
            fout.write(lemmatize_french_line(line) + '\n')


### Application de la lemmatisation aux différents corpus

In [65]:
# Pour Europarl Train (100K)
lemmatize_file_english('corpus_splitFinal/train/Europarl_train_100k.clean.en',
                         'corpus_splitFinal/train/Europarl_train_100k.lem.en')
lemmatize_file_french('corpus_splitFinal/train/Europarl_train_100k.clean.fr',
                      'corpus_splitFinal/train/Europarl_train_100k.lem.fr')

# Pour Europarl Dev (3750)
lemmatize_file_english('corpus_splitFinal/dev/Europarl_dev_3750.clean.en',
                         'corpus_splitFinal/dev/Europarl_dev_3750.lem.en')
lemmatize_file_french('corpus_splitFinal/dev/Europarl_dev_3750.clean.fr',
                      'corpus_splitFinal/dev/Europarl_dev_3750.lem.fr')

# Pour Europarl Test (500)
lemmatize_file_english('corpus_splitFinal/test/Europarl_test_500.clean.en',
                         'corpus_splitFinal/test/Europarl_test_500.lem.en')
lemmatize_file_french('corpus_splitFinal/test/Europarl_test_500.clean.fr',
                      'corpus_splitFinal/test/Europarl_test_500.lem.fr')

# Pour Emea Train (10K)
lemmatize_file_english('corpus_splitFinal/train/Emea_train_10k.clean.en',
                         'corpus_splitFinal/train/Emea_train_10k.lem.en')
lemmatize_file_french('corpus_splitFinal/train/Emea_train_10k.clean.fr',
                      'corpus_splitFinal/train/Emea_train_10k.lem.fr')

# Pour Emea Test (500) – si vous avez ce split
lemmatize_file_english('corpus_splitFinal/test/Emea_test_500.clean.en',
                         'corpus_splitFinal/test/Emea_test_500.lem.en')
lemmatize_file_french('corpus_splitFinal/test/Emea_test_500.clean.fr',
                      'corpus_splitFinal/test/Emea_test_500.lem.fr')


###  Entraînement sur corpus lemmatisés avec Europarl




In [54]:
# Définir le chemin de votre fichier YAML (adapté à votre environnement)
config_path = BASE_PATH+'/europarl_lemm_config.yaml'

if not os.path.exists(config_path):
    with open(config_path, 'w', encoding='utf-8') as f:
        f.write("""# Configuration pour l'entraînement sur corpus lemmatisés (Europarl)
save_data: corpus_splitFinal/run/europarl_lemmatized
## Vocabulaire
src_vocab: corpus_splitFinal/run/europarl_lemmatized.vocab.src
tgt_vocab: corpus_splitFinal/run/europarl_lemmatized.vocab.tgt
overwrite: True

data:
  corpus_1:
    # Utiliser les fichiers lemmatisés
    path_src: corpus_splitFinal/train/Europarl_train_100k.lem.en
    path_tgt: corpus_splitFinal/train/Europarl_train_100k.lem.fr
  valid:
    path_src: corpus_splitFinal/dev/Europarl_dev_3750.lem.en
    path_tgt: corpus_splitFinal/dev/Europarl_dev_3750.lem.fr

save_model: corpus_splitFinal/run/model_europarl_lemmatized
save_checkpoint_steps: 2000
train_steps: 10000
valid_steps: 500

# GPU configuration
world_size: 1
gpu_ranks: [0]
""")
    print(f"Fichier {config_path} créé.")
else:
    print(f"Fichier {config_path} existe déjà, on ne le recrée pas.")


Fichier /content/Projet_OpenNMT/europarl_lemm_config.yaml créé.


In [67]:
!onmt_build_vocab \
    -config 'europarl_lemm_config.yaml' \
    -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
[2025-03-06 09:24:17,711 INFO] Counter vocab from 10000 samples.
[2025-03-06 09:24:17,711 INFO] Build vocab on 10000 transformed examples/corpus.
[2025-03-06 09:24:18,081 INFO] Counters src: 17463
[2025-03-06 09:24:18,082 INFO] Counters tgt: 21814


In [None]:
# Entraînement du modèle
!onmt_train \
    -config 'europarl_lemm_config.yaml'

[2025-03-05 12:03:18,845 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-03-05 12:03:18,846 INFO] Missing transforms field for valid data, set to default: [].
[2025-03-05 12:03:18,846 INFO] Parsed 2 corpora from -data.
[2025-03-05 12:03:18,847 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-03-05 12:03:18,951 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the', 'of', 'to', 'and', 'a', 'in']
[2025-03-05 12:03:18,951 INFO] The decoder start token is: <s>
[2025-03-05 12:03:18,951 INFO] Building model...
[2025-03-05 12:03:20,138 INFO] Switching model to float32 for amp/apex_amp
[2025-03-05 12:03:20,139 INFO] Non quantized layer compute is fp32
[2025-03-05 12:03:20,599 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(17472, 500, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.3,

### Traduction et Évaluation sur corpus lemmatisés Europarl

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_europarl_lemmatized_step_10000.pt" \
-src "/corpus_splitFinal/test/Europarl_test_500.lem.en" \
-output "/corpus_splitFinal/test/Europarl_test_pred_run4.fr" \




[2025-03-05 12:25:41,155 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_europarl_lemmatized_step_10000.pt
[2025-03-05 12:25:41,820 INFO] Loading data into the model
[2025-03-05 12:26:43,062 INFO] PRED SCORE: -0.7007, PRED PPL: 2.02 NB SENTENCES: 493
Time w/o python interpreter load/terminate:  61.936150550842285


In [None]:
!sacrebleu "/corpus_splitFinal/test/Europarl_test_500.lem.fr" \
-i "/corpus_splitFinal/test/Europarl_test_pred_run4.fr" \
-m bleu \
-b \
--force

16.1
[0m

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_europarl_lemmatized_step_10000.pt" \
-src "/corpus_splitFinal/test/Emea_test_500.lem.en" \
-output "/corpus_splitFinal/test/Emea_test_pred_run4.fr" \




[2025-03-05 12:27:11,416 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_europarl_lemmatized_step_10000.pt
[2025-03-05 12:27:12,221 INFO] Loading data into the model
[2025-03-05 12:27:38,632 INFO] PRED SCORE: -0.5928, PRED PPL: 1.81 NB SENTENCES: 500
Time w/o python interpreter load/terminate:  27.245006322860718


In [None]:
!sacrebleu "/corpus_splitFinal/test/Emea_test_500.lem.fr" \
-i "/corpus_splitFinal/test/Emea_test_pred_run4.fr" \
-m bleu \
-b \
--force

1.6
[0m

###  Entraînement sur corpus lemmatisés avec Europarl + Emea




In [69]:
# Définir le chemin de votre fichier YAML (adapté à votre environnement)
config_path = BASE_PATH+'/europarl_emea_lemm_config.yaml'

if not os.path.exists(config_path):
    with open(config_path, 'w', encoding='utf-8') as f:
        f.write("""# Configuration pour l'entraînement sur corpus lemmatisés (Europarl + EMEA)
save_data: corpus_splitFinal/run/europarl_emea_lemmatized
## Vocabulaire
src_vocab: corpus_splitFinal/run/europarl_emea_lemmatized.vocab.src
tgt_vocab: corpus_splitFinal/run/europarl_emea_lemmatized.vocab.tgt
overwrite: True

data:
  corpus_1:
    # Utiliser les fichiers lemmatisés
    path_src: corpus_splitFinal/train/Europarl_train_100k.lem.en
    path_tgt: corpus_splitFinal/train/Europarl_train_100k.lem.fr
  corpus_2:
    path_src: corpus_splitFinal/train/Emea_train_10k.lem.en
    path_tgt: corpus_splitFinal/train/Emea_train_10k.lem.fr
  valid:
    path_src: corpus_splitFinal/dev/Europarl_dev_3750.lem.en
    path_tgt: corpus_splitFinal/dev/Europarl_dev_3750.lem.fr

save_model: corpus_splitFinal/run/model_europarl_emea_lemmatized
save_checkpoint_steps: 500
train_steps: 10000
valid_steps: 500

# GPU configuration
world_size: 1
gpu_ranks: [0]

""")
    print(f"Fichier {config_path} créé.")
else:
    print(f"Fichier {config_path} existe déjà, on ne le recrée pas.")


Fichier /content/Projet_OpenNMT/europarl_emea_lemm_config.yaml créé.


In [70]:
!onmt_build_vocab \
    -config 'europarl_emea_lemm_config.yaml' \
    -n_sample 10000

Corpus corpus_1's weight should be given. We default it to 1 for you.
Corpus corpus_2's weight should be given. We default it to 1 for you.
[2025-03-06 09:26:49,732 INFO] Counter vocab from 10000 samples.
[2025-03-06 09:26:49,733 INFO] Build vocab on 10000 transformed examples/corpus.
[2025-03-06 09:26:50,439 INFO] Counters src: 20802
[2025-03-06 09:26:50,439 INFO] Counters tgt: 25381


In [None]:
# Entraînement du modèle
!onmt_train \
    -config 'europarl_emea_lemm_config.yaml'

[2025-03-05 10:55:45,167 INFO] Missing transforms field for corpus_1 data, set to default: [].
[2025-03-05 10:55:45,167 INFO] Missing transforms field for corpus_2 data, set to default: [].
[2025-03-05 10:55:45,168 INFO] Missing transforms field for valid data, set to default: [].
[2025-03-05 10:55:45,168 INFO] Parsed 3 corpora from -data.
[2025-03-05 10:55:45,169 INFO] Get special vocabs from Transforms: {'src': [], 'tgt': []}.
[2025-03-05 10:55:45,247 INFO] The first 10 tokens of the vocabs are:['<unk>', '<blank>', '<s>', '</s>', 'the', 'of', 'to', 'and', 'in', 'a']
[2025-03-05 10:55:45,247 INFO] The decoder start token is: <s>
[2025-03-05 10:55:45,247 INFO] Building model...
[2025-03-05 10:55:46,011 INFO] Switching model to float32 for amp/apex_amp
[2025-03-05 10:55:46,012 INFO] Non quantized layer compute is fp32
[2025-03-05 10:55:46,432 INFO] NMTModel(
  (encoder): RNNEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
      

###  Entraînement sur corpus lemmatisés avec Europarl + Emea

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_europarl_emea_lemmatized_step_10000.pt" \
-src "/corpus_splitFinal/test/Europarl_test_500.lem.en" \
-output "/corpus_splitFinal/test/Europarl_test_pred_run3.fr" \




[2025-03-05 11:24:25,674 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_europarl_emea_lemmatized_step_10000.pt
[2025-03-05 11:24:26,337 INFO] Loading data into the model
[2025-03-05 11:25:25,041 INFO] PRED SCORE: -0.6431, PRED PPL: 1.90 NB SENTENCES: 493
Time w/o python interpreter load/terminate:  59.41123652458191


In [None]:
!sacrebleu "/corpus_splitFinal/test/Europarl_test_500.lem.fr" \
-i "/corpus_splitFinal/test/Europarl_test_pred_run3.fr" \
-m bleu \
-b \
--force

15.8
[0m

In [None]:
!onmt_translate \
-model "/corpus_splitFinal/run/model_europarl_emea_lemmatized_step_10000.pt" \
-src "/corpus_splitFinal/test/Emea_test_500.lem.en" \
-output "/corpus_splitFinal/test/Emea_test_pred_run3.fr" \




[2025-03-05 11:26:17,801 INFO] Loading checkpoint from /content/drive/My Drive/Projet_OpenNMT/corpus_splitFinal/run/model_europarl_emea_lemmatized_step_10000.pt
[2025-03-05 11:26:18,750 INFO] Loading data into the model
[2025-03-05 11:26:41,471 INFO] PRED SCORE: -0.6312, PRED PPL: 1.88 NB SENTENCES: 500
Time w/o python interpreter load/terminate:  23.70695972442627


In [None]:
!sacrebleu "/corpus_splitFinal/test/Emea_test_500.lem.fr" \
-i "/corpus_splitFinal/test/Emea_test_pred_run3.fr" \
-m bleu \
-b \
--force

7.7
[0m