# Data Quality Evaluation

In this notebook, we evaluate the quality of translations with quality estimation (QE) metric called CometKiwi22.
That model was trained on data realeased within WMT 2022 Shared Task and supports many languages, including French.

The model is developped by Unbabel; the model card can be found here: https://huggingface.co/Unbabel/wmt22-cometkiwi-da.

We run all the code on a single GPU A100. Running this notebook took ~10 mins. 
The corresponding results are reported in Chapter 4 of the paper in Table 2: Translation quality per sentence category es-
timated with COMETKIWI22.

In [None]:
import json
import os
import pandas as pd
from tqdm import tqdm

In [None]:
!ls "data_fr"

classification	generation  moral_stories_full.jsonl


# Load the data

In [None]:
french_data_dir="data_fr"
english_data_dir="data_en"

In [None]:
records_fr = []
with open(french_data_dir+"/moral_stories_full.jsonl", 'r', encoding='utf-8') as f:
    for line in f:
        records_fr.append(json.loads(line))
french_df = pd.DataFrame(records_fr)
records_en = []
with open(english_data_dir+"/moral_stories_full.jsonl", 'r', encoding='utf-8') as f:
    for line in f:
        records_en.append(json.loads(line))
english_df = pd.DataFrame(records_en)

In [None]:
english_df.shape

(12000, 8)

In [None]:
english_df.columns

Index(['ID', 'norm', 'situation', 'intention', 'moral_action',
       'moral_consequence', 'immoral_action', 'immoral_consequence'],
      dtype='object')

In [None]:
data = []
for index, row_en in tqdm(english_df.iterrows()):
    en_id = row_en['ID']
    row_fr = french_df[french_df['ID'] == en_id].iloc[0]  # Assuming there's exactly one match
    for column in english_df.columns:
        data_entry = {}
        if column != 'ID':
            data_entry["src"] = row_en[column]
            data_entry["mt"] = row_fr[column]
        if data_entry:
            data.append(data_entry)

12000it [00:23, 516.05it/s]


# Load model for QE

Note, that before loading the model you have to loging on HF and accept with license to be granted with acceess: https://huggingface.co/Unbabel/wmt22-cometkiwi-da.

In [None]:
HUGGINGFACE_TOKEN="KEY" #TB copied from https://huggingface.co/settings/tokens. READ access token is enough to run this notebook.

In [None]:
# !pip install --upgrade pip  # ensures that pip is current
!pip install "unbabel-comet>=2.0.0" -q
!huggingface-cli login --token $HUGGINGFACE_TOKEN

In [None]:
from comet import download_model, load_from_checkpoint

In [None]:
model_path = download_model("Unbabel/wmt22-cometkiwi-da")
model = load_from_checkpoint(model_path)

In [None]:
data=[{ "src": "It is wrong to worry your grandmother",
        "mt": "Il est mal de s'inquiéter pour sa grand-mère."
    }] # Example of how the model can be called

In [None]:
model.predict(data, batch_size=8, gpus=1)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]


Prediction([('scores', [0.6662602424621582]),
            ('system_score', 0.6662602424621582)])

For the input pairs of sentences, model outputs a score as shown above.

In [None]:
data_all=dict()
for index, row_en in tqdm(english_df.iterrows()):
    en_id = row_en['ID']
    row_fr = french_df[french_df['ID'] == en_id].iloc[0]
    for column in english_df.columns:
        if column != 'ID':
            data_entry = {}
            if column not in data_all.keys():
                data_all[column]=[]
            data_entry["src"] = row_en[column]
            data_entry["mt"] = row_fr[column]
            if data_entry:
                data_all[column].append(data_entry)

12000it [00:20, 572.29it/s]


In [None]:
for k,v in data_all.items():
    print(k)

norm
situation
intention
moral_action
moral_consequence
immoral_action
immoral_consequence


# Run evaluation on paired translations

In [None]:
scores_all=dict()
for k,v in data_all.items():
    print(k)
    model_output_i = model.predict(v, batch_size=8, gpus=1)
    scores_all[k]=model_output_i

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


norm


Predicting DataLoader 0: 100%|██████████| 1500/1500 [01:36<00:00, 15.51it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


situation


Predicting DataLoader 0: 100%|██████████| 1500/1500 [02:25<00:00, 10.30it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


intention


Predicting DataLoader 0: 100%|██████████| 1500/1500 [01:25<00:00, 17.49it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


moral_action


Predicting DataLoader 0: 100%|██████████| 1500/1500 [02:19<00:00, 10.78it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


moral_consequence


Predicting DataLoader 0: 100%|██████████| 1500/1500 [02:12<00:00, 11.32it/s]


immoral_action


INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 1500/1500 [02:23<00:00, 10.46it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


immoral_consequence


Predicting DataLoader 0: 100%|██████████| 1500/1500 [02:19<00:00, 10.73it/s]


In [None]:
for k,v in scores_all.items():
    value_=np.mean(v['scores'])
    string_=str(k)+" & " + str(round(value_, 3))
    print(string_)

norm & 0.858
situation & 0.85
intention & 0.854
moral_action & 0.844
moral_consequence & 0.848
immoral_action & 0.832
immoral_consequence & 0.841


In [None]:
for k,v in scores_all.items():
    value_=np.std(v['scores'])
    string_=str(k)+" & " + str(round(value_, 3))
    print(string_)

norm & 0.057
situation & 0.043
intention & 0.049
moral_action & 0.046
moral_consequence & 0.045
immoral_action & 0.054
immoral_consequence & 0.052
