# Evaluation of FineTuned Model vs Base MarianMT Model

In this notebook we will compare translations using the English-Spanish Kaggle Dataset.

In [89]:
import pandas as pd 
import numpy as np 
import torch
from torch.utils.data import Dataset, DataLoader
import tqdm as tqdm
from evaluate import load
from transformers import MarianMTModel, MarianTokenizer, MarianConfig

In [90]:
data_file = pd.read_csv('../data/English-Spanish-Kaggle.csv')

Removing duplicate phrases from dataset

In [91]:
df_clean = data_file.drop_duplicates(subset=['english'], keep='first', ignore_index=True)

In [92]:
df_clean

Unnamed: 0,english,spanish
0,Go.,Ve.
1,Hi.,Hola.
2,Run!,¡Corre!
3,Run.,Corred.
4,Who?,¿Quién?
...,...,...
102899,There are four main causes of alcohol-related ...,Hay cuatro causas principales de muertes relac...
102900,There are mothers and fathers who will lie awa...,Hay madres y padres que se quedan despiertos d...
102901,A carbon footprint is the amount of carbon dio...,Una huella de carbono es la cantidad de contam...
102902,Since there are usually multiple websites on a...,Como suele haber varias páginas web sobre cual...


## Import Models

In [93]:
base_model_name = 'Helsinki-NLP/opus-mt-en-es'
base_model = MarianMTModel.from_pretrained(base_model_name)
tokenizer = MarianTokenizer.from_pretrained(base_model_name)



In [94]:
state_dict = torch.load('fine_tuned_en_es.bin', map_location=torch.device('cpu'))
config = MarianConfig.from_json_file('config.json')
ft_model = MarianMTModel(config=config)
ft_model.load_state_dict(state_dict)

<All keys matched successfully>

## Evaluation

Here we will 

In [95]:
def eval_model(model, device, df, bertscore):
    model.eval()
    all_predictions = []
    all_references = []
    
    with torch.no_grad():
        for row in tqdm.tqdm(range(len(df))):
            
            # Grab phrases from data frame
            en_phrase = df['english'][row]
            es_ref = df['spanish'][row]
            
            # Grab tokenized version of phrases and generate translation
            inputs = tokenizer([en_phrase], return_tensors='pt')
            
            # Generate translation from English
            trans_ids = model.generate(**inputs)

            # Return the untokenized reference and prediction
            model_trans = tokenizer.batch_decode(trans_ids, skip_special_tokens=True)[0]
            
            all_predictions.append(model_trans)
            all_references.append(es_ref)
    
    return bertscore.compute(predictions=all_predictions, references=all_references, device=device, lang='es')

In [96]:
bertscore = load("bertscore")
ft_model_eval = eval_model(ft_model, torch.device('mps'), df_clean, bertscore=bertscore)
base_model_eval = eval_model(base_model, torch.device('mps'), df_clean, bertscore=bertscore)


100%|██████████| 102904/102904 [6:15:09<00:00,  4.57it/s]     
100%|██████████| 102904/102904 [7:12:15<00:00,  3.97it/s]     


In [115]:
ft_recall = np.mean(ft_model_eval['recall']).round(6)
base_recall = np.mean(base_model_eval['recall']).round(6)

In [116]:
ft_f1 = np.mean(ft_model_eval['f1']).round(6)
base_f1 = np.mean(base_model_eval['f1']).round(6)

In [117]:
ft_precision = np.mean(ft_model_eval['precision']).round(6)
base_precision = np.mean(base_model_eval['precision']).round(6)

In [125]:
print(f'Fine-Tuned F1: {ft_f1}')
print(f'Base Model F1: {base_f1}')
print(f'F1 Difference: {(base_f1 - ft_f1).round(6)}')
print('')

print(f'Fine-Tuned Precision: {ft_precision}')
print(f'Base Model Precision: {base_precision}')
print(f'Precision Difference: {(base_precision - ft_precision).round(6)}')
print('')

print(f'Fine-Tuned Recall: {ft_recall}')
print(f'Base Model Recall: {base_recall}')
print(f'Recall Difference: {(base_recall - ft_recall).round(6)}')

Fine-Tuned F1: 0.938918
Base Model F1: 0.938914
F1 Difference: -4e-06

Fine-Tuned Precision: 0.940694
Base Model Precision: 0.940692
Precision Difference: -2e-06

Fine-Tuned Recall: 0.937361
Base Model Recall: 0.937357
Recall Difference: -4e-06
