Evaluate the performance of a trained detoxification model.
Given a model generated text file and a reference text file, calculate and output the following three metrics:
    1. BERTscore
    2. ROUGE scores
    3. Percentage of exact matches

Sources:
    1. BERTscore: https://github.com/Tiiiger/bert_score    https://pypi.org/project/bert-score/
    2. ROUGE: https://pypi.org/project/rouge/

In [1]:
!pip install bert_score
!git clone https://github.com/Tiiiger/bert_score.git
    
!pip install rouge

fatal: destination path 'bert_score' already exists and is not an empty directory.


In [2]:
import time
import pandas as pd
import numpy as np
import json
from bert_score import score
from rouge import Rouge, FilesRouge

In [3]:
#process an input csv into two txt; one record per line

In [4]:
file = pd.read_csv("sample_output.csv")
file

Unnamed: 0,input_text,target_text,generated_text
0,Yeah that gun ran around and shot people.....\...,Yeah that gun ran around and shot people.....\...,Yeah that gun ran around and <c> people.....\n...
1,She forgot how to walk. Was waiting for the $$...,She forgot how to walk. Was waiting for the $$...,She forgot how to walk. Was waiting for the $$...
2,Melania is the FLOTUS. Hillary is a retired/fa...,Melania is the FLOTUS. Hillary is a retired/fa...,Melania is the FLOTUS.\nBoth are seriously out...
3,"Um, what? No one is sure....yet. \n\n\nAnd hel...","Um, what? No one is sure....yet. \n\n\nAnd hel...","Um, what? No one is sure....yet. \n\nAnd held ..."
4,A fool and his money are soon separated.😆,A <c> and his money are soon <c>.😆,A <c> and his money are soon separated.😆
...,...,...,...
95,I'm impressed. You've upped your game from na...,I'm impressed. You've upped your game from na...,I'm impressed. You've upped your game from na...
96,McCain. What a loser he turned out to be. My...,McCain. What a <c> he turned out to be. My a...,McCain. What a <c> he turned out to be. <c>....
97,"baloney. Obviously, either your memory is shot...","baloney. Obviously, either your memory is shot...","<c>. Obviously, either your memory is <c>, or ..."
98,I have a solution: KILL OFF those insipid SB h...,I have a solution: <c> OFF those insipid SB ha...,I have a solution: <c> OFF those insipid SB ha...


In [5]:
ref_text = [str(t.encode("unicode_escape")) for t in list(file["target_text"])]
hyp_text = [str(t.encode("unicode_escape")) for t in list(file["generated_text"])]

with open("ref.txt", "w") as f:
    for t in ref_text:
        f.write(t)
        f.write("\n")
with open("hyp.txt", "w") as f:
    for t in hyp_text:
        f.write(t)
        f.write("\n")

In [6]:
#import the hyp and ref files
hyp_path = "hyp.txt"
ref_path = "ref.txt"

In [8]:
def cal_eval_metrics(hyp_path, ref_path):
    
    #calculate BERTscore
    with open(hyp_path) as f:
        hyp = [line.strip() for line in f]
    with open(ref_path) as f:
        ref = [line.strip() for line in f] 
    print("# of records in hypo file:", len(hyp))
    print("# of records in reference file:", len(ref))
    P, R, F1 = score(hyp, ref, lang="en", verbose=True)
    bertscore = float(F1.mean())
    
    #calculate rouge scores
    file_rouge = FilesRouge()
    rouge_scores = file_rouge.get_scores(hyp_path, ref_path, avg=True)
    rouge_1 = rouge_scores["rouge-1"]["f"]
    rouge_2 = rouge_scores["rouge-2"]["f"]
    rouge_l = rouge_scores["rouge-l"]["f"]
    
    #calculate matching pairs
    num_match = np.count_nonzero(np.array(hyp)==np.array(ref))
    percent_match = num_match/len(hyp)
    
    return {"BERTscore": bertscore, "ROUGE-1": rouge_1, "ROUGE-2": rouge_2, 
            "ROUGE-L": rouge_l, "Exact Match": percent_match}  

In [9]:
scores = cal_eval_metrics(hyp_path, ref_path)
score_df = pd.DataFrame.from_dict(scores, orient="index", columns=["score"])
score_df

# of records in hypo file: 100
# of records in reference file: 100


Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


calculating scores...
computing bert embedding.


  0%|          | 0/3 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/2 [00:00<?, ?it/s]

done in 129.52 seconds, 0.77 sentences/sec


Unnamed: 0,score
BERTscore,0.951883
ROUGE-1,0.7592
ROUGE-2,0.675615
ROUGE-L,0.75795
Exact Match,0.17


In [None]:
sentence1 = "<c>"
sentence2 = "<c>"

print(score(sentence1, sentence2, lang="en", verbose=True))