<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/inference/evaluate_inferences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports & Google Drive Mounting

In [1]:
from os import listdir
from os.path import isfile, join

import csv
import pprint

import pandas as pd

In [2]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
inference_root = "/content/drive/MyDrive/w266 NLP Final Project/Predictions/"

In [36]:
inference_files = listdir(inference_root)
pprint.pprint(inference_files)

['t5_simple_transformers_preds.csv',
 'predictions.T5_base_pt.squad.quac.csv',
 'predictions.T5_base_pt.squad.squad.csv',
 'predictions.T5_base_pt.quac.squad.csv',
 'predictions.T5_base_pt.quac.quac.csv',
 'predictions.T5_base_pt.squad.nq.csv',
 'predictions.T5_base_pt.quac.nq.csv',
 'predictions.T5_base_pt_long.squad.triviaqa.csv',
 'predictions.T5_base_pt_long.quac.triviaqa.csv',
 'predictions.bart_base_pt.squad.squad.csv']


# Evaluations
## Load Data

Inferences will be saved into the `inference_dict` nested dictionary, whose format is:
- keys: CSV filenames
- values:
  - `target`: list of target values
  - `prediction`: list of prediction values

In [37]:
inference_dict = {}

for id, inf_file in enumerate(inference_files):

  # Load CSV file containing predictions
  filename = join(inference_root, inf_file)
  
  # If the file exists, load it into pandas
  if isfile(filename):
    print(f"Opening file {id + 1} of {len(inference_files)}: {inf_file}\n")

    df = pd.read_csv(filename)
    
    # If the CSV does not have correct column names, warn user and skip file
    if 'target' not in df.columns and 'prediction' not in df.columns:
      print("WARNING: Columns `target` and `prediction` not found in CSV. Skipping CSV.")
      print(f"Check file: {filename}")
      # continue

    # Columns exist, so continue
    else:
      targets = df['target']
      predictions = df['prediction']

      print('CSV loaded.')
      print(f"Length of targets:      {len(targets)}")
      print(f"Length of predictions:  {len(predictions)}")
      
      # Save lists into prediction dictionary under file's name
      inference_dict.update(
          {inf_file: {'target': targets,
                      'prediction': predictions}
          }
      )
      print('\nTargets and predictions saved.')
    
    print('________________________________________\n')


print(f"\nTotal of {len(inference_dict.keys())} datasets loaded:")
for dataset in inference_dict.keys():
  print('    ' + dataset)

Opening file 1 of 10: t5_simple_transformers_preds.csv

Check file: /content/drive/MyDrive/w266 NLP Final Project/Predictions/t5_simple_transformers_preds.csv
________________________________________

Opening file 2 of 10: predictions.T5_base_pt.squad.quac.csv

CSV loaded.
Length of targets:      5868
Length of predictions:  5868

Targets and predictions saved.
________________________________________

Opening file 3 of 10: predictions.T5_base_pt.squad.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.
________________________________________

Opening file 4 of 10: predictions.T5_base_pt.quac.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.
________________________________________

Opening file 5 of 10: predictions.T5_base_pt.quac.quac.csv

CSV loaded.
Length of targets:      5868
Length of predictions:  5868

Targets and predictions saved.
_________________

We'll be using:
- ROUGE
- BLEU-RT
- BERTScore
- METEOR
- USE

And storing evaluations in `evaluation_dict` formatted as:
- keys: CSV filenames
- values:
  - metric_name: metric_value

## Load Evaluation Metrics

In [38]:
!pip install -q evaluate
import evaluate

### ROUGE
🤗 [ROUGE page](https://huggingface.co/spaces/evaluate-metric/rouge)

In [39]:
!pip install -q rouge_score

rouge = evaluate.load('rouge')

### BLEU-RT
- No fine-tuning yet.
- Using `BLEURT-20` checkpoint per Google's recommendation (see [BLEURT GitHub page](https://github.com/google-research/bleurt/blob/master/checkpoints.md#the-recommended-checkpoint-bleurt-20))

In [40]:
!pip install git+https://github.com/google-research/bleurt.git

bleurt = evaluate.load('bleurt', module_type='metric', checkpoint='BLEURT-20-D3')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-jz5bxvkv
  Running command git clone -q https://github.com/google-research/bleurt.git /tmp/pip-req-build-jz5bxvkv




### BERTScore
🤗 [BERTScore page](https://huggingface.co/spaces/evaluate-metric/bertscore)
- Using `distilbert-base-uncased` per 🤗 recommendation because the default model (`roberta-large`) is over 1.4GB

In [41]:
!pip install bert_score

bertscore = evaluate.load('bertscore')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### METEOR
🤗 [METEOR page](https://huggingface.co/spaces/evaluate-metric/meteor)

In [42]:
meteor = evaluate.load('meteor')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


### Universal Sentence Encoder (USE) – PENDING

## Calculate Metrics on Each Dataset

Metrics are calculated for each `target` – `prediction` pair. These pairs are averaged for each dataset to have a single value to compare between models and datasets.

In [24]:
# Evaluations will be stored in dictionary

evaluation_dict = {}

### ROUGE

In [43]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating ROUGE on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  # ROUGE scores
  # The use_aggregator argument takes the average for us
  rouge_results = rouge.compute(predictions=predictions,
                                references=targets,
                                use_aggregator=True)
  
  for metric in rouge_results:
    
    # If this dataset hasn't been added to dict, add it and metric
    if not evaluation_dict.get(dataset):
      evaluation_dict.update(
            {
                dataset: {metric: rouge_results[metric]}
            }
        )
      
    # This dataset already exists as a key, so add this metric
    else:
      evaluation_dict[dataset].update(
          {
              metric: rouge_results[metric]
          }
      )

Evaluating ROUGE on predictions.T5_base_pt.squad.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt.squad.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt.quac.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt.quac.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt.squad.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt.quac.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.squad.triviaqa.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.quac.triviaqa.csv...
Evaluating ROUGE on predictions.bart_base_pt.squad.squad.csv...


In [44]:
# test out on one of the datasets
dataset = list(inference_dict.keys())[0]
print(dataset)

targets = inference_dict[dataset]['target'].tolist()
predictions = inference_dict[dataset]['prediction'].tolist()
print(len(targets), len(predictions))

print('ROUGE')
rouge_results = rouge.compute(predictions=predictions,
                              references=targets,
                              use_aggregator=False)

predictions.T5_base_pt.squad.quac.csv
5868 5868
ROUGE


In [45]:
print(type(rouge_results))
print(len(rouge_results))
print(rouge_results.keys())
print(len(rouge_results['rouge1']))

print(f"Averages:")
for k in rouge_results.keys():
  print(k + ': ', end='')
  print(sum(rouge_results[k])/len(rouge_results[k]))


<class 'dict'>
4
dict_keys(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
5868
Averages:
rouge1: 0.18249635750457507
rouge2: 0.03966958090769743
rougeL: 0.17632388612567854
rougeLsum: 0.17632388612567854


In [46]:
for d in evaluation_dict.keys():
  print(d)
  pprint.pprint(evaluation_dict[d])

predictions.T5_base_pt.squad.quac.csv
{'meteor': 0.15452861614981653,
 'rouge1': 0.18246685757067002,
 'rouge2': 0.039596756029108796,
 'rougeL': 0.17629216493034694,
 'rougeLsum': 0.1763863380861867}
predictions.T5_base_pt.squad.squad.csv
{'meteor': 0.4430975371865821,
 'rouge1': 0.4505391054758221,
 'rouge2': 0.23630356484624362,
 'rougeL': 0.41706970371613455,
 'rougeLsum': 0.4169109103737757}
predictions.T5_base_pt.quac.squad.csv
{'meteor': 0.32373301147608025,
 'rouge1': 0.2995440075780461,
 'rouge2': 0.10090232093578526,
 'rougeL': 0.27689684166147077,
 'rougeLsum': 0.27709570427605157}
predictions.T5_base_pt.quac.quac.csv
{'meteor': 0.2247521536626319,
 'rouge1': 0.21764427384319768,
 'rouge2': 0.0718138922285132,
 'rougeL': 0.21306161056965345,
 'rougeLsum': 0.21318144470237674}
predictions.T5_base_pt.squad.nq.csv
{'meteor': 0.2694306204049121,
 'rouge1': 0.34958469648362983,
 'rouge2': 0.1639298816200002,
 'rougeL': 0.3276778292684154,
 'rougeLsum': 0.32779565382724274}
predic

### BLEU-RT

In [None]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating BLEU-RT on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  # BLEU-RT scores
  bleurt_results = bleurt.compute(predictions=predictions,
                                  references=targets)
  
  # Average over scores
  bleurt_scores_list = bleurt_results['scores']
  avg_bleurt = sum(bleurt_scores_list) / len(bleurt_scores_list)
    
  # If this dataset hasn't been added to dict, add it and metric
  if not evaluation_dict.get(dataset):
    evaluation_dict.update(
          {
              dataset: {'bleurt': avg_bleurt}
          }
      )
    
  # This dataset already exists as a key, so add this metric
  else:
    evaluation_dict[dataset].update(
        {
            'bleurt': avg_bleurt
        }
    )


In [None]:
pprint.pprint(evaluation_dict)

### BERTScore

In [None]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating BERTScore on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  # BERT Scores
  bertscore_results = bertscore.compute(predictions=predictions,
                                        references=targets,
                                        model_type='distilbert-base-uncased')
  
  # Average over scores
  bertscore_precision_list = bertscore_results['precision']
  bertscore_recall_list = bertscore_results['recall']
  bertscore_f1_list = bertscore_results['f1']

  avg_precision = sum(bertscore_precision_list) / len(bertscore_precision_list)
  avg_recall = sum(bertscore_recall_list) / len(bertscore_recall_list)
  avg_f1 = sum(bertscore_f1_list) / len(bertscore_f1_list)
    
  # If this dataset hasn't been added to dict, add it and metric
  if not evaluation_dict.get(dataset):
    evaluation_dict.update(
          {
              dataset: {'bertscore-precision': avg_precision,
                        'bertscore-recall': avg_recall,
                        'bertscore-f1': avg_f1}
          }
      )
    
  # This dataset already exists as a key, so add this metric
  else:
    evaluation_dict[dataset].update(
        {
            'bertscore-precision': avg_precision,
            'bertscore-recall': avg_recall,
            'bertscore-f1': avg_f1
        }
    )


### METEOR

In [None]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating METEOR on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  meteor_results_list = []

  # METEOR takes a pair of inputs at a time
  for pair in zip(targets, predictions):
    
    # Calculate METEOR scores
    results = meteor.compute(predictions=[pair[0]],
                             references=[pair[1]])
    
    meteor_results_list.append(results['meteor'])
  
  avg_meteor = sum(meteor_results_list) / len(meteor_results_list)

  # Add METEOR to dictionary
  # If this dataset hasn't been added to dict, add it and metric
  if not evaluation_dict.get(dataset):
    evaluation_dict.update(
        {
            dataset: {'meteor': avg_meteor}
        }
    )
  
  # This dataset already exists as a key, so add this metric
  else:
      evaluation_dict[dataset].update(
          {
              'meteor': avg_meteor
          }
      )


Evaluating METEOR on predictions.T5_base_pt.squad.quac.csv...
Evaluating METEOR on predictions.T5_base_pt.squad.squad.csv...
Evaluating METEOR on predictions.T5_base_pt.quac.squad.csv...


In [None]:
evaluation_dict

# Archive

⚠️ This cell takes some time. ⚠️

In [None]:
evaluation_dict = {}

for id, dataset in enumerate(inference_dict.keys()):

  # Get this dataset's `target` and `prediction` values
  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  
  # Evaluations

  # ROUGE
  rouge_results = rouge.compute(predictions=predictions,
                                references=targets,
                                use_aggregator=False)
  evaluation_dict.update(
      {
          dataset: {'rouge': rouge_results}
      }
  )

  # BLEU-RT
  bleurt_results = bleurt.compute(predictions=predictions,
                                  references=targets)
  evaluation_dict.update(
      {
          dataset: {'bleurt': bleurt_results}
      }
  )

  # BERTScore
  bertscore_results = bertscore.compute(predictions=predictions,
                                        references=targets,
                                        model_type='distilbert-base-uncased')
  evaluation_dict.update(
      {
          dataset: {'bertscore': bleurt_results}
      }
  )

  print(f"Dataset {dataset} evaluated.")

KeyboardInterrupt: ignored