<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/inference/evaluate_inferences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports & Google Drive Mounting

In [1]:
from os import listdir
from os.path import isfile, join

import csv
import json
import pprint

import pandas as pd

In [2]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
inference_root = "/content/drive/MyDrive/w266 NLP Final Project/Predictions/"

In [15]:
inference_files = listdir(inference_root)
print(len(inference_files), 'files')
pprint.pprint(inference_files)

33 files
['t5_simple_transformers_preds.csv',
 'predictions.bart_base_pt.squad.squad.csv',
 'predictions.T5_base_pt_long.nq.nq.csv',
 'predictions.T5_base_pt_long.nq.squad.csv',
 'predictions.T5_base_pt_long.triviaqa.squad.csv',
 'T5 short model prediction archive',
 'predictions.T5_base_pt_long.nq.quac.csv',
 'predictions.T5_base_pt_long.nq.triviaqa.csv',
 'predictions.T5_base_pt_long.triviaqa.quac.csv',
 'predictions.T5_base_pt_long.triviaqa.nq.csv',
 'predictions.T5_base_pt_long.squad.squad.csv',
 'predictions.T5_base_pt_long.squad.nq.csv',
 'predictions.T5_base_pt_long.squad.triviaqa.csv',
 'predictions.bart_base_pt.squad.quac.csv',
 'predictions.T5_base_pt_long.quac.squad.csv',
 'predictions.T5_base_pt_long.quac.nq.csv',
 'predictions.T5_base_pt_long.squad.quac.csv',
 'predictions.T5_base_pt_long.quac.quac.csv',
 'predictions.T5_base_pt_long.quac.triviaqa.csv',
 'predictions.T5_base_pt_long.triviaqa.triviaqa.csv',
 'predictions.bart_base_pt_long.squad.nq.csv',
 'predictions.bart_b

# Evaluations


## Load Data

Inferences will be saved into the `inference_dict` nested dictionary, whose format is:
- keys: CSV filenames
- values:
  - `target`: list of target values
  - `prediction`: list of prediction values

In [5]:
inference_dict = {}

for id, inf_file in enumerate(inference_files):

  # Load CSV file containing predictions
  filename = join(inference_root, inf_file)
  
  # If the file exists, load it into pandas
  if isfile(filename):
    print(f"Opening file {id + 1} of {len(inference_files)}: {inf_file}\n")

    df = pd.read_csv(filename)
    
    # If the CSV does not have correct column names, warn user and skip file
    if 'target' not in df.columns and 'prediction' not in df.columns:
      print("WARNING: Columns `target` and `prediction` not found in CSV. Skipping CSV.")
      print(f"Check file: {filename}")
      # continue

    # Columns exist, so continue
    else:
      targets = df['target']
      predictions = df['prediction']

      print('CSV loaded.')
      print(f"Length of targets:      {len(targets)}")
      print(f"Length of predictions:  {len(predictions)}")
      
      # Save lists into prediction dictionary under file's name
      inference_dict.update(
          {inf_file: {'target': targets,
                      'prediction': predictions}
          }
      )
      print('\nTargets and predictions saved.')
    
    print('________________________________________\n')


print(f"\nTotal of {len(inference_dict.keys())} datasets loaded:")
for dataset in inference_dict.keys():
  print('    ' + dataset)

Opening file 1 of 33: t5_simple_transformers_preds.csv

Check file: /content/drive/MyDrive/w266 NLP Final Project/Predictions/t5_simple_transformers_preds.csv
________________________________________

Opening file 2 of 33: predictions.bart_base_pt.squad.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.
________________________________________

Opening file 3 of 33: predictions.T5_base_pt_long.nq.nq.csv

CSV loaded.
Length of targets:      2356
Length of predictions:  2356

Targets and predictions saved.
________________________________________

Opening file 4 of 33: predictions.T5_base_pt_long.nq.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.
________________________________________

Opening file 5 of 33: predictions.T5_base_pt_long.triviaqa.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.


We'll be using:
- ROUGE
- BLEU-RT
- BERTScore
- METEOR
- USE

And storing evaluations in `evaluation_dict` formatted as:
- keys: CSV filenames
- values:
  - metric_name: metric_value

## Load Evaluation Metrics

In [6]:
!pip install -q evaluate
import evaluate

[K     |████████████████████████████████| 72 kB 1.4 MB/s 
[K     |████████████████████████████████| 115 kB 65.9 MB/s 
[K     |████████████████████████████████| 441 kB 55.5 MB/s 
[K     |████████████████████████████████| 212 kB 72.7 MB/s 
[K     |████████████████████████████████| 163 kB 68.4 MB/s 
[K     |████████████████████████████████| 95 kB 5.5 MB/s 
[K     |████████████████████████████████| 127 kB 55.1 MB/s 
[K     |████████████████████████████████| 115 kB 71.6 MB/s 
[?25h

### ROUGE
🤗 [ROUGE page](https://huggingface.co/spaces/evaluate-metric/rouge)

In [7]:
!pip install -q rouge_score

rouge = evaluate.load('rouge')

  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

### BLEU-RT
- No fine-tuning yet.
- Using `BLEURT-20` checkpoint per Google's recommendation (see [BLEURT GitHub page](https://github.com/google-research/bleurt/blob/master/checkpoints.md#the-recommended-checkpoint-bleurt-20))

In [8]:
!pip install git+https://github.com/google-research/bleurt.git

bleurt = evaluate.load('bleurt', module_type='metric', checkpoint='BLEURT-20-D3')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-cbpzethh
  Running command git clone -q https://github.com/google-research/bleurt.git /tmp/pip-req-build-cbpzethh
Collecting tf-slim>=1.1
  Downloading tf_slim-1.1.0-py2.py3-none-any.whl (352 kB)
[K     |████████████████████████████████| 352 kB 40.9 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 43.1 MB/s 
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456783 sha256=68e4c17a8f54ad2802e73b751f3fda362c4ed07647cced58cc9e3113266316ac
  Stored in directory: /tmp/pip-ephem-wheel-cache-b

Downloading builder script:   0%|          | 0.00/5.20k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

### BERTScore
🤗 [BERTScore page](https://huggingface.co/spaces/evaluate-metric/bertscore)
- Using `distilbert-base-uncased` per 🤗 recommendation because the default model (`roberta-large`) is over 1.4GB

In [9]:
!pip install bert_score

bertscore = evaluate.load('bertscore')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert_score
  Downloading bert_score-0.3.12-py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 6.7 MB/s 
Collecting transformers>=3.0.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 48.3 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 52.9 MB/s 
Installing collected packages: tokenizers, transformers, bert-score
Successfully installed bert-score-0.3.12 tokenizers-0.13.1 transformers-4.24.0


Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

### METEOR
🤗 [METEOR page](https://huggingface.co/spaces/evaluate-metric/meteor)

In [10]:
meteor = evaluate.load('meteor')

Downloading builder script:   0%|          | 0.00/6.81k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


### Universal Sentence Encoder (USE) – PENDING

## Calculate Metrics on Each Dataset

Metrics are calculated for each `target` – `prediction` pair. These pairs are averaged for each dataset to have a single value to compare between models and datasets.

In [11]:
# Evaluations will be stored in dictionary
filename = join(inference_root, "evaluation_dict.json")

if isfile(filename):
  print('A previously saved `evaluation_dict` exists!')
  print('Consider if you would like to load it or work with an empty dictionary.')

A previously saved `evaluation_dict` exists!
Consider if you would like to load it or work with an empty dictionary.


### Option 1: Load dict JSON

In [13]:
filename = join(inference_root, "evaluation_dict.json")

with open(filename) as json_file:
  evaluation_dict = json.load(json_file)
  pprint.pprint(evaluation_dict.keys())
  print(len(evaluation_dict.keys()))

dict_keys(['predictions.T5_base_pt_long.quac.triviaqa.csv', 'predictions.bart_base_pt.squad.squad.csv', 'predictions.T5_base_pt_long.nq.nq.csv', 'predictions.T5_base_pt_long.triviaqa.triviaqa.csv', 'predictions.T5_base_pt_long.nq.squad.csv', 'predictions.T5_base_pt_long.triviaqa.squad.csv', 'predictions.T5_base_pt_long.nq.quac.csv', 'predictions.T5_base_pt_long.nq.triviaqa.csv', 'predictions.T5_base_pt_long.triviaqa.quac.csv', 'predictions.T5_base_pt_long.triviaqa.nq.csv', 'predictions.T5_base_pt_long.squad.squad.csv', 'predictions.T5_base_pt_long.squad.nq.csv', 'predictions.T5_base_pt_long.squad.triviaqa.csv', 'predictions.bart_base_pt.squad.quac.csv', 'predictions.T5_base_pt_long.quac.squad.csv', 'predictions.T5_base_pt_long.quac.nq.csv', 'predictions.T5_base_pt_long.squad.quac.csv', 'predictions.T5_base_pt_long.quac.quac.csv', 'predictions.bart_base_pt_long.squad.nq.csv', 'predictions.bart_base_pt_long.squad.quac.csv', 'predictions.bart_base_pt_long.squad.squad.csv', 'predictions.ba

### Option 2: Create empty JSON

In [None]:
# evaluation_dict = {}

### ROUGE

In [14]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating ROUGE on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  # ROUGE scores
  # The use_aggregator argument takes the average for us
  rouge_results = rouge.compute(predictions=predictions,
                                references=targets,
                                use_aggregator=True)
  
  for metric in rouge_results:
    
    # If this dataset hasn't been added to dict, add it and metric
    if not evaluation_dict.get(dataset):
      evaluation_dict.update(
            {
                dataset: {metric: rouge_results[metric]}
            }
        )
      
    # This dataset already exists as a key, so add this metric
    else:
      evaluation_dict[dataset].update(
          {
              metric: rouge_results[metric]
          }
      )


Evaluating ROUGE on predictions.bart_base_pt.squad.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.nq.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.nq.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.triviaqa.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.nq.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.nq.triviaqa.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.triviaqa.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.triviaqa.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.squad.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.squad.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.squad.triviaqa.csv...
Evaluating ROUGE on predictions.bart_base_pt.squad.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.quac.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.quac.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.squad.quac.csv...
Evaluating ROUGE

In [None]:
# `evaluation_dict` as JSON
filename = join(inference_root, "evaluation_dict.json")
with open(filename, "w") as outfile:
  json.dump(evaluation_dict, outfile)
  print(f'JSON saved at {filename}')

JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json


### BLEU-RT

In [None]:
# BLEU-RT takes longer, so here's an option to pick-up where
# we left off prior if Colab terminated the session

print('All predictions to evaluate:')
for dataset in inference_dict.keys():
  print(f'    {dataset}')

print()
print('Predictions with BLEU-RT evaluations:')
for eval in evaluation_dict.keys():
  if evaluation_dict[eval]['bleurt']:
    print('    ' + eval)

All predictions to evaluate:
    predictions.T5_base_pt_long.quac.triviaqa.csv
    predictions.bart_base_pt.squad.squad.csv
    predictions.T5_base_pt_long.nq.nq.csv
    predictions.T5_base_pt_long.triviaqa.triviaqa.csv
    predictions.T5_base_pt_long.nq.squad.csv
    predictions.T5_base_pt_long.triviaqa.squad.csv
    predictions.T5_base_pt_long.nq.quac.csv
    predictions.T5_base_pt_long.nq.triviaqa.csv
    predictions.T5_base_pt_long.triviaqa.quac.csv
    predictions.T5_base_pt_long.triviaqa.nq.csv
    predictions.T5_base_pt_long.squad.squad.csv
    predictions.T5_base_pt_long.squad.nq.csv
    predictions.T5_base_pt_long.squad.triviaqa.csv
    predictions.bart_base_pt.squad.quac.csv
    predictions.T5_base_pt_long.quac.squad.csv
    predictions.T5_base_pt_long.quac.nq.csv
    predictions.T5_base_pt_long.squad.quac.csv
    predictions.T5_base_pt_long.quac.quac.csv

Predictions with BLEU-RT evaluations:


In [None]:
# Copy a dataset name string from above and paste below
# or set to None to evaluate all of them.
begin_at = 'predictions.bart_base_pt.squad.squad.csv'

In [None]:
# Used to control which dataset to resume evaluation at
play = False

for id, dataset in enumerate(inference_dict.keys()):

  # Do not evaluate until we reach the `begin_at` dataset
  if dataset == begin_at or begin_at is None:
    play = True
  
  # If `play` is True, we've reached the dataset and can resume evaluation
  if play:
    print(f"Evaluating BLEU-RT on {dataset}...")

    targets = inference_dict[dataset]['target'].tolist()
    predictions = inference_dict[dataset]['prediction'].tolist()

    # BLEU-RT scores
    bleurt_results = bleurt.compute(predictions=predictions,
                                    references=targets)
    
    # Average over scores
    bleurt_scores_list = bleurt_results['scores']
    avg_bleurt = sum(bleurt_scores_list) / len(bleurt_scores_list)
      
    # If this dataset hasn't been added to dict, add it and metric
    if not evaluation_dict.get(dataset):
      evaluation_dict.update(
            {
                dataset: {'bleurt': avg_bleurt}
            }
        )
      
    # This dataset already exists as a key, so add this metric
    else:
      evaluation_dict[dataset].update(
          {
              'bleurt': avg_bleurt
          }
      )

    # Save this version of `evaluation_dict` as JSON in case Colab dies out
    filename = join(inference_root, "evaluation_dict.json")
    with open(filename, "w") as outfile:
      json.dump(evaluation_dict, outfile)
      print(f'JSON saved at {filename}')

  # If `play` is False, we have not reached the `begin_at` dataset yet
  else:
    print(f"Skipping {dataset}.")


Evaluating BLEU-RT on predictions.T5_base_pt_long.quac.triviaqa.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BLEU-RT on predictions.bart_base_pt.squad.squad.csv...


In [None]:
pprint.pprint(evaluation_dict)

### BERTScore

In [16]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating BERTScore on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  # BERT Scores
  bertscore_results = bertscore.compute(predictions=predictions,
                                        references=targets,
                                        model_type='distilbert-base-uncased')
  
  # Average over scores
  bertscore_precision_list = bertscore_results['precision']
  bertscore_recall_list = bertscore_results['recall']
  bertscore_f1_list = bertscore_results['f1']

  avg_precision = sum(bertscore_precision_list) / len(bertscore_precision_list)
  avg_recall = sum(bertscore_recall_list) / len(bertscore_recall_list)
  avg_f1 = sum(bertscore_f1_list) / len(bertscore_f1_list)
    
  # If this dataset hasn't been added to dict, add it and metric
  if not evaluation_dict.get(dataset):
    evaluation_dict.update(
          {
              dataset: {'bertscore-precision': avg_precision,
                        'bertscore-recall': avg_recall,
                        'bertscore-f1': avg_f1}
          }
      )
    
  # This dataset already exists as a key, so add this metric
  else:
    evaluation_dict[dataset].update(
        {
            'bertscore-precision': avg_precision,
            'bertscore-recall': avg_recall,
            'bertscore-f1': avg_f1
        }
    )

  # Save this version of `evaluation_dict` as JSON in case Colab dies out
  filename = join(inference_root, "evaluation_dict.json")
  with open(filename, "w") as outfile:
    json.dump(evaluation_dict, outfile)
    print(f'JSON saved at {filename}')


Evaluating BERTScore on predictions.bart_base_pt.squad.squad.csv...


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BERTScore on predictions.T5_base_pt_long.nq.nq.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BERTScore on predictions.T5_base_pt_long.nq.squad.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BERTScore on predictions.T5_base_pt_long.triviaqa.squad.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BERTScore on predictions.T5_base_pt_long.nq.quac.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BERTScore on predictions.T5_base_pt_long.nq.triviaqa.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating BERTScore on predictions.T5_base_pt_long.triviaqa.quac.csv...
JSON saved at /content/d

### METEOR

In [17]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating METEOR on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  meteor_results_list = []

  # METEOR takes a pair of inputs at a time
  for pair in zip(targets, predictions):
    
    # Calculate METEOR scores
    results = meteor.compute(predictions=[pair[0]],
                             references=[pair[1]])
    
    meteor_results_list.append(results['meteor'])
  
  avg_meteor = sum(meteor_results_list) / len(meteor_results_list)

  # Add METEOR to dictionary
  # If this dataset hasn't been added to dict, add it and metric
  if not evaluation_dict.get(dataset):
    evaluation_dict.update(
        {
            dataset: {'meteor': avg_meteor}
        }
    )
  
  # This dataset already exists as a key, so add this metric
  else:
      evaluation_dict[dataset].update(
          {
              'meteor': avg_meteor
          }
      )

  # Save this version of `evaluation_dict` as JSON in case Colab dies out
  filename = join(inference_root, "evaluation_dict.json")
  with open(filename, "w") as outfile:
    json.dump(evaluation_dict, outfile)
    print(f'JSON saved at {filename}')


Evaluating METEOR on predictions.bart_base_pt.squad.squad.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating METEOR on predictions.T5_base_pt_long.nq.nq.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating METEOR on predictions.T5_base_pt_long.nq.squad.csv...
JSON saved at /content/drive/MyDrive/w266 NLP Final Project/Predictions/evaluation_dict.json
Evaluating METEOR on predictions.T5_base_pt_long.triviaqa.squad.csv...


ValueError: ignored

In [None]:
evaluation_dict

# Results

loop through dict and for each dataset add row to pandas DF

# Archive

⚠️ This cell takes some time. ⚠️

In [None]:
evaluation_dict = {}

for id, dataset in enumerate(inference_dict.keys()):

  # Get this dataset's `target` and `prediction` values
  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  
  # Evaluations

  # ROUGE
  rouge_results = rouge.compute(predictions=predictions,
                                references=targets,
                                use_aggregator=False)
  evaluation_dict.update(
      {
          dataset: {'rouge': rouge_results}
      }
  )

  # BLEU-RT
  bleurt_results = bleurt.compute(predictions=predictions,
                                  references=targets)
  evaluation_dict.update(
      {
          dataset: {'bleurt': bleurt_results}
      }
  )

  # BERTScore
  bertscore_results = bertscore.compute(predictions=predictions,
                                        references=targets,
                                        model_type='distilbert-base-uncased')
  evaluation_dict.update(
      {
          dataset: {'bertscore': bleurt_results}
      }
  )

  print(f"Dataset {dataset} evaluated.")

KeyboardInterrupt: ignored