<a href="https://colab.research.google.com/github/jeanlucjackson/w266_final_project/blob/main/code/inference/evaluate_inferences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports & Google Drive Mounting

In [None]:
from os import listdir
from os.path import isfile, join

import csv
import pprint

import pandas as pd

In [None]:
# This cell will authenticate you and mount your Drive in the Colab.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
inference_root = "/content/drive/MyDrive/w266 NLP Final Project/Predictions/"

In [None]:
inference_files = listdir(inference_root)
print(inference_files)

['t5_simple_transformers_preds.csv', 'predictions.T5_base_pt.squad.quac.csv', 'predictions.T5_base_pt.squad.squad.csv', 'predictions.T5_base_pt.quac.squad.csv', 'predictions.T5_base_pt.quac.quac.csv', 'predictions.T5_base_pt.squad.nq.csv', 'predictions.T5_base_pt.quac.nq.csv', 'predictions.T5_base_pt_long.squad.triviaqa.csv', 'predictions.T5_base_pt_long.quac.triviaqa.csv']


# Evaluations
## Load Data

Inferences will be saved into the `inference_dict` nested dictionary, whose format is:
- keys: CSV filenames
- values:
  - `target`: list of target values
  - `prediction`: list of prediction values

In [None]:
inference_dict = {}

for id, inf_file in enumerate(inference_files):

  # Load CSV file containing predictions
  filename = join(inference_root, inf_file)
  
  # If the file exists, load it into pandas
  if isfile(filename):
    print(f"Opening file {id + 1} of {len(inference_files)}: {inf_file}\n")

    df = pd.read_csv(filename)
    
    # If the CSV does not have correct column names, warn user and skip file
    if 'target' not in df.columns and 'prediction' not in df.columns:
      print("WARNING: Columns `target` and `prediction` not found in CSV. Skipping CSV.")
      print(f"Check file: {filename}")
      # continue

    # Columns exist, so continue
    else:
      targets = df['target']
      predictions = df['prediction']

      print('CSV loaded.')
      print(f"Length of targets:      {len(targets)}")
      print(f"Length of predictions:  {len(predictions)}")
      
      # Save lists into prediction dictionary under file's name
      inference_dict.update(
          {inf_file: {'target': targets,
                      'prediction': predictions}
          }
      )
      print('\nTargets and predictions saved.')
    
    print('________________________________________\n')


print(f"\nTotal of {len(inference_dict.keys())} datasets loaded:")
for dataset in inference_dict.keys():
  print('    ' + dataset)

Opening file 1 of 9: t5_simple_transformers_preds.csv

Check file: /content/drive/MyDrive/w266 NLP Final Project/Predictions/t5_simple_transformers_preds.csv
________________________________________

Opening file 2 of 9: predictions.T5_base_pt.squad.quac.csv

CSV loaded.
Length of targets:      5868
Length of predictions:  5868

Targets and predictions saved.
________________________________________

Opening file 3 of 9: predictions.T5_base_pt.squad.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.
________________________________________

Opening file 4 of 9: predictions.T5_base_pt.quac.squad.csv

CSV loaded.
Length of targets:      10570
Length of predictions:  10570

Targets and predictions saved.
________________________________________

Opening file 5 of 9: predictions.T5_base_pt.quac.quac.csv

CSV loaded.
Length of targets:      5868
Length of predictions:  5868

Targets and predictions saved.
______________________

## Evaluate Predictions

We'll be using:
- ROUGE
- BLEU-RT
- BERTScore

And storing evaluations in `evaluation_dict` formatted as:
- keys: CSV filenames
- values:
  - metric_name: metric_value

### Load Evaluation Metrics

In [None]:
!pip install -q evaluate
import evaluate

[K     |████████████████████████████████| 72 kB 363 kB/s 
[K     |████████████████████████████████| 441 kB 18.0 MB/s 
[K     |████████████████████████████████| 115 kB 58.9 MB/s 
[K     |████████████████████████████████| 163 kB 61.6 MB/s 
[K     |████████████████████████████████| 212 kB 58.9 MB/s 
[K     |████████████████████████████████| 95 kB 5.3 MB/s 
[K     |████████████████████████████████| 127 kB 62.3 MB/s 
[K     |████████████████████████████████| 115 kB 60.7 MB/s 
[?25h

#### ROUGE
🤗 [ROUGE page](https://huggingface.co/spaces/evaluate-metric/rouge)

In [None]:
!pip install -q rouge_score

rouge = evaluate.load('rouge')

  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone


Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

#### BLEU-RT
- No fine-tuning yet.
- Using `BLEURT-20` checkpoint per Google's recommendation (see [BLEURT GitHub page](https://github.com/google-research/bleurt/blob/master/checkpoints.md#the-recommended-checkpoint-bleurt-20))

In [None]:
!pip install git+https://github.com/google-research/bleurt.git

bleurt = evaluate.load('bleurt', module_type='metric', checkpoint='BLEURT-20')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/google-research/bleurt.git
  Cloning https://github.com/google-research/bleurt.git to /tmp/pip-req-build-_agi1ll8
  Running command git clone -q https://github.com/google-research/bleurt.git /tmp/pip-req-build-_agi1ll8
Collecting tf-slim>=1.1
  Downloading tf_slim-1.1.0-py2.py3-none-any.whl (352 kB)
[K     |████████████████████████████████| 352 kB 7.5 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 36.9 MB/s 
Building wheels for collected packages: BLEURT
  Building wheel for BLEURT (setup.py) ... [?25l[?25hdone
  Created wheel for BLEURT: filename=BLEURT-0.0.2-py3-none-any.whl size=16456783 sha256=8c08d48215b2c350532157754a6527ce320b10bdc8bd3a0680d703af9303ea5f
  Stored in directory: /tmp/pip-ephem-wheel-cache-yp

Downloading builder script:   0%|          | 0.00/5.20k [00:00<?, ?B/s]



Downloading data:   0%|          | 0.00/405M [00:00<?, ?B/s]

#### BERTScore
🤗 [BERTScore page](https://huggingface.co/spaces/evaluate-metric/bertscore)
- Using `distilbert-base-uncased` per 🤗 recommendation because the default model (`roberta-large`) is over 1.4GB

In [None]:
!pip install bert_score

bertscore = evaluate.load('bertscore')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bert_score
  Downloading bert_score-0.3.12-py3-none-any.whl (60 kB)
[K     |████████████████████████████████| 60 kB 849 kB/s 
Collecting transformers>=3.0.0
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 16.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 33.0 MB/s 
Installing collected packages: tokenizers, transformers, bert-score
Successfully installed bert-score-0.3.12 tokenizers-0.13.1 transformers-4.24.0


Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

#### Universal Sentence Encoder (USE) – PENDING

### Calculate Metrics on Each Dataset

Metrics are calculated for each `target` – `prediction` pair. These pairs are averaged for each dataset to have a single value to compare between models and datasets.

In [None]:
# Evaluations will be stored in dictionary

evaluation_dict = {}

##### ROUGE

In [None]:
for id, dataset in enumerate(inference_dict.keys()):
  print(f"Evaluating ROUGE on {dataset}...")

  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  # The use_aggregator argument takes the average for us
  rouge_results = rouge.compute(predictions=predictions,
                                references=targets,
                                use_aggregator=True)
  
  for metric in rouge_results:
    
    # If this dataset hasn't been added to dict, add it and metric
    if not evaluation_dict.get(dataset):
      evaluation_dict.update(
            {
                dataset: {metric: rouge_results[metric]}
            }
        )
      
    # This dataset already exists as a key, so add this metric
    else:
      evaluation_dict[dataset].update(
          {
              metric: rouge_results[metric]
          }
      )

Evaluating ROUGE on predictions.T5_base_pt.squad.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt.squad.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt.quac.squad.csv...
Evaluating ROUGE on predictions.T5_base_pt.quac.quac.csv...
Evaluating ROUGE on predictions.T5_base_pt.squad.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt.quac.nq.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.squad.triviaqa.csv...
Evaluating ROUGE on predictions.T5_base_pt_long.quac.triviaqa.csv...


In [None]:
# test out on one of the datasets
dataset = list(inference_dict.keys())[0]
print(dataset)

targets = inference_dict[dataset]['target'].tolist()
predictions = inference_dict[dataset]['prediction'].tolist()
print(len(targets), len(predictions))

print('ROUGE')
rouge_results = rouge.compute(predictions=predictions,
                              references=targets,
                              use_aggregator=False)

predictions.T5_base_pt.squad.quac.csv
5868 5868
ROUGE


In [None]:
print(type(rouge_results))
print(len(rouge_results))
print(rouge_results.keys())
print(len(rouge_results['rouge1']))

print(f"Averages:")
for k in rouge_results.keys():
  print(k + ': ', end='')
  print(sum(rouge_results[k])/len(rouge_results[k]))


<class 'dict'>
4
dict_keys(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'])
5868
Averages:
rouge1: 0.18249635750457507
rouge2: 0.03966958090769743
rougeL: 0.17632388612567854
rougeLsum: 0.17632388612567854
Maxs:
rouge1: 1.0
rouge2: 1.0
rougeL: 1.0
rougeLsum: 1.0


##### WORKING ON THIS CELL

⚠️ This cell takes some time. ⚠️

In [None]:
evaluation_dict = {}

for id, dataset in enumerate(inference_dict.keys()):

  # Get this dataset's `target` and `prediction` values
  targets = inference_dict[dataset]['target'].tolist()
  predictions = inference_dict[dataset]['prediction'].tolist()

  
  # Evaluations

  # ROUGE
  rouge_results = rouge.compute(predictions=predictions,
                                references=targets,
                                use_aggregator=False)
  evaluation_dict.update(
      {
          dataset: {'rouge': rouge_results}
      }
  )

  # BLEU-RT
  bleurt_results = bleurt.compute(predictions=predictions,
                                  references=targets)
  evaluation_dict.update(
      {
          dataset: {'bleurt': bleurt_results}
      }
  )

  # BERTScore
  bertscore_results = bertscore.compute(predictions=predictions,
                                        references=targets,
                                        model_type='distilbert-base-uncased')
  evaluation_dict.update(
      {
          dataset: {'bertscore': bleurt_results}
      }
  )

  print(f"Dataset {dataset} evaluated.")

KeyboardInterrupt: ignored