# Evaluate original models and finetuned summarizers with ROUGE and BertScore

Fine-tuned and saved models on HF:

short-form models:
- Chung-Fan/bart-pubmed-20k (original: facebook/bart-large-cnn)

- Chung-Fan/distilbart-pubmed-20k (original: philschmid/distilbart-cnn-12-6-samsum)

- Chung-Fan/pegasus-pubmed-20k (original: tuner007/pegasus_summarizer)

long-form models:

- Chung-Fan/primera-pubmed-20k (original: allenai/PRIMERA)

- Chung-Fan/led-pubmed-20k (original: pszemraj/led-base-book-summary)

- Chung-Fan/longformer-pubmed-20k (original: hyesunyun/update-summarization-bart-large-longformer)

finetuned on bottom-truncated dataset:

- Chung-Fan/bart-pubmed-20k-bottom-tokens

- Chung-Fan/distilbart-pubmed-20k-bottom-tokens

- Chung-Fan/pegasus-pubmed-20k-bottom-tokens

---
** All the datasets, generated summaries, and evaluation results can be found in my shared Google Drive folder: https://drive.google.com/drive/folders/1sNoJxaShjifrt_AqyG5_sZYGxHknqfOM?usp=sharing

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from huggingface_hub import notebook_login
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
import transformers
import datasets
import pandas as pd
from datasets import load_dataset, Dataset, DatasetDict

scaledown = 20000
summary_max_length = 400
my_num_beams = 3 # 8 to 3, to make it less computational intensive

# Replace model_index, my_model, and og_model with the other models listed in the intro text above to evaluate every model
model_index = "distilbart-pubmed-20k"
my_model = "Chung-Fan/distilbart-pubmed-20k"
og_model = "philschmid/distilbart-cnn-12-6-samsum"

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!pip install datasets py7zr rouge_score bert_score

In [None]:
from datasets import load_metric
from tqdm import tqdm

rouge_metric = load_metric("rouge")
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
bertscore_metric = load_metric("bertscore")
bertscore_names = ["precision", "recall", "f1"]

metrics = [rouge_metric, bertscore_metric]
metric_names = [str(rouge_metric.name), str(bertscore_metric.name)]

  rouge_metric = load_metric("rouge")
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/2.92k [00:00<?, ?B/s]

# Load processed dataset

In [None]:
# Load the CSV file into a DataFrame
token_df = pd.read_csv('/content/drive/My Drive/pubmed/token_df.csv')
token_df_test = pd.read_csv('/content/drive/My Drive/pubmed/token_df_test.csv')
token_df_val = pd.read_csv('/content/drive/My Drive/pubmed/token_df_val.csv')

token1024_df = pd.read_csv('/content/drive/My Drive/pubmed/token1024_df.csv')
token1024_df_test = pd.read_csv('/content/drive/My Drive/pubmed/token1024_df_test.csv')
token1024_df_val = pd.read_csv('/content/drive/My Drive/pubmed/token1024_df_val.csv')

In [None]:
# Convert DataFrames to Hugging Face Datasets
dataset_train = Dataset.from_pandas(token_df)
dataset_test = Dataset.from_pandas(token_df_test)
dataset_val = Dataset.from_pandas(token_df_val)

# Create DatasetDict
dataset_dict = DatasetDict({
    'train': dataset_train,
    'test': dataset_test,
    'validation': dataset_val
})

dataset_med = dataset_dict
dataset_med

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'article_len', 'abstract_len', 'article', 'abstract'],
        num_rows: 10700
    })
    test: Dataset({
        features: ['Unnamed: 0', 'article_len', 'abstract_len', 'article', 'abstract'],
        num_rows: 1125
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'article', 'abstract'],
        num_rows: 1107
    })
})

In [None]:
# Convert DataFrames to Hugging Face Datasets
dataset1024_train = Dataset.from_pandas(token1024_df)
dataset1024_test = Dataset.from_pandas(token1024_df_test)
dataset1024_val = Dataset.from_pandas(token1024_df_val)

# Create DatasetDict
dataset1024_dict = DatasetDict({
    'train': dataset1024_train,
    'test': dataset1024_test,
    'validation': dataset1024_val
})

dataset1024_med = dataset1024_dict
dataset1024_med

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'article', 'abstract'],
        num_rows: 642
    })
    test: Dataset({
        features: ['Unnamed: 0', 'article', 'abstract'],
        num_rows: 74
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'article', 'abstract'],
        num_rows: 56
    })
})

# Evaluate fine-tuned model

In [None]:
!nvidia-smi

Fri Apr 12 11:27:28 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla V100-SXM2-16GB           Off | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P0              24W / 300W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(my_model)
model = AutoModelForSeq2SeqLM.from_pretrained(my_model).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.89k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/358 [00:00<?, ?B/s]

In [None]:
def chunks(list_of_elements, batch_size):
    """Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

In [None]:
def evaluate_summaries(dataset, metrics, model, tokenizer, generated_summaries,
                       batch_size=16, device=device,
                                   column_text="article",
                                   column_summary="highlights"):
    '''Calculate respective rouge metric for the given data'''
    generated_summaries = []

    article_batches = list(chunks(dataset[column_text], batch_size)) # dialogue batches
    target_batches = list(chunks(dataset[column_summary], batch_size))  # target batches

    i = 0
    for article_batch, target_batch in tqdm(zip(article_batches, target_batches), total=len(article_batches)):
            inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
                            padding="max_length", return_tensors="pt") # encode the input
            print(type(article_batch))
            summaries = model.generate(input_ids=inputs["input_ids"].to(device),  # generate summary
                             attention_mask=inputs["attention_mask"].to(device),
                             length_penalty=0.8, num_beams=my_num_beams, max_length=summary_max_length)
            decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                    clean_up_tokenization_spaces=True) for s in summaries] # decode them

            decoded_summaries = [d.replace("<n>", " ") for d in decoded_summaries]  # misc processing

            for metric in metrics:
                metric.add_batch(predictions=decoded_summaries, references=target_batch)  # add this batch to the metric

    scores = dict()
    for metric in metrics:
      if str(metric.name) == 'bert_score':
        scores['bert_score'] = metric.compute(lang="en") # Calculate final metric score
      else:
        scores[str(metric.name)] = metric.compute() # Calculate final metric score

    return scores

In [None]:
generated_summaries = []
scores_fineTuned = evaluate_summaries(dataset_med["test"], metrics , model, tokenizer, generated_summaries, column_text="article", column_summary="abstract", batch_size=8)
rouge_dict_fineTuned = dict((rn, scores_fineTuned['rouge'][rn].mid.fmeasure) for rn in rouge_names)
bertscore_dict_fineTuned = dict((rn, sum(scores_fineTuned['bert_score'][rn])/len(scores_fineTuned['bert_score'][rn])) for rn in bertscore_names)


  0%|          | 0/141 [00:00<?, ?it/s]

<class 'list'>


  1%|          | 1/141 [00:07<18:36,  7.98s/it]

<class 'list'>


  1%|▏         | 2/141 [00:16<18:51,  8.14s/it]

<class 'list'>


  2%|▏         | 3/141 [00:25<19:33,  8.50s/it]

<class 'list'>


  3%|▎         | 4/141 [00:33<19:26,  8.52s/it]

<class 'list'>


  4%|▎         | 5/141 [00:42<19:11,  8.47s/it]

<class 'list'>


  4%|▍         | 6/141 [00:50<19:15,  8.56s/it]

<class 'list'>


  5%|▍         | 7/141 [00:59<18:50,  8.44s/it]

<class 'list'>


  6%|▌         | 8/141 [01:07<18:48,  8.49s/it]

<class 'list'>


  6%|▋         | 9/141 [01:15<18:19,  8.33s/it]

<class 'list'>


  7%|▋         | 10/141 [01:24<18:30,  8.48s/it]

<class 'list'>


  8%|▊         | 11/141 [01:32<18:14,  8.42s/it]

<class 'list'>


  9%|▊         | 12/141 [01:39<17:08,  7.98s/it]

<class 'list'>


  9%|▉         | 13/141 [01:48<17:16,  8.10s/it]

<class 'list'>


 10%|▉         | 14/141 [01:56<17:13,  8.14s/it]

<class 'list'>


 11%|█         | 15/141 [02:03<16:21,  7.79s/it]

<class 'list'>


 11%|█▏        | 16/141 [02:11<16:33,  7.95s/it]

<class 'list'>


 12%|█▏        | 17/141 [02:20<16:57,  8.21s/it]

<class 'list'>


 13%|█▎        | 18/141 [02:29<17:12,  8.39s/it]

<class 'list'>


 13%|█▎        | 19/141 [02:37<16:55,  8.32s/it]

<class 'list'>


 14%|█▍        | 20/141 [02:44<16:18,  8.09s/it]

<class 'list'>


 15%|█▍        | 21/141 [02:53<16:30,  8.26s/it]

<class 'list'>


 16%|█▌        | 22/141 [03:01<16:24,  8.27s/it]

<class 'list'>


 16%|█▋        | 23/141 [03:10<16:17,  8.28s/it]

<class 'list'>


 17%|█▋        | 24/141 [03:17<15:44,  8.07s/it]

<class 'list'>


 18%|█▊        | 25/141 [03:26<15:59,  8.27s/it]

<class 'list'>


 18%|█▊        | 26/141 [03:35<16:01,  8.36s/it]

<class 'list'>


 19%|█▉        | 27/141 [03:41<15:02,  7.92s/it]

<class 'list'>


 20%|█▉        | 28/141 [03:49<14:45,  7.84s/it]

<class 'list'>


 21%|██        | 29/141 [03:56<14:12,  7.61s/it]

<class 'list'>


 21%|██▏       | 30/141 [04:05<14:37,  7.91s/it]

<class 'list'>


 22%|██▏       | 31/141 [04:14<15:04,  8.22s/it]

<class 'list'>


 23%|██▎       | 32/141 [04:22<14:49,  8.16s/it]

<class 'list'>


 23%|██▎       | 33/141 [04:30<14:29,  8.05s/it]

<class 'list'>


 24%|██▍       | 34/141 [04:37<13:47,  7.74s/it]

<class 'list'>


 25%|██▍       | 35/141 [04:44<13:25,  7.60s/it]

<class 'list'>


 26%|██▌       | 36/141 [04:51<12:58,  7.42s/it]

<class 'list'>


 26%|██▌       | 37/141 [04:59<13:14,  7.64s/it]

<class 'list'>


 27%|██▋       | 38/141 [05:07<13:33,  7.90s/it]

<class 'list'>


 28%|██▊       | 39/141 [05:15<13:27,  7.92s/it]

<class 'list'>


 28%|██▊       | 40/141 [05:23<13:04,  7.76s/it]

<class 'list'>


 29%|██▉       | 41/141 [05:30<12:50,  7.71s/it]

<class 'list'>


 30%|██▉       | 42/141 [05:39<13:10,  7.99s/it]

<class 'list'>


 30%|███       | 43/141 [05:46<12:26,  7.61s/it]

<class 'list'>


 31%|███       | 44/141 [05:53<12:02,  7.45s/it]

<class 'list'>


 32%|███▏      | 45/141 [06:01<12:16,  7.67s/it]

<class 'list'>


 33%|███▎      | 46/141 [06:09<12:10,  7.69s/it]

<class 'list'>


 33%|███▎      | 47/141 [06:18<12:38,  8.07s/it]

<class 'list'>


 34%|███▍      | 48/141 [06:26<12:25,  8.01s/it]

<class 'list'>


 35%|███▍      | 49/141 [06:34<12:36,  8.23s/it]

<class 'list'>


 35%|███▌      | 50/141 [06:42<12:10,  8.03s/it]

<class 'list'>


 36%|███▌      | 51/141 [06:49<11:35,  7.73s/it]

<class 'list'>


 37%|███▋      | 52/141 [06:57<11:42,  7.89s/it]

<class 'list'>


 38%|███▊      | 53/141 [07:04<11:18,  7.71s/it]

<class 'list'>


 38%|███▊      | 54/141 [07:13<11:30,  7.93s/it]

<class 'list'>


 39%|███▉      | 55/141 [07:22<11:38,  8.13s/it]

<class 'list'>


 40%|███▉      | 56/141 [07:30<11:34,  8.17s/it]

<class 'list'>


 40%|████      | 57/141 [07:37<11:01,  7.87s/it]

<class 'list'>


 41%|████      | 58/141 [07:45<10:45,  7.78s/it]

<class 'list'>


 42%|████▏     | 59/141 [07:53<11:00,  8.06s/it]

<class 'list'>


 43%|████▎     | 60/141 [08:02<11:08,  8.25s/it]

<class 'list'>


 43%|████▎     | 61/141 [08:10<11:06,  8.33s/it]

<class 'list'>


 44%|████▍     | 62/141 [08:19<11:06,  8.43s/it]

<class 'list'>


 45%|████▍     | 63/141 [08:26<10:32,  8.11s/it]

<class 'list'>


 45%|████▌     | 64/141 [08:35<10:44,  8.37s/it]

<class 'list'>


 46%|████▌     | 65/141 [08:44<10:29,  8.28s/it]

<class 'list'>


 47%|████▋     | 66/141 [08:52<10:30,  8.41s/it]

<class 'list'>


 48%|████▊     | 67/141 [09:01<10:32,  8.55s/it]

<class 'list'>


 48%|████▊     | 68/141 [09:10<10:24,  8.55s/it]

<class 'list'>


 49%|████▉     | 69/141 [09:19<10:22,  8.65s/it]

<class 'list'>


 50%|████▉     | 70/141 [09:26<09:42,  8.21s/it]

<class 'list'>


 50%|█████     | 71/141 [09:34<09:30,  8.15s/it]

<class 'list'>


 51%|█████     | 72/141 [09:41<09:06,  7.92s/it]

<class 'list'>


 52%|█████▏    | 73/141 [09:50<09:08,  8.06s/it]

<class 'list'>


 52%|█████▏    | 74/141 [09:58<09:08,  8.19s/it]

<class 'list'>


 53%|█████▎    | 75/141 [10:07<09:13,  8.38s/it]

<class 'list'>


 54%|█████▍    | 76/141 [10:16<09:16,  8.56s/it]

<class 'list'>


 55%|█████▍    | 77/141 [10:22<08:13,  7.71s/it]

<class 'list'>


 55%|█████▌    | 78/141 [10:30<08:24,  8.01s/it]

<class 'list'>


 56%|█████▌    | 79/141 [10:39<08:35,  8.31s/it]

<class 'list'>


 57%|█████▋    | 80/141 [10:48<08:29,  8.36s/it]

<class 'list'>


 57%|█████▋    | 81/141 [10:55<08:01,  8.02s/it]

<class 'list'>


 58%|█████▊    | 82/141 [11:01<07:23,  7.51s/it]

<class 'list'>


 59%|█████▉    | 83/141 [11:09<07:16,  7.53s/it]

<class 'list'>


 60%|█████▉    | 84/141 [11:16<07:04,  7.45s/it]

<class 'list'>


 60%|██████    | 85/141 [11:24<07:01,  7.52s/it]

<class 'list'>


 61%|██████    | 86/141 [11:31<06:43,  7.34s/it]

<class 'list'>


 62%|██████▏   | 87/141 [11:39<06:57,  7.73s/it]

<class 'list'>


 62%|██████▏   | 88/141 [11:48<07:04,  8.01s/it]

<class 'list'>


 63%|██████▎   | 89/141 [11:57<07:08,  8.24s/it]

<class 'list'>


 64%|██████▍   | 90/141 [12:04<06:51,  8.06s/it]

<class 'list'>


 65%|██████▍   | 91/141 [12:13<06:50,  8.21s/it]

<class 'list'>


 65%|██████▌   | 92/141 [12:22<06:49,  8.36s/it]

<class 'list'>


 66%|██████▌   | 93/141 [12:30<06:40,  8.34s/it]

<class 'list'>


 67%|██████▋   | 94/141 [12:39<06:41,  8.55s/it]

<class 'list'>


 67%|██████▋   | 95/141 [12:46<06:13,  8.12s/it]

<class 'list'>


 68%|██████▊   | 96/141 [12:54<06:03,  8.07s/it]

<class 'list'>


 69%|██████▉   | 97/141 [13:02<05:45,  7.85s/it]

<class 'list'>


 70%|██████▉   | 98/141 [13:09<05:35,  7.80s/it]

<class 'list'>


 70%|███████   | 99/141 [13:17<05:28,  7.82s/it]

<class 'list'>


 71%|███████   | 100/141 [13:26<05:34,  8.16s/it]

<class 'list'>


 72%|███████▏  | 101/141 [13:34<05:27,  8.18s/it]

<class 'list'>


 72%|███████▏  | 102/141 [13:43<05:20,  8.23s/it]

<class 'list'>


 73%|███████▎  | 103/141 [13:51<05:18,  8.39s/it]

<class 'list'>


 74%|███████▍  | 104/141 [14:00<05:12,  8.46s/it]

<class 'list'>


 74%|███████▍  | 105/141 [14:07<04:51,  8.09s/it]

<class 'list'>


 75%|███████▌  | 106/141 [14:16<04:49,  8.28s/it]

<class 'list'>


 76%|███████▌  | 107/141 [14:25<04:47,  8.47s/it]

<class 'list'>


 77%|███████▋  | 108/141 [14:32<04:30,  8.19s/it]

<class 'list'>


 77%|███████▋  | 109/141 [14:41<04:26,  8.34s/it]

<class 'list'>


 78%|███████▊  | 110/141 [14:50<04:23,  8.49s/it]

<class 'list'>


 79%|███████▊  | 111/141 [14:59<04:17,  8.58s/it]

<class 'list'>


 79%|███████▉  | 112/141 [15:07<04:07,  8.54s/it]

<class 'list'>


 80%|████████  | 113/141 [15:16<03:59,  8.57s/it]

<class 'list'>


 81%|████████  | 114/141 [15:24<03:51,  8.59s/it]

<class 'list'>


 82%|████████▏ | 115/141 [15:32<03:34,  8.23s/it]

<class 'list'>


 82%|████████▏ | 116/141 [15:40<03:26,  8.26s/it]

<class 'list'>


 83%|████████▎ | 117/141 [15:48<03:18,  8.28s/it]

<class 'list'>


 84%|████████▎ | 118/141 [15:57<03:14,  8.47s/it]

<class 'list'>


 84%|████████▍ | 119/141 [16:06<03:06,  8.46s/it]

<class 'list'>


 85%|████████▌ | 120/141 [16:14<02:58,  8.51s/it]

<class 'list'>


 86%|████████▌ | 121/141 [16:23<02:50,  8.51s/it]

<class 'list'>


 87%|████████▋ | 122/141 [16:32<02:44,  8.64s/it]

<class 'list'>


 87%|████████▋ | 123/141 [16:39<02:25,  8.07s/it]

<class 'list'>


 88%|████████▊ | 124/141 [16:44<02:04,  7.31s/it]

<class 'list'>


 89%|████████▊ | 125/141 [16:53<02:04,  7.76s/it]

<class 'list'>


 89%|████████▉ | 126/141 [17:01<01:57,  7.85s/it]

<class 'list'>


 90%|█████████ | 127/141 [17:10<01:53,  8.08s/it]

<class 'list'>


 91%|█████████ | 128/141 [17:18<01:46,  8.18s/it]

<class 'list'>


 91%|█████████▏| 129/141 [17:26<01:38,  8.18s/it]

<class 'list'>


 92%|█████████▏| 130/141 [17:34<01:29,  8.17s/it]

<class 'list'>


 93%|█████████▎| 131/141 [17:43<01:23,  8.31s/it]

<class 'list'>


 94%|█████████▎| 132/141 [17:50<01:12,  8.05s/it]

<class 'list'>


 94%|█████████▍| 133/141 [17:58<01:04,  8.02s/it]

<class 'list'>


 95%|█████████▌| 134/141 [18:07<00:58,  8.31s/it]

<class 'list'>


 96%|█████████▌| 135/141 [18:16<00:51,  8.51s/it]

<class 'list'>


 96%|█████████▋| 136/141 [18:25<00:42,  8.50s/it]

<class 'list'>


 97%|█████████▋| 137/141 [18:34<00:34,  8.63s/it]

<class 'list'>


 98%|█████████▊| 138/141 [18:42<00:25,  8.42s/it]

<class 'list'>


 99%|█████████▊| 139/141 [18:49<00:15,  7.99s/it]

<class 'list'>


 99%|█████████▉| 140/141 [18:58<00:08,  8.29s/it]

<class 'list'>


100%|██████████| 141/141 [19:02<00:00,  8.10s/it]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
pd.DataFrame.from_records(rouge_dict_fineTuned, index=[model_index])

Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
distilbart-pubmed-20k,0.39833,0.162007,0.237544,0.348125


In [None]:
pd.DataFrame.from_records(bertscore_dict_fineTuned, index=[model_index])

Unnamed: 0,f1,precision,recall
distilbart-pubmed-20k,0.852918,0.850995,0.855384


# Evaluate original model

In [None]:
torch.cuda.empty_cache()  # Free GPU Memory to test on original model.

In [None]:
model_ckpt = og_model
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

tokenizer_config.json:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.85k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

In [None]:
scores_original = evaluate_summaries(dataset_med["test"], metrics, model, tokenizer, column_text="article", column_summary="abstract", batch_size=8)
rouge_dict_original = dict((rn, scores_original['rouge'][rn].mid.fmeasure) for rn in rouge_names)
bertscore_dict_original = dict((rn, sum(scores_original['bert_score'][rn])/len(scores_original['bert_score'][rn])) for rn in bertscore_names)

  0%|          | 0/141 [00:00<?, ?it/s]

<class 'list'>


  1%|          | 1/141 [00:03<08:27,  3.63s/it]

<class 'list'>


  1%|▏         | 2/141 [00:06<07:00,  3.02s/it]

<class 'list'>


  2%|▏         | 3/141 [00:07<05:35,  2.43s/it]

<class 'list'>


  3%|▎         | 4/141 [00:10<05:24,  2.37s/it]

<class 'list'>


  4%|▎         | 5/141 [00:12<05:02,  2.23s/it]

<class 'list'>


  4%|▍         | 6/141 [00:14<05:11,  2.31s/it]

<class 'list'>


  5%|▍         | 7/141 [00:16<05:00,  2.24s/it]

<class 'list'>


  6%|▌         | 8/141 [00:18<04:52,  2.20s/it]

<class 'list'>


  6%|▋         | 9/141 [00:21<04:53,  2.22s/it]

<class 'list'>


  7%|▋         | 10/141 [00:23<04:58,  2.28s/it]

<class 'list'>


  8%|▊         | 11/141 [00:26<05:10,  2.39s/it]

<class 'list'>


  9%|▊         | 12/141 [00:27<04:37,  2.15s/it]

<class 'list'>


  9%|▉         | 13/141 [00:30<04:45,  2.23s/it]

<class 'list'>


 10%|▉         | 14/141 [00:31<04:25,  2.09s/it]

<class 'list'>


 11%|█         | 15/141 [00:34<04:35,  2.19s/it]

<class 'list'>


 11%|█▏        | 16/141 [00:36<04:30,  2.16s/it]

<class 'list'>


 12%|█▏        | 17/141 [00:39<04:50,  2.35s/it]

<class 'list'>


 13%|█▎        | 18/141 [00:41<04:36,  2.25s/it]

<class 'list'>


 13%|█▎        | 19/141 [00:43<04:42,  2.32s/it]

<class 'list'>


 14%|█▍        | 20/141 [00:45<04:29,  2.23s/it]

<class 'list'>


 15%|█▍        | 21/141 [00:48<04:27,  2.23s/it]

<class 'list'>


 16%|█▌        | 22/141 [00:50<04:17,  2.16s/it]

<class 'list'>


 16%|█▋        | 23/141 [00:52<04:14,  2.16s/it]

<class 'list'>


 17%|█▋        | 24/141 [00:54<04:09,  2.13s/it]

<class 'list'>


 18%|█▊        | 25/141 [00:56<04:03,  2.10s/it]

<class 'list'>


 18%|█▊        | 26/141 [00:58<03:55,  2.05s/it]

<class 'list'>


 19%|█▉        | 27/141 [01:00<04:07,  2.17s/it]

<class 'list'>


 20%|█▉        | 28/141 [01:02<03:59,  2.12s/it]

<class 'list'>


 21%|██        | 29/141 [01:04<04:00,  2.14s/it]

<class 'list'>


 21%|██▏       | 30/141 [01:06<03:52,  2.09s/it]

<class 'list'>


 22%|██▏       | 31/141 [01:08<03:46,  2.06s/it]

<class 'list'>


 23%|██▎       | 32/141 [01:10<03:46,  2.07s/it]

<class 'list'>


 23%|██▎       | 33/141 [01:12<03:42,  2.06s/it]

<class 'list'>


 24%|██▍       | 34/141 [01:15<03:53,  2.18s/it]

<class 'list'>


 25%|██▍       | 35/141 [01:17<04:01,  2.28s/it]

<class 'list'>


 26%|██▌       | 36/141 [01:19<03:51,  2.20s/it]

<class 'list'>


 26%|██▌       | 37/141 [01:21<03:33,  2.05s/it]

<class 'list'>


 27%|██▋       | 38/141 [01:24<03:46,  2.20s/it]

<class 'list'>


 28%|██▊       | 39/141 [01:26<03:53,  2.29s/it]

<class 'list'>


 28%|██▊       | 40/141 [01:28<03:42,  2.20s/it]

<class 'list'>


 29%|██▉       | 41/141 [01:30<03:33,  2.14s/it]

<class 'list'>


 30%|██▉       | 42/141 [01:32<03:14,  1.97s/it]

<class 'list'>


 30%|███       | 43/141 [01:34<03:15,  2.00s/it]

<class 'list'>


 31%|███       | 44/141 [01:36<03:11,  1.97s/it]

<class 'list'>


 32%|███▏      | 45/141 [01:38<03:07,  1.95s/it]

<class 'list'>


 33%|███▎      | 46/141 [01:39<03:00,  1.90s/it]

<class 'list'>


 33%|███▎      | 47/141 [01:41<03:01,  1.93s/it]

<class 'list'>


 34%|███▍      | 48/141 [01:44<03:05,  1.99s/it]

<class 'list'>


 35%|███▍      | 49/141 [01:46<03:09,  2.06s/it]

<class 'list'>


 35%|███▌      | 50/141 [01:48<03:00,  1.99s/it]

<class 'list'>


 36%|███▌      | 51/141 [01:50<03:03,  2.04s/it]

<class 'list'>


 37%|███▋      | 52/141 [01:52<03:03,  2.06s/it]

<class 'list'>


 38%|███▊      | 53/141 [01:54<02:57,  2.01s/it]

<class 'list'>


 38%|███▊      | 54/141 [01:56<03:04,  2.13s/it]

<class 'list'>


 39%|███▉      | 55/141 [01:58<02:49,  1.97s/it]

<class 'list'>


 40%|███▉      | 56/141 [01:59<02:36,  1.84s/it]

<class 'list'>


 40%|████      | 57/141 [02:01<02:41,  1.93s/it]

<class 'list'>


 41%|████      | 58/141 [02:04<02:44,  1.98s/it]

<class 'list'>


 42%|████▏     | 59/141 [02:05<02:39,  1.94s/it]

<class 'list'>


 43%|████▎     | 60/141 [02:08<02:42,  2.01s/it]

<class 'list'>


 43%|████▎     | 61/141 [02:09<02:35,  1.95s/it]

<class 'list'>


 44%|████▍     | 62/141 [02:11<02:31,  1.92s/it]

<class 'list'>


 45%|████▍     | 63/141 [02:13<02:25,  1.86s/it]

<class 'list'>


 45%|████▌     | 64/141 [02:16<02:40,  2.09s/it]

<class 'list'>


 46%|████▌     | 65/141 [02:18<02:37,  2.07s/it]

<class 'list'>


 47%|████▋     | 66/141 [02:20<02:37,  2.10s/it]

<class 'list'>


 48%|████▊     | 67/141 [02:22<02:38,  2.14s/it]

<class 'list'>


 48%|████▊     | 68/141 [02:25<02:45,  2.27s/it]

<class 'list'>


 49%|████▉     | 69/141 [02:27<02:46,  2.32s/it]

<class 'list'>


 50%|████▉     | 70/141 [02:29<02:41,  2.27s/it]

<class 'list'>


 50%|█████     | 71/141 [02:31<02:33,  2.19s/it]

<class 'list'>


 51%|█████     | 72/141 [02:33<02:22,  2.07s/it]

<class 'list'>


 52%|█████▏    | 73/141 [02:35<02:17,  2.02s/it]

<class 'list'>


 52%|█████▏    | 74/141 [02:37<02:17,  2.05s/it]

<class 'list'>


 53%|█████▎    | 75/141 [02:39<02:18,  2.10s/it]

<class 'list'>


 54%|█████▍    | 76/141 [02:42<02:34,  2.38s/it]

<class 'list'>


 55%|█████▍    | 77/141 [02:44<02:21,  2.22s/it]

<class 'list'>


 55%|█████▌    | 78/141 [02:46<02:19,  2.22s/it]

<class 'list'>


 56%|█████▌    | 79/141 [02:48<02:13,  2.15s/it]

<class 'list'>


 57%|█████▋    | 80/141 [02:51<02:15,  2.22s/it]

<class 'list'>


 57%|█████▋    | 81/141 [02:53<02:14,  2.25s/it]

<class 'list'>


 58%|█████▊    | 82/141 [02:55<02:04,  2.11s/it]

<class 'list'>


 59%|█████▉    | 83/141 [02:57<01:59,  2.06s/it]

<class 'list'>


 60%|█████▉    | 84/141 [02:59<01:56,  2.04s/it]

<class 'list'>


 60%|██████    | 85/141 [03:00<01:48,  1.94s/it]

<class 'list'>


 61%|██████    | 86/141 [03:02<01:44,  1.89s/it]

<class 'list'>


 62%|██████▏   | 87/141 [03:04<01:44,  1.94s/it]

<class 'list'>


 62%|██████▏   | 88/141 [03:06<01:37,  1.84s/it]

<class 'list'>


 63%|██████▎   | 89/141 [03:08<01:43,  1.99s/it]

<class 'list'>


 64%|██████▍   | 90/141 [03:10<01:37,  1.91s/it]

<class 'list'>


 65%|██████▍   | 91/141 [03:12<01:35,  1.91s/it]

<class 'list'>


 65%|██████▌   | 92/141 [03:14<01:37,  1.99s/it]

<class 'list'>


 66%|██████▌   | 93/141 [03:16<01:29,  1.87s/it]

<class 'list'>


 67%|██████▋   | 94/141 [03:18<01:34,  2.02s/it]

<class 'list'>


 67%|██████▋   | 95/141 [03:20<01:31,  1.99s/it]

<class 'list'>


 68%|██████▊   | 96/141 [03:22<01:27,  1.95s/it]

<class 'list'>


 69%|██████▉   | 97/141 [03:24<01:28,  2.01s/it]

<class 'list'>


 70%|██████▉   | 98/141 [03:26<01:25,  1.98s/it]

<class 'list'>


 70%|███████   | 99/141 [03:28<01:23,  2.00s/it]

<class 'list'>


 71%|███████   | 100/141 [03:29<01:17,  1.88s/it]

<class 'list'>


 72%|███████▏  | 101/141 [03:32<01:20,  2.01s/it]

<class 'list'>


 72%|███████▏  | 102/141 [03:34<01:18,  2.03s/it]

<class 'list'>


 73%|███████▎  | 103/141 [03:36<01:17,  2.03s/it]

<class 'list'>


 74%|███████▍  | 104/141 [03:38<01:18,  2.12s/it]

<class 'list'>


 74%|███████▍  | 105/141 [03:40<01:16,  2.13s/it]

<class 'list'>


 75%|███████▌  | 106/141 [03:43<01:19,  2.26s/it]

<class 'list'>


 76%|███████▌  | 107/141 [03:45<01:16,  2.26s/it]

<class 'list'>


 77%|███████▋  | 108/141 [03:47<01:09,  2.10s/it]

<class 'list'>


 77%|███████▋  | 109/141 [03:49<01:10,  2.20s/it]

<class 'list'>


 78%|███████▊  | 110/141 [03:52<01:14,  2.41s/it]

<class 'list'>


 79%|███████▊  | 111/141 [03:54<01:06,  2.20s/it]

<class 'list'>


 79%|███████▉  | 112/141 [03:56<01:01,  2.13s/it]

<class 'list'>


 80%|████████  | 113/141 [03:58<00:57,  2.06s/it]

<class 'list'>


 81%|████████  | 114/141 [04:00<00:53,  1.99s/it]

<class 'list'>


 82%|████████▏ | 115/141 [04:02<00:51,  1.96s/it]

<class 'list'>


 82%|████████▏ | 116/141 [04:04<00:49,  1.99s/it]

<class 'list'>


 83%|████████▎ | 117/141 [04:06<00:48,  2.03s/it]

<class 'list'>


 84%|████████▎ | 118/141 [04:08<00:49,  2.15s/it]

<class 'list'>


 84%|████████▍ | 119/141 [04:10<00:45,  2.08s/it]

<class 'list'>


 85%|████████▌ | 120/141 [04:12<00:41,  1.99s/it]

<class 'list'>


 86%|████████▌ | 121/141 [04:14<00:41,  2.07s/it]

<class 'list'>


 87%|████████▋ | 122/141 [04:16<00:38,  2.02s/it]

<class 'list'>


 87%|████████▋ | 123/141 [04:18<00:35,  1.98s/it]

<class 'list'>


 88%|████████▊ | 124/141 [04:20<00:32,  1.92s/it]

<class 'list'>


 89%|████████▊ | 125/141 [04:22<00:30,  1.93s/it]

<class 'list'>


 89%|████████▉ | 126/141 [04:24<00:29,  1.94s/it]

<class 'list'>


 90%|█████████ | 127/141 [04:25<00:25,  1.84s/it]

<class 'list'>


 91%|█████████ | 128/141 [04:27<00:22,  1.76s/it]

<class 'list'>


 91%|█████████▏| 129/141 [04:29<00:22,  1.89s/it]

<class 'list'>


 92%|█████████▏| 130/141 [04:31<00:20,  1.84s/it]

<class 'list'>


 93%|█████████▎| 131/141 [04:33<00:18,  1.87s/it]

<class 'list'>


 94%|█████████▎| 132/141 [04:35<00:18,  2.09s/it]

<class 'list'>


 94%|█████████▍| 133/141 [04:37<00:16,  2.08s/it]

<class 'list'>


 95%|█████████▌| 134/141 [04:40<00:15,  2.26s/it]

<class 'list'>


 96%|█████████▌| 135/141 [04:42<00:13,  2.27s/it]

<class 'list'>


 96%|█████████▋| 136/141 [04:45<00:11,  2.35s/it]

<class 'list'>


 97%|█████████▋| 137/141 [04:47<00:08,  2.19s/it]

<class 'list'>


 98%|█████████▊| 138/141 [04:48<00:06,  2.10s/it]

<class 'list'>


 99%|█████████▊| 139/141 [04:50<00:04,  2.05s/it]

<class 'list'>


 99%|█████████▉| 140/141 [04:53<00:02,  2.18s/it]

<class 'list'>


100%|██████████| 141/141 [04:55<00:00,  2.09s/it]


In [None]:
pd.DataFrame.from_records(rouge_dict_original, index=[f"model without finetuning"])


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
model without finetuning,0.270585,0.097951,0.17736,0.227989


In [None]:
pd.DataFrame.from_records(bertscore_dict_original, index=[f"model without finetuning"])

Unnamed: 0,f1,precision,recall
model without finetuning,0.831853,0.859441,0.806432


# Shorter text performance Evaluation
Now we want to evaluate the model's performance on shorter text (1024)

In [None]:
scores_fineTuned_1024 = evaluate_summaries(dataset1024_med["test"], metrics , model, tokenizer, column_text="article", column_summary="abstract", batch_size=8)
rouge_dict_fineTuned_1024 = dict((rn, scores_fineTuned_1024['rouge'][rn].mid.fmeasure) for rn in rouge_names)
bertscore_dict_fineTuned_1024 = dict((rn, sum(scores_fineTuned_1024['bert_score'][rn])/len(scores_fineTuned_1024['bert_score'][rn])) for rn in bertscore_names)

pd.DataFrame.from_records(rouge_dict_fineTuned_1024, index=[model_index])

  0%|          | 0/10 [00:00<?, ?it/s]

<class 'list'>


 10%|█         | 1/10 [00:02<00:18,  2.02s/it]

<class 'list'>


 20%|██        | 2/10 [00:03<00:14,  1.85s/it]

<class 'list'>


 30%|███       | 3/10 [00:05<00:13,  1.91s/it]

<class 'list'>


 40%|████      | 4/10 [00:07<00:11,  1.96s/it]

<class 'list'>


 50%|█████     | 5/10 [00:09<00:10,  2.05s/it]

<class 'list'>


 60%|██████    | 6/10 [00:12<00:08,  2.12s/it]

<class 'list'>


 70%|███████   | 7/10 [00:14<00:06,  2.11s/it]

<class 'list'>


 80%|████████  | 8/10 [00:16<00:04,  2.15s/it]

<class 'list'>


 90%|█████████ | 9/10 [00:18<00:02,  2.06s/it]

<class 'list'>


100%|██████████| 10/10 [00:19<00:00,  1.94s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
distilbart-pubmed-20k,0.325677,0.133961,0.217255,0.271944


In [None]:
pd.DataFrame.from_records(bertscore_dict_fineTuned_1024, index=[model_index])

Unnamed: 0,f1,precision,recall
distilbart-pubmed-20k,0.843664,0.862857,0.825671


In [None]:
torch.cuda.empty_cache()  # Free GPU Memory to test on original model.

original pretrained model on shorter text (1024)

In [None]:
scores_original_1024 = evaluate_summaries(dataset1024_med["test"], metrics, model, tokenizer, column_text="article", column_summary="abstract", batch_size=8)
rouge_dict_original_1024 = dict((rn, scores_original_1024['rouge'][rn].mid.fmeasure) for rn in rouge_names)
bertscore_dict_original_1024 = dict((rn, sum(scores_original_1024['bert_score'][rn])/len(scores_original_1024['bert_score'][rn])) for rn in bertscore_names)

pd.DataFrame.from_records(rouge_dict_original_1024, index=[f"model without finetuning (1024 tokens)"])

  0%|          | 0/10 [00:00<?, ?it/s]

<class 'list'>


 10%|█         | 1/10 [00:01<00:16,  1.84s/it]

<class 'list'>


 20%|██        | 2/10 [00:03<00:14,  1.76s/it]

<class 'list'>


 30%|███       | 3/10 [00:05<00:13,  1.86s/it]

<class 'list'>


 40%|████      | 4/10 [00:07<00:11,  1.92s/it]

<class 'list'>


 50%|█████     | 5/10 [00:09<00:10,  2.03s/it]

<class 'list'>


 60%|██████    | 6/10 [00:12<00:08,  2.17s/it]

<class 'list'>


 70%|███████   | 7/10 [00:14<00:06,  2.06s/it]

<class 'list'>


 80%|████████  | 8/10 [00:16<00:04,  2.14s/it]

<class 'list'>


 90%|█████████ | 9/10 [00:18<00:02,  2.04s/it]

<class 'list'>


100%|██████████| 10/10 [00:19<00:00,  1.92s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
model without finetuning (1024 tokens),0.325677,0.133961,0.217255,0.271944


In [None]:
pd.DataFrame.from_records(bertscore_dict_original_1024, index=[f"model without finetuning (1024 tokens)"])

Unnamed: 0,f1,precision,recall
model without finetuning (1024 tokens),0.843664,0.862857,0.825671


In [None]:
torch.cuda.empty_cache()  # Free GPU Memory to test on original model.

# Examine abstracts generated

We want to see with our own eyes and examine the generated abstracts to make sure they are doing what we want them to do.

In [None]:
from transformers import pipeline

pipe = pipeline("summarization", model=my_model)

In [None]:
for i in range(10):
  custom_article = str(token_df_test['article'][i])
  # print(pipe(custom_article, **gen_kwargs)[0]["summary_text"])
  print("human summary:\n", (token_df_test['abstract'][i]))
  print("generated summary:\n", pipe(custom_article[:1200])[0]["summary_text"]) # custom_article[:1200]

  print(i, ' : ---------------------------')

human summary:
 research on the implications of anxiety in parkinson 's disease ( pd ) has been neglected despite its prevalence in nearly 50% of patients and its negative impact on quality of life . 
 previous reports have noted that neuropsychiatric symptoms impair cognitive performance in pd patients ; however , to date , no study has directly compared pd patients with and without anxiety to examine the impact of anxiety on cognitive impairments in pd . 
 this study compared cognitive performance across 50 pd participants with and without anxiety ( 17 pda+ ; 33 pda ) , who underwent neurological and neuropsychological assessment . 
 group performance was compared across the following cognitive domains : simple attention / visuomotor processing speed , executive function ( e.g. , set - shifting ) , working memory , language , and memory / new verbal learning . 
 results showed that pda+ performed significantly worse on the digit span forward and backward test and part b of the trail 