In [2]:
import pandas as pd
import evaluate
from bert_score import score as bertscore
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

In [3]:
deepseek_r1_df = pd.read_csv("deepseek_sample_summaries.csv")
reference_df = pd.read_csv("deepseek_bhc_matched_reference_summaries.csv")

df = pd.merge(deepseek_r1_df, reference_df, on="note_id")
generated = df['summary'].fillna("").tolist()
reference = df['target'].fillna("").tolist()

In [5]:
### 1. ROUGE Score
rouge = evaluate.load("rouge")
rouge_result = rouge.compute(predictions=generated, references=reference)

print("\nðŸ“Š ROUGE Scores:")
for k, v in rouge_result.items():
    print(f"{k.upper()}: {v:.4f}")


ðŸ“Š ROUGE Scores:
ROUGE1: 0.3221
ROUGE2: 0.0740
ROUGEL: 0.1417
ROUGELSUM: 0.2082


In [6]:
### 2. BLEU (average over all rows)
smoothie = SmoothingFunction().method4
bleu_scores = [
    sentence_bleu([ref.split()], pred.split(), smoothing_function=smoothie)
    for ref, pred in zip(reference, generated)
]
bleu_avg = sum(bleu_scores) / len(bleu_scores)

print("\nðŸ“Š BLEU Score (average over 1000 rows):")
print(f"BLEU: {bleu_avg:.4f}")


ðŸ“Š BLEU Score (average over 1000 rows):
BLEU: 0.0235


In [7]:
### 3. BERTScore
P, R, F1 = bertscore(generated, reference, lang="en", verbose=True)
bert_avg = {
    "precision": P.mean().item(),
    "recall": R.mean().item(),
    "f1": F1.mean().item()
}

print("\nðŸ“Š BERTScore:")
print(f"Precision: {bert_avg['precision']:.4f}")
print(f"Recall:    {bert_avg['recall']:.4f}")
print(f"F1 Score:  {bert_avg['f1']:.4f}")

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 32/32 [00:37<00:00,  1.18s/it]


computing greedy matching.


100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 16/16 [00:00<00:00, 18.95it/s]

done in 38.58 seconds, 25.92 sentences/sec

ðŸ“Š BERTScore:
Precision: 0.8152
Recall:    0.8194
F1 Score:  0.8172



