# Introduction

This notebook demonstrates the functionalities of a custom BERTScore implementation using the **HerBERT** model as an example. In practice, any Polish language model (or other languages, if paired with an appropriate dataset) can be used. The main features include:

- **Sentence similarity scoring**: Compute BERTScore metrics (accuracy, recall, F1) between pairs of sentences.
- **IDF weighting**: Support for term importance through precomputed or dynamically calculated Inverse Document Frequency (IDF) scores.
- **Baseline rescaling**: Efficient computation and application of baseline values (`b_acc` and `b_recall`) to normalize scores across different datasets.
- **Persistent storage**: Easy saving and loading of precomputed IDF and baseline values to avoid repeated, time-consuming calculations.
- **Flexible input handling**: Compatible with individual sentence pairs, batches of sentences, and large datasets.

This notebook will guide you through:

1. Initializing the model and tokenizer.
2. Generating or loading IDF and baseline values.
3. Computing BERTScore metrics for sentences or datasets.
4. Saving and reusing precomputed parameters to speed up future evaluations.


In [1]:
from herBERTscore.HerBERTScore import HerBERTScore

"""
Note: The base dataset used for training and computing IDF in HerBERTScore
comes from the National Corpus of Polish (NKJP), available at:
https://nkjp.pl/

This corpus covers the entirety of publicly available Polish texts in NKJP
and serves as the basis for computing IDF for our model. Therefore, the IDF
values are specific to the Polish language and to the dataset provided in this repository.

Custom dataset usage note:
--------------------------
HerBERTScore computes IDF and baseline metrics from a corpus of text. If you want to use your own dataset:
1. Save it as a plain UTF-8 text file (e.g., "my_corpus.txt").
2. Include sentences or paragraphs in natural language. No JSON or CSV formatting is required.
3. Keep it simple: raw text is sufficient; tokenization is handled internally.
4. Longer texts are fine; HerBERTScore will batch them efficiently for IDF computation.
5. Proper sentence boundaries improve metric quality, but the format is flexible.
"""

# Initialize HerBERTScore with a text file path
hbs = HerBERTScore(file_path_to_texts="data/NKJP_test_texts.txt")

# Make idf map based on our dataset
hbs.make_idf()


  from .autonotebook import tqdm as notebook_tqdm


New idf has been set.


Now we compute the **baseline rescaling factor `b`**.
This ensures that scores are normalized between `[0, 1]`, making them easier to interpret.

Unlike the original BERTScore paper (which uses random sentence pairs from Common Crawl),
here we approximate `b` using all possible sentence pairs from our dataset.

⚖️ **Trade-off:**
- ✅ Faster and deterministic.
- ❌ Slightly less universal than the paper's baseline.

💡 **Tip:** If you encounter *out-of-memory (OOM)* errors on GPU during this step,
try reducing the `batch_size` parameter (default: `batch_size=100`)

In [2]:
hbs.compute_b(batch_size = 100)

Computing baseline: 100%|██████████| 121/121 [00:11<00:00, 10.50it/s]

New baseline has been set.





Here we compare pairs of sentences.
HerBERTScore computes three similarity metrics:
- **Accuracy** (precision-like score: how much of candidate matches reference).
- **Recall** (recall-like score: how much of reference is covered by candidate).
- **F1** (harmonic mean of the above).

In [4]:
candidate = [
    "Ala ma kota",
    "Kotki są urocze",
    "Być albo nie być, oto jest pytanie",
    "Pogoda dzisiaj jest piękna",
    "Lubię czytać książki w wolnym czasie",
    "Programowanie w Pythonie jest przyjemne",
    "Samochody elektryczne stają się coraz popularniejsze",
    "Spacer po lesie uspokaja umysł",
    "Kawa o poranku dodaje energii",
    "Malarstwo to wyraz emocji artysty"
]

reference = [
    "Ala ma kocurka",
    "Kotki są kochane",
    "Testujemy herBERTScore",
    "Czy istnieje życie po śmierci?",
    "Dzisiejsza pogoda jest słoneczna i ciepła",
    "W wolnym czasie często czytam powieści",
    "Python jest świetny do szybkiego prototypowania",
    "Elektryczne auta zyskują popularność na świecie",
    "Spacerowanie w lesie poprawia nastrój",
    "Poranna kawa pomaga się obudzić",
    "Sztuka malarska oddaje uczucia twórcy"
]

x = hbs(candidate, reference)
i, j = 0, 0
print(f'Sentence: "{candidate[i]}" and sentence: "{reference[j]}" accuracy = {x["accuracy"][i, j]}, recall = {x["recall"][i, j]}, and f1_score = {x["f1"][i, j]}')

Sentence: "Ala ma kota" and sentence: "Ala ma kocurka" accuracy = 0.4337066947040838, recall = 0.764933061538449, and f1_score = 0.5535551245683671


Computing IDF and baseline can take a long time.
For efficiency, we provide methods to **save** and **reload** them later.

This makes experiments reproducible and avoids recomputation.

In [5]:
# Saves both baseline and idf
folder_path = "herBERTScoreStateExample"
hbs.save_state(folder_path = folder_path)

IDF saved to herBERTScoreStateExample\idf.json
Baseline saved to herBERTScoreStateExample\baseline.pt


In [6]:
# Initialize new HerBERTScore instance
hbs_loaded = HerBERTScore()
# Load them later without recomputation
hbs_loaded.load_state(folder_path = folder_path)


IDF loaded from herBERTScoreStateExample\idf.json
Baseline loaded from herBERTScoreStateExample\baseline.pt


In [7]:
x2 = hbs(candidate, reference)
i, j = 3, 2
print(f'Sentence: "{candidate[i]}" and sentence: "{reference[j]}" accuracy = {x["accuracy"][i, j]}, recall = {x["recall"][i, j]}, and f1_score = {x["f1"][i, j]}')

Sentence: "Pogoda dzisiaj jest piękna" and sentence: "Testujemy herBERTScore" accuracy = -0.04372621470501321, recall = -0.2406226212579045, and f1_score = -0.07400428677245109


# Summary

In this notebook we:
- Initialized **HerBERTScore** with a text dataset.
- Computed IDF and baseline rescaling factor.
- Compared example sentences.
- Demonstrated saving and loading state.

HerBERTScore is flexible – while we use **HerBERT + NKJP** as the default Polish setup,
you can adapt it to **any HuggingFace transformer model** and **any text dataset**. 🚀