<a href="https://colab.research.google.com/github/GabrielLoiseau/tau-eval/blob/main/examples/Custom_Models_and_Metrics_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/GabrielLoiseau/tau-eval.git

Cloning into 'tau-eval'...
remote: Enumerating objects: 66, done.[K
remote: Counting objects: 100% (66/66), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 66 (delta 23), reused 56 (delta 16), pack-reused 0 (from 0)[K
Receiving objects: 100% (66/66), 44.44 KiB | 722.00 KiB/s, done.
Resolving deltas: 100% (23/23), done.


In [None]:
!cd tau-eval && pip install .

Processing /content/tau-eval
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate>=0.4.1 (from tau-eval==0.1.0)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting tasknet>=1.57.0 (from tau-eval==0.1.0)
  Downloading tasknet-1.57.0-py3-none-any.whl.metadata (4.9 kB)
Collecting tasksource>=0.0.47 (from tau-eval==0.1.0)
  Downloading tasksource-0.0.47-py3-none-any.whl.metadata (5.4 kB)
Collecting ipywidgets>=8.1.5 (from tau-eval==0.1.0)
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting ipykernel>=6.29.5 (from tau-eval==0.1.0)
  Downloading ipykernel-6.29.5-py3-none-any.whl.metadata (6.3 kB)
Collecting rich>=14.0.0 (from tau-eval==0.1.0)
  Downloading rich-14.0.0-py3-none-any.whl.metadata (18 kB)
Collecting bert-score>=0.3.13 (from tau-eval==0.1.0)
  Downloading bert_score-0.3.13-py3-none-any.whl.metadat

In [None]:
!pip install -U datasets huggingface_hub fsspec

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
Downloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
Installing collected packages: fsspec, datasets
[2K  Attempting uninstall: fsspec
[2K    Found existing installation: fsspec 2025.3.2
[2K    Uninstalling fsspec-2025.3.2:
[2K      Successfully uninstalled fsspec-2025.3.2
[2K  Attempting uninstall: datasets
[2K    Found existing installation: datasets 2.14.4
[2K    Uninstalling datasets-2.14.4:
[2K      Successfully uninstalled datasets-2.14.4
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [datasets]
[1A[2K[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency co

# Implementing Custom Models and Metrics in Tau-Eval

This notebook guides users in creating their own anonymization models and evaluation metrics within the Tau-Eval framework. Tau-Eval is designed to be extensible, allowing researchers and developers to easily integrate and test custom components.


## Section 1: Custom Anonymization Models

Custom anonymization models in Tau-Eval should inherit from the `tau_eval.models.Anonymizer` base class. This class provides a standard interface for models, ensuring they can be seamlessly integrated into the evaluation pipeline.


In [None]:
from tau_eval.models import Anonymizer
import random
import string
import wandb
wandb.init(mode="disabled")

# Define a simple custom model class
class NoisyTextModel(Anonymizer):
  def __init__(self, name: str = "NoisyText", noise_level: float = 0.1):
    self.name = name
    self.noise_level = noise_level

  def anonymize(self, text: str) -> str:
    """Adds random noise to text (deletion, case changes, insertion, swapping)."""
    result, i = [], 0
    while i < len(text):
        char = text[i]

        # Deletion: skip character
        if random.random() < self.noise_level * 0.3:
            i += 1
            continue

        # Case change for letters
        if char.isalpha() and random.random() < self.noise_level * 0.3:
            char = char.swapcase()

        result.append(char)

        # Insertion: add random letter
        if random.random() < self.noise_level * 0.2:
            result.append(random.choice(string.ascii_letters))

        # Swapping: swap current with next character
        if i < len(text) - 1 and random.random() < self.noise_level * 0.2:
            result.append(text[i + 1])
            i += 2
        else:
            i += 1

    return ''.join(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/5.94k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Downloading builder script:   0%|          | 0.00/8.46k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [None]:
# Instantiate the custom model
noisy_model = NoisyTextModel()

# Test the model's anonymize method
sample_text = "Hello, world!"
anonymized_text = noisy_model.anonymize(sample_text)
print(f"Original: {sample_text}")
print(f"Anonymized: {anonymized_text}")

Original: Hello, world!
Anonymized: Hello, world!


## Section 2: Custom Evaluation Metrics

Custom evaluation metrics in Tau-Eval are Python callables (e.g., functions) that adhere to a specific signature: `Callable[[str | list[str], str | list[str]], dict]`.

The callable takes two arguments:
1. `input_texts`: The original text(s) before anonymization.
2. `output_texts`: The anonymized text(s) produced by a model.
Both arguments can be a single string or a list of strings. The function should return a dictionary where keys are metric names (strings) and values are their corresponding scores (numerical values).

In [None]:
import numpy as np

# Define a simple custom metric function
def jaccard_similarity_metric(input_texts: str | list[str],
                              output_texts: str | list[str]) -> dict[str, float]:
    if isinstance(input_texts, str):
      input_texts = [input_texts]
    if isinstance(output_texts, str):
      output_texts = [output_texts]

    scores = []
    for input_text, output_text in zip(input_texts, output_texts):
      # Simple word tokenization
      words_input = set(input_text.lower().split())
      words_output = set(output_text.lower().split())

      intersection = len(words_input.intersection(words_output))
      union = len(words_input.union(words_output))
      score = intersection / union if union > 0 else 0.0

      scores.append(score)

    return {'jaccard': np.mean(scores)}


# Test the custom metric
original_text = "This is the original text."
alternative_text = "This is the processed text with some changes."
anonymized_text = noisy_model.anonymize(original_text)

print(f"Jaccard Similarity: {jaccard_similarity_metric(original_text, alternative_text)}")

original_texts_list = [original_text, original_text]
processed_texts_list = [alternative_text, anonymized_text]
jaccard_score_dict_list = jaccard_similarity_metric(original_texts_list, processed_texts_list)
print(f"Jaccard Similarity for lists: {jaccard_score_dict_list}")

Jaccard Similarity: {'jaccard': np.float64(0.3)}
Jaccard Similarity for lists: {'jaccard': np.float64(0.36428571428571427)}


## Section 3: Using Custom Components in an Experiment

Once you have your custom model and metric, you can integrate them into a Tau-Eval `Experiment`. This involves:
1. Importing necessary classes: `Experiment`, `ExperimentConfig`.
2. Defining or importing a task. For this demonstration, we'll use a simple sentiment analysis task with a `tasksource` dataset.
3. Instantiating your custom model.
4. Creating lists of models and metrics, including your custom ones. Note that custom metric functions are passed directly.
5. Configuring the experiment with `ExperimentConfig`.
6. Running the experiment and observing the results.

In [None]:
from tasknet import AutoTask
from tau_eval.tasks import IMDBAuthorshipClassification

sent = AutoTask("tweet_eval/sentiment", max_rows=1000, max_rows_eval=1000)
imdb = IMDBAuthorshipClassification(n_authors=10, max_rows=1000, max_rows_eval=1000)

README.md:   0%|          | 0.00/15.3k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

mmlu.py:   0%|          | 0.00/5.20k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.78M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/901k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/167k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/45615 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/12284 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/860 [00:00<?, ?B/s]

(…)-00000-of-00001-62894f3b39974716.parquet:   0%|          | 0.00/70.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/61987 [00:00<?, ? examples/s]

Stringifying the column:   0%|          | 0/10000 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
from tau_eval import Experiment, ExperimentConfig

# Define the experiment configuration
models = [NoisyTextModel(name="NoisyText0.1",noise_level=0.1),
          NoisyTextModel(name="NoisyText0.3",noise_level=0.3),
          NoisyTextModel(name="NoisyText0.5",noise_level=0.5)]

metrics = [jaccard_similarity_metric, "meteor"]

tasks = [imdb, sent]

config = ExperimentConfig("full","answerdotai/ModernBERT-base",True,False)

e = Experiment(models, metrics, tasks, config)
e.run(output_dir="results.json")

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

DEBUG:tau_eval.logger:Saved generated dataset


INFO:tau_eval.logger:Results saved


6. Print or show the results
The structure of 'results' might depend on the Experiment's implementation details, or how it's configured to return/display results.
Often, results are saved to files in 'output_dir' and/or returned.

In [None]:
e.summary()['imdb_authorship_10_authors_0']

Unnamed: 0,Model Name,Accuracy,F1,meteor,jaccard
0,Original,0.973,0.9719,-,-
1,NoisyText0.1,0.684,0.640728,0.7355,0.5659
2,NoisyText0.3,0.341,0.196522,0.3954,0.2687
3,NoisyText0.5,0.263,0.143626,0.2402,0.1616


In [None]:
e.summary()['tweet_eval/sentiment_1']

Unnamed: 0,Model Name,Accuracy,F1,meteor,jaccard
0,Original,0.697,0.6878,-,-
1,NoisyText0.1,0.639,0.610885,0.7737,0.597
2,NoisyText0.3,0.548,0.439357,0.4656,0.2806
3,NoisyText0.5,0.504,0.327534,0.2996,0.1571


Unsuprisingly, more noise leads to better privacy and worse utility!