# Implementing Custom Tasks in Tau-Eval

This notebook guides users in creating their own evaluation tasks within the Tau-Eval framework. While Tau-Eval provides several built-in tasks, users might have specific datasets or evaluation methodologies that require a custom task implementation.

## Section 1: Understanding CustomTasks

Tau-Eval offers a base class  `tau_eval.tasks.CustomTask` to facilitate the creation of new tasks, especially when tasks do not need model fine-tuning that can be expressed using `tasksource`. When you create a class that inherits from `CustomTask`, it will mostly involve a Hugging Face Dataset which will contain at least your original texts.The most crucial part is that **the user needs to implement the `evaluate` method**. This method defines how the outputs of an anonymization model (applied to your task's input texts) are evaluated to produce metrics.

## Section 2: Creating a CustomTask

([Pilán et al. 2022](https://aclanthology.org/2022.cl-4.19/#)) introduced the Text Anonymization Benchmark (TAB), which is a 1,268 English-language court cases dataset from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. It is designed to enable a more granular aspect en span-based anonymization evaluation. Let's implement this example task within Tau-Eval.

In [None]:
from tau_eval.tasks import CustomTask
from datasets import load_dataset
import difflib

class TextAnonymizationBenchmark(CustomTask):
  def __init__(self):
    self.name = "text-anonymization-benchmark"
    self.dataset = load_dataset("ildpil/text-anonymization-benchmark")["test"]
    self.gold_spans: dict[str, set[tuple[int, int]]] = {}
    self.original_texts: dict[str, str] = {}
    self.doc_ids: list[str] = []

    # Pre-process the dataset to extract necessary information
    for sample in self.dataset:
        doc_id = sample['doc_id']
        self.doc_ids.append(doc_id)
        self.original_texts[doc_id] = sample['text']

        # Filter mentions that need to be masked and store their spans
        spans_to_mask = set()
        for mention in sample['entity_mentions']:
            if mention['identifier_type'] in ['DIRECT', 'QUASI']:
                spans_to_mask.add((mention['start_offset'], mention['end_offset']))
        self.gold_spans[doc_id] = spans_to_mask

  def _get_masked_spans_from_diff(self, original_text: str, anonymized_text: str) -> set[tuple[int, int]]:
    """
    Compares original and anonymized texts to find masked spans.
    A masked span is any part of the original text that was deleted or replaced.
    """
    matcher = difflib.SequenceMatcher(a=original_text, b=anonymized_text, autojunk=False)
    spans = set()
    for tag, i1, i2, j1, j2 in matcher.get_opcodes():
        if tag in ['delete', 'replace']:
            spans.add((i1, i2))
    return spans

  def evaluate(self, new_texts: list[str]) -> dict:
      """
      Evaluates the anonymization performance of the provided texts.

      Args:
          new_texts: A list of anonymized strings. The order must correspond
                                  to the order of the documents in the initial dataset.

      Returns:
          A dictionary containing evaluation metrics.
      """

      # Initialize counters for metrics
      total_tp_mention, total_fp_mention, total_fn_mention = 0, 0, 0
      total_tp_token, total_fp_token, total_fn_token = 0, 0, 0

      for i, anonymized_text in enumerate(new_texts):
          doc_id = self.doc_ids[i]
          original_text = self.original_texts[doc_id]

          # 1. Get system-generated spans and gold-standard spans
          system_spans = self._get_masked_spans_from_diff(original_text, anonymized_text)
          gold_spans = self.gold_spans[doc_id]

          # 2. Calculate Strict Mention-Level Metrics
          # These are exact matches between system spans and gold spans.
          tp_mention = len(system_spans.intersection(gold_spans))
          fp_mention = len(system_spans - gold_spans)
          fn_mention = len(gold_spans - system_spans)

          total_tp_mention += tp_mention
          total_fp_mention += fp_mention
          total_fn_mention += fn_mention


      # 3. Calculate final Precision, Recall, and F1 scores
      def calculate_metrics(tp, fp, fn):
          precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
          recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
          f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0.0
          return precision, recall, f1

      mention_precision, mention_recall, mention_f1 = calculate_metrics(total_tp_mention, total_fp_mention, total_fn_mention)
      token_precision, token_recall, token_f1 = calculate_metrics(total_tp_token, total_fp_token, total_fn_token)

      results = {
          "entity_precision": mention_precision,
          "entity_recall": mention_recall,
          "entit_f1": mention_f1,
      }

      return results

## Section 3: Using the `CustomTask` in an Experiment

Integrating your `CustomTask` into a Tau-Eval `Experiment` is straightforward. You'll pass an instance of your custom task class to the `Experiment` constructor.


In [None]:
from tau_eval import Experiment
# We will test our task using presidio pseudonymization engine
from tau_eval.models.presidio import EntityDeletion

# Instanciate the model
model = EntityDeletion()

# Instanciate the task
task = TextAnonymizationBenchmark()

In [None]:
experiment = Experiment([model],["rouge"],[task])
experiment.run()

INFO:tau_eval.logger:Running experiment...


INFO:tau_eval.logger:Running task: 0


DEBUG:tau_eval.logger:Evaluating model 0


Map:   0%|          | 0/555 [00:00<?, ? examples/s]

INFO:tau_eval.logger:Results saved


# 4. Display results

Results are stored in a json file and inside the `Experiment` objects, you can visualize them using `Experiment.summary()`

In [None]:
experiment.summary()["text-anonymization-benchmark"]