# Under the Hood of 🍬 SuiteEval

Now let's look at the constituent structures of 🍬 SuiteEval and how it can be extended to other benchmarks or tweaked.

## Setup

Set up a new environment if you're working locally! Otherwise you can just run all cells

In [None]:
%pip install suiteval pyterrier-pisa

In [None]:
from os.path import join

import pandas as pd
from pyterrier_pisa import PisaIndex

from suiteeval import DatasetContext

In [None]:
def bm25(context: DatasetContext):
    index = PisaIndex(join(context.path, "index.pisa")) # context.path is a temporary directory
    index.index(context.get_corpus_iter()) # get_corpus_iter gives us a pyterrier compatible generator of {docno, text} records

    return index.bm25(), "BM25"

## Making a New Suite

If we want to make a new suite it can be as simple as registering the datasets and measures we want to use!

Let's say we are primarily interested in multi-lingual IR, let's make a single suite that will run all of our datasets at once, we can do our cross-lingual and mono-lingual experiments at once. Let's add a broad set of measures to explore.

In [None]:
datasets = [		
    "wikiclir/en-simple",	
    "wikiclir/es",
    "wikiclir/fr",
    "wikir/en59k/test",
    "wikir/es13k/test",
    "wikir/fr14k/test"
]

In [None]:
from ir_measures import nDCG, P, R

from suiteeval import Suite

MLIR = Suite.register('MultiLingualIR', datasets=datasets, metadata={
    'description': 'A multilingual information retrieval benchmark combining WikiCLIR and WikiR datasets.',
    'official_measures' : [nDCG@10, P@10, R@1000],
})

That's it! We can now run our new suite.

In [None]:
MLIR(bm25)

## Getting Fancy

Let's now consider that we want to see overall performance of our models / baselines on multi-lingual versus cross-lingual IR, here we can subclass Suite to implement this behaviour.

In [None]:
from suiteeval.utility import geometric_mean

class _MLIR(Suite):
    def compute_overall_mean(self, results: pd.DataFrame, eval_metrics = None) -> pd.DataFrame:
        measure_cols = [str(m) for m in (eval_metrics or self.__default_measures) if str(m) in results.columns]
        if not measure_cols:
            return results

        # 1) Per-(dataset, name) geometric means (no relabel yet)
        gmean_rows = []
        for (dataset, name), group in results.groupby(["dataset", "name"], dropna=False):
            row = {"dataset": dataset, "name": name}
            for col in measure_cols:
                vals = pd.to_numeric(group[col], errors="coerce").dropna().values
                if vals.size:
                    row[col] = geometric_mean(vals)
            gmean_rows.append(row)
        per_ds_df = pd.DataFrame(gmean_rows)

        # 2) Multi-Lingual and Cross-Lingual overalls computed from per-dataset means
        def _overall_for_prefix(prefix: str, label: str) -> pd.DataFrame:
            subset = per_ds_df[per_ds_df["dataset"].astype(str).str.startswith(prefix, na=False)]
            if subset.empty:
                return pd.DataFrame(columns=["dataset", "name"] + measure_cols)
            rows = []
            for name, grp in subset.groupby("name", dropna=False):
                row = {"dataset": label, "name": name}
                for col in measure_cols:
                    vals = pd.to_numeric(grp[col], errors="coerce").dropna().values
                    if vals.size:
                        row[col] = geometric_mean(vals)
                rows.append(row)
            return pd.DataFrame(rows)

        mlir_df = _overall_for_prefix("wikir", "Overall (Multi-Lingual)")
        clir_df = _overall_for_prefix("wikiclir", "Overall (Cross-Lingual)")

        # 3) Preserve your existing "Overall" behaviour (relabel per-dataset means)
        gmean_df = per_ds_df.copy()
        gmean_df["dataset"] = "Overall"

        # 4) Concatenate and return
        return pd.concat([results, gmean_df, mlir_df, clir_df], ignore_index=True)

MLIR = _MLIR()

Now not only do we run all of our evaluation but we get principled averages of our different tasks!

In [None]:
MLIR(bm25)

## Try yourself!

Group your favourite datasets or head over to [ir-datasets](ir-datasets.com) and choose a custom collection.

In [None]:
from ir_measures import *

datasets = []
measures = []

In [None]:
MySuite = Suite.register('MySuite', datasets=datasets, metadata={
    'description': 'My custom IR benchmark suite.',
    'official_measures' : measures,
})