# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 5: Transformers**

This notebook introduces PyTerrier _transformers_ (not to be confused with [neural transformer models](<https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)>)). We'll learn about the different types of data frames that PyTerrier uses and how the transformers operate on them.

In order to run everything in this notebook, you'll need [pyspellchecker](https://github.com/barrust/pyspellchecker) installed:


In [None]:
pip install python-terrier pyspellchecker

Because we're going to use RM3 in this notebook, we'll need to load the `terrier-prf` plugin:


In [None]:
import pyterrier as pt

if not pt.started():
    pt.init(
        tqdm="notebook", boot_packages=["com.github.terrierteam:terrier-prf:-SNAPSHOT"]
    )

In the following, we'll illustrate the different kinds of transformers and data frames using examples. Note that we're only scratching the surface here, so **make sure to have a look at the [documentation](https://pyterrier.readthedocs.io/)**!

We'll use the `nfcorpus` dataset:


In [None]:
dataset = pt.get_dataset("irds:nfcorpus/test")

For this task we'll need an index with blocks (i.e., positional information), so we need to create a new one. Since memory indexes do not support blocks at the moment, we'll create one on disk:


In [None]:
from pathlib import Path

idx_path = Path("nfcorpus_index_with_blocks").absolute()

We index the corpus with `blocks=True`:


In [None]:
pt.index.IterDictIndexer(
    str(idx_path),
    blocks=True,
).index(dataset.get_corpus_iter(), fields=["title", "abstract", "url"])

## Data format

Recall that the queries (topics) of a dataset can be accessed using the `get_topics` method. For this dataset, there are multiple variants; we choose `title`:


In [None]:
queries = dataset.get_topics(variant="title")
queries

In general, PyTerrier represents all data as `pandas.DataFrame` objects.

The method above outputs a data frame with two columns, `qid` and `query`. In PyTerrier, data frames of this format are referred to as _data type_ `Q`, and they essentially represent a set of queries, each of which has a unique identifier. In fact, we have already constructed our own `Q` data frames in the scaffolding project.

There are some other data types, and we will introduce them throughout the rest of this series. You can find an overview [here](https://pyterrier.readthedocs.io/en/latest/datamodel.html).

## Transformers

_Transformers_ directly operate on these data frames; in other words, a transformer takes as input a data frame of some type and outputs another data frame (of the same or another type). We'll take a look at several pre-implemented transformers in this notebook.

### Retrieval transformers

Retrievers are the most common transformers, and we have already used them plenty throughout this introductory series. For example, let's take a BM25 model as before:


In [None]:
index = pt.IndexFactory.of(str(idx_path))
bm25 = pt.BatchRetrieve(index, wmodel="BM25")

This transformer consumes data of type `Q` and returns data of type `R` (i.e., columns `qid`, `docno`, `score`, `rank`), which corresponds to a ranking. The transformation can be invoked by calling the `transform` method:


In [None]:
bm25.transform(queries)

Note that our result is actually a superset of `R` (we have an additional column, `query`). In general, data frames may have more columns than a specific transformer requires, but they can still be used.

Alternatively, the transformer can be called directly, as we have done so far, which gives the same result:


In [None]:
bm25(queries)

Furthermore, transformers implement the `search` method, which processes a single query:


In [None]:
bm25.search("what is the meaning of life")

### Query rewriting transformers

You have already experimented with query rewriting in the scaffolding project. PyTerrier implements several transformers that rewrite queries.

The simplest one is the _sequential dependence model_:


In [None]:
sdm = pt.rewrite.SequentialDependence()

SDM requires positional information in the index (that's why we needed to set a flag during indexing). More information about SDM can be found [here](https://pyterrier.readthedocs.io/en/latest/rewrite.html#sequentialdependence).

It operates solely on the queries themselves; in other words, both input and output are data frames of type `Q`:


In [None]:
sdm(queries)

In this case, the `query` column contains the new (rewritten) queries, while the original queries are retained in the `query_0` column.

#### Query expansion

_Query expansion_ differs from standard query rewriting in that it operates on queries **and** corresponding relevant documents (these need to be retrieved based on the original queries prior to the expansion). This is also known as _pseudo relevance feedback_ (PRF). A popular PRF model is _RM3_:


In [None]:
rm3 = pt.rewrite.RM3(index)

Since RM3 requires a set of documents for each query, its input type needs to be `R`. Consequently, we can use the result of our retriever as an input for the PRF model:


In [None]:
rm3(bm25(queries))

## Pipelines

You probably noticed that the transformers we've seen so far are mostly designed to work in sequence; for example, reformulating queries alone is pointless without an actual retrieval step afterwards.

This is where _pipelines_ come into play. PyTerrier implements the `>>` operator to build sequences of transformers. Let's build a simple pipeline that applies SDM and then retrieves documents using BM25:


In [None]:
pl_sdm = sdm >> bm25

We can now use this pipeline like any other transformer:


In [None]:
pl_sdm(queries)

Let's compare SDM and RM3 in terms of performance.

First, we create a pipeline for RM3:


In [None]:
pl_rm3 = bm25 >> rm3 >> bm25

Now we can run an experiment to evaluate and compare both of these pipelines. We'll also include standalone BM25:


In [None]:
from pyterrier.measures import MAP, nDCG

pt.Experiment(
    [bm25, pl_sdm, pl_rm3],
    queries,
    dataset.get_qrels(),
    names=["BM25", "SDM >> BM25", "BM25 >> RM3 >> BM25"],
    eval_metrics=[MAP, nDCG @ 10],
)

### Operators

There are a number of _operators_ that can be applied to transformers within pipelines. We've already seen the `>>` operator. Here, we'll look at a few more selected operators. You can find the complete list [here](https://pyterrier.readthedocs.io/en/latest/operators.html).

#### Rank cutoff

The `%` operator limits how many documents per query are kept (the lowest-scoring ones are removed). For example, we may want to consider only a single document for RM3:


In [None]:
pl_rm3_1doc = (bm25 % 1) >> rm3 >> bm25

pt.Experiment(
    [pl_rm3, pl_rm3_1doc],
    queries,
    dataset.get_qrels(),
    names=["RM3", "RM3 (1 document)"],
    eval_metrics=[MAP, nDCG @ 10],
)

#### Caching

The `~` operator can be used to automatically cache the output of a transformer. Let's time our BM25 retriever without caching first:


In [None]:
%timeit bm25(queries)

Now we enable caching. This should make it much faster:


In [None]:
%timeit (~bm25)(queries)

**Important**: When you use caching, make sure to clear the cache when you make changes to the transformers you cached. Otherwise, you might get unexpected results. The default location of the cache is `~/.pyterrier/transformer_cache/` (on Linux and macOS systems).


#### Combining rankings

The `+` and `*` operators can be used to linearly combine two transformers that output rankings (data type `R`). For example, we can use two different retrievers and combine them as follows:


In [None]:
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")

pt.Experiment(
    [tf_idf, bm25, 2 * tf_idf + bm25],
    queries,
    dataset.get_qrels(),
    names=["TF-IDF", "BM25", "2 * TF-IDF + BM25"],
    eval_metrics=[MAP, nDCG @ 10],
)

Note that the operations are applied to the scores computed by the retrievers. If a document is missing for one of the retrievers, a score of `0` is used.

### Compiling pipelines

Pipelines can be compiled. The compilation may (or may not) improve the efficiency for certain operations. For example, consider the following:


In [None]:
pl = bm25 % 3
pl_compiled = pl.compile()

Let's time them both:


In [None]:
%timeit pl(queries)

In [None]:
%timeit pl_compiled(queries)

## Custom transformers

PyTerrier makes it easy for you to implement your own custom transformers. In fact, we've used a custom query transformer under the hood for the scaffolding project.

### `apply` functions

[`pyterrier.apply`](https://pyterrier.readthedocs.io/en/latest/apply.html) allows for applying a custom function to each row of a data frame. There are many `apply` functions, each of which focuses on different data types. An overview can be found [here](https://pyterrier.readthedocs.io/en/latest/apply.html#module-pyterrier.apply).

Let's implement one that reformulates the query to sound a bit nicer:


In [None]:
ask_nicely = pt.apply.query(
    lambda row: "please find some information about " + row["query"]
)
ask_nicely(queries)

### Extending `pyterrier.Transformer`

More complex transformers can be implemented by extending the base class directly.

Here, we implement a transformer that naively corrects supposed spelling mistakes using a spell checking library. In order to do this, we only need to implement the `transform` method. We adapt the behavior of the other query rewriters and retain the original formulation in the `query_0` column:


In [None]:
import pandas as pd
from spellchecker import SpellChecker


class CorrectQuerySpelling(pt.Transformer):
    def __init__(self):
        self.spellchecker = SpellChecker()
        super().__init__()

    def _correct_spelling(self, query: str) -> str:
        result = []
        for word in query.split(" "):
            if len(self.spellchecker.unknown([word])) > 0:
                result.append(self.spellchecker.correction(word) or word)
            else:
                result.append(word)
        return " ".join(result)

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        df_new = df.copy()
        df_new["query_0"] = df_new["query"]
        df_new["query"] = df_new["query_0"].map(self._correct_spelling)
        return df_new

Let's give it a try:


In [None]:
correct_query_spelling = CorrectQuerySpelling()
correct_query_spelling(queries)

## Further reading

Check out the sections about the [data model](https://pyterrier.readthedocs.io/en/latest/datamodel.html) and [transformers](https://pyterrier.readthedocs.io/en/latest/transformer.html) in the PyTerrier documentation.
