# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 2: Indexing & retrieval**

In this notebook we'll learn how to

- create a simple searchable index of a document corpus in PyTerrier and
- retrieve documents based on a query from that index (_ad-hoc retrieval_).


In [None]:
pip install python-terrier

In [None]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm="notebook")

## The data

For our simple example, we'll use a collection of works by William Shakespeare as our document corpus. They are available, collated in a single text file, [here](https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt). We can download this file directly:


In [None]:
from urllib import request

with request.urlopen(
    "https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt"
) as u:
    shakespeare_complete = u.read().decode("utf-8")

Before we can index the works, we need to parse and split them somehow. Let's take a look at the first chunk of the file:


In [None]:
print(shakespeare_complete[:15000])

You can see that each work starts with

```
<YEAR>

<TITLE>

by William Shakespeare
```

and ends with `THE END`.

We can use use a regular expression to extract each individual work (including year and title). We'll package the whole thing as a generator that yields a dictionary for each document, which contains

- a document ID,
- the year,
- the title,
- the document text.

The unique document ID may be useful to identify documents in the index later. Note that PyTerrier also assigns internal unique IDs itself.


In [None]:
import re


def shakespeare_generator():
    for i, item in enumerate(
        re.compile(
            r"((\d{4})\s*?([A-Z ]+)\s*?by William Shakespeare.*?THE END)",
            re.DOTALL,
        ).finditer(shakespeare_complete)
    ):
        yield {
            "docno": f"D{i}",
            "year": item.group(2),
            "title": item.group(3),
            "text": item.group(1),
        }

Let's give it a spin and print the first document:


In [None]:
from pprint import pprint

for x in shakespeare_generator():
    pprint(x)
    break

## Indexing

We can use this generator to index our collection. `pyterrier.IterDictIndexer` will consume our iterator and build the index. We just need to tell it a path for our index (`shakespeare_index`) and the metadata we want to store (along with the corresponding maximum length).

Note that we also pass the arguments `stemmer="porter"` and `stopwords="terrier"`; this is optional, as PyTerrier applies Porter stemming and stopword removal by default, but these arguments can be used to customize that behaviour.


In [None]:
from pathlib import Path

indexer = pt.IterDictIndexer(
    str(Path("shakespeare_index").absolute()),
    meta={
        "docno": 4,
        "year": 4,
        "title": 32,
        "text": 131072,
    },
    stemmer="porter",
    stopwords="terrier",
)

Now we can index our collection. By default, only the field `text` will be indexed. Since the text contains both year and title in our case, we'll keep the default. To change this behavior, you can set, for example, `fields=("text", "some_other_field")` if you want `some_other_field` to be searchable as well.

This method returns a _reference_ to our newly created index:


In [None]:
index_ref = indexer.index(shakespeare_generator())

There are many different indexers available. For a complete list, click [here](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html#indexer-classes).

## Retrieval

In order to search in our index, we use `pyterrier.BatchRetrieve`. [Terrier supports lots of weighting models](http://terrier.org/docs/current/javadoc/org/terrier/matching/models/package-summary.html), and we can specify one using the `wmodel` parameter. For now, we'll use simple TF-IDF.

By setting the `metadata` argument, we can tell PyTerrier to retrieve any metadata that we added earlier (such as the titles) along with the document IDs.


In [None]:
tf_idf = pt.BatchRetrieve(
    index_ref, wmodel="TF_IDF", num_results=10, metadata=["docno", "title"]
)

This model can be used directly to search. The result is a `pandas.DataFrame`:


In [None]:
tf_idf.search("tragedy")

As the name suggests, you can also retrieve documents for a batch of queries, but this needs to be done using a `pandas.DataFrame`:


In [None]:
import pandas as pd

tf_idf(
    pd.DataFrame(
        [
            ["Q1", "a public place"],
            ["Q2", "king henry"],
        ],
        columns=["qid", "query"],
    )
)

## Loading an index

Once you have created your index on disk, you can always load it rather than re-indexing the collection every time. Let's delete our index reference and access the index directly from disk:


In [None]:
del index_ref

pt.BatchRetrieve(
    str(Path("shakespeare_index").absolute()),
    wmodel="TF_IDF",
    num_results=10,
    metadata=["docno", "title"],
).search("tragedy")

Note that, any time you're sharing one index among multiple models, the best practice is to load it into memory once rather than using references:


In [None]:
index = pt.IndexFactory.of(str(Path("shakespeare_index").absolute()))
tf_idf = pt.BatchRetrieve(index, wmodel="TF_IDF")
dirichlet_lm = pt.BatchRetrieve(index, wmodel="DirichletLM")

## Memory indexes

The first index we created is saved to and loaded from the disk. Another alternative that can be useful for small corpora is a _memory index_. These are kept entirely in the main memory and are therefore faster.

We can create a memory index by specifying `type=pyterrier.index.IndexingType.MEMORY`. Note that the index path must still be valid, even though it will be ignored. Hence, we can simply pass the current working directory:


In [None]:
memory_index = pt.index.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    meta={
        "docno": 4,
        "year": 4,
        "title": 32,
        "text": 131072,
    },
    type=pt.index.IndexingType.MEMORY,
).index(shakespeare_generator())

Now we can use the index just as before:


In [None]:
pt.BatchRetrieve(memory_index, wmodel="TF_IDF").search("tragedy")

## Further reading

Check out the [indexing guide](https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html) in the official documentation.
