# Introduction to PyTerrier

_IN4325: Information retrieval lecture, TU Delft_

**Part 3: Datasets**

This notebook focuses on IR datasets and pre-made indexes that can be loaded automatically in PyTerrier.


In [None]:
pip install python-terrier

In [None]:
import pyterrier as pt

if not pt.started():
    pt.init(tqdm="notebook")

## Importing datasets

PyTerrier comes with a multitude of datasets that can be loaded directly. This is great because the parsing is already taken care of and any required files will be downloaded automatically.

A list of available datasets can be found [here](https://pyterrier.readthedocs.io/en/latest/datasets.html#available-datasets) or by calling the following function:


In [None]:
pt.datasets.list_datasets()

Each dataset has the following components:

- Corpus (the documents),
- index (pre-made, ready to use),
- topics (queries or topic descriptions, grouped in folds or splits),
- qrels (query relevance information, we'll use this for evaluation in an upcoming notebook).

Note that, for many datasets, some of these components are missing. Furthermore, the prefix `irds:` denotes that the corresponding dataset is loaded from the [`ir_datasets`](https://ir-datasets.com/) library, which seamlessly integrates with PyTerrier.

Let's start by loading the `vaswani` dataset:


In [None]:
dataset = pt.get_dataset("vaswani")

For this dataset, there are pre-made indexes available that we can load. In order to do this, we need to select a _variant_. The variants differ slightly, for example, in terms of pre-processing. An overview of the indexes and variants can be found in the [Terrier data repository](http://data.terrier.org/).

We'll use the standard variant, `terrier_stemmed`, to create a BM25 model:


In [None]:
index = dataset.get_index(variant="terrier_stemmed")
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
bm25.search("computer")

We can also create a retriever directly from the dataset like so:


In [None]:
bm25 = pt.BatchRetrieve.from_dataset(dataset, variant="terrier_stemmed", wmodel="BM25")

We can also browse the corpus:


In [None]:
for doc in dataset.get_corpus_iter():
    print(doc)
    break

Similarly, the topics (queries) can be accessed as a `pandas.DataFrame`, such that we can use them directly:


In [None]:
bm25(dataset.get_topics())

Note that some datasets require a variant here, such as `variant="train"`.

Since the corpus iterator already yields the documents in the correct format (see part 2: indexing), we can use it directly to create our own index if we wish:


In [None]:
from pathlib import Path

index = pt.IterDictIndexer(
    str(Path.cwd()),  # this will be ignored
    type=pt.index.IndexingType.MEMORY,
).index(dataset.get_corpus_iter())

## Further reading

Check out the [datasets section](https://pyterrier.readthedocs.io/en/latest/datasets.html) in the documentation.
