#Instructor:
- Sarawoot Kongyoung
- Piyawat Chuangkrud

References:

[PyTerrier’s documentation](https://pyterrier.readthedocs.io/en/latest/)

[Notebooks](https://github.com/terrier-org/pyterrier/blob/master/examples/notebooks.md)

# Prerequisites
You will need PyTerrier installed. PyTerrier also needs Java to be installed, and will find most installations.


In [1]:
!pip install python-terrier
!pip install datasets
!pip install lextoplus

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting lextoplus
  Downloading lextoplus-0.0.5-py3-none-any.whl (234 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m234.2/234.2 KB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting marisa-trie
  Downloading marisa_trie-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: marisa-trie, lextoplus
Successfully installed lextoplus-0.0.5 marisa-trie-1.1.0


# Import Libraries

In [2]:
import os
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-19-openjdk-amd64'

In [3]:
import pyterrier as pt
from pyterrier.measures import *
from datasets import load_dataset
from lextoplus import LexToPlus
import pandas as pd
import re

You must run pt.init() before other pyterrier functions and classes.

In [5]:
if not pt.started():
    pt.init()

# Load Dataset
We're going to use a very old IR test collection called [Mr. TyDi](https://huggingface.co/datasets/castorini/mr-tydi) . This is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages. It is designed for monolingual retrieval, specifically to evaluate ranking with learned dense representations.

## Load Corpus for Index

In [6]:
corpus = load_dataset('castorini/mr-tydi-corpus', 'thai')

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/3.15k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/115M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [7]:
corpus

DatasetDict({
    train: Dataset({
        features: ['docid', 'title', 'text'],
        num_rows: 568855
    })
})

## Load Dataset for create topics & qrels.

In [8]:
dataset = load_dataset('castorini/mr-tydi', 'thai')

Downloading data:   0%|          | 0.00/60.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/59.1k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/89.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3319 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/807 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1190 [00:00<?, ? examples/s]

In [9]:
dataset

DatasetDict({
    train: Dataset({
        features: ['query_id', 'query', 'positive_passages', 'negative_passages'],
        num_rows: 3319
    })
    dev: Dataset({
        features: ['query_id', 'query', 'positive_passages', 'negative_passages'],
        num_rows: 807
    })
    test: Dataset({
        features: ['query_id', 'query', 'positive_passages', 'negative_passages'],
        num_rows: 1190
    })
})

# Indexing a Corpus

In [None]:
def iterdf():
    for data in corpus['train']:
        docno = data['docid']
        title = data['title']
        text = data['text']
        text = title + ' ' + text
        yield {'docno':docno, 'text': ' '.join(word_tokenize(text, engine="newmm"))}

Use this Indexer if you wish to index an iter of dicts (possibly with multiple fields). This version is optimized by using multiple threads and POSIX fifos to tranfer data, which ends up being much faster.

In [None]:
index_path = './tydi-index-corpus'
indexer = pt.IterDictIndexer(index_path,
                        overwrite=True, stemmer=None, stopwords=None, tokeniser="UTFTokeniser")

In [None]:
%%time
itix = indexer.index(iterdf())

# Retrieval
BatchRetrieve is one of the most commonly used PyTerrier objects. It represents a retrieval transformation, in which queries are mapped to retrieved documents. BatchRetrieve uses a pre-existing Terrier index data structure, typically saved on disk.

In [None]:
def es_preprocess(text):
    text = ' '.join(word_tokenize(text))
    return text

In [None]:
%%time
tfidf_nostem = pt.apply.query(
    lambda row: es_preprocess(row.query)
    ) >> pt.BatchRetrieve(itix, wmodel='TF_IDF')

In [None]:
def cleanstr(text):
    text = text.replace('?', '')
    text = re.sub(r'[^\u0E00-\u0E7Fa-zA-Z0-9 ]', '', text)

    return text

## Create Topics

In [None]:
topics = []
for data in dataset['test']:
    query = data['query']
    qid = data['query_id']
    tmp = {'qid':qid, 'query':cleanstr(query)}
    topics.append(tmp)

In [None]:
topics = pd.DataFrame(topics)

In [None]:
topics

## Create Qrels

In [None]:
qrels = []
for row in dataset['test']:
    for p in row['positive_passages']:
        qrels.append({
            'qid' : row['query_id'],
            'docno' : p['docid'],
            'label' : 1,
            'iter' : 1
        })

In [None]:
qrels = pd.DataFrame(qrels)

In [None]:
qrels

# Evaluation

In [None]:
%%time
pt.Experiment(
    [tfidf_nostem],
    topics,
    qrels,
    ['map_cut_10', 'recip_rank', nDCG@5],
    names=['TFIDF'],
    round=4
)

# Search

In [None]:
df = corpus['train'].to_pandas()

In [None]:
dataset['test'][1]

In [None]:
query =' '.join(word_tokenize(cleanstr('สกอตแลนด​์อยู่ที่ไหน')))

In [None]:
query

In [None]:
out = tfidf_nostem.search(query)[:10]
out

In [None]:
for row in out['docno']:
    print(row)
    print(df[df['docid'] == row].text.tolist()[0])
    print('-'*10)

# Assignment

Implement indexing and retrieval systems using more weighting models and evaluation metrics than those used in lab settings. Then, compare the results.