Step 1: install dependencies

> Add blockquote



In [1]:
!pip3 install tira snorkel wows-eval textdistance rank-bm25

Collecting tira
  Using cached tira-0.0.157-py3-none-any.whl.metadata (4.8 kB)
Collecting wows-eval
  Using cached wows_eval-0.0.6-py3-none-any.whl.metadata (1.3 kB)
Using cached tira-0.0.157-py3-none-any.whl (1.3 MB)
Using cached wows_eval-0.0.6-py3-none-any.whl (3.9 kB)
Installing collected packages: tira, wows-eval
Successfully installed tira-0.0.157 wows-eval-0.0.6


In [2]:
!pip3 uninstall -y tira wows-eval

Found existing installation: tira 0.0.157
Uninstalling tira-0.0.157:
  Successfully uninstalled tira-0.0.157
Found existing installation: wows-eval 0.0.6
Uninstalling wows-eval-0.0.6:
  Successfully uninstalled wows-eval-0.0.6


In [3]:
!pip3 install wows-eval>=0.0.6

In [4]:
!pip3 install python-terrier==0.10.0



In [12]:
!pip3 uninstall numpy -y
!pip3 install numpy --upgrade --force-reinstall

Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Collecting numpy
  Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Using cached numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.4 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tira 0.0.157 requires numpy==1.*, but you have numpy 2.2.4 which is incompatible.
tensorflow 2.18.0 requires numpy<2.1.0,>=1.26.0, but you have numpy 2.2.4 which is incompatible.
numba 0.60.0 requires numpy<2.1,>=1.22, but you have numpy 2.2.4 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-2.2.4


In [None]:
!pip3 install numpy==1.26.4 --force-reinstall
import os
os.kill(os.getpid(), 9)

Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.2.4
    Uninstalling numpy-2.2.4:
      Successfully uninstalled numpy-2.2.4
Successfully installed numpy-1.26.4


In [1]:
import pyterrier as pt
from tqdm import tqdm
if not pt.started():
    pt.init()

from tira.rest_api_client import Client
from wows_eval import evaluate as wows_evaluate
import pandas as pd
from jnius import autoclass
import numpy as np

# For measuring consumed resources (e.g., GPU, CPU, RAM, etc.)
from tirex_tracker import tracking, ExportFormat

pd.set_option('display.max_colwidth', None)

# Dataset IDs visible at https://archive.tira.io/datasets?query=wows-eval
DATASET_ID = 'wows-eval/pairwise-20250309-test'

#, 'wows-eval/pairwise-smoke-test-20250210-training'
tira = Client()
input_data = tira.pd.inputs(DATASET_ID)

PyTerrier 0.10.0 has loaded Terrier 5.11 (built by craig.macdonald on 2025-01-13 21:29) and terrier-helper 0.0.8



In [2]:
!pip3 install rapidfuzz



Step 2: load all the libraries

In [3]:
from snorkel.labeling import labeling_function, PandasLFApplier
from snorkel.labeling.model.label_model import LabelModel
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from wows_eval import evaluate as wows_evaluate
from rapidfuzz import fuzz
import numpy as np
import textdistance
import pandas as pd
vectorizer = TfidfVectorizer()
import re

In [4]:
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
# Download stopwords if not already present
nltk.download("stopwords")
stop_words = set(stopwords.words("english"))
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Step 3: define global variables

In [5]:
# Global Variables
min_bm25 = None
max_bm25 = None
bm25 = None


In [6]:
# Global Min/Max Values (These should be computed dynamically)
GLOBAL_MIN = -1
GLOBAL_MAX = 1


Step 4: define helping functions

In [7]:

def tokenize(text):
    """Tokenizes text into lowercase words without punctuation."""
    return re.findall(r'\w+', text.lower())

def tokenize_and_lemmatize(text):
    """Tokenizes text into lowercase words without punctuation and lemmatizes words."""
    tokens = re.findall(r'\w+', text.lower())  # Tokenization
    return {lemmatizer.lemmatize(word) for word in tokens} - stop_words

def get_ngrams(text, n=2):
    """Generate n-grams from text"""
    tokens = list(tokenize_and_lemmatize(text))
    return set(zip(*[tokens[i:] for i in range(n)])) if len(tokens) >= n else set()

Step 5: Create Snorkel labeling functions

In [8]:
from snorkel.labeling import labeling_function
from rank_bm25 import BM25Okapi

@labeling_function()
def bm25_prob_diff(input_data):
    query = input_data["query"]
    relevant_doc = input_data["relevant"]
    unknown_doc = input_data["unknown"]

    # Tokenize query and documents
    tokenized_query = tokenize(query)
    tokenized_relevant = tokenize(relevant_doc)
    tokenized_unknown = tokenize(unknown_doc)

    # Compute BM25 scores separately for "relevant" and "unknown"
    bm25_relevant_model = BM25Okapi([tokenized_relevant])  # BM25 model with relevant doc
    bm25_unknown_model = BM25Okapi([tokenized_unknown])  # BM25 model with unknown doc

    bm25_relevant = bm25_relevant_model.get_scores(tokenized_query)[0]  # Score for relevant
    bm25_unknown = bm25_unknown_model.get_scores(tokenized_query)[0]  # Score for unknown

    # Compute difference without normalization
    diff = bm25_relevant - bm25_unknown
    return diff  # Return raw difference



@labeling_function()
def jaccard_similarity_prob_diff(input_data):
    query = input_data["query"]
    relevant_doc = input_data["relevant"]
    unknown_doc = input_data["unknown"]

    # Concatenate Query + Document
    query_rel = f"{query} {relevant_doc}"
    query_unk = f"{query} {unknown_doc}"

    # Compute Jaccard Similarity
    def jaccard_sim(text1, text2):
        set1, set2 = set(text1.split()), set(text2.split())
        intersection = len(set1 & set2)
        union = len(set1 | set2)
        return intersection / union if union != 0 else 0

    jaccard_relevant = jaccard_sim(query, query_rel)
    jaccard_unknown = jaccard_sim(query, query_unk)

    diff = jaccard_relevant - jaccard_unknown
    return diff  # Return raw difference


@labeling_function()
def word_level_levenshtein_prob_diff(input_data):
    query = input_data["query"]
    relevant_doc = input_data["relevant"]
    unknown_doc = input_data["unknown"]

    # Compute Levenshtein similarity
    lev_relevant = fuzz.ratio(query, relevant_doc) / 100
    lev_unknown = fuzz.ratio(query, unknown_doc) / 100

    # Compute difference
    diff = lev_relevant - lev_unknown
    return diff  # Return raw difference


@labeling_function()
def tfidf_cosine_similarity_prob_diff(input_data):
    query = input_data["query"]
    relevant_doc = input_data["relevant"]
    unknown_doc = input_data["unknown"]

    # Concatenate Query + Document
    query_rel = f"{query} {relevant_doc}"
    query_unk = f"{query} {unknown_doc}"

    # Compute TF-IDF Cosine Similarity
    docs = [query, query_rel, query_unk]
    tfidf_matrix = vectorizer.fit_transform(docs)

    tfidf_relevant = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
    tfidf_unknown = cosine_similarity(tfidf_matrix[0], tfidf_matrix[2])[0][0]

    diff = tfidf_relevant - tfidf_unknown
    return diff  # Return raw difference

bert_model = SentenceTransformer('all-MiniLM-L6-v2')


@labeling_function()
def bert_cosine_similarity_prob_diff(input_data):
    query = input_data["query"]
    relevant_doc = input_data["relevant"]
    unknown_doc = input_data["unknown"]

    # Concatenate Query + Document
    query_rel = f"{query} [SEP] {relevant_doc}"
    query_unk = f"{query} [SEP] {unknown_doc}"

    # Compute BERT Embeddings
    embeddings = bert_model.encode([query, query_rel, query_unk])

    bert_relevant = np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1]))
    bert_unknown = np.dot(embeddings[0], embeddings[2]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[2]))

    diff = bert_relevant - bert_unknown
    return diff  # Return raw difference



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Step 6: train Snorkel

In [9]:
lfs = [bm25_prob_diff, word_level_levenshtein_prob_diff,tfidf_cosine_similarity_prob_diff, bert_cosine_similarity_prob_diff, jaccard_similarity_prob_diff]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(input_data)

label_model = LabelModel(cardinality=4, verbose=True)
label_model.fit(L_train, n_epochs=100, log_freq=10)

train_prob = label_model.predict_proba(L_train)
train_prob_relevant = train_prob[:, 1]

input_data["probability_relevant"] = train_prob_relevant


100%|██████████| 15080/15080 [1:43:34<00:00,  2.43it/s]
100%|██████████| 100/100 [00:00<00:00, 423.04epoch/s]


In [10]:
WMODEL = "Snorkel"

!rm -Rf run


with tracking(export_file_path='run/.metadata.yml', export_format=ExportFormat.IR_METADATA) as tracked:
    queries = set(input_data['query'].unique())

    def unknown_documents(query):
        docs = set(input_data[input_data['query'] == query]['unknown'].unique())
        return {f'{i[0]}-unkn': i[1] for i in zip(range(len(docs)), docs)}

    results = {}

    for query in tqdm(queries):
        unk = unknown_documents(query)

        # Build index
        docs = [{'docno': k, 'text': v} for k, v in unk.items()]
        indexer = pt.IterDictIndexer("/tmp/index", overwrite=True, meta={'docno': 100, 'text': 20480})
        index_ref = indexer.index(docs)
        retriever = pt.BatchRetrieve(index_ref, wmodel=WMODEL)
        rels = input_data[input_data['query'] == query][['unknown', 'probability_relevant']]
        rel_dict = dict(zip(rels['unknown'], rels['probability_relevant']))

        results[query] = rel_dict

    predictions = []
    for _, i in input_data.iterrows():
        res = results[i['query']]
        predictions.append({
            'id': i['id'],
            'probability_relevant': res.get(i['unknown'], -1)
        })
    predictions = pd.DataFrame(predictions)

predictions

100%|██████████| 13/13 [00:14<00:00,  1.11s/it]


Unnamed: 0,id,probability_relevant
0,ce293ca6-f54c-4499-a7c8-8546029af919,0.183267
1,e035e319-db74-4d31-aa01-13bf60c595ae,0.183267
2,2e58ffb0-3815-4033-a363-f0d9f3dab89f,0.183267
3,43463fe4-63de-4ff3-ab26-43d002c18c72,0.183267
4,141436c9-2d5a-4fdc-abd5-bb8de9952592,0.183267
...,...,...
15075,ee0341bd-8296-4a2b-abf5-1ac7e55be097,0.242772
15076,b0b2d6f8-c0c9-4b52-be87-ece222b219f8,0.242772
15077,0f8bf575-6760-46b3-94ff-c228de6793b5,0.242772
15078,13d3f42f-5057-4563-9c9a-23107f8ed4ab,0.242772


In [11]:
wows_evaluate(
    predictions,
    DATASET_ID,
    tracking_results=tracked,
    upload=True,
    system_name="Snorkel",https://www.tira.io/claim-submission/c1baecf2-84dd-46ae-bda8-995cf4dc3724
    system_description=f'We use the PyTerrier retrieval model {WMODEL} to assign the probability that an unknown document is relevant. We rank all unknown documents that are to-be judged against the query. The probability that a document is relevant is then the min-max normalized rank of an unknown document in the ranking of all unknown documents to the query.'
)

Run uploaded to TIRA. Claim ownership via: https://www.tira.io/claim-submission/c1baecf2-84dd-46ae-bda8-995cf4dc3724
No truth data is available yet. The evaluation is possible after the deadline when the truth data was published.
