# Information Retrieval and Recommender Systems a.y. 2024 - 2025


Authors: Emirhan Kayar ID:513222,  Nicolò Cappa ID:513241, Alessandro Longato ID:512876

# Disclaimer

Due to time and hardware limitations, the tests in this notebook were conducted using manually selected configurations. While these provided valuable insights, future work could focus on conducting a more detailed parameter search to identify optimal weightings for improved performance and ranking quality.


# GOAL
The goal is to design a fully transparent and customizable search engine. Given a user profile, a natural language query, and the chosen retrieval method, the system processes the input and returns the top 10 most relevant documents from the corpus based on the selected technique. Users have complete control over the retrieval methods employed, encouraging trust and empowering them to tailor their experience.

In [None]:
!gdown 1HhgXzyEpsZNcenU9XhJuOYyDUKEzUse4

Downloading...
From (original): https://drive.google.com/uc?id=1HhgXzyEpsZNcenU9XhJuOYyDUKEzUse4
From (redirected): https://drive.google.com/uc?id=1HhgXzyEpsZNcenU9XhJuOYyDUKEzUse4&confirm=t&uuid=62b7a997-e43c-412c-b9d3-5b64c4707a03
To: /content/pir_data.zip
100% 3.30G/3.30G [01:00<00:00, 54.3MB/s]


In [None]:
!unzip pir_data.zip

Archive:  pir_data.zip
   creating: PIR_data/
  inflating: PIR_data/tags.csv       
  inflating: PIR_data/questions_with_answer.csv  
  inflating: PIR_data/questions.csv  
  inflating: PIR_data/comments.csv   
  inflating: PIR_data/users.csv      
  inflating: PIR_data/answers.csv    
   creating: PIR_data/answer_retrieval/
   creating: PIR_data/answer_retrieval/val/
  inflating: PIR_data/answer_retrieval/val/subset_data.jsonl  
  inflating: PIR_data/answer_retrieval/val/qrels.json  
   creating: PIR_data/answer_retrieval/train/
  inflating: PIR_data/answer_retrieval/train/subset_data.jsonl  
  inflating: PIR_data/answer_retrieval/train/qrels.json  
   creating: PIR_data/answer_retrieval/test/
  inflating: PIR_data/answer_retrieval/test/subset_data.jsonl  
  inflating: PIR_data/answer_retrieval/test/qrels.json  
  inflating: PIR_data/answer_retrieval/answers.csv  
  inflating: PIR_data/answer_retrieval/subset_answers.json  
  inflating: PIR_data/postlinks.csv  


In [None]:
!pip install --upgrade -q python-terrier==0.12.1

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m347.9/347.9 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m51.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m287.9/287.9 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.4/119.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.2/119

In [None]:
#Initiate PyTerrier
import pyterrier as pt
if not pt.started():
    pt.init(tqdm='notebook')

  if not pt.started():


terrier-assemblies 5.11 jar-with-dependencies not found, downloading to /root/.pyterrier...
Done
terrier-python-helper 0.0.8 jar not found, downloading to /root/.pyterrier...
Done


Java started and loaded: pyterrier.java, pyterrier.terrier.java [version=5.11 (build: craig.macdonald 2025-01-13 21:29), helper_version=0.0.8]
The following code will have the same effect:
pt.utils.set_tqdm('notebook')
pt.java.init() # optional, forces java initialisation
  pt.init(tqdm='notebook')


In [None]:
import pandas as pd

corpus_df = pd.read_json('PIR_data/answer_retrieval/subset_answers.json', orient='index')
corpus_df = corpus_df.reset_index()
corpus_df.columns = ['docno', 'text']

corpus_df.head(20)

Unnamed: 0,docno,text
0,writers_2010,TL;DRIf you're going to do present tense do it...
1,writers_2018,"Your writing style is stream-of-consciousness,..."
2,writers_2023,Place emphasis on uncomfortable things. Depend...
3,writers_2026,The answer to this depends a lot on what you'r...
4,writers_2095,Short Answer: Read a book on writing stand up ...
5,writers_2518,I'm not entirely sure of the context: do you w...
6,writers_2542,"First of all, what you're describing are not r..."
7,writers_2548,Use as many characters you need. Don't add ext...
8,writers_2582,Your two examples are from very different peop...
9,writers_2600,You almost certainly want to avoid a shopping ...


# Text Preprocessing

### We create two different corpus:
- one with simple text preprocessing
- one with stemming and stop word removal

This is done because models like BERT usually handle semantic representation well even without those type of preprocessing. We will keep the most performant one.

In [None]:
import re

def clean_text(text):
    """Lowercases text, removes emojis, backslashes, special characters, punctuation, and extra whitespace.

    Args:
        text: The input text string.

    Returns:
        The cleaned text string.
    """
    # Lowercase the text
    text = text.lower()

    # Remove emojis
    text = re.sub(r'[\U00010000-\U0010FFFF]', '', text, flags=re.UNICODE)

    # Remove backslashes
    text = re.sub(r"\\", "", text)

    # Remove special characters and punctuation
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)

    # Remove extra whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

# Stop words
stop_words = set(stopwords.words('english'))

# Initialize the stemmer
stemmer = PorterStemmer()

# Function to remove stop words and apply stemming
def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text)

    # Remove stop words and apply stemming
    stemmed_tokens = [stemmer.stem(word) for word in tokens if word.lower() not in stop_words]
    return " ".join(stemmed_tokens)

corpus_df_stemmed_no_stopwords = corpus_df.copy()
corpus_df_stemmed_no_stopwords['text'] = corpus_df_stemmed_no_stopwords['text'].apply(clean_text)
corpus_df_stemmed_no_stopwords['text'] = corpus_df_stemmed_no_stopwords['text'].apply(preprocess_text)


print(corpus_df_stemmed_no_stopwords)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


                docno                                               text
0        writers_2010  tldrif your go present tens good reason mitig ...
1        writers_2018  write style streamofconsci hard digest interne...
2        writers_2023  place emphasi uncomfort thing depend level rea...
3        writers_2026  answer depend lot your set achievein experi cl...
4        writers_2095  short answer read book write stand comedi they...
...               ...                                                ...
9393  academia_138970  gener time set asid academ nearli museum acade...
9394  academia_139396                              took break comic back
9395  academia_143753  provid enough context frame issu way make ques...
9396  academia_148936  ive edit journal without hold phd past dont as...
9397  academia_185179  georgetown univers name villag georgetown clos...

[9398 rows x 2 columns]


In [None]:
corpus_df['text'] = corpus_df['text'].apply(clean_text)

# Initialize 2 Indexes for each corpus

In [None]:
import os
import pyterrier as pt


# Paths for two indexes
preprocessed_index_path = './se-pqa-processed'
original_index_path = './se-pqa-original'

# Create index for stemming and stop word removal
if not os.path.exists(preprocessed_index_path + '/data.properties'):
    indexer = pt.index.IterDictIndexer(preprocessed_index_path, overwrite=True)
    index_ref_preprocessed = indexer.index(
        corpus_df_stemmed_no_stopwords.to_dict(orient='records'),  # Replace with preprocessed DataFrame
        fields={'text': 2096},
        meta={'docno': 20, 'text': 2096}
    )
else:
    index_ref_preprocessed = pt.IndexRef.of(preprocessed_index_path + '/data.properties')

stemmed_and_no_stop_word_index = pt.IndexFactory.of(index_ref_preprocessed)

# Create index for simple preprocessing
if not os.path.exists(original_index_path + '/data.properties'):
    indexer = pt.index.IterDictIndexer(original_index_path, overwrite=True)
    index_ref_original = indexer.index(
        corpus_df.to_dict(orient='records'),  # Replace with original DataFrame
        fields={'text': 2096},
        meta={'docno': 20, 'text': 2096}
    )
else:
    index_ref_original = pt.IndexRef.of(original_index_path + '/data.properties')

simple_preprocessing_index = pt.IndexFactory.of(index_ref_original)


18:51:50.003 [ForkJoinPool-1-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 2 empty documents


  warn(msg)


18:52:00.226 [ForkJoinPool-2-worker-3] WARN org.terrier.structures.indexing.Indexer -- Indexed 2 empty documents


In [None]:
print(simple_preprocessing_index.getCollectionStatistics().toString())

Number of documents: 9398
Number of terms: 99151
Number of postings: 717355
Number of fields: 1
Number of tokens: 1065591
Field names: [text]
Positions:   false



In [None]:
print(stemmed_and_no_stop_word_index.getCollectionStatistics().toString())

Number of documents: 9398
Number of terms: 98437
Number of postings: 717162
Number of fields: 1
Number of tokens: 1056264
Field names: [text]
Positions:   false



# Prepare Validation and test dataset
Also here we preprocess queries differently:
- One set of validation and test queries will be with simple preprocessing
- One with stemming and stop word removal

In [None]:
val_queries = pd.read_json('PIR_data/answer_retrieval/val/subset_data.jsonl', lines=True)
test_queries = pd.read_json('PIR_data/answer_retrieval/test/subset_data.jsonl', lines=True)

val_queries = val_queries[['id','title','user_id', 'user_questions', 'user_answers', 'tags', 'timestamp']]
val_queries.columns = ['qid', 'query', 'user_id','user_questions', 'user_answers', 'tags', 'timestamp']

val_queries['query'] = val_queries['query'].apply(clean_text)

val_queries_stemmed_no_stopwords = val_queries.copy()
val_queries_stemmed_no_stopwords['query'] = val_queries_stemmed_no_stopwords['query'].apply(preprocess_text)

test_queries = test_queries[['id','title','user_id','user_questions', 'user_answers', 'tags', 'timestamp']]
test_queries.columns = ['qid', 'query', 'user_id','user_questions', 'user_answers', 'tags', 'timestamp']
test_queries['query'] = test_queries['query'].apply(clean_text)

test_queries_stemmed_no_stopwords = test_queries.copy()
test_queries_stemmed_no_stopwords['query'] = test_queries_stemmed_no_stopwords['query'].apply(preprocess_text)

test_queries.head(2)

Unnamed: 0,qid,query,user_id,user_questions,user_answers,tags,timestamp
0,academia_185177,after what george was georgetown university named,1532620,"[writers_27613, writers_29562, writers_43973, ...","[vegetarianism_1871, skeptics_39944, skeptics_...","[academic-history, history]",2022-05-12 21:27:50
1,anime_67047,can someone explain why garou made saitama do ...,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[one-punch-man],2022-07-21 12:04:38


In [None]:
val_qrels = pd.read_json('PIR_data/answer_retrieval/val/qrels.json', orient='index').reset_index()
val_qrels.columns = ['qid', 'docno']
val_qrels['label'] = 1

test_qrels = pd.read_json('PIR_data/answer_retrieval/test/qrels.json', orient='index').reset_index()
test_qrels.columns = ['qid', 'docno']
test_qrels['label'] = 1

# Run simple BM25 and TF-IDF retrival as a baseline for both corpus

Simple preprocessing

In [None]:
BM25_br = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100
tfidf_br = pt.BatchRetrieve(simple_preprocessing_index, wmodel='TF_IDF') % 100

  BM25_br = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100
  tfidf_br = pt.BatchRetrieve(simple_preprocessing_index, wmodel='TF_IDF') % 100


In [None]:
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

pt.Experiment(
    [
        BM25_br,
        tfidf_br
    ],
    val_queries,
    val_qrels,
    names=[
        'BM25','TF-IDF'
      ],
    eval_metrics=metrics
)

Unnamed: 0,name,R@100,R@5,P@1,AP@100,nDCG@10,nDCG@3
0,BM25,0.897959,0.693878,0.530612,0.614691,0.65193,0.605197
1,TF-IDF,0.887755,0.693878,0.540816,0.62002,0.656265,0.608963


Stemming and stop word removal

In [None]:
BM25_br = pt.BatchRetrieve(stemmed_and_no_stop_word_index, wmodel='BM25') % 100
tfidf_br = pt.BatchRetrieve(stemmed_and_no_stop_word_index, wmodel='TF_IDF') % 100

  BM25_br = pt.BatchRetrieve(stemmed_and_no_stop_word_index, wmodel='BM25') % 100
  tfidf_br = pt.BatchRetrieve(stemmed_and_no_stop_word_index, wmodel='TF_IDF') % 100


In [None]:
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

pt.Experiment(
    [
        BM25_br,
        tfidf_br
    ],
    val_queries_stemmed_no_stopwords,
    val_qrels,
    names=[
        'BM25','TF-IDF'
      ],
    eval_metrics=metrics
)

Unnamed: 0,name,R@100,R@5,P@1,AP@100,nDCG@10,nDCG@3
0,BM25,0.908163,0.693878,0.55102,0.627118,0.658293,0.619167
1,TF-IDF,0.908163,0.693878,0.55102,0.626879,0.658171,0.619167


## Neural Ranking with Cross-Encoders and Bi-Encoders

Neural re-rankers like **bi-encoders** and **cross-encoders** enhance ranking quality by refining results from traditional methods (e.g., BM25).

### Bi-Encoders
Bi-encoders encode queries and documents separately, calculating similarity between embeddings.  
- **Efficient**: Precompute document embeddings for fast retrieval.  
- **Use Case**: Large-scale ranking.

### Cross-Encoders
Cross-encoders process queries and documents jointly, modeling interactions for higher accuracy.  
- **Precise**: Context-rich encoding improves ranking quality.  
- **Use Case**: Precision-focused tasks.

### Workflow
1. Use BM25 to retrieve initial candidates.
2. Apply bi-encoder for fast re-ranking or cross-encoder for fine-grained ranking.
3. Return the final ranked results.

Neural ranking boosts relevance, making it essential for personalized search and recommendations.


In [None]:
!pip install -q sentence_transformers ipdb

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m1.5/1.6 MB[0m [31m44.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from sentence_transformers import CrossEncoder, SentenceTransformer
crossmodel = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

# Precompute embeddings for documents to avoid redundance computations

create folder to store precomputed embeddings

In [None]:
import os
import pickle
import numpy as np

# Define the path to save document embeddings
embeddings_dir = './se-pqa/bi_encoder_embeddings'
os.makedirs(embeddings_dir, exist_ok=True)

doc_emb_path = os.path.join(embeddings_dir, 'bi_encoder_doc_embeddings.pkl')

Precompute document embeddings

In [None]:
from sentence_transformers import SentenceTransformer
import torch

# Initialize the bi-encoder model and move it to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
biencoder_model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L-6-v3', device=device)

# Check if embeddings already exist to avoid redundant computation
if not os.path.exists(doc_emb_path):
    print("Precomputing document embeddings...")

    # Encode all documents in batches
    doc_embeddings = biencoder_model.encode(
        corpus_df['text'].values,
        batch_size=64,
        show_progress_bar=True,
        device=device,
        convert_to_numpy=True,
        normalize_embeddings=False  # This normalizes the embeddings
    )

    # Save the embeddings to disk
    with open(doc_emb_path, 'wb') as f:
        pickle.dump(doc_embeddings, f)

    print(f"Document embeddings saved to {doc_emb_path}.")
else:
    print(f"Document embeddings already exist at {doc_emb_path}. Loading embeddings...")

    # Load the precomputed embeddings
    with open(doc_emb_path, 'rb') as f:
        doc_embeddings = pickle.load(f)

    print("Document embeddings loaded successfully.")


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.72k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Precomputing document embeddings...


Batches:   0%|          | 0/147 [00:00<?, ?it/s]

Document embeddings saved to ./se-pqa/bi_encoder_embeddings/bi_encoder_doc_embeddings.pkl.


## Bi-encoder and cross-encoder methods

In [None]:
from sentence_transformers.util import cos_sim
# Create a mapping from docno to index
docno_to_index = {docno: idx for idx, docno in enumerate(corpus_df['docno'])}

def _biencoder_apply(df):
    """
    Computes cosine similarity scores between query embeddings and precomputed document embeddings.

    Args:
        df (pd.DataFrame): DataFrame containing 'query' and 'docno' columns.

    Returns:
        np.ndarray: Array of similarity scores corresponding to each (query, docno) pair.
    """
    queries = df['text'].values
    docnos = df['docno'].values

    # Encode queries in batches
    query_embs = biencoder_model.encode(
        queries,
        batch_size=64,
        show_progress_bar=False,
        device=device,
        convert_to_numpy=True,
        normalize_embeddings=False  # Ensure queries are normalized
    )

    # Retrieve document indices based on 'docno'
    try:
        doc_indices = [docno_to_index[docno] for docno in docnos]
    except KeyError as e:
        print(f"Error: {e}. Some docnos are missing in the corpus.")
        # Assign zero scores for missing documents
        return np.zeros(len(df))

    # Retrieve the corresponding document embeddings
    selected_doc_embs = doc_embeddings[doc_indices]

    # Compute cosine similarity between query embeddings and document embeddings
    scores = cos_sim(query_embs, selected_doc_embs)
    return scores[0]


bi_encT = pt.apply.doc_score(_biencoder_apply, batch_size=64)

In [None]:
from transformers import AutoTokenizer
from functools import partial

# Initialize tokenizer for precise truncation control
tokenizer = AutoTokenizer.from_pretrained('cross-encoder/ms-marco-MiniLM-L-6-v2')

def truncate_pair(query, text, max_length=512):
    """
    Truncate the combined query and text to fit within max_length tokens.
    """
    encoded = tokenizer.encode_plus(query, text, truncation=True, max_length=max_length, return_tensors=None)
    return encoded['input_ids']

def crossencoder_apply(df, query_col='query', text_col='text'):
    """
    Applies the CrossEncoder to compute scores for query-text pairs.

    Args:
        df (pd.DataFrame): DataFrame containing query and text columns.
        query_col (str): Name of the query column.
        text_col (str): Name of the text column.

    Returns:
        list: List of scores for each query-text pair.
    """
    queries = df[query_col].tolist()
    texts = df[text_col].tolist()

    # Form query-text pairs
    pairs = list(zip(queries, texts))
    try:
        scores = crossmodel.predict(pairs, batch_size=64, show_progress_bar=False)
    except Exception as e:
        print(f"Error during CrossEncoder prediction: {e}")
        scores = [0.0] * len(df)  # Default score in case of error

    return scores
#return crossmodel.predict(list(zip(df['query'].values, df[column].values)))


crossencoder_apply_partial = partial(crossencoder_apply, query_col='query', text_col='text')

cross_encT = pt.apply.doc_score(crossencoder_apply_partial, batch_size=64)



## Combining BM25 with cross-encoder and bi-encoder

In [None]:
import pyterrier as pt
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents


# Bi-encoder Pipeline
bi_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT
)

# Cross-encoder Pipeline
cross_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> cross_encT
)

# Normalize the scores
normalized_br = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_bi = bi_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_cross = cross_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()





  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents


In [None]:
# Weight combinations with BM25
with_bm25_weights = [(0.3, 0.4, 0.3), (0.2, 0.4, 0.4), (0.1, 0.5, 0.4)]
# Weight combinations without BM25
without_bm25_weights = [(0.7, 0.3), (0.6, 0.4), (0.8, 0.2)]

pipelines = []
pipeline_names = []

# Pipelines with BM25
for bm25_weight, bi_weight, cross_weight in with_bm25_weights:
    combined = (
        bm25_weight * normalized_br +
        bi_weight * normalized_bi +
        cross_weight * normalized_cross
    )
    pipelines.append(combined)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{bi_weight}_cross_{cross_weight}")

# Pipelines without BM25
for bi_weight, cross_weight in without_bm25_weights:
    combined = (
        bi_weight * normalized_bi +
        cross_weight * normalized_cross
    )
    pipelines.append(combined)
    pipeline_names.append(f"bi_{bi_weight}_cross_{cross_weight}")


In [None]:
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

results = pt.Experiment(
    pipelines,
    val_queries,
    qrels=val_qrels,
    names=pipeline_names,
    eval_metrics = metrics,
    verbose= True
)
print(results)

pt.Experiment:   0%|          | 0/6 [00:00<?, ?system/s]

                        name     R@100       R@5       P@1    AP@100  \
0  bm25_0.3_bi_0.4_cross_0.3  0.897959  0.816327  0.540816  0.662553   
1  bm25_0.2_bi_0.4_cross_0.4  0.897959  0.826531  0.551020  0.674045   
2  bm25_0.1_bi_0.5_cross_0.4  0.897959  0.836735  0.551020  0.677193   
3           bi_0.7_cross_0.3  0.897959  0.836735  0.530612  0.665617   
4           bi_0.6_cross_0.4  0.897959  0.836735  0.551020  0.678541   
5           bi_0.8_cross_0.2  0.897959  0.816327  0.530612  0.652008   

    nDCG@10    nDCG@3  
0  0.703862  0.670671  
1  0.712948  0.692416  
2  0.715985  0.698854  
3  0.707746  0.686220  
4  0.717321  0.700190  
5  0.693997  0.665569  


# Experiment Conclusion: Comments on Results

The experiment evaluated six configurations and compared their performance against the baseline models **BM25** and **TF-IDF**. Here are the key observations:

- **Consistent Recall Across Configurations:** All configurations achieved the same **R@100** (0.897959), matching the BM25 baseline. This indicates that all methods retrieved a similar set of relevant documents, highlighting the effectiveness of the initial retrieval step.

- **Incremental Gains in Ranking Metrics:** Configurations incorporating neural re-rankers (bi-encoder and cross-encoder) provided small but noticeable improvements over the BM25 baseline in ranking metrics like **R@5**, **P@1**, and **nDCG@10**. For example, the configuration `bm25_0.1_bi_0.5_cross_0.4` achieved the highest **nDCG@10** (0.715985), outperforming the BM25 baseline's **nDCG@10** (0.651930).

- **Best-Performing Configurations:**
  - `bi_0.6_cross_0.4` and `bm25_0.1_bi_0.5_cross_0.4` consistently demonstrated strong ranking performance, with the highest **AP@100**, **nDCG@10**, and **nDCG@3** scores.
  - This highlights the potential of balancing bi-encoder and cross-encoder weights to optimize ranking quality.

- **Trade-offs with Simplicity:** The simpler configurations (e.g., `bi_0.8_cross_0.2`) underperformed slightly in precision metrics like **P@1** and **nDCG@3**, emphasizing the importance of fine-tuning weight allocations.

- **Baseline Performance:**
  - BM25 provided a solid baseline for recall, but its ranking precision was significantly lower than the best configurations.
  - TF-IDF showed marginally lower recall (**R@100** = 0.887755) and slightly better **P@1** (0.540816) than BM25, but it underperformed compared to neural ranking models.

### Summary
While all configurations improved ranking quality compared to the baseline models, the differences between them were subtle. The best-performing configurations demonstrated the value of combining bi-encoder and cross-encoder models with careful weight tuning to achieve optimal ranking precision and relevance.


----





# Query Refinement with LLM

- We aim to directly integrate user context, derived from their historical activity, into the query itself.
- This approach will allow us to evaluate whether modifying the query text using an LLM can improve performance compared to the original queries or if it introduces challenges, such as irrelevant or noisy information.


In [None]:
import pandas as pd
import re
import pyterrier as pt
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
if not pt.java.started():
    pt.init(tqdm='notebook')
import os
from pyterrier.measures import *
import json

In [None]:
# Load the pre-trained model
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

## Define model configuation

In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

Device set to use cuda


### Only use tags of precends query since user don't know the tags of its own query
- We are searching only queries in the validation dataset both to prevent information leakage and due to memory constraint in loading the queries.csv file
- For this reason some user do not have any previous query, in a sense we are emulating a cold start problem for some user
- If user do not have any historical data available we just trasform the query with a generic LLM prompt

In [None]:
def get_prev_tags(df):
    # Sort dataframe by timestamp (new to old)
    df = df.sort_values('timestamp', ascending=False)
    prev_map = {}
    prev_tags_col = []
    prev_queries_col = []

    for _, row in df.iterrows():
        user_id = row['user_id']
        if user_id not in prev_map:
            prev_map[user_id] = []

        # Collect queries and tags from the previous 3 entries of the same user
        prev_entries = prev_map[user_id][:3]
        prev_tags = [tags for _, tags in prev_entries]
        prev_queries = [query for query, _ in prev_entries]

        prev_tags_col.append(prev_tags)
        prev_queries_col.append(prev_queries)

        # Add the current query's timestamp, query text, and tags to the user's history
        prev_map[user_id].insert(0, (row['query'], row['tags']))

    return prev_queries_col, prev_tags_col

In [None]:
# Compute previous queries and tags, and assign them to new columns
prev_queries, prev_tags = get_prev_tags(val_queries)

# create utility df in order to not overload val_queries with reduntant info
val_queries_to_expand = val_queries.copy()

val_queries_to_expand['prev_tags'] = prev_tags


In [None]:
val_queries_to_expand.head()

Unnamed: 0,qid,query,user_id,user_questions,user_answers,tags,timestamp,prev_tags
0,academia_143743,on answering a question that no one has asked,1582241,"[travel_149904, travel_151531, sound_39671, so...","[politics_13376, politics_37453, politics_3793...","[publications, publishability, history]",2020-02-03 05:38:49,[]
1,academia_148899,how much domain expertise and network does a s...,935589,"[writers_9077, workplace_6846, workplace_9638,...","[writers_27741, writers_43414, writers_44312, ...","[publications, editors, special-issue]",2020-05-09 09:10:08,[]
2,anime_56513,does overhaul need to touch with his hands to ...,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[my-hero-academia],2020-01-18 18:01:25,"[[citations, chapters]]"
3,anime_59459,why did kanon reincarnate in another race,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[mao-gakuin-no-futekigosha],2020-08-31 12:37:37,"[[catholicism, mass, liturgy-of-the-hours], [c..."
4,apple_408963,how do i disallow screen sharing for messages,331923,"[travel_4961, travel_46275, travel_46638, trav...","[travel_4647, travel_90661, skeptics_6876, ske...","[security, screen-sharing]",2020-12-15 20:46:16,[]


In [None]:
#tags = val_queries['tags'].tolist()
queries = val_queries['query'].tolist()
p_queries = prev_queries
p_tags = val_queries_to_expand['prev_tags']

### Demo of the function

In [None]:
%%time
i = 15
pt = p_tags[i]
q = val_queries['query'].tolist()[i]

# join keyword from different queries
set_of_keywords = []
for tag_list in pt:
    set_of_keywords.extend(tag_list)
    set_of_keywords = list(set(set_of_keywords))


print(set_of_keywords)

# Instructions
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},

    {"role": "user", "content": f"Given the query and the user interests given as comma separated values, provide an expanded version of the query with personalized interests. Do not add anything else. Query: {q}, Keywords: {', '.join(set_of_keywords)}"}
]

# Results
output = pipe(messages, **generation_args)
print(f"\nOriginal Query: {q}, Tags: {set_of_keywords}\n")
print(f"EXPANDED Query:{output[0]['generated_text']}")



['mass', 'citations', 'fruit', 'chapters', 'liturgy-of-the-hours', 'indian-cuisine', 'catholicism']

Original Query: i found a tamarind that was mostly powder how do i avoid buying these, Tags: ['mass', 'citations', 'fruit', 'chapters', 'liturgy-of-the-hours', 'indian-cuisine', 'catholicism']

EXPANDED Query: i found a tamarind that was mostly powder, and I'm interested in mass, citations, fruit, chapters, liturgy-of-the-hours, Indian cuisine, and Catholicism. How can I avoid purchasing these tamarind powders?
CPU times: user 4.36 s, sys: 93.4 ms, total: 4.46 s
Wall time: 5.96 s


## Pipeline that create a new DF *val_queries_expanded* with LLM expanded queries

- This step is done offline even if it requires user query for a reason of efficiency. In this way we could test different configs without computing the enriched query everytime

- in the final pipeline this step if the user select the query enrichment, will be done online

In [None]:
# Load necessary data
tags = val_queries['tags'].tolist()
queries = val_queries['query'].tolist()

prev_queries, prev_tags = get_prev_tags(val_queries)


val_queries_expanded = val_queries.copy()

expanded_queries = []

"""
  Aim of this loop is to,
    feed all the user queries
    with last 2 historical tags
    as an instruction to the LLM

  If there is >= 2 tags, then
    submit query + tags

  If no tags, then
    just submit only the user query

  Each time query generated,
    clean the text from possible LLM additions/errors

"""
for i in range(len(val_queries)):
    q = queries[i]
    pt = prev_tags[i]
    # Prepare the tags list
    pt = [item for sublist in pt for item in sublist]


    if not pt:
        pt = ["None"]
    elif len(pt) > 2:
      # use last 2 keywords otherwise query becomes too long
        pt = pt[:2]

    if pt == ["None"]:
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": f"Given the query, provide a generalized expanded version of the query. Keep the new query with similar length to the original. Do not add anything else. Query: {q}"}
        ]
    else:
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": f"Given the query and the user interests given as comma separated values, provide an expanded version of the query with personalized interests. Do not add anything else. Query: {q}, Keywords: {', '.join(pt)}"}
        ]

    # Get the model output
    output = pipe(messages, **generation_args)
    cleaned_text = output[0]['generated_text'].strip()

    # Remove the "Query:"
    cleaned_text = re.sub(r'(?i)^query:\s*', '', cleaned_text)
    expanded_queries.append(cleaned_text)

    # Print the original and expanded queries
    print(f"\n\nOriginal Query {i + 1}: {q}, Previous Tags: {pt}\n")
    print(f"EXPANDED Query {i + 1}: {cleaned_text}")

val_queries_expanded['query'] = expanded_queries
# apply text pre processing to expanded queries
val_queries_expanded['query'] = val_queries_expanded['query'].apply(clean_text)







Original Query 1: on answering a question that no one has asked, Previous Tags: ['None']

EXPANDED Query 1: Respond to a unique inquiry that hasn't been posed before


Original Query 2: how much domain expertise and network does a special issue guest editor need, Previous Tags: ['None']

EXPANDED Query 2: What level of domain expertise and professional network is required for a special issue guest editor?


Original Query 3: does overhaul need to touch with his hands to activate his quirk, Previous Tags: ['citations', 'chapters']

EXPANDED Query 3: Does the overhaul need to physically interact with his hands to activate his quirk, considering the citations and chapters mentioned in the document?


Original Query 4: why did kanon reincarnate in another race, Previous Tags: ['catholicism', 'mass']

EXPANDED Query 4: Why did Kanon reincarnate in another race, considering her strong Catholic beliefs and the significance of the mass in her life?


Original Query 5: how do i disallow scree

In [None]:
val_queries_expanded.head()

Unnamed: 0,qid,query,user_id,user_questions,user_answers,tags,timestamp
0,academia_143743,respond to a unique inquiry that hasnt been po...,1582241,"[travel_149904, travel_151531, sound_39671, so...","[politics_13376, politics_37453, politics_3793...","[publications, publishability, history]",2020-02-03 05:38:49
1,academia_148899,what level of domain expertise and professiona...,935589,"[writers_9077, workplace_6846, workplace_9638,...","[writers_27741, writers_43414, writers_44312, ...","[publications, editors, special-issue]",2020-05-09 09:10:08
2,anime_56513,does the overhaul need to physically interact ...,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[my-hero-academia],2020-01-18 18:01:25
3,anime_59459,why did kanon reincarnate in another race cons...,59256,"[sports_14060, sports_14107, sports_14133, spo...","[sports_14061, sports_14062, sports_14064, spo...",[mao-gakuin-no-futekigosha],2020-08-31 12:37:37
4,apple_408963,how can i prevent screen sharing during messag...,331923,"[travel_4961, travel_46275, travel_46638, trav...","[travel_4647, travel_90661, skeptics_6876, ske...","[security, screen-sharing]",2020-12-15 20:46:16


## We test the expanded queries with same pipeliens as the original ones

In [None]:
import pyterrier as pt
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents
tfidf = pt.BatchRetrieve(simple_preprocessing_index, wmodel='TF_IDF') % 100  # Retrieve top-100 documents


# Bi-encoder Pipeline
bi_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT
)

# Cross-encoder Pipeline
cross_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> cross_encT
)

# Normalize the scores
normalized_br = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_bi = bi_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_cross = cross_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()





  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents
  tfidf = pt.BatchRetrieve(simple_preprocessing_index, wmodel='TF_IDF') % 100  # Retrieve top-100 documents


In [None]:
# Weight combinations with BM25
with_bm25_weights = [(0.3, 0.4, 0.3), (0.2, 0.4, 0.4), (0.1, 0.5, 0.4)]
# Weight combinations without BM25
without_bm25_weights = [(0.7, 0.3), (0.6, 0.4), (0.8, 0.2)]

pipelines = []
pipeline_names = []

# Pipelines with BM25
for bm25_weight, bi_weight, cross_weight in with_bm25_weights:
    combined = (
        bm25_weight * normalized_br +
        bi_weight * normalized_bi +
        cross_weight * normalized_cross
    )
    pipelines.append(combined)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{bi_weight}_cross_{cross_weight}")

# Pipelines without BM25
for bi_weight, cross_weight in without_bm25_weights:
    combined = (
        bi_weight * normalized_bi +
        cross_weight * normalized_cross
    )
    pipelines.append(combined)
    pipeline_names.append(f"bi_{bi_weight}_cross_{cross_weight}")


In [None]:
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

results = pt.Experiment(
    pipelines,
    val_queries_expanded,
    qrels=val_qrels,
    names=pipeline_names,
    eval_metrics = metrics,
    verbose= True
)
print(results)

In [None]:
print(results)

                        name     R@100       R@5       P@1    AP@100  \
0  bm25_0.3_bi_0.4_cross_0.3  0.826531  0.744898  0.438776  0.580305   
1  bm25_0.2_bi_0.4_cross_0.4  0.826531  0.775510  0.479592  0.606129   
2  bm25_0.1_bi_0.5_cross_0.4  0.826531  0.765306  0.438776  0.582585   
3           bi_0.7_cross_0.3  0.826531  0.724490  0.438776  0.568709   
4           bi_0.6_cross_0.4  0.826531  0.755102  0.438776  0.581994   
5           bi_0.8_cross_0.2  0.826531  0.663265  0.438776  0.540615   

    nDCG@10    nDCG@3  
0  0.632593  0.599485  
1  0.652634  0.620987  
2  0.635027  0.604587  
3  0.612511  0.581507  
4  0.631817  0.605923  
5  0.585165  0.536440  


## Evaluation:

The observed performance decline suggests that query expansion negatively impacts statistical retrieval models like BM25, often adding information that may not align with the current query’s intent. To incorporate personalized information more effectively, without directly altering the original query, future work will explore more recommender-oriented systems. This approach aims to integrate personalization seamlessly, enhancing relevance while maintaining the integrity of the original query.


---

# Personalization

- We will compute and normalize personalized and content-based scores, specifically:
  - **Quality (Score):** Non-personalized metric based on the intrinsic value of the document.
  - **Popularity:** Non-personalized metric derived from user engagement (e.g., comment counts).
  - **Content-Based:** Personalized metric leveraging user context for tailored recommendations.

- While the first two methods are non-personalized, the content-based approach incorporates user context to provide personalized recommendations.

- These scores will be normalized and combined with retrieval scores to re-rank the final set of documents.

- We will explore different approaches for reranking:
  - Direct reranking based solely on recommender metrics.
  - Score combination by integrating retrieval and recommender scores.


In [None]:
users = pd.read_csv('/content/PIR_data/users.csv', encoding='ISO-8859-1', engine='python', usecols=['AccountId', 'DisplayName', 'Reputation'])
users.dropna(inplace=True)
users['AccountId'] = pd.to_numeric(users['AccountId'], downcast='integer')
users.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2782402 entries, 0 to 2782731
Data columns (total 3 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   Reputation   int64 
 1   DisplayName  object
 2   AccountId    int32 
dtypes: int32(1), int64(1), object(1)
memory usage: 74.3+ MB


In [None]:
users.head()

Unnamed: 0,Reputation,DisplayName,AccountId
0,1,Community,-1
1,101,Geoff Dalgas,2
2,101,Jarrod Dixon,3
3,101,Emmett,1998
4,101,Jin,21721


In [None]:
answers = pd.read_csv('/content/PIR_data/answers.csv', encoding='ISO-8859-1', engine='python', usecols=['ParentId', 'AccountId', 'Score', 'CommentCount', 'Id'])
answers.dropna(inplace=True)
answers['AccountId'] = pd.to_numeric(answers['AccountId'], downcast='integer')
answers.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2103045 entries, 0 to 2173519
Data columns (total 5 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Id            object 
 1   Score         float64
 2   CommentCount  float64
 3   ParentId      object 
 4   AccountId     int32  
dtypes: float64(2), int32(1), object(2)
memory usage: 88.2+ MB


In [None]:
answers.head()

Unnamed: 0,Id,Score,CommentCount,ParentId,AccountId
0,writers_8,9.0,0.0,writers_1,225829
1,writers_10,9.0,0.0,writers_5,66721
2,writers_14,5.0,0.0,writers_7,198391
3,writers_15,18.0,3.0,writers_1,94574
4,writers_16,15.0,9.0,writers_2,66721


In [None]:
tags = pd.read_csv('/content/PIR_data/tags.csv', encoding='ISO-8859-1', engine='python', usecols=['TagName'])
#tags['Count'] = pd.to_numeric(tags['Count'], downcast='integer', errors='coerce')
#tags.dropna(inplace=True)
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29443 entries, 0 to 29442
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   TagName  29342 non-null  object
dtypes: object(1)
memory usage: 230.2+ KB


In [None]:
tags.head()

Unnamed: 0,TagName
0,third-person
1,publishing
2,brainstorming
3,short-story
4,fiction


In [None]:
questions = pd.read_csv('/content/PIR_data/questions.csv', encoding='ISO-8859-1', engine='python', usecols=['Id', 'Tags', 'AccountId', 'Community', 'Title'])
questions.dropna(inplace=True)
questions['AccountId'] = pd.to_numeric(questions['AccountId'], downcast='integer')
questions.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1089169 entries, 0 to 1126456
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   Id         1089169 non-null  object
 1   Title      1089169 non-null  object
 2   Tags       1089169 non-null  object
 3   AccountId  1089169 non-null  int32 
 4   Community  1089169 non-null  object
dtypes: int32(1), object(4)
memory usage: 45.7+ MB


In [None]:
questions.head()

Unnamed: 0,Id,Title,Tags,AccountId,Community
0,writers_1,What are some online guides for starting writers?,<resources><first-time-author>,1335,writers
1,writers_2,What is the difference between writing in the ...,<fiction><grammatical-person><third-person>,1335,writers
2,writers_3,How do I find an agent?,<publishing><novel><agent>,496961,writers
3,writers_5,Decide on a theme/overarching meaning before w...,<plot><short-story><planning><brainstorming>,174535,writers
4,writers_7,What is Literary Fiction?,<fiction><genre><categories>,496961,writers


### Quality Measure
First, we chose to use the ‘Score’ of the ‘answer’ table as a proxy for answer quality. For each document retrieved by the information retrieval system, the code fetches the corresponding score. Then, all scores are normalized in a [0, 1] range, where 0 corresponds to the lowest score and 1 to the highest score, and used in the weighted reranking of documents.

In [None]:
def get_normalized_scores(answer_ids, answers):
    """
    Retrieve and normalize scores for a given list of answer IDs.

    Args:
        answer_ids (list): List of answer IDs.
        answers (pd.DataFrame): DataFrame containing answer data, including 'Id' and 'Score' columns.

    Returns:
        pd.Series: A Series containing the normalized scores for the given answer IDs.
    """
    # Filter the answers DataFrame for the specified answer IDs
    answer_scores = answers[answers['Id'].isin(answer_ids)][['Id', 'Score']]

    # Check if there are scores to normalize
    if answer_scores.empty:
        print("No scores found for the given answer IDs.")
        return pd.Series(dtype=float)

    # Normalize the scores (Min-Max Normalization)
    min_score = answer_scores['Score'].min()
    max_score = answer_scores['Score'].max()

    if min_score == max_score:
        # Avoid division by zero if all scores are the same
        answer_scores['NormalizedScore'] = 1.0
    else:
        answer_scores['NormalizedScore'] = (answer_scores['Score'] - min_score) / (max_score - min_score)

    # Return the normalized scores as a Series, indexed by answer ID
    return answer_scores.set_index('Id')['NormalizedScore']

### Popularity Measure
In the same fashion, we used the ‘CommentCount’ as a proxy for answer popularity. Normalized comment counts are used together with the other scores in the reranking of documents.

In [None]:
def get_normalized_popularity(answer_ids, answers):
    """
    Retrieve and normalize popularity for a given list of answer IDs.

    Args:
        answer_ids (list): List of answer IDs.
        answers (pd.DataFrame): DataFrame containing answer data, including 'Id' and 'CommentCount' columns.

    Returns:
        pd.Series: A Series containing the normalized popularity for the given answer IDs.
    """
    # Filter the answers DataFrame for the specified answer IDs
    answer_scores = answers[answers['Id'].isin(answer_ids)][['Id', 'CommentCount']]

    # Check if there are scores to normalize
    if answer_scores.empty:
        print("No scores found for the given answer IDs.")
        return pd.Series(dtype=float)

    # Normalize the scores (Min-Max Normalization)
    min_score = answer_scores['CommentCount'].min()
    max_score = answer_scores['CommentCount'].max()

    if min_score == max_score:
        # Avoid division by zero if all scores are the same
        answer_scores['NormalizedPopularity'] = 1.0
    else:
        answer_scores['NormalizedPopularity'] = (answer_scores['CommentCount'] - min_score) / (max_score - min_score)

    # Return the normalized scores as a Series, indexed by answer ID
    return answer_scores.set_index('Id')['NormalizedPopularity']

### Content-based
First, the querying user’s profile is built by extracting all the tags associated with their other authored questions and answers. These tags serve as the features for our model. A binary vectorizer is then trained using this tag set as the dictionary. The tag set is transformed using the vectorizer, resulting in a binary vector with each entry set to 1 for tags the user has interacted with. Afterwards, for each document under evaluation, its associated tags are retrieved and transformed into a binary vector using the same vectorizer fitted to the user's profile, ensuring both vectors share the same feature space. The cosine similarity between the user's profile vector and the document vector is then computed. To address the issue of low support in the evaluated document vector, the similarity score is normalized by multiplying it by the minimum magnitude of the two vectors. This normalization enhances the significance of the similarity score, ensuring its relevance during the weighted re-ranking of evaluated documents. A score close to zero would have a negligible influence on the final ranking.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

def content_based(user_questions, user_answers, questions, answers, evaluate_answers):
    """
    Computes the cosine similarity between a user profile and a list of answers to evaluate.

    Args:
        user_questions (list): List of question IDs authored by the user.
        user_answers (list): List of answer IDs authored by the user.
        questions (pd.DataFrame): DataFrame containing questions, including 'Id' and 'Tags'.
        answers (pd.DataFrame): DataFrame containing answers, including 'Id' and 'ParentId'.
        evaluate_answers (list): List of answer IDs to evaluate.

    Returns:
        list: A list of tuples, where each tuple contains the answer ID and its similarity score.
    """
    # Step 1: Build the user profile as a set of tags
    question_tags = questions[questions['Id'].isin(user_questions)]['Tags'].dropna()
    question_tags_list = []
    for tags in question_tags:
        question_tags_list.extend(tags.strip('<>').split('><'))

    parent_question_ids = answers[answers['Id'].isin(user_answers)]['ParentId']
    parent_question_tags = questions[questions['Id'].isin(parent_question_ids)]['Tags'].dropna()
    answer_tags_list = []
    for tags in parent_question_tags:
        answer_tags_list.extend(tags.strip('<>').split('><'))

    user_profile = set(question_tags_list + answer_tags_list)

    # Step 2: Prepare tag-based representations
    user_profile_str = ' '.join(user_profile)

    # Initialize list to store similarity scores
    similarities_norm = []

    vectorizer = CountVectorizer(binary=True)
    user_profile_vector = vectorizer.fit_transform([user_profile_str])

    # Step 3: Compute cosine similarity for each answer to evaluate
    for answer_id in evaluate_answers:
        # Get parent question ID for the answer
        parent_id = answers.loc[answers['Id'] == answer_id, 'ParentId']
        if parent_id.empty:
            continue
        parent_id = parent_id.values[0]

        # Get tags of the parent question
        parent_question_tags = questions.loc[questions['Id'] == parent_id, 'Tags']
        if parent_question_tags.empty:
            continue
        ans_tags = parent_question_tags.values[0].strip('<>').split('><')
        ans_tags_str = ' '.join(ans_tags)

        # Vectorize and compute cosine similarity
        ans_tags_vector = vectorizer.transform([ans_tags_str])
        sim_score = cosine_similarity(user_profile_vector, ans_tags_vector)[0][0]
        min_magnitude = min(np.linalg.norm(user_profile_vector.toarray()) , np.linalg.norm(ans_tags_vector.toarray()))
        normalized_sim_score = sim_score * min_magnitude
        if normalized_sim_score > 1:
          similarities_norm.append((answer_id, 1))
        else:
          similarities_norm.append((answer_id, normalized_sim_score))

    return similarities_norm

# Custom PyTerrier Transformers

To enhance document ranking, we will implement custom PyTerrier transformers for the following scores:

- **Answer Quality (Score):**  
  A metric to evaluate the intrinsic quality of an answer based on its `score` field, normalized to ensure comparability across documents.

- **Popularity Score:**  
  A metric derived from user engagement data, such as comment counts or votes, to reflect the popularity of each document. This score will also be normalized.

- **Content-Based Score:**  
  A personalized metric leveraging user context (e.g., historical activity, tags) to compute the similarity between user preferences and document content.

Each transformer will compute these scores and integrate them into the PyTerrier pipeline, enabling flexible and efficient re-ranking of documents based on combined metrics.


In [None]:
import pyterrier as pt
import pandas as pd

class ContentBasedTransformer(pt.Transformer):
    def __init__(self, questions, answers, top_n=5, rerank= True):
        """
        Initializes the ContentBasedTransformer with questions, answers, and a limit on how many documents (top_n) to process.

        Args:
            questions (pd.DataFrame): DataFrame containing question data (e.g., tags, IDs).
            answers (pd.DataFrame): DataFrame containing answer data (e.g., IDs).
            top_n (int): Number of top documents to process for each user. Defaults to 10.
        """
        self.rerank = rerank
        self.questions = questions
        self.answers = answers
        self.top_n = top_n

    def transform(self, inp: pd.DataFrame) -> pd.DataFrame:
        """
        Compute content-based similarity scores for each user's top_n documents.

        Args:
            inp (pd.DataFrame): PyTerrier DataFrame containing 'user_id', 'user_questions', 'user_answers', 'docno', and 'score'.

        Returns:
            pd.DataFrame: Input DataFrame with an additional 'content_similarity' column.
        """
        print('ALIVE')
        if self.rerank:

          # 1) Sort by BM25 in descending order
          inp = inp.sort_values(by='score', ascending=False)

        # 2) Prepare a list to collect per-user similarity results
        all_results = []

        # 3) Group by user_id and process each user's data
        for user_id, user_data in inp.groupby('user_id'):

            # Retrieve up to 10 questions/answers from the user's profile
            user_questions = user_data['user_questions'].iloc[0][:10]
            user_answers   = user_data['user_answers'].iloc[0][:10]

            # Retrieve top_n docno for further similarity evaluation
            # if we want to rerank apply only to top 10 othersie apply to all
            if self.rerank:
              evaluate_answers = user_data['docno'].tolist()[:self.top_n]
            else:
              evaluate_answers = user_data['docno'].tolist()[:10]


            similarities = content_based(
                user_questions=user_questions,
                user_answers=user_answers,
                questions=self.questions,
                answers=self.answers,
                evaluate_answers=evaluate_answers
            )

            # Convert the similarities to a DataFrame
            sim_df = pd.DataFrame(similarities, columns=['docno', 'content_similarity'])

            # Attach the user_id back so we can merge properly
            sim_df['user_id'] = user_id
            all_results.append(sim_df)

        # 4) Concatenate the per-user data into one DataFrame
        all_results_df = pd.concat(all_results, ignore_index=True)

        # 5) Merge the new similarities back onto the original DataFrame
        inp = inp.merge(all_results_df, on=['docno', 'user_id'], how='left')

        # 6) Fill any missing similarity scores with 0
        inp['content_similarity'] = inp['content_similarity'].fillna(0)

        # Return the final DataFrame with 'content_similarity'
        return inp


In [None]:
class NormalizedScoresTransformer(pt.Transformer):
    def __init__(self, answers):
        """
        Initializes the NormalizedScoresTransformer with answer data.

        Args:
            answers (pd.DataFrame): DataFrame containing answer data (e.g., 'Id' and 'Score').
        """
        self.answers = answers

    def transform(self, inp):
        """
        Add normalized scores to the DataFrame based on the provided answer IDs.

        Args:
            inp (pd.DataFrame): PyTerrier DataFrame containing 'docno' (answer IDs).

        Returns:
            pd.DataFrame: Input DataFrame with an additional 'normalized_score' column.
        """
        # Extract the list of document IDs (answer IDs)
        answer_ids = inp['docno'].tolist()



        # Compute normalized scores using the provided function
        #print(f"Calculating normalized scores for {len(answer_ids)} answers.")
        normalized_scores = get_normalized_scores(answer_ids, self.answers)

        # Convert normalized scores to a DataFrame for merging
        scores_df = normalized_scores.reset_index()
        scores_df.columns = ['docno', 'normalized_score']

        # Merge normalized scores back into the input DataFrame
        inp = inp.merge(scores_df, on='docno', how='left')

        # Fill missing normalized scores with 0
        inp['normalized_score'] = inp['normalized_score'].fillna(0)


        return inp


In [None]:
class NormalizedPopularityTransformer(pt.Transformer):
    def __init__(self, answers):
        """
        Initializes the NormalizedPopularityTransformer with answer data.

        Args:
            answers (pd.DataFrame): DataFrame containing answer data (e.g., 'Id' and 'CommentCount').
        """
        self.answers = answers

    def transform(self, inp):
        """
        Add normalized popularity scores to the DataFrame based on the provided answer IDs.

        Args:
            inp (pd.DataFrame): PyTerrier DataFrame containing 'docno' (answer IDs).

        Returns:
            pd.DataFrame: Input DataFrame with an additional 'normalized_popularity' column.
        """
        # Extract the list of document IDs (answer IDs)
        answer_ids = inp['docno'].tolist()

        # Compute normalized popularity scores
        #print(f"Calculating normalized popularity for {len(answer_ids)} answers.")

        normalized_popularity = get_normalized_popularity(answer_ids, self.answers)


        # Convert normalized popularity scores to a DataFrame for merging
        popularity_df = normalized_popularity.reset_index()
        popularity_df.columns = ['docno', 'normalized_popularity']

        # Merge normalized popularity scores back into the input DataFrame
        inp = inp.merge(popularity_df, on='docno', how='left')

        # Fill missing popularity scores with 0
        inp['normalized_popularity'] = inp['normalized_popularity'].fillna(0)


        return inp


In [None]:
content_transformer = ContentBasedTransformer(questions, answers)
score_transformer = NormalizedScoresTransformer(answers)
popularity_transformer = NormalizedPopularityTransformer(answers)

### Run baseline personalizayion configuration to figure out the weights for combinign with best configurations

In [None]:
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]
# Define the combined pipeline
normalized_bm25 = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()
combined_pipeline_with_scores = (
    normalized_bm25
    >> score_transformer  # Adds 'normalized_score'
    >> pt.apply.doc_score(
        lambda row: (
            0.6 * row['score'] +  # BM25 score
            0.4 * row['normalized_score']  # Normalized score
        )
    )
)

combined_pipeline_with_popularity = (
    normalized_bm25
    >> popularity_transformer  # Adds 'normalized_score'
    >> pt.apply.doc_score(
        lambda row: (
            0.6 * row['score'] +  # BM25 score
            0.4 * row['normalized_popularity']  # Normalized score
        )
    )
)

combined_pipeline_with_content_base = (
    normalized_bm25
    >> content_transformer  # Adds 'normalized_score'
    >> pt.apply.doc_score(
        lambda row: (
            0.6 * row['score'] +  # BM25 score
            0.4 * row['content_similarity']  # Normalized score
        )
    )
)

# Run an experiment to evaluate the pipeline
results = pt.Experiment(
    [bm25,combined_pipeline_with_content_base,  combined_pipeline_with_scores,combined_pipeline_with_popularity],  # Compare BM25 baseline with combined pipeline
    val_queries,  # Validation queries
    val_qrels,  # Ground truth relevance
    eval_metrics=metrics,  # Evaluation metrics
    names=[ 'BM25', "BM25 + Content Base Combined", "BM25 + Scores Combined", "BM25 + Popularity Combined"],  # Pipeline namesq
    verbose = True
)

# Display the results
print(results)


pt.Experiment:   0%|          | 0/4 [00:00<?, ?system/s]

ALIVE
                           name     R@100       R@5       P@1    AP@100  \
0                          BM25  0.897959  0.693878  0.530612  0.614691   
1  BM25 + Content Base Combined  0.897959  0.693878  0.540816  0.619793   
2        BM25 + Scores Combined  0.897959  0.693878  0.520408  0.610679   
3    BM25 + Popularity Combined  0.897959  0.704082  0.530612  0.613799   

    nDCG@10    nDCG@3  
0  0.651930  0.605197  
1  0.655696  0.608963  
2  0.649075  0.606533  
3  0.648986  0.610299  


We can only notice very slighltly improvement over simple baseline, we will explore more complex weight combinations and strategies to leverage this additional infromation

## Instead of combining weight we just rerank according to personalization features

In [None]:
import pyterrier as pt
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG
# Define evaluation metrics
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents
# Initialize the BM25 pipeline with score normalization
normalized_bm25 = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()

# Define reranking pipeline using the score transformer
rerank_pipeline_with_scores = (
    normalized_bm25
    >> score_transformer  # Adds 'normalized_score'
    >> pt.apply.doc_score(
        lambda row: row['normalized_score']  # Use only the transformer's normalized score for reranking
    )
)

# Define reranking pipeline using the popularity transformer
rerank_pipeline_with_popularity = (
    normalized_bm25
    >> popularity_transformer  # Adds 'normalized_popularity'
    >> pt.apply.doc_score(
        lambda row: row['normalized_popularity']  # Use only the transformer's normalized popularity score for reranking
    )
)

# Define reranking pipeline using the content-based transformer
rerank_pipeline_with_content_base = (
    normalized_bm25
    >> content_transformer  # Adds 'content_similarity'
    >> pt.apply.doc_score(
        lambda row: row['content_similarity']  # Use only the transformer's content similarity score for reranking
    )
)

# Run an experiment to evaluate the pipelines
results = pt.Experiment(
    retr_systems=[
        bm25,  # BM25 baseline
        #rerank_pipeline_with_content_base,  # BM25 + Content-Based Reranking
        rerank_pipeline_with_scores,         # BM25 + Score-Based Reranking
        rerank_pipeline_with_popularity       # BM25 + Popularity-Based Reranking
    ],
    topics=val_queries,      # Validation queries
    qrels=val_qrels,          # Ground truth relevance judgments
    eval_metrics=metrics,    # Evaluation metrics
    names=[
        'BM25',
        #'BM25 + Content-Based Reranking',
        'BM25 + Score-Based Reranking',
        'BM25 + Popularity-Based Reranking'
    ],
    verbose=True
)

# Display the results
print(results)


  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents


pt.Experiment:   0%|          | 0/3 [00:00<?, ?system/s]

                                name     R@100       R@5       P@1    AP@100  \
0                               BM25  0.897959  0.693878  0.530612  0.614691   
1       BM25 + Score-Based Reranking  0.897959  0.051020  0.000000  0.040336   
2  BM25 + Popularity-Based Reranking  0.897959  0.051020  0.000000  0.043876   

    nDCG@10    nDCG@3  
0  0.651930  0.605197  
1  0.046444  0.016642  
2  0.035991  0.029518  


This clearly is not working, since we completly distrupt the work of the BM25 disregarding completly topicality and relevance during the reranking

# What if we rerank top 5 document only among themselves?
- We are overwerwriting BM25 scores, losing their interpretability, but its a quick and dirty way of reranking top-n docs without affecting overall ranking
- In this way we ensure that any document in the top-n can't leave the top-n, and any document outside of it can't enter

In [None]:
import pyterrier as pt
import pandas as pd

metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel="BM25") % 100
normalized_bm25 = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()

def partial_rerank_overwrite_scores(df, n=5, personalization_col="normalized_popularity"):
    """
    Partially rerank top-n docs by 'personalization_col' and overwrite BM25 'score'
    with descending scores to preserve the new order.

    Args:
        df (pd.DataFrame): DataFrame containing retrieval results for a single query.
        n (int): Number of top documents to rerank.
        personalization_col (str): Column name to sort the top-n documents.

    Returns:
        pd.DataFrame: DataFrame with updated 'score' for top-n documents.
    """
    # Sort by BM25 score descending to identify top-n
    df_sorted = df.sort_values("score", ascending=False).reset_index(drop=True)
    top_n = df_sorted.iloc[:n].copy()

    # Sort the top-n subset by the personalization column descending
    top_n_sorted = top_n.sort_values(personalization_col, ascending=False).reset_index(drop=True)

    # Assign new scores: higher personalization gets higher scores
    # Example: Start from n*100 and decrement by 1
    top_n_sorted["score"] = list(range(n , n - len(top_n_sorted), -1))

    # Keep the bottom documents (rank n+ to 100) in BM25 order and scores
    bottom = df_sorted.iloc[n:].copy()

    # Re-combine them
    combined = pd.concat([top_n_sorted, bottom], ignore_index=True)

    # Ensure the combined DataFrame is sorted by the new 'score' descending
    combined = combined.sort_values("score", ascending=False).reset_index(drop=True)

    return combined

def rerank_top_n_transformer(n=5, personalization_col="normalized_popularity"):
    """
    PyTerrier transformer that applies partial reranking on a per-query basis.

    Args:
        n (int): Number of top documents to rerank.
        personalization_col (str): Column name to sort the top-n documents.

    Returns:
        pt.Transformer: A PyTerrier transformer.
    """
    def _transform(df):
        return df.groupby("qid", group_keys=False).apply(
            lambda group: partial_rerank_overwrite_scores(group, n, personalization_col)
        ).reset_index(drop=True)
    return pt.apply.generic(_transform)

rerank_pipeline_content = (
    normalized_bm25
    >> content_transformer
    >> rerank_top_n_transformer(n=5, personalization_col="content_similarity")
)

rerank_pipeline_pop = (
    normalized_bm25
    >> popularity_transformer
    >> rerank_top_n_transformer(n=5, personalization_col="normalized_popularity")
)

# Rerank top-10 by normalized_score
rerank_pipeline_score = (
    normalized_bm25
    >> score_transformer
    >> rerank_top_n_transformer(n=5, personalization_col="normalized_score")
)

# Compare systems in an Experiment
results = pt.Experiment(
    retr_systems=[
        normalized_bm25,
        rerank_pipeline_pop,
        rerank_pipeline_score,
        rerank_pipeline_content
    ],
    topics=val_queries,
    qrels=val_qrels,
    eval_metrics=metrics,
    names=["BM25 (Top-100)", "Top-100 + Partial Rerank (By Popularity)","Top-100 + Partial Rerank (By Score)","Top-100 + Partial Rerank (By Content)" ],
    verbose=True
)

print(results)


  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel="BM25") % 100


pt.Experiment:   0%|          | 0/4 [00:00<?, ?system/s]

  return df.groupby("qid", group_keys=False).apply(
  return df.groupby("qid", group_keys=False).apply(


                                       name     R@100       R@5       P@1  \
0                            BM25 (Top-100)  0.897959  0.693878  0.530612   
1  Top-100 + Partial Rerank (By Popularity)  0.897959  0.693878  0.122449   
2       Top-100 + Partial Rerank (By Score)  0.897959  0.693878  0.112245   
3     Top-100 + Partial Rerank (By Content)  0.897959  0.693878  0.510204   

     AP@100   nDCG@10    nDCG@3  
0  0.614691  0.651930  0.605197  
1  0.338500  0.443755  0.314865  
2  0.294283  0.408053  0.228740  
3  0.604487  0.644398  0.597665  


  return df.groupby("qid", group_keys=False).apply(


### Tradeoffs
- Given the nature of the labels, we see that trying to propose more relevant documents lead to a performance reduction
- Results may be ranked in a more personalized way but if we look at performances we see clearly that still this is not good enough

# Lets test different weighting with the best configuration we have found so far:
-  bm25_0.2_bi_0.4_cross_0.4 Configurations A
- bm25_0.1_bi_0.5_cross_0.4 Configuration B

They are very similar so we can shift weight to see in wich configuration personalization has the most beneficial impact

## Custom biencoder and cross encoder apply functions, to return a new column in df instead of just the rank score, so we can compare different weights in the configurations later in the pipeline

In [None]:
from sentence_transformers.util import cos_sim
# Create a mapping from docno to index
docno_to_index = {docno: idx for idx, docno in enumerate(corpus_df['docno'])}

def _biencoder_apply(df):
    """
    Computes cosine similarity scores between query embeddings and precomputed document embeddings.

    Args:
        df (pd.DataFrame): DataFrame containing 'query' and 'docno' columns.

    Returns:
        np.ndarray: Array of similarity scores corresponding to each (query, docno) pair.
    """
    queries = df['text'].values
    docnos = df['docno'].values

    # Encode queries in batches
    query_embs = biencoder_model.encode(
        queries,
        batch_size=64,
        show_progress_bar=False,
        device=device,
        convert_to_numpy=True,
        normalize_embeddings=False  # Ensure queries are normalized
    )

    # Retrieve document indices based on 'docno'
    try:
        doc_indices = [docno_to_index[docno] for docno in docnos]
    except KeyError as e:
        print(f"Error: {e}. Some docnos are missing in the corpus.")
        # Assign zero scores for missing documents
        return np.zeros(len(df))

    # Retrieve the corresponding document embeddings
    selected_doc_embs = doc_embeddings[doc_indices]

    # Compute cosine similarity between query embeddings and document embeddings
    similarity_matrix = cos_sim(query_embs, selected_doc_embs)
    # Extract the diagonal elements to get per-pair similarities
    scores = np.diag(similarity_matrix)


    # Add the new scores as a column
    df['biencoder_score'] = scores

    return df


bi_encT_generic = pt.apply.generic(_biencoder_apply, batch_size=64)

In [None]:
from transformers import AutoTokenizer
from functools import partial

# Initialize tokenizer for precise truncation control
tokenizer = AutoTokenizer.from_pretrained('ncbi/MedCPT-Cross-Encoder')

def truncate_pair(query, text, max_length=512):
    """
    Truncate the combined query and text to fit within max_length tokens.
    """
    encoded = tokenizer.encode_plus(query, text, truncation=True, max_length=max_length, return_tensors=None)
    return encoded['input_ids']

def _crossencoder_apply(df, query_col='query', text_col='text'):
    """
    Applies the CrossEncoder to compute scores for query-text pairs.

    Args:
        df (pd.DataFrame): DataFrame containing query and text columns.
        query_col (str): Name of the query column.
        text_col (str): Name of the text column.

    Returns:
        list: List of scores for each query-text pair.
    """
    queries = df[query_col].tolist()
    texts = df[text_col].tolist()

    # Form query-text pairs
    pairs = list(zip(queries, texts))
    try:
        scores = crossmodel.predict(pairs, batch_size=64, show_progress_bar=False)
    except Exception as e:
        print(f"Error during CrossEncoder prediction: {e}")
        scores = [0.0] * len(df)  # Default score in case of error



    df['cross_score'] = scores

    return df
#return crossmodel.predict(list(zip(df['query'].values, df[column].values)))


crossencoder_apply_partial = partial(_crossencoder_apply, query_col='query', text_col='text')

cross_encT_generic = pt.apply.generic(crossencoder_apply_partial, batch_size=64)



tokenizer_config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/706k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

In [None]:
import pyterrier as pt
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents
normalized_br = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()

content_trasformer_no_rerank = ContentBasedTransformer(questions, answers, rerank=False)

# -----------------------------------
# Define Neural Rerankers Pipelines
# -----------------------------------

# Bi-encoder Pipeline
bi_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT_generic
)

# Cross-encoder Pipeline
cross_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> cross_encT_generic
)

combined_pipeline = (
    normalized_br
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT_generic
    >> cross_encT_generic
)

# Normalize the scores

normalized_bi = bi_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_cross = cross_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_combined = combined_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()





  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents


In [None]:
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]


score_weights = [(0.1, 0.1, 0.4, 0.4),  # BM25 weight , score weight, biencoder weight, crossencoder weight
                 (0.2, 0.1, 0.3, 0.4),
                 (0.1, 0.3, 0.3, 0.4)]

popularity_weights = [(0.1, 0.1, 0.4, 0.4),  # BM25 weight , popularity weight, biencoder weight, crossencoder weight
                 (0.2, 0.1, 0.3, 0.4),
                 (0.1, 0.3, 0.3, 0.4)]

content_weights = [(0.1, 0.1, 0.4, 0.4),  # BM25 weight , content weight, biencoder weight, crossencoder weight
                 (0.2, 0.1, 0.3, 0.4),
                 (0.1, 0.3, 0.3, 0.4)]


pipelines = []
pipeline_names = []

for weights in score_weights:
    bm25_weight, score_weight, biencoder_weight, crossencoder_weight = weights

    # Configuration A
    combined_pipeline_with_scores = (
        normalized_combined
        >> score_transformer  # Adds 'normalized_score'
        >> pt.apply.doc_score(
            lambda row: (
                bm25_weight * row['score'] +  # BM25 score
                score_weight * row['normalized_score']  # Normalized score
                + biencoder_weight * row['biencoder_score'] +
                crossencoder_weight * row['cross_score']
            )
        )
    )
    pipelines.append(combined_pipeline_with_scores)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{biencoder_weight}_cross_{crossencoder_weight}_score_{score_weight}")

for weights in popularity_weights:
    bm25_weight, popularity_weights, biencoder_weight, crossencoder_weight = weights

    # Configuration A
    combined_pipeline_with_scores = (
        normalized_combined
        >> popularity_transformer  # Adds 'normalized_score'
        >> pt.apply.doc_score(
            lambda row: (
                bm25_weight * row['score'] +  # BM25 score
                popularity_weights * row['normalized_popularity']  # Normalized popularity
                + biencoder_weight * row['biencoder_score'] +
                crossencoder_weight * row['cross_score']
            )
        )
    )
    pipelines.append(combined_pipeline_with_scores)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{biencoder_weight}_cross_{crossencoder_weight}_popularity_{popularity_weights}")


for weights in content_weights:
    bm25_weight, content_weights, biencoder_weight, crossencoder_weight = weights

    # Configuration A
    combined_pipeline_with_scores = (
        normalized_combined
        >> content_trasformer_no_rerank  # Adds 'normalized_score'
        >> pt.apply.doc_score(
            lambda row: (
                bm25_weight * row['score'] +  # BM25 score
                content_weights * row['content_similarity']  # content similarity
                + biencoder_weight * row['biencoder_score'] +
                crossencoder_weight * row['cross_score']
            )
        )
    )
    pipelines.append(combined_pipeline_with_scores)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{biencoder_weight}_cross_{crossencoder_weight}_content_{content_weights}")





In [None]:
# Run an experiment to evaluate the pipeline
results = pt.Experiment(
    pipelines,  # Compare BM25 baseline with combined pipeline
    val_queries,  # Validation queries
    val_qrels,  # Ground truth relevance
    eval_metrics=metrics,  # Evaluation metrics
    names=pipeline_names,  # Pipeline namesq
    verbose = True
)

# Display the results
print(results)


pt.Experiment:   0%|          | 0/9 [00:00<?, ?system/s]

ALIVE
ALIVE
ALIVE
                                       name     R@100       R@5      P@1  \
0       bm25_0.1_bi_0.4_cross_0.4_score_0.1  0.897959  0.826531  0.72449   
1       bm25_0.2_bi_0.3_cross_0.4_score_0.1  0.897959  0.826531  0.72449   
2       bm25_0.1_bi_0.3_cross_0.4_score_0.3  0.897959  0.826531  0.72449   
3  bm25_0.1_bi_0.4_cross_0.4_popularity_0.1  0.897959  0.826531  0.72449   
4  bm25_0.2_bi_0.3_cross_0.4_popularity_0.1  0.897959  0.826531  0.72449   
5  bm25_0.1_bi_0.3_cross_0.4_popularity_0.3  0.897959  0.826531  0.72449   
6     bm25_0.1_bi_0.4_cross_0.4_content_0.1  0.897959  0.826531  0.72449   
7     bm25_0.2_bi_0.3_cross_0.4_content_0.1  0.897959  0.826531  0.72449   
8     bm25_0.1_bi_0.3_cross_0.4_content_0.3  0.897959  0.826531  0.72449   

     AP@100   nDCG@10   nDCG@3  
0  0.771087  0.790765  0.76822  
1  0.771087  0.790765  0.76822  
2  0.771087  0.790765  0.76822  
3  0.771084  0.790765  0.76822  
4  0.771084  0.790765  0.76822  
5  0.771084  0.790765  

Although each configuration shows a significant improvement over the pure retrieval approach, we fail to differentiate among them. This suggests that the overall improvement is quite similar across all configurations. We are unable to identify a clearly superior one, so we will retain all three scores: quality, popularity, and content-based.

The inability to differentiate can be due to the fact that each system produces weights that are very similar, resulting in comparable outputs. This could be caused by  the normalization steps we applied.

---

# Test best configurations on test set
- llm enrichment
- personalization
- baseline
- neural ranking

Baseline:
- bm25

LLM:
- bm25_0.2_bi_0.4_cross_0.4


Personalization:
- bm25_0.2_bi_0.3_cross_0.4_score_0.1
- bm25_0.1_bi_0.4_cross_0.4_popularity_0.1  
- bm25_0.1_bi_0.4_cross_0.4_content_0.1

Neural Ranking:
- bm25_0.2_bi_0.4_cross_0.4

-----

## Run test

## Normalization

In [None]:
class CrossEncoderNormalizer(pt.Transformer):
    def __init__(self, normalization_method="min-max"):
        """
        Custom transformer to normalize cross-encoder scores before combining.

        Args:
            normalization_method: Type of normalization. Options: "min-max", "z-score".
        """
        self.normalization_method = normalization_method

    def transform(self, inp: pd.DataFrame) -> pd.DataFrame:
        if self.normalization_method == "min-max":
            return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)
        elif self.normalization_method == "z-score":
            return inp.groupby("qid", group_keys=False).apply(self._z_score_normalize)
        else:
            raise ValueError(f"Unknown normalization method: {self.normalization_method}")

    def _min_max_normalize(self, df):
        min_score = df["cross_score"].min()
        max_score = df["cross_score"].max()
        if max_score - min_score > 0:
            df["cross_score"] = (df["cross_score"] - min_score) / (max_score - min_score)
        else:
            df["cross_score"] = 0.5  # Assign a constant if all scores are the same
        return df




In [None]:
import pyterrier as pt
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents
normalized_br = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()

content_trasformer_no_rerank = ContentBasedTransformer(questions, answers, rerank=False)


# Bi-encoder Pipeline
bi_pipeline_generic = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT_generic
)

# Cross-encoder Pipeline
cross_pipeline_generic = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> cross_encT_generic

)

combined_pipeline_generic = (
    normalized_br
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT_generic
    >> cross_encT_generic
)

# Normalize the scores

normalized_bi_generic = bi_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_cross_generic = cross_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_combined_generic = combined_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()






  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents


In [None]:
import pyterrier as pt
from pyterrier.measures import R, P, AP, nDCG #recall, precision, map, nDCG

bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents




# Bi-encoder Pipeline
bi_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> bi_encT
)

# Cross-encoder Pipeline
cross_pipeline = (
    bm25
    >> pt.text.get_text(simple_preprocessing_index, 'text')  # Retrieve the document text
    >> cross_encT
)

# Normalize the scores
normalized_br = bm25 >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_bi = bi_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()
normalized_cross = cross_pipeline >> pt.pipelines.PerQueryMaxMinScoreTransformer()





  bm25 = pt.BatchRetrieve(simple_preprocessing_index, wmodel='BM25') % 100  # Retrieve top-100 documents


In [None]:

score_weights = [(0.1, 0.1, 0.4, 0.4)]  # BM25 weight , score weight, biencoder weight, crossencoder weight
popularity_weights = [(0.1, 0.1, 0.4, 0.4)]  # BM25 weight , popularity weight, biencoder weight, crossencoder weight
content_weights = [(0.1, 0.1, 0.4, 0.4),]  # BM25 weight , content weight, biencoder weight, crossencoder weight
with_bm25_weights = [ (0.2, 0.4, 0.4)] #BM25, biencoder, cross encoder


pipelines = []
pipeline_names = []
pipelines.append(normalized_br)
pipeline_names.append("bm25")


# Pipelines with BM25
for bm25_weight, bi_weight, cross_weight in with_bm25_weights:
    combined = (
        bm25_weight * normalized_br +
        bi_weight * normalized_bi +
        cross_weight * normalized_cross
    )
    pipelines.append(combined)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{bi_weight}_cross_{cross_weight}")

for weights in score_weights:
    bm25_weight, score_weight, biencoder_weight, crossencoder_weight = weights

    # Configuration A
    combined_pipeline_with_scores = (
        normalized_combined_generic
        >> score_transformer  # Adds 'normalized_score
        >> CrossEncoderNormalizer(normalization_method="min-max")  # Normalize cross-encoder scores
        >> pt.apply.doc_score(
            lambda row: (
                bm25_weight * row['score'] +  # BM25 score
                score_weight * row['normalized_score']  # Normalized score
                + biencoder_weight * row['biencoder_score'] +
                crossencoder_weight * row['cross_score']
            )
        )
    )
    pipelines.append(combined_pipeline_with_scores)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{biencoder_weight}_cross_{crossencoder_weight}_score_{score_weight}")

for weights in popularity_weights:
    bm25_weight, popularity_weights, biencoder_weight, crossencoder_weight = weights

    # Configuration A
    combined_pipeline_with_scores = (
        normalized_combined_generic
        >> popularity_transformer  # Adds 'normalized_score'
        >> CrossEncoderNormalizer(normalization_method="min-max")  # Normalize cross-encoder scores
        >> pt.apply.doc_score(
            lambda row: (
                bm25_weight * row['score'] +  # BM25 score
                popularity_weights * row['normalized_popularity']  # Normalized popularity
                + biencoder_weight * row['biencoder_score'] +
                crossencoder_weight * row['cross_score']
            )
        )
    )
    pipelines.append(combined_pipeline_with_scores)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{biencoder_weight}_cross_{crossencoder_weight}_popularity_{popularity_weights}")


for weights in content_weights:
    bm25_weight, content_weights, biencoder_weight, crossencoder_weight = weights

    # Configuration A
    combined_pipeline_with_scores = (
        normalized_combined_generic
        >> content_trasformer_no_rerank  # Add normalized content based scores'
        >> CrossEncoderNormalizer(normalization_method="min-max")  # Normalize cross-encoder scores
        >> pt.apply.doc_score(
            lambda row: (
                bm25_weight * row['score'] +  # BM25 score
                content_weights * row['content_similarity']  # content similarity
                + biencoder_weight * row['biencoder_score'] +
                crossencoder_weight * row['cross_score']
            )
        )
    )
    pipelines.append(combined_pipeline_with_scores)
    pipeline_names.append(f"bm25_{bm25_weight}_bi_{biencoder_weight}_cross_{crossencoder_weight}_content_{content_weights}")





In [None]:
metrics = [R@100, R@5, P@1, AP@100, nDCG@10, nDCG@3]

# Run an experiment to tets the pipelines
results = pt.Experiment(
    pipelines,  # Compare BM25 baseline with combined pipeline
    test_queries,  # Validation queries
    test_qrels,  # Ground truth relevance
    eval_metrics=metrics,  # Evaluation metrics
    names=pipeline_names,  # Pipeline namesq
    verbose = True
)

# Display the results
print(results)

pt.Experiment:   0%|          | 0/5 [00:00<?, ?system/s]

  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)
  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)


ALIVE


  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)


                                       name     R@100       R@5       P@1  \
0                                      bm25  0.928571  0.826531  0.714286   
1                 bm25_0.2_bi_0.4_cross_0.4  0.928571  0.897959  0.744898   
2       bm25_0.1_bi_0.4_cross_0.4_score_0.1  0.928571  0.908163  0.775510   
3  bm25_0.1_bi_0.4_cross_0.4_popularity_0.1  0.928571  0.918367  0.775510   
4     bm25_0.1_bi_0.4_cross_0.4_content_0.1  0.928571  0.908163  0.775510   

     AP@100   nDCG@10    nDCG@3  
0  0.763667  0.789195  0.760446  
1  0.808658  0.832735  0.816811  
2  0.822925  0.846119  0.817905  
3  0.823258  0.846431  0.817905  
4  0.823443  0.846566  0.817905  


- Personalized scores perform better also on the test set
- We did not combine them since they perform similarly, so there is not any apparant advantage in doing so, still for future studies we may consider also this option
- We can now use these pipelines in the final search engine

# Final Pipeline
- Set a **user id** to emulate a user profile
- Provide a natural language query
- Chode a valid retrieval method


### Set of optimal pipelines

In [None]:
pipelines

[(TerrierRetr(BM25) >> RankCutoff(100) >> <pyterrier.pipelines.PerQueryMaxMinScoreTransformer object at 0x7d9ec17307d0>),
 <pyterrier._ops.Sum at 0x7d9ec27a7990>,
 (TerrierRetr(BM25) >> RankCutoff(100) >> <pyterrier.pipelines.PerQueryMaxMinScoreTransformer object at 0x7d9ec17307d0> >> <pyterrier.terrier._text_loader.TerrierTextLoader object at 0x7d9ec27c91d0> >> pt.apply.generic() >> pt.apply.generic() >> <pyterrier.pipelines.PerQueryMaxMinScoreTransformer object at 0x7d9ec2774ad0> >> <__main__.NormalizedScoresTransformer object at 0x7d9ec3368dd0> >> pt.apply.doc_score()),
 (TerrierRetr(BM25) >> RankCutoff(100) >> <pyterrier.pipelines.PerQueryMaxMinScoreTransformer object at 0x7d9ec17307d0> >> <pyterrier.terrier._text_loader.TerrierTextLoader object at 0x7d9ec27c91d0> >> pt.apply.generic() >> pt.apply.generic() >> <pyterrier.pipelines.PerQueryMaxMinScoreTransformer object at 0x7d9ec2774ad0> >> <__main__.NormalizedPopularityTransformer object at 0x7d9ec27a5010> >> pt.apply.doc_score()),

Save in less ambiguous varibales the pipelines we tested to perform live retrieval

In [None]:
# 0 is bm25 baseline
# 1 is score based
# 2 is popularity based
# 3 is content based

baseline_bm25 = pipelines[0]
neural_ranking = pipelines[1]
score_based = pipelines[2]
popularity_based = pipelines[3]
content_based_pipeline = pipelines[4]




Reload since we altered original df

In [None]:
# load val data
val_queries_pipeline = pd.read_json('PIR_data/answer_retrieval/val/subset_data.jsonl', lines=True)

## LLM expansion
(We reuse some variabled of the LLM section to avoid too much redundancy)

In [None]:
def get_prev_tags_for_user(df, user_id):
    """
    Retrieve the history of tags and queries for a single user ID from the given DataFrame.

    Args:
        df (pd.DataFrame): DataFrame containing 'user_id', 'query', 'tags', and 'timestamp' columns.
        user_id (int): The user ID for whom the history of tags and queries is retrieved.

    Returns:
        tuple: Two lists - previous queries and previous tags for the given user.
    """
    # Filter the DataFrame for the given user ID
    user_df = df[df['user_id'] == user_id]

    # Sort the user's entries by timestamp (newest to oldest)
    user_df = user_df.sort_values('timestamp', ascending=False)

    # Collect the last 3 queries and tags for the user
    prev_queries = user_df['query'].iloc[:3].tolist() if not user_df.empty else []
    prev_tags = user_df['tags'].iloc[:3].tolist() if not user_df.empty else []

    return prev_queries, prev_tags


In [None]:
import re

def enrich_query(query, prev_tags, pipe, generation_args):
    """
    Enrich a single query based on historical tags.

    Args:
        query (str): The user's original query.
        prev_tags (list): List of historical tags associated with the user.
        pipe (callable): Text generation pipeline (e.g., Hugging Face pipeline).
        generation_args (dict): Arguments for the text generation pipeline.

    Returns:
        str: The enriched query.
    """
    # Flatten historical tags
    flattened_tags = [item for sublist in prev_tags for item in sublist]

    # Handle empty or overly long tag lists
    if not flattened_tags:
        flattened_tags = ["None"]
    elif len(flattened_tags) > 2:
        flattened_tags = flattened_tags[:2]  # Use the last 2 tags to keep query manageable

    # Construct the system message
    if flattened_tags == ["None"]:
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": f"Given the query, provide a generalized expanded version of the query. Keep the new query with similar length to the original. Do not add anything else. Query: {query}"}
        ]
    else:
        messages = [
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": f"Given the query and the user interests given as comma separated values, provide an expanded version of the query with personalized interests. Do not add anything else. Query: {query}, Keywords: {', '.join(flattened_tags)}"}
        ]

    # Get the model output
    output = pipe(messages, **generation_args)
    enriched_query = output[0]['generated_text'].strip()

    # Remove any prefixes like "Query:"
    enriched_query = re.sub(r'(?i)^query:\s*', '', enriched_query)
    print(f'enriched query: {enriched_query}')

    return enriched_query


In [None]:
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

Device set to use cuda


## Build **input_df** to be able to integrate differnt type of informations

In [None]:
def build_input_df(user_id,query_id, query, dataframe):
    user_data = dataframe[dataframe['user_id'] == user_id]

    user_questions = user_data['user_questions'].iloc[0] if not user_data.empty else []
    user_answers = user_data['user_answers'].iloc[0] if not user_data.empty else []



    input_df = pd.DataFrame(({
        "qid": query_id,
        "query": query,
        "user_id": user_id,
        "user_questions": [user_questions],
        "user_answers": [user_answers]
    }), index = [0])

    return input_df

# Test pipelines

In [None]:
import time # import time to track perfomance

# Final function to run the customizable search pipeline
def search_pipeline(input, retrieval_type,llm_enriched = False, top_k=10):
    """
  Run a customizable search pipeline.

  Args:
        input (pd.DataFrame): A DataFrame containing at least the following columns:
            - 'query' (str): The query to be processed.
            - 'user_id' (int): The user ID for fetching query history (required if llm_enriched=True).
        retrieval_type (str): The type of retrieval to apply. Must be one of:
            - "baseline": Standard BM25-based retrieval.
            - "popularity": Combines BM25 with document popularity scores.
            - "score": Combines BM25 and additional score-based weighting.
            - "neural_rerank": Applies neural re-ranking models (e.g., bi-encoder or cross-encoder).
            - "content_based": Content-based filtering and ranking.
        llm_enriched (bool, optional): If True, the query is enriched using historical tags and
            an LLM before processing. Defaults to False.
        top_k (int, optional): The number of top results to return. Defaults to 10.
    """



    if llm_enriched:
        prev_queries, prev_tags = get_prev_tags_for_user(val_queries, input['user_id'][0])
        input['query'] = enrich_query(input['query'],prev_tags, pipe, generation_args )

    # Clean query and text
    input['query'] = input['query'].apply(clean_text)
    if retrieval_type == "baseline":
        pipeline = baseline_bm25
    elif retrieval_type == "popularity":
        pipeline = popularity_based
    elif retrieval_type == "score":
        pipeline = score_based
    elif retrieval_type == "neural_rerank" :
      pipeline = neural_ranking
    elif retrieval_type == "content_based":
        pipeline = content_based_pipeline
    else:
        raise ValueError(f"Unknown retrieval type: {retrieval_type}")
    start_time = time.time()
    results = pipeline.transform(input).head(top_k)
    end_time = time.time()
    return results, end_time - start_time




# We now try to see if differnt user using content base retrival will get results ranked slightly different

- take user **935589** (A) and user **30494** (B) (from val data)


In [None]:
# Choose inputs

query = "Interesting book to read while bored ?"
user_id = 935589
query_id = "prova_1"

input = build_input_df(user_id=user_id, query_id = query_id, query = query, dataframe = val_queries_pipeline)

# Perform a search
results_A, time_A = search_pipeline(
    input,
    retrieval_type = "content_based",
    top_k=10,
)

# sort results by score
results_A = results_A.sort_values(by='score', ascending=False)
output_A = results_A[['docno', 'score']]

  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)


In [None]:
# Choose inputs

query = "Interesting book to read while bored ?"
user_id = 30494
query_id = "prova_1"

input = build_input_df(user_id=user_id, query_id = query_id, query = query, dataframe = val_queries_pipeline)

# Perform a search
results_B, time_B = search_pipeline(
    input,
    retrieval_type = "content_based",
    top_k=10,
)

# sort results by score
results_B = results_B.sort_values(by='score', ascending=False)
output_B = results_B[['docno', 'score']]


  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)


In [None]:
print(output_A.sort_values(by='score', ascending=False))
print(f'Query execution time: {round(time_A, 2)} seconds')

                docno     score
1       writers_10171  0.855181
0       writers_18427  0.778719
3        writers_7788  0.778289
7        scifi_105312  0.760634
5     parenting_11797  0.706440
6        scifi_194554  0.647856
9    philosophy_67645  0.561249
2           rpg_60069  0.538583
8     philosophy_3766  0.512863
4  hermeneutics_18948  0.475426
Query execution time: 12.14


In [None]:
print(output_B.sort_values(by='score', ascending=False))
print(f'Query execution time: {round(time_B, 2)} seconds')

                docno     score
1       writers_10171  0.855181
3        writers_7788  0.764286
7        scifi_105312  0.760634
0       writers_18427  0.736711
5     parenting_11797  0.718205
6        scifi_194554  0.647856
9    philosophy_67645  0.578896
2           rpg_60069  0.538583
8     philosophy_3766  0.530510
4  hermeneutics_18948  0.475426
Query execution time: 2.28 seconds


We can notice that scores **vary between the two users and that writers_7788 is ranked on 3rd place for user A and 4rt place for user B**

Even if slightly weights change across different users, in this way each user will receive documents reranked with its own context. Resulting in a more personalized experience

# Run popularity and score retrieval

In [None]:
# Choose inputs

query = "Interesting book to read while bored ?"
user_id = 30494
query_id = "prova_1"

input = build_input_df(user_id=user_id, query_id = query_id, query = query, dataframe = val_queries_pipeline)

# Perform a search
results_C, time_C = search_pipeline(
    input,
    retrieval_type = "popularity",
    top_k=10,
)

# sort results by score
results_C = results_C.sort_values(by='score', ascending=False)
output_C = results_C[['docno', 'score']]


  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)


In [None]:
# Choose inputs

query = "Interesting book to read while bored ?"
user_id = 30494
query_id = "prova_1"

input = build_input_df(user_id=user_id, query_id = query_id, query = query, dataframe = val_queries_pipeline)

# Perform a search
results_D, time_D = search_pipeline(
    input,
    retrieval_type = "score",
    top_k=10,
)

# sort results by score
results_D = results_D.sort_values(by='score', ascending=False)
output_D = results_D[['docno', 'score']]


  return inp.groupby("qid", group_keys=False).apply(self._min_max_normalize)


In [None]:
print(output_C.sort_values(by='score', ascending=False))
print(f'Query execution time: {round(time_C, 2)} seconds')

                docno     score
1       writers_10171  0.855181
3        writers_7788  0.764286
7        scifi_105312  0.760634
0       writers_18427  0.736711
5     parenting_11797  0.718205
6        scifi_194554  0.647856
9    philosophy_67645  0.578896
2           rpg_60069  0.538583
8     philosophy_3766  0.530510
4  hermeneutics_18948  0.475426
Query execution time: 3.96 seconds


In [None]:
print(output_D.sort_values(by='score', ascending=False))
print(f'Query execution time: {round(time_D, 2)} seconds')

                docno     score
1       writers_10171  0.855181
3        writers_7788  0.777273
7        scifi_105312  0.772322
0       writers_18427  0.739308
5     parenting_11797  0.720726
6        scifi_194554  0.658246
9    philosophy_67645  0.570340
2           rpg_60069  0.538583
8     philosophy_3766  0.514161
4  hermeneutics_18948  0.491010
Query execution time: 3.5 seconds


# Neural ranking Pipeline

In [None]:
# Choose inputs

query = "Interesting book to read while bored ?"
user_id = 30494
query_id = "prova_1"

input = build_input_df(user_id=user_id, query_id = query_id, query = query, dataframe = val_queries_pipeline)

# Perform a search
results_E, time_E = search_pipeline(
    input,
    retrieval_type = "neural_rerank",
    top_k=10,
)

# sort results by score
results_E = results_E.sort_values(by='score', ascending=False)
output_E = results_E[['docno', 'score']]


In [None]:
print(output_E.sort_values(by='score', ascending=False))
print(f'Query execution time: {round(time_E, 2)} seconds')

                docno     score
2      academia_28724  0.248460
7  christianity_17799  0.217622
9  christianity_63712  0.185049
1       academia_2593  0.170499
8  christianity_59815  0.151260
0      academia_21421  0.098295
3         anime_37173  0.085324
5        apple_241980  0.078947
6       buddhism_1859  0.058347
4          anime_9795  0.007058
Query execution time: 3.32 seconds


# LLM enrichment pipeline

In [None]:
# Choose inputs

query = "Interesting book to read while bored ?"
user_id = 30494
query_id = "prova_1"

input = build_input_df(user_id=user_id, query_id = query_id, query = query, dataframe = val_queries_pipeline)

# Perform a search
results_F, time_F = search_pipeline(
    input,
    retrieval_type = "neural_rerank",
    top_k=10, llm_enriched = True
)

# sort results by score
results_F = results_F.sort_values(by='score', ascending=False)
output_F = results_F[['docno', 'score']]


enriched query: What's an interesting book to read about the Pali Canon while exploring mindfulness?


In [None]:
print(output_F.sort_values(by='score', ascending=False))
print(f'Query execution time: {round(time_F, 2)} seconds')

              docno     score
4  boardgames_33340  0.616216
2        anime_4030  0.455176
9     buddhism_1859  0.446735
8    buddhism_10754  0.341367
6    buddhism_10128  0.276856
0    academia_28724  0.146353
1       anime_37173  0.087344
3       anime_47766  0.057944
5  boardgames_33343  0.045858
7    buddhism_10518  0.033839
Query execution time: 2.4 seconds


# Below simple code snippet to investigate text of documents given docno

In [None]:
import textwrap
docno_to_retrieve = "buddhism_2433"

# get original docs without preprocess to enhance redability
original_corpus = pd.read_json('PIR_data/answer_retrieval/subset_answers.json', orient='index')
original_corpus.reset_index(inplace=True)
original_corpus.columns = ['docno', 'text']


top_document = original_corpus[original_corpus["docno"] == docno_to_retrieve]

# Check if the document exists
if not top_document.empty:
    # Extract the text
    document_text = top_document["text"].iloc[0]

    # Wrap the text for better readability
    wrapped_text = textwrap.fill(document_text, width=80)

    # Print the document beautifully
    print(f"Document with docno '{docno_to_retrieve}':\n")
    print(wrapped_text)
else:
    print(f"No document found with docno '{docno_to_retrieve}'.")


Document with docno 'buddhism_2433':

1. Is the Pali Canon the only canon in modern Buddhism?No. Additionally, as far
as I know, there are also many versions of the Chinese Canon which contain the
Āgamas (parallels of the Nikāyas from the Suttapiṭaka of the Pāli Canon),
Mahāyāna sūtras & Vajrayāna sūtras.There is also the Tibetan Canon which
contains the Kangyur (material which is similar to the Chinese Canon) and the
Tengyur (Commentarial works, Abhidharma and additional treatises).2. So would
the term canon be completely synonymous with Pali Canon in the context of
Buddhism?Bearing in mind what was said above, no.3. What is the status of texts
such as Bodhicaryāvatāra which aren't in the Pali canon?That text is present in
the Chinese Canon1 & Tibetan Canon2.


## Conclusion

The project successfully built a personalized search engine that can rerank results based on user profiles. We faced challenges like understanding the PyTerrier framework and hardware limitations but managed by dividing tasks efficiently.

Future work could include configuration weight fine-tuning and also model fine tuning, testing on larger datasets, and optimizing content-based scores to improve the system further.
