This is the implementation corresponding to my final BSc thesis as part of the TU Delft 2024 Research Project, the paper can be found here. We explore different methods to improve ranking quality of long and complex queries in different ad-hoc retrieval tasks, using the Fast-Forward index framework. The methods explored include query reduction using large language models and re-ranking utilising multiple semantic models.
Install all necessary dependencies:
pip install -r requirements.txt
In order to run this code, for each dataset both a sparse and dense index are needed. Sparse indexing was done using PyTerrier using the following script pt_index.py. All the datasets used are provided by ir_datasets and can be accessed in PyTerrier. Dense indexing was done using the Fast-Forward index framework, the scripts for all 3 dense encoders are available in the fast_forward_indexing directory.
Note: due to their large storage size, it isn't possible to upload the indexes to this repository. The indexing process is very resource-intensive and was primarily conducted using the Delft High Performance Computing Centrte. As this may be a limitation for some users, the indexes are also available upon request.
The repository is organized as follows:
- fast_forward_indexing - includes all the scripts related to dense indexing. /fast_forward_indexing/script_pt.sh contains the bash script utilised when indexing in the DelftBlue supercomputer.
- length_experiments - collection of scripts that measure the retrieval quality for each individual query of the dataset and plots it against their respective length.
- multi_rerank - contains all the experiments related to utilising multiple dense re-rankers in the Fast-Forward framework.
- generate_scores - generate the final ranking scores before interpolation
- multi_rank - experiments that compare the ranking performance for various numbers of dense re-rankers.
- scifact_alpha_tuning - script that tuned the alpha values in the development set to their optimal values
- query_reduction - contains all the experiments related to query reduction using LLM's.
- llama3_reduce.py - script that generates the reductions using Meta-Llama-3-8B-Instruct model.
- reduced_queries - directory that stores the reduced queries generated in csv format.
- system_prompts.txt - system prompts utilised for each dataset.
- eval_reduction_* - scripts that compare ranking quality between the original and reduced queries.
- sparse_indexing - includes all the scripts related to sparse indexing.