Skip to content

Code for my BSc thesis "Exploring methods to improve effectiveness of ad-hoc retrieval systems for long and complex queries"

Notifications You must be signed in to change notification settings

Erhan1706/fast-forward-long-query-effectiveness

Repository files navigation

Exploring methods to improve effectiveness of ad-hoc retrieval systems for long and complex queries

This is the implementation corresponding to my final BSc thesis as part of the TU Delft 2024 Research Project, the paper can be found here. We explore different methods to improve ranking quality of long and complex queries in different ad-hoc retrieval tasks, using the Fast-Forward index framework. The methods explored include query reduction using large language models and re-ranking utilising multiple semantic models.

Installation & Usage

Install all necessary dependencies:

pip install -r requirements.txt

In order to run this code, for each dataset both a sparse and dense index are needed. Sparse indexing was done using PyTerrier using the following script pt_index.py. All the datasets used are provided by ir_datasets and can be accessed in PyTerrier. Dense indexing was done using the Fast-Forward index framework, the scripts for all 3 dense encoders are available in the fast_forward_indexing directory.

Note: due to their large storage size, it isn't possible to upload the indexes to this repository. The indexing process is very resource-intensive and was primarily conducted using the Delft High Performance Computing Centrte. As this may be a limitation for some users, the indexes are also available upon request.

Overview

The repository is organized as follows:

  • fast_forward_indexing - includes all the scripts related to dense indexing. /fast_forward_indexing/script_pt.sh contains the bash script utilised when indexing in the DelftBlue supercomputer.
  • length_experiments - collection of scripts that measure the retrieval quality for each individual query of the dataset and plots it against their respective length.
  • multi_rerank - contains all the experiments related to utilising multiple dense re-rankers in the Fast-Forward framework.
    • generate_scores - generate the final ranking scores before interpolation
    • multi_rank - experiments that compare the ranking performance for various numbers of dense re-rankers.
    • scifact_alpha_tuning - script that tuned the alpha values in the development set to their optimal values
  • query_reduction - contains all the experiments related to query reduction using LLM's.
    • llama3_reduce.py - script that generates the reductions using Meta-Llama-3-8B-Instruct model.
    • reduced_queries - directory that stores the reduced queries generated in csv format.
    • system_prompts.txt - system prompts utilised for each dataset.
    • eval_reduction_* - scripts that compare ranking quality between the original and reduced queries.
  • sparse_indexing - includes all the scripts related to sparse indexing.

About

Code for my BSc thesis "Exploring methods to improve effectiveness of ad-hoc retrieval systems for long and complex queries"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages