reRankDocs

Psuedo Relevance Feedback using Query Expansion and Reranking

To call the RM1 model -

Call lm_rerank.sh with the apt arguments. The arguments expected by this file (in order) are called as bash lm_rerank.sh [query-file] [top-100-file] [collection-dir] [output-file] [expansions-file]
The script calls top_rerank_rm1.py, and passes the same arguments as above to it. This file is the flow-control of the entire task.
top_rerank_rm1.py calls a series of healper files for reading some of the data. These are :-
- read_csv.py :- Houses functionality to read the metadata.csv in the [collection-dir]
- read_qfile.py :- Houses functionality to read and parse the [query-file]
- read_top100.py :- Houses functionality to read and parse the [top-100-file]
- read_tjson.py :- To read and parse a json file (pmc_json/pdf_json)
rm1.py is where the algorithmic implementations of the RM1 model are housed. These include computing the per-document LM, the global-collection LM, and functionality for calculating the query-document score. This score can now be used to re-rank the documents for a given query.
top_rerank.py iterates over the queries. For each query, it computes a score for the top100 documents that have been retrieved . Arranging these is a descending order, these results are written to the [output-file].

Call w2v_rerank.sh with the apt arguments. The arguments expected by this file (in order) are called as bash w2v_rerank.sh [query-file] [top-100-file] [collection-dir] [output-file] [expansion-file]
The script calls top_rerank_w2v.py, and passes the same arguments as above to it. This file is the flow-control of the entire task.
top_rerank_w2v.py calls a series of healper files for reading some of the data. These are :-
- read_csv.py :- Houses functionality to read the metadata.csv in the [collection-dir]
- read_qfile.py :- Houses functionality to read and parse the [query-file]
- read_top100.py :- Houses functionality to read and parse the [top-100-file]
- read_tjson.py :- To read and parse a json file (pmc_json/pdf_json)
word2vec trains on the intm_data/i.txt for the ith query, and yields a intm_data/vector_i.bin, from which the embedding matrix U can be extracted for this query. This is then used to calculate a per-term score for terms in the vocabulary via U* U^T * q, and the top-k terms thus appearing are selected as expansion terms.
These expansion terms are appended to the original query terms, and the RM1 model above is then applied for re-ranking the scores.
The re-ranked results are written to the [output-file] and the expansions to [expansions-file]

Copy the [output-file] and the [t40-qrels.txt] (the relevance scores file) into trec_eval-9.0.7/.
Ensure that the trec_eval-9.0.7/ has been compiled. Else run make.
Run ./trec_eval -m ndcg -m ndcg_cut.5,10,50 [t40-qrels.txt] [output-file] to obtain the nDCG values!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
intm_data		intm_data
trec_eval-9.0.7		trec_eval-9.0.7
AlgorithmAndResults.pdf		AlgorithmAndResults.pdf
README.md		README.md
build.sh		build.sh
covid19-topics.xml		covid19-topics.xml
distance		distance
distance.c		distance.c
distance.py		distance.py
expansions.txt		expansions.txt
getndcg.sh		getndcg.sh
lm_rerank.sh		lm_rerank.sh
out.txt		out.txt
out_rm1.txt		out_rm1.txt
out_w2v.txt		out_w2v.txt
read_csv.py		read_csv.py
read_qfile.py		read_qfile.py
read_tjson.py		read_tjson.py
read_top100.py		read_top100.py
rm1.py		rm1.py
t40-qrels.txt		t40-qrels.txt
t40-top-100.txt		t40-top-100.txt
top_rerank_rm1.py		top_rerank_rm1.py
top_rerank_w2v.py		top_rerank_w2v.py
w2v_rerank.sh		w2v_rerank.sh
word2vec		word2vec
word2vec.c		word2vec.c