COL764-Information-Retrieval/A2 at main · ChinmayMittal/COL764-Information-Retrieval

History

Name		Name	Last commit message	Last commit date
parent directory ..
Images		Images
col764-ass2-release		col764-ass2-release
relevance-corpus		relevance-corpus
trec_eval-9.0.7		trec_eval-9.0.7
vectors		vectors
word2vec		word2vec
.DS_Store		.DS_Store
2020CS10336.pdf		2020CS10336.pdf
A2.pdf		A2.pdf
README.md		README.md
build.sh		build.sh
constants.py		constants.py
embeddings.sh		embeddings.sh
expansions.txt		expansions.txt
generate.py		generate.py
lm.py		lm.py
lm_rerank.py		lm_rerank.py
lm_rerank.sh		lm_rerank.sh
plot.py		plot.py
score.py		score.py
temp-plot.py		temp-plot.py
tokenizer.py		tokenizer.py
utils.py		utils.py
w2v_rerank.py		w2v_rerank.py
w2v_rerank.sh		w2v_rerank.sh

README.md

Build Script (build.sh) This file will be used to build the source code for word2vec and TREC eval tool

bash build.sh

Running Task-1 (Pseudo Relevance Language Modelling)

bash lm_rerank.sh [query-file] [top-100-file] [collection-dir] [output-file]

Eg.

bash lm_rerank.sh ./col764-ass2-release/covid19-topics.xml ./col764-ass2-release/t40-top-100.txt /home/baadalvm/COL764/2020-07-16/ ./output-lm.txt

Running TREC Eval

./trec_eval-9.0.7/trec_eval -m ndcg -m ndcg_cut.5,10,50 <Query-Results-File> <Output-To-Be-Tested>

eg.

./trec_eval-9.0.7/trec_eval -m ndcg -m ndcg_cut.5,10,50 ./col764-ass2-release/t40-qrels.txt ./output-lm.txt

Running Custom Eval (score.py)

python score.py --gold_query_relevances_path <path> --results_path <path>

eg.

python score.py --gold_query_relevances_path ./col764-ass2-release/t40-qrels.txt --results_path ./output-lm.txt

Running Local Embedding based Query Expansion Reranking

bash w2v_rerank.sh [query-file] [top-100-file] [collection-dir] [output-file] [expansions-file]

Eg.

bash w2v_rerank.sh ./col764-ass2-release/covid19-topics.xml ./col764-ass2-release/t40-top-100.txt /home/baadalvm/COL764/2020-07-16/ ./output.w2v.txt ./expansions.txt

Note: That during the execution of this task, Two folders are created:

vectors (This stores the embeddings learnt for each Query)
relevance-corpus (This stores the text files for each query on which word2vec is trained)

Other Files:

constants.py: contains hyperparameters used for both the tasks tuned to their best values
utils.py: basic utility funcitons, such as those for processing text, parsing PDFs, JSON etc.
lm.py: Contains implementations of basic Uni-gram Language models with Dirichilet Smoothing
tokenizer.py: Simple Tokenizer
lm_rerank.py: Main Python script used by Task-1
w2v_rerank.py: Main Python script used by Task-2
generate.py: Used in Task-2 to generate training corpus for Word2Vec
./word2vec: Source code from Google for Word2Vec Training
trec_eval-9.0.7: Source code for the TREC eval tool
embeddings.sh: Used to Train Word2Vec for each Query on the Relevance Corpus Generated

Running Individual Python Files

python lm_rerank.py --query_file_path ./col764-ass2-release/covid19-topics.xml --original_rankings_path ./col764-ass2-release/t40-top-100.txt --corpus_path /Users/chinmaymittal/Downloads/2020-07-16 --output_path ./output-lm.txt

Running Word Embeddings Based Reranking

python w2v_rerank.py  --query_file_path ./col764-ass2-release/covid19-topics.xml --original_rankings_path ./col764-ass2-release/t40-top-100.txt --corpus_path /Users/chinmaymittal/Downloads/2020-07-16 --output_path ./output-w2v.txt --expansions_path ./expansions.txt --vectors_dir ./vectors

Generate Relevance Corpus for Each Query to Train Word2Vec

python generate.py --query_file_path ./col764-ass2-release/covid19-topics.xml --original_rankings_path ./col764-ass2-release/t40-top-100.txt --corpus_path /Users/chinmaymittal/Downloads/2020-07-16 --output_dir ./relevance-corpus

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A2

A2

README.md

Files

A2

Directory actions

More options

Directory actions

More options

Latest commit

History

A2

Folders and files

parent directory

README.md