IR-TermProject

Group Number: 22 Group Members:

Arnab Kumar Mallick - 18CH10011
Arghyadeep Bandyopadhyay - 18EE10012
Subham Karmakar - 18EE10067
Sankalp Srivastava - 18EE10069

Data: https://drive.google.com/drive/folders/1ELskVvmr4snskq8vUfPh0gtlDBBgPhQ9?usp=sharing

Task 1A: Generates pickle file and indexed documents file. The pickle file stores doc-ids instead of actual doc-path. The doc-ids are the indices of that particular document in the file - indexed_docs.txt. The pickle file contains a dictinary with terms as keys and a list of doc-id and term-frequency as its values, i.e. [doc-id, term frequency].

Task 1B: Generates queries file.

Task 1C: Logic and Algorithm:

Considered the tokens from the queries as they are joined by 'AND' logic.
Used the trivial merge algorithm(for 'AND' logic) for the boolean retrieval dicussed in the class, for merging the postings lists for the specific tokens obtained from the queries above.

Assumption: indexed_docs.txt file must be present in the root directory of the project.

Task 2A (TF-IDF Vectorization)

Performed by Arghyadeep Bandyopadhyay (Roll No: 18EE10012)

Steps used:

The paths and file names of all the documents are extracted fromt the ‘en_BDNews24’ folder
The inverted index file is read and the document frequencies (df) along with the vocabulary are stored
For each document, the document text is stored as a list of terms using the inverted index.
The text file containing the queries is read and each query is then stored as a list of the terms contained in that query
The |V|-dimensional TF-IDF vectors are obtained for each query with a given weighting scheme
For each document, at first the |V|-dimensional TF-IDF vector is obtained with a given weighting scheme. Then, for each query vector, the value of the cosine similarity metric with normalization between the query vector and the current document vector is computed. The process is repeated for all the documents and the values of the cosine similarity metric for each query-document pair is stored.
For each query, the cosine similarity scores are sorted in descending order. The top 50 documents are then stored in a 2-column csv file in the format : .
Steps 5 to 7 are performed for three ddd.qqq schemes, namely scheme ‘A’ (lnc.ltc), scheme ‘B’ (Lnc.Lpc) and scheme ‘C’ (anc.apc)

Assumptions / Changes:

The term frequencies are not computed separately and are assumed to be stored in the inverted index itself.

Extra input / parameters:

The path to the “queries_<GROUP_NO>.txt is to be given along with the path to the inverted index file and the “en_BDNews24” folder. Here, GROUP_NO is 22.

To run Task 2A:

$>>python3 PAT2_22_ranker.py <path_to_model_queries_22.pth>

Python version used: 3.6.8

Library Requirements:

os
sys
pickle
numpy
math
csv
collections

Task 2B: Logic:

Parsed the data of generated Ranked list A, B, C; Gold standard ranked lists and Queries in an array
Maintained a document that maps the query id with the relevant documents and relevance score
We then compared which documents in our ranked list is present in the Gold Standard list and calculated the Precision@K and Average Precision.
The relevance of the documents is then stored in an array and compared to the sorted array of the relevance scores. This way, we calculate the NDCG.
We then average over all the queries to find out the average parameters.

Assumptions:

The Data folder contains rankedRelevantDocList.csv
queries_22.txt must be present in the root directory of the project.
PAT2_22_ranked_list.csv must be present in the root directory of the project.

To run Task 2B:
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_A.csv
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_B.csv
$>> python PAT2_22_evaluator.py ./Data/rankedRelevantDocList.csv PAT2_22_ranked_list_C.csv

The output will be generated in the root directory
PAT2_22_metrics_A.csv
PAT2_22_metrics_B.csv
PAT2_22_metrics_C.csv

Python version used: 3.9.7

Library Requirements:

sys
csv
numpy
tabulate
pickle
pickle
bs4
re
os
pickle

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
Data		Data
.gitignore		.gitignore
1637254679709_PB_22_psrf.py		1637254679709_PB_22_psrf.py
Format_CSV.py		Format_CSV.py
PAT1_22_bool.py		PAT1_22_bool.py
PAT1_22_indexer.py		PAT1_22_indexer.py
PAT1_22_parser.py		PAT1_22_parser.py
PAT1_22_results.txt		PAT1_22_results.txt
PAT2_22_evaluator.py		PAT2_22_evaluator.py
PAT2_22_metrics_A.csv		PAT2_22_metrics_A.csv
PAT2_22_metrics_A.txt		PAT2_22_metrics_A.txt
PAT2_22_metrics_B.csv		PAT2_22_metrics_B.csv
PAT2_22_metrics_B.txt		PAT2_22_metrics_B.txt
PAT2_22_metrics_C.csv		PAT2_22_metrics_C.csv
PAT2_22_metrics_C.txt		PAT2_22_metrics_C.txt
PAT2_22_ranked_list_A.csv		PAT2_22_ranked_list_A.csv
PAT2_22_ranked_list_B.csv		PAT2_22_ranked_list_B.csv
PAT2_22_ranked_list_C.csv		PAT2_22_ranked_list_C.csv
PAT2_22_ranker.py		PAT2_22_ranker.py
PB_22_Metrics_Generate.py		PB_22_Metrics_Generate.py
PB_22_evaluator.py		PB_22_evaluator.py
PB_22_important_words.csv		PB_22_important_words.csv
PB_22_important_words.py		PB_22_important_words.py
PB_22_ranked_list_PsRF_0.csv		PB_22_ranked_list_PsRF_0.csv
PB_22_ranked_list_PsRF_1.csv		PB_22_ranked_list_PsRF_1.csv
PB_22_ranked_list_PsRF_2.csv		PB_22_ranked_list_PsRF_2.csv
PB_22_ranked_list_RF_0.csv		PB_22_ranked_list_RF_0.csv
PB_22_ranked_list_RF_1.csv		PB_22_ranked_list_RF_1.csv
PB_22_ranked_list_RF_2.csv		PB_22_ranked_list_RF_2.csv
PB_22_rocchio.py		PB_22_rocchio.py
PB_22_rocchio_PsRF_evaluated_0.csv		PB_22_rocchio_PsRF_evaluated_0.csv
PB_22_rocchio_PsRF_evaluated_1.csv		PB_22_rocchio_PsRF_evaluated_1.csv
PB_22_rocchio_PsRF_evaluated_2.csv		PB_22_rocchio_PsRF_evaluated_2.csv
PB_22_rocchio_PsRF_metrics.csv		PB_22_rocchio_PsRF_metrics.csv
PB_22_rocchio_RF_evaluated_0.csv		PB_22_rocchio_RF_evaluated_0.csv
PB_22_rocchio_RF_evaluated_1.csv		PB_22_rocchio_RF_evaluated_1.csv
PB_22_rocchio_RF_evaluated_2.csv		PB_22_rocchio_RF_evaluated_2.csv
PB_22_rocchio_RF_metrics.csv		PB_22_rocchio_RF_metrics.csv
PB_22_rocchio_report.txt		PB_22_rocchio_report.txt
README.md		README.md
README.pdf		README.pdf
indexed_docs.txt		indexed_docs.txt
queries_22.txt		queries_22.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IR-TermProject

About

Contributors 4

Languages

SubhamKarmakar24/IR-TermProject

Folders and files

Latest commit

History

Repository files navigation

IR-TermProject

About

Topics

Resources

Stars

Watchers

Forks

Contributors 4

Languages