Create an index of the downloaded corpus. The documents are found within the ap89_collection folder in the data .zip file. You will need to write a program to parse the documents and send them to your elasticsearch instance.

The corpus files are in a standard format used by TREC. Each file contains multiple documents. The format is similar to XML, but standard XML and HTML parsers will not work correctly. Instead, read the file one line at a time with the following rules:

Each document begins with a line containing <DOC> and ends with a line containing </DOC>.
The first several lines of a document’s record contain various metadata. You should read the <DOCNO> field and use it as the ID of the document.
The document contents are between lines containing <TEXT> and </TEXT>.
All other file contents can be ignored.
Make sure to index term positions: you will need them later. You are also free to add any other fields to your index for later use. This might be the easiest way to get particular values used in the scoring functions. However, if a value is provided by elasticsearch then we encourage you to retrieve it using the elasticsearch API rather than calculating and storing it yourself.

In [2]:
import os 
import nltk
import math 
import re
import time
import pandas as pd
import numpy as np
from elasticsearch import Elasticsearch 
from nltk.stem import PorterStemmer
from collections import OrderedDict

global average_doc_length
global total_doc
global vocab_size
global dict_ids
global df_term_freq
global df_ttf_freq
global df_doc_freq
global len_of_doc


In [3]:
## Elasticsearch instance
es = Elasticsearch()

## Initialize stemmer
ps = PorterStemmer()

## Get all the docs from the file location
file_all = os.getcwd() + "\HW1\AP_DATA\\ap89_collection"

## Temove the read me part
list_of_all = os.listdir(file_all)[:-1]

In [4]:
## Read the data from all sources

total_doc = 0
dict_temp = []
doc_id_list_mapping = {}
for doc in list_of_all:
    with open(r".\HW1\AP_DATA\ap89_collection\\" + doc, "r") as file_data:
        file_data_read = file_data.read().split("<DOC>")
    for each_file in file_data_read:
        if "</DOCNO>" in each_file:
            doc_id_content = each_file.split("</DOCNO>")
            total_doc += 1
            doc_id = doc_id_content[0].strip("\n<DOCNO>").strip()
            doc_id_list_mapping[total_doc] = doc_id
            doc_content = doc_id_content[1].split("<TEXT>")[1].strip("\n</TEXT>\n</DOC>\n").strip()
            dict_to_iterate = {"docno": doc_id, "text": doc_content}
            dict_temp.append(dict_to_iterate)


In [None]:
## Save the data in elastic search

i = 1
for es_doc in dict_temp:
    es.index(index = "ap_dataset", doc_type = "document", id = i, body = es_doc)
    i += 1
    if i % 1000 == 0:
        print ("AT i: ", i)


In [5]:
## Read queries

with open(r"C:\Users\mm199\IR-hw\HW1_data\AP_DATA\query_desc.51-100.short.txt", "r") as query:
    query_list = query.readlines()
    
## define the common words from queries
common_words = ['Document', 'discuss', 'exist','determine', 'current','pay','even','taken','type', 'report', 'describe', 'cite', 'include', 'identify', 'make', 'one', 'must', 'second', 'use', 'side', 'take', 'predict']

## get all the words in a query
all_query_in_a_list = {i[:2] : nltk.word_tokenize(i[3:]) for i in query_list}

## get stop words from nltk
stop_words = nltk.corpus.stopwords.words("english")

## words of dictionary which are not in stop words or common words
words_list = {}
for key in all_query_in_a_list:
    if all_query_in_a_list[key] != []:
        words_list[key] = [ps.stem(i) for i in all_query_in_a_list[key] if i not in stop_words and i not in common_words and re.match("^\w+",i)]
for key in words_list:
    words_list[key] = [i for i in words_list[key] if i not in stop_words and i not in common_words and re.match("^\w+",i)]

all_unique_words = list(set([word for query_words in words_list.values() for word in query_words]))


alleg measur corrupt public offici government jurisdict worldwid


147

Write a program to run the queries in the file query_desc.51-100.short.txt, included in the data .zip file. You should run all queries (omitting the leading number) using each of the retrieval models listed below, and output the top 1000 results for each query to an output file. If a particular query has fewer than 1000 documents with a nonzero matching score, then just list whichever documents have nonzero scores.

You should write precisely one output file per retrieval model. Each line of an output file should specify one retrieved document, in the following format:

<query-number> Q0 <docno> <rank> <score> Exp
Where:

is the number preceding the query in the query list
is the document number, from the <DOCNO> field (which we asked you to index)
is the document rank: an integer from 1-1000
is the retrieval model’s matching score for the document
Q0 and Exp are entered literally
Your program will run queries against elasticsearch. Instead of using their built in query engine, we will be retrieving information such as TF and DF scores from elasticsearch and implementing our own document ranking. It will be helpful if you write a method which takes a term as a parameter and retrieves the postings for that term from elasticsearch. You can then easily reuse this method to implement the retrieval models.

Implement the following retrieval models, using TF and DF scores from your elasticsearch index, as needed



In [None]:
## Get total vocab size

body = {
  "size": 0, 
  "aggs": {
    "vocabSize": {
      "cardinality": {
        "field": "text"
      }
    }
  }
}
vocab_res = es.search(index = "ap_dataset", doc_type = "document", body = body)
vocab_size = vocab_res["aggregations"]['vocabSize']['value']


In [None]:
## Get average doc length for any doc id 

doc_id = 1
dict_all = es.termvectors(index = "ap_dataset", doc_type = "document", id = doc_id, term_statistics = True, fields = ["text"])
average_doc_length = dict_all["term_vectors"]["text"]["field_statistics"]["sum_ttf"]/dict_all["term_vectors"]["text"]["field_statistics"]["doc_count"]


In [None]:
## Get all the data and save it in globally defined data frames

def get_terms(word, doc_id):
    len_of_doc_id = 0
    dict_all = es.termvectors(index = "ap_dataset", doc_type = "document", id = doc_id, term_statistics = True, fields = ["text"])
    if dict_all["term_vectors"] != {}:
        temp_dict = dict_all["term_vectors"]["text"]["terms"]
    else:
        return
    for every_word in temp_dict:
        len_of_doc_id += temp_dict[every_word]["term_freq"] 
    len_of_doc["length"][doc_id] = len_of_doc_id   
    try:     
        tfwd = temp_dict[word]["term_freq"]
        ttf = temp_dict[word]["ttf"]
    except KeyError:
        tfwd = 0
        ttf = 0
    df_term_freq[word][doc_id] = tfwd
    df_ttf_freq[word][doc_id] = ttf
    

In [None]:
## Define and initialize pandas dataframe to save term freq, doc freq and ttf

list_of_values = list(words_list.values())
list_single_item = list(set([i for item in list_of_values for i in item]))

df_term_freq = pd.DataFrame(data = 0, index = range(1,total_doc+1), columns = list_single_item)
df_ttf_freq = pd.DataFrame(data = 0, index = range(1,total_doc+1), columns = list_single_item)
df_doc_freq = pd.DataFrame(data = 0, index = ["doc_freq"], columns = list_single_item)
len_of_doc = pd.DataFrame(data = 0, index = range(1,total_doc+1), columns = ["length"])


In [None]:
## Search es for retrieving doc for each term in a query

dict_ids = {}
for key in words_list:
    list_id = []
    for word in words_list[key]:
        res = es.search(index = "ap_dataset", doc_type = "document",size = 1000, body = {"query" : {"match" : {"text" : word}}})
        df_doc_freq[word]["doc_freq"] = res["hits"]["total"]
        id_list = [int(each_hit["_id"]) for each_hit in res['hits']["hits"]]
        list_id.extend(id_list)
    dict_ids[key] = list(set(list_id))
    

In [55]:
## Initialize pandas dataframe with actual term frequency values

for key in words_list:
    t1 = time.time()
    for word in set(words_list[key]):
        for doc_id in dict_ids[key]:
            get_terms(word,doc_id)
    print ("Word completed:", key," time taken : ", (time.time() - t1), " for ids ", len(dict_ids[key]))

Word completed: 85  time taken :  663.5449593067169  for ids  7151
Word completed: 59  time taken :  446.7133719921112  for ids  5805
Word completed: 56  time taken :  270.5866074562073  for ids  4640
Word completed: 71  time taken :  1339.8888409137726  for ids  9692
Word completed: 64  time taken :  282.0133726596832  for ids  4892
Word completed: 62  time taken :  374.0085747241974  for ids  4604
Word completed: 93  time taken :  334.16528725624084  for ids  4900
Word completed: 99  time taken :  100.2273941040039  for ids  2854
Word completed: 58  time taken :  157.2018280029297  for ids  3226
Word completed: 77  time taken :  91.32353138923645  for ids  1829
Word completed: 54  time taken :  811.0812311172485  for ids  7980
Word completed: 87  time taken :  530.8166315555573  for ids  6677
Word completed: 94  time taken :  148.1805226802826  for ids  3155
Word completed: 10  time taken :  1312.5897421836853  for ids  9272
Word completed: 89  time taken :  412.30075788497925  for i

In [None]:
## Function for each model per query

def call_each_model(model_name):
    store_result_for_each_query = {}
    for key in words_list:
        result_dict = {}
        for doc_id in dict_ids[key]:
            result_dict[doc_id] = model_name(words_list[key], doc_id)
        store_result_for_each_query[key] = sorted(result_dict.items(), key = lambda x:x[1], reverse = True)[:1000]
        print ("Query completed: ", key)
    return store_result_for_each_query


In [None]:
## Define all model 

def okapi_model (word_term_freq_list, doc_id):
    okapi = 0.0
    for key in word_term_freq_list:
        tfwd = df_term_freq[key][doc_id]
        okapi += round((tfwd / (tfwd + 0.5 + 1.5 * (len_of_doc["length"][doc_id] / average_doc_length))),4)
    return okapi
    
def tfidf_model(word_term_freq_list, doc_id):  
    tfidf = 0.0
    for key in word_term_freq_list:
        tfwd = df_term_freq[key][doc_id]
        doc_freq = df_doc_freq[key]["doc_freq"]
        if doc_freq == 0:
            tfidf_val = 0
        else:
            tfidf_val = math.log(total_doc / doc_freq)
        tfidf += round(((tfwd / (tfwd + 0.5 + 1.5 * (len_of_doc["length"][doc_id] / average_doc_length))) * tfidf_val),4)
    return tfidf

def okapi_bm25(word_term_freq_list, doc_id):  
    bm25 = 0.0
    tfwq = {}
    k1 = 1.2
    k2 = 500
    b = 0.75
    for key in word_term_freq_list:
        tfwd = df_term_freq[key][doc_id]
        tfwq[key] = word_term_freq_list.count(key)
        bm25 += round(((math.log((total_doc + 0.5) / (df_doc_freq[key]["doc_freq"] + 0.5)))*((tfwd + k1 * tfwd) / (tfwd + k1 * ((1 - b) + b * (len_of_doc["length"][doc_id] / average_doc_length))))*((tfwq[key] + k2 * tfwq[key]) / (tfwq[key] + k2))), 5)
    return bm25 

def laplace_model(word_term_freq_list, doc_id):  
    laplace = 0.0
    for key in word_term_freq_list:
        tfwd = df_term_freq[key][doc_id]
        prob = ((tfwd + 1) / (len_of_doc["length"][doc_id] + vocab_size))
        laplace += math.log(prob)
    return laplace

def jm_model(word_term_freq_list, doc_id):
    jm = 0.0
    epsilon = 0.00000001
    lamda = 0.3
    for key in word_term_freq_list:
        tfwd = df_term_freq[key][doc_id]
        ttf = df_ttf_freq[key][doc_id]
        if tfwd == 0:
            tfwd = epsilon
            ttf = epsilon
        prob_lm = lamda * (tfwd / len_of_doc["length"][doc_id]) + (1 - lamda) * (ttf / vocab_size)
        jm += math.log(prob_lm)
    return jm


In [60]:
## Call each model

okapi_result = call_each_model(okapi_model)
tfidf_result = call_each_model(tfidf_model)
okapi_bm25_result = call_each_model(okapi_bm25)
laplace_result = call_each_model(laplace_model)
jm_result = call_each_model(jm_model)


Query completed:  85
Query completed:  59
Query completed:  56
Query completed:  71
Query completed:  64
Query completed:  62
Query completed:  93
Query completed:  99
Query completed:  58
Query completed:  77
Query completed:  54
Query completed:  87
Query completed:  94
Query completed:  10
Query completed:  89
Query completed:  61
Query completed:  95
Query completed:  68
Query completed:  57
Query completed:  97
Query completed:  98
Query completed:  60
Query completed:  80
Query completed:  63
Query completed:  91
Query completed:  85
Query completed:  59
Query completed:  56
Query completed:  71
Query completed:  64
Query completed:  62
Query completed:  93
Query completed:  99
Query completed:  58
Query completed:  77
Query completed:  54
Query completed:  87
Query completed:  94
Query completed:  10
Query completed:  89
Query completed:  61
Query completed:  95
Query completed:  68
Query completed:  57
Query completed:  97
Query completed:  98
Query completed:  60
Query complet

In [None]:
## Function to save the result in a file

# <query-number> Q0 <docno> <rank> <score> Exp

def save_result (store_result_for_each_query, file_name):
    with open(file_name,"w") as file:
        for query_num in store_result_for_each_query:
            rank = 1
            for docid, score in store_result_for_each_query[query_num]:
                temp_str = query_num + " Q0 " + str(doc_id_list_mapping[docid]) + " " + str(rank) + " " +  str(score) + " " + "Exp"
                file.write(temp_str)
                file.write("\n")
                rank += 1
       

In [61]:
## Call to save each model's result

save_result (okapi_result, "okapi")
save_result (tfidf_result, "okapi_tfidf")
save_result (okapi_bm25_result, "okapi_bm25")
save_result (laplace_result, "laplace")
save_result (jm_result, "jm")