## Re-ranking problem

### Reranking ms-marco and car collections.<br>

This notebook is heavily inspired by Bjarte's solution to [A5](https://github.com/uis-dat640-fall2020/bjabot12-assignments/blob/master/A5/A5.ipynb). The only difference is that made to work on ms-marco and car collection, e.g. we only have the body field.

We use PointWise learning to rank, using RandomForestRegressor.

<br>
We also compute the run-file, to re-rank with [OpenMatch](https://github.com/thunlp/OpenMatch). We extract the documents based on the query, using elasticsearch basic es.search() score, and then OpenMatch re-ranks them.

In [1]:
import elasticsearch
import math
import numpy as np
import os
import pytest
import random
import operator

from collections import Counter
from collections import defaultdict
from elasticsearch import Elasticsearch

In [2]:
es = Elasticsearch(http_auth=("elastic", "tiff,caloric,snuggle"))
es.info()

{'name': 'DESKTOP-3ATBQKL',
 'cluster_name': 'elasticsearch',
 'cluster_uuid': 'EaNcWp_wRPCfBAQQnph00w',
 'version': {'number': '7.9.2',
  'build_flavor': 'default',
  'build_type': 'zip',
  'build_hash': 'd34da0ea4a966c4e49417f2da2f244e3e97b4e6e',
  'build_date': '2020-09-23T00:45:33.626720Z',
  'build_snapshot': False,
  'lucene_version': '8.6.2',
  'minimum_wire_compatibility_version': '6.8.0',
  'minimum_index_compatibility_version': '6.0.0-beta1'},
 'tagline': 'You Know, for Search'}

In [3]:
def analyze_query(es, query, field="body", index='toy_index'):
    """Analyzes a query with respect to the relevant index. 
    
    Arguments:
        es: Elasticsearch object instance.
        query: String of query terms.
        field: The field with respect to which the query is analyzed. 
        index: Name of the index with respect to which the query is analyzed.  
    
    Returns:
        A list of query terms that exist in the specified field among the documents in the index. 
    """
    tokens = es.indices.analyze(index=index, body={'text': query})['tokens']
    query_terms = []
    for t in sorted(tokens, key=lambda x: x['position']):
        ## Use a boolean query to find at least one document that contains the term.
        hits = es.search(index=index, body={'query': {'match': {field: t['token']}}}, 
                                   _source=False, size=1).get('hits', {}).get('hits', {})
        doc_id = hits[0]['_id'] if len(hits) > 0 else None
        if doc_id is None:
            continue
        query_terms.append(t['token'])
    return query_terms

### Extracting features

We extract the features to use in the LTR-algorithm
```
FEATURES_QUERY = ['query_length', 'query_sum_idf', 'query_max_idf', 'query_avg_idf']
FEATURES_DOC = ['doc_length_body']
FEATURES_QUERY_DOC = ['unique_query_terms_in_body', 'sum_TF_body', 'max_TF_body', 'avg_TF_body']
AND BM25-SCORE FOR EACH DOCUMENT
```

In [4]:
def get_doc_term_freqs(es, doc_id, field="body", index='toy_index'):
    """Gets the term frequencies of a field of an indexed document. 
    
    Arguments:
        es: Elasticsearch object instance.
        doc_id: Document identifier with which the document is indexed. 
        field: Field of document to consider for term frequencies.
        index: Name of the index where document is indexed.  
    
    Returns:
        Dictionary of terms and their respective term frequencies in the field and document.  
    """
    tv = es.termvectors(index=index, id=doc_id, fields=field, term_statistics=True)
    if tv['_id'] != doc_id:
        return None
    if field not in tv['term_vectors']:
        return None
    term_freqs = {}
    for term, term_stat in tv['term_vectors'][field]['terms'].items():
        term_freqs[term] = term_stat['term_freq']
    return term_freqs
        
def get_query_term_freqs(es, query_terms):
    """Gets the term frequencies of a list of query terms. 
    
    Arguments:
        es: Elasticsearch object instance.
        query_terms: List of query terms, analyzed using `analyze_query` with respect to some relevant index. 
    
    Returns:
        A list of query terms that exist in the specified field among the documents in the index. 
    """
    c = Counter()
    for term in query_terms:
        c[term] += 1
    return dict(c)

In [5]:
def extract_query_features(query_terms, es, index='toy_index'):
    """Extracts features of a query.
    
        Arguments:
            query_terms: List of analyzed query terms.
            es: Elasticsearch object instance.
            index: Name of relevant index on the running Elasticsearch service. 
        Returns:
            Dictionary with keys 'query_length', 'query_sum_idf', 'query_max_idf', and 'query_avg_idf'.
    """
    q_features = {}
    q_features["query_length"] = len(query_terms)
    
    idf_list = []
    temp_max = 0
    field = "body"
    for term in query_terms:
        hits = es.search(index=index, body={'query': {'match': {field: term}}}, _source=False, size=1).get('hits', {}).get('hits', {})
        doc_id = hits[0]['_id'] if len(hits) > 0 else None
        if doc_id is not None:
            tv = es.termvectors(index=index, id=doc_id, fields=field, term_statistics=True)
            if term not in tv["term_vectors"][field]["terms"]:
                continue
            doc_count = tv["term_vectors"][field]["field_statistics"]["doc_count"]
            doc_freq = tv["term_vectors"][field]["terms"][term]["doc_freq"]
            N = int(es.cat.count(index, params={"format": "json"})[0]['count'])
            idf = math.log(N/doc_freq)
            
            idf_list.append(idf)
            if idf > temp_max:
                temp_max = idf
    if len(idf_list) > 0:
        q_features['query_avg_idf'] = sum(idf_list)/len(idf_list)
        q_features["query_sum_idf"] = sum(idf_list)
        q_features["query_max_idf"] = temp_max
    else:
        q_features['query_avg_idf'] = 0
        q_features["query_sum_idf"] = 0
        q_features["query_max_idf"] = 0
    
    return q_features

In [6]:
def extract_doc_features(doc_id, es, index='toy_index'):
    """Extracts features of a document.
    
        Arguments:
            doc_id: Document identifier of indexed document.
            es: Elasticsearch object instance.
            index: Name of relevant index on the running Elasticsearch service. 

        Returns:
            Dictionary with key 'doc_length_body'.
    """
    doc_features = {}
    counts = []

    count = 0
    tv = es.termvectors(index=index, id=doc_id, fields="body")
    terms = get_doc_term_freqs(es, doc_id, index=index)
    if terms is not None:
        for term in terms:
            count = count + terms[term]
    counts.append(count)
    
    if len(counts) > 0:
        doc_features["doc_length_body"] = counts[0]
    else:
        doc_features["doc_length_body"] = 0
    return doc_features

In [7]:
def extract_query_doc_features(query_terms, doc_id, es, index='toy_index'):
    """Extracts features of a query and document pair.
    
        Arguments:
            query_terms: List of analyzed query terms.
            doc_id: Document identifier of indexed document.
            es: Elasticsearch object instance.
            index: Name of relevant index on the running Elasticsearch service. 
            
        Returns:
            Dictionary with keys 'unique_query_terms_in_body',
            'sum_TF_body', 'max_TF_body', 'avg_TF_body'. 
    """
    q_doc_features = {}

    # YOUR CODE HERE
    
    count = 0
    sum_tf = 0
    max_tf = 0
    terms = get_doc_term_freqs(es, doc_id, index=index)
    for term in query_terms:
        if terms is not None:
            if term in terms:
                count = count + 1
                tf = terms[term]
            else:
                tf = 0
            if tf > max_tf:
                max_tf = tf
            sum_tf += tf


    if len(query_terms) > 0:
        q_doc_features["sum_TF_body"] = sum_tf
        q_doc_features["max_TF_body"] = max_tf
        q_doc_features["avg_TF_body"] = sum_tf/len(query_terms) 
        q_doc_features["unique_query_terms_in_body"] = count
    else:
        q_doc_features["sum_TF_body"] = 0
        q_doc_features["max_TF_body"] = 0
        q_doc_features["avg_TF_body"] = 0
        q_doc_features["unique_query_terms_in_body"] = 0
        
    return q_doc_features

In [92]:
FEATURES_QUERY = ['query_length', 'query_sum_idf', 'query_max_idf', 'query_avg_idf']
FEATURES_DOC = ['doc_length_body']
FEATURES_QUERY_DOC = ['unique_query_terms_in_body', 'sum_TF_body', 'max_TF_body', 'avg_TF_body']

def extract_features(query_terms, doc_id, es, index='toy_index', norm_bm25score=0, checkBM25=False):
    """Extracts query features, document features and query-document features of a query and document pair.
    
        Arguments:
            query_terms: List of analyzed query terms.
            doc_id: Document identifier of indexed document.
            es: Elasticsearch object instance.
            index: Name of relevant index on the running Elasticsearch service. 
            
        Returns:
            List of extracted feature values in a fixed order.
    """
    feature_vect = []
    
    query_features = extract_query_features(query_terms, es, index=index)
    for f in FEATURES_QUERY:
        feature_vect.append(query_features[f])
    
    doc_features = extract_doc_features(doc_id, es, index=index)
    for f in FEATURES_DOC:
        feature_vect.append(doc_features[f])

    query_doc_features = extract_query_doc_features(query_terms, doc_id, es, index=index)
    for f in FEATURES_QUERY_DOC:
        feature_vect.append(query_doc_features[f])
    
    # Code below is to adding bm25-score, this make the ranking process really slow, but the best way I could think of.
    if checkBM25:
        bm25docs = es.search(index="trec2019_stem", q=' '.join(query_terms), _source=True, size=100)['hits']['hits']
        doc_list = []
        score_list = []
        for doc in bm25docs:
            score_list.append(doc["_score"])
        max_score = max(score_list)
        min_score = min(score_list)
        for doc in bm25docs:
            if doc_id == doc["_id"]:
                norm_bm25score2 = (doc["_score"] - min_score) / (max_score - min_score)
                feature_vect.append(norm_bm25score2)
            doc_list.append(doc["_id"])
        if doc_id not in doc_list:
            feature_vect.append(0) # appends zero if the doc id was not in top 100 docs
        return feature_vect, max_score, min_score
    else:
        feature_vect.append(norm_bm25score)
    return feature_vect

### Load qrels

In [10]:
def load_qrels(filepath):
    """Loads query relevance judgments from a file.
    
        Arguments:
            filepath: String (constructed using os.path) of the filepath to a file with queries.  
            
        Returns:
            A dictionary with query IDs and a corresponding list of document IDs for documents judged 
            relevant to the query. 
    """
    qrels = defaultdict(list)

    # YOUR CODE HERE
    
    with open(filepath, "r") as f:
        for i, line in enumerate(f):
            q_id = line.split("\t")[0]
            pas_id = "MARCO_" + str(line.split("\t")[2])
            qrels[q_id].append(pas_id)
            
    return qrels

In [11]:
def load_qrels_trec(filepath):
    """Loads query relevance judgments from a file.
    
        Arguments:
            filepath: String (constructed using os.path) of the filepath to a file with queries.  
            
        Returns:
            A dictionary with query IDs and a corresponding list of document IDs for documents judged 
            relevant to the query. 
    """
    qrels = defaultdict(list)

    # YOUR CODE HERE
    
    with open(filepath, "r") as f:
        for line in f:
            q_id = line.split(" ")[0]
            pas_id = line.split(" ")[2]
            qrels[q_id].append(pas_id)
            
    return qrels

In [12]:
# For training data

cwd = os.getcwd()

# Path to ms_marco qrels
qrels_file = 'OpenMatch/data/qrels.train_marco.tsv'
qrels_filepath = os.path.join(cwd, qrels_file)
qrels_marco = load_qrels(qrels_filepath)

In [13]:
# For testing data

cwd = os.getcwd()

# Path to trec_eval qrels
qrels_file_trec = './trec_data/eval_topics.qrel'
qrels_filepath = os.path.join(cwd, qrels_file_trec)
qrels_trec = load_qrels_trec(qrels_filepath)
qrels_trec

defaultdict(list,
            {'31_1': ['CAR_116d829c4c800c2fc70f11692fec5e8c7e975250',
              'CAR_1463f964653c5c9f614a0a88d26b175e4a8120f1',
              'CAR_172e16e89ea3d5546e53384a27c3be299bcfe968',
              'CAR_1c93ef499a0c2856c4a857b0cb4720c380dda476',
              'CAR_2174ad0aa50712ff24035c23f59a3c2b43267650',
              'CAR_25a576af9caa6422f55c2acf945dc79b423fb41e',
              'CAR_2dc597ac2fc10917a752552bc335e6ac1aedc3f0',
              'CAR_3249e5618575a849152c02b05f4fda924f10326f',
              'CAR_393cb4e18a9d30018e843c4d37c564272ec5fa6f',
              'CAR_40c64256e988c8103550008f4e9b7ce436d9536d',
              'CAR_41ab200340a8fcae748b366ff0b0506e6b87cefa',
              'CAR_41b7dce4f8a72ee34d78c2b5c363272a54997f27',
              'CAR_462db9a569840533b644a7eb3e2557a23ae8204b',
              'CAR_55f68f0a33f49f6015035f70d8685e293389c9d6',
              'CAR_5b50a00d9e1bd1d4ac3a973a656c1b20ab96ec1e',
              'CAR_683d2822b14dca56ef1ffd93c

### Load queries

In [14]:
def load_queries(filepath):
    """Loads queries from a file.
    
        Arguments:
            filepath: String (constructed using os.path) of the filepath to a file with queries.  
            
        Returns:
            A dictionary with query IDs and corresponding query strings. 
    """
    queries = {}
    q_ids = []

    # YOUR CODE HERE
    
    with open(filepath, "r",encoding="utf-8") as f:
        for i, line in enumerate(f):
            try:
                _id = line.split("\t")[0]
                query = line.split("\t")[1].replace("\n", "")
            except:
                continue
            
            queries[_id] = query
            q_ids.append(_id)
    return queries, q_ids

In [15]:
# For training data

cwd = os.getcwd()

# Path to ms_marco passage queries
query_file = 'OpenMatch/data/queries.tar'
query_filepath = os.path.join(cwd, query_file)
queries, q_ids = load_queries(query_filepath)


0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
260000
270000
280000
290000
300000
310000
320000
330000
340000
350000
360000
370000
380000
390000
400000
410000
420000
430000
440000
450000
460000
470000
480000
490000
500000
510000
520000
530000
540000
550000
560000
570000
580000
590000
600000
610000
620000
630000
640000
650000
660000
670000
680000
690000
700000
710000
720000
730000
740000
750000
760000
770000
780000
790000
800000
810000
820000
830000
840000
850000
860000
870000
880000
890000
900000
910000
920000
930000
940000
950000
960000
970000
980000
990000
1000000
1010000


In [16]:
# For testing data

cwd = os.getcwd()

# Path to trec_eval queries
query_trec_topics = "./trec_data/eval_topics_resolved.tsv"
query_filepath = os.path.join(cwd, query_trec_topics)
queries_trec, q_ids_trec = load_queries(query_filepath)

0


In [17]:
len(queries_trec)

479

### Prepare ltr training data

In [112]:
def prepare_ltr_training_data(query_ids, queries, qrels, es, index='trec2019_stem'):
    """Prepares feature vectors and labels for query and document pairs found in the training data.
    
        Arguments:
            query_ids: List of query IDs.
            es: Elasticsearch object instance.
            index: Name of relevant index on the running Elasticsearch service. 
            
        Returns:
            X: List of feature vectors extracted for each pair of query and retrieved or relevant document. 
            y: List of corresponding labels.
    """
    X = []
    y = []

    # YOUR CODE HERE
    for i, query_id in enumerate(query_ids):
        simCheck = []
        print(i)
        rel_docs = False
        query = analyze_query(es, queries[query_id], "body", index)
        for q_id in qrels[query_id]:
            check = (q_id, queries[query_id])
            if check in simCheck:
                continue  
            features, max_score, min_score = extract_features(query, q_id, es, index, checkBM25=True)
            X.append(features)
            y.append(1)
            simCheck.append(check)
            rel_docs = True
            
        irrelevant = es.search(index='trec2019_stem', q=query, _source=True, size=5)['hits']
        score_list = []
        
#         if not rel_docs: # Calculating norm_bm25score if no relevant docs
#             for doc in irrelevant["hits"]:
#                 score_list.append(doc["_score"])
#             max_score = max(score_list)
#             min_score = min(score_list)
        for docs in irrelevant["hits"]:
            check = (docs["_id"], str(queries[query_id]))
            if check in simCheck:
                continue
#             norm_bm25score = (docs["_score"] - min_score)/(max_score - min_score)
            norm_bm25score = 0
            X.append(extract_features(query, docs["_id"], es, index, norm_bm25score=norm_bm25score))
            y.append(0)
            simCheck.append(check)
    return X, y

In [114]:
%%time
X_train, y_train = prepare_ltr_training_data(q_ids[:5000], queries, qrels_marco, es, index='trec2019_stem')

Wall time: 6h 27min 23s


In [113]:
X_train_5, y_train_5 = prepare_ltr_training_data(q_ids[:100], queries, qrels_marco, es, index='trec2019_stem')

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99


In [57]:
X_train_5[-1]

[3,
 13.660693234398419,
 8.621966279360105,
 6.8303466171992095,
 10,
 1,
 1,
 1,
 0.3333333333333333,
 23.093998]

In [189]:
import pickle

In [190]:
def save_training_data(X, y):
    """ Writes training set to files
    
    Arguments:
        X: Training feature vectors.
        y: Training labels
    """
    with open('./trec_data/X_train', 'wb') as fp:
        pickle.dump(X, fp)
    with open('./trec_data/y_train', 'wb') as fp:
        pickle.dump(y, fp)

In [191]:
def load_training_data():
    """ Loads the datasets from a pickle-file
    
    Returns:
        X: Training feature vectors
        y: Training labels
    """
    with open ('./trec_data/X_train', 'rb') as fp:
        X = pickle.load(fp)
    with open ('./trec_data/y_train', 'rb') as fp:
        y = pickle.load(fp)
    return X, y

In [192]:
save_training_data(X_train, y_train)

In [197]:
X1, y1 = load_training_data()

In [199]:
X1[:2]

[[5,
  24.671317172279792,
  7.771230777339052,
  6.167829293069948,
  17,
  3,
  6,
  3,
  1.2],
 [5,
  24.671317172279792,
  7.771230777339052,
  6.167829293069948,
  66,
  3,
  18,
  9,
  3.6]]

In [22]:
class PointWiseLTRModel(object):
    def __init__(self, regressor):
        """
        Arguments:
            classifier: An instance of scikit-learn regressor.
        """
        self.regressor = regressor

    def _train(self, X, y):
        """Trains an LTR model.
        
        Arguments:
            X: Features of training instances.
            y: Relevance assessments of training instances.
        """
        assert self.regressor is not None
        self.model = self.regressor.fit(X, y)

    def rank(self, ft, doc_ids):
        """Predicts relevance labels and rank documents for a given query.
        
        Arguments:
            ft: A list of feature vectors for query-document pairs.
            doc_ids: A list of document ids.
        Returns:
            List of tuples, each consisting of document ID and predicted relevance label.
        """
        assert self.model is not None
        rel_labels = self.model.predict(ft)
        sort_indices = np.argsort(rel_labels)[::-1]

        results = []
        for i in sort_indices:
            results.append((doc_ids[i], rel_labels[i]))
        return results

In [23]:
from sklearn.ensemble import RandomForestRegressor

In [130]:
# Instantiate an scikit-learn regression model, `clf`.
# YOUR CODE HERE
# raise NotImplementedError()
clf = RandomForestRegressor(max_depth=3, random_state=0)

# Instantiate PointWiseLTRModel.
ltr = PointWiseLTRModel(regr)

In [115]:
from sklearn import svm

In [129]:
regr = svm.SVR(kernel='poly', C=100, gamma='auto', degree=3, epsilon=.1,coef0=1)

In [131]:
ltr._train(X_train_5, y_train_5)

In [120]:
def get_rankings(ltr, query_ids, queries, es, index='trec2019_stem', rerank=False):
    """Generate rankings for each of the test query IDs.
    
        Arguments:
            ltr: A trained PointWiseLTRModel instance.
            query_ids: List of query IDs.
            es: Elasticsearch object instance.
            index: Name of relevant index on the running Elasticsearch service. 
            rerank: Boolean flag indicating whether the first-pass retrieval results should be reranked using the LTR model.
            
        Returns:
            A dictionary of rankings for each test query ID. 
    """

    test_rankings = {}
    for i, query_id in enumerate(query_ids):
        print('Processing query {}/{} ID {}'.format(i + 1, len(query_ids), query_id))
        # First-pass retrieval
        query_terms = analyze_query(es, queries[query_id], 'body', index=index)
        if len(query_terms) == 0:
            print('WARNING: query {} is empty after analysis; ignoring'.format(query_id))
            continue
        hits = es.search(index=index, q=' '.join(query_terms), _source=True, size=100)['hits']['hits']        
        test_rankings[query_id] = [hit['_id'] for hit in hits]
        
        # Rerank the first-pass result set using the LTR model.
        score_list = []
        if rerank:
            for doc in hits:
                score_list.append(doc["_score"])
            min_score = min(score_list)
            max_score = max(score_list)
            doc_ids = test_rankings[query_id] = [hit['_id'] for hit in hits]
            test_rankings[query_id] = {}
            feature_vectors = []
            for i, doc_id in enumerate(doc_ids):
                norm_bm25score = (hits[i]["_score"] - min_score)/(max_score - min_score)
                feature_vectors.append(extract_features(query_terms, doc_id, es, index, norm_bm25score=norm_bm25score))
            test_rankings[query_id] = [hit[0] for hit in ltr.rank(feature_vectors, doc_ids)]
                
            
#             raise NotImplementedError()
    return test_rankings

In [132]:
%%time
rankings_ltr_trec = get_rankings(ltr, q_ids_trec, queries_trec, es, index='trec2019_stem', rerank=True)

Processing query 1/479 ID 31_1
Processing query 2/479 ID 31_2
Processing query 3/479 ID 31_3
Processing query 4/479 ID 31_4
Processing query 5/479 ID 31_5
Processing query 6/479 ID 31_6
Processing query 7/479 ID 31_7
Processing query 8/479 ID 31_8
Processing query 9/479 ID 31_9
Processing query 10/479 ID 32_1
Processing query 11/479 ID 32_2
Processing query 12/479 ID 32_3
Processing query 13/479 ID 32_4
Processing query 14/479 ID 32_5
Processing query 15/479 ID 32_6
Processing query 16/479 ID 32_7
Processing query 17/479 ID 32_8
Processing query 18/479 ID 32_9
Processing query 19/479 ID 32_10
Processing query 20/479 ID 32_11
Processing query 21/479 ID 33_1
Processing query 22/479 ID 33_2
Processing query 23/479 ID 33_3
Processing query 24/479 ID 33_4
Processing query 25/479 ID 33_5
Processing query 26/479 ID 33_6
Processing query 27/479 ID 33_7
Processing query 28/479 ID 33_8
Processing query 29/479 ID 33_9
Processing query 30/479 ID 33_10
Processing query 31/479 ID 34_1
Processing que

Processing query 253/479 ID 58_2
Processing query 254/479 ID 58_3
Processing query 255/479 ID 58_4
Processing query 256/479 ID 58_5
Processing query 257/479 ID 58_6
Processing query 258/479 ID 58_7
Processing query 259/479 ID 58_8
Processing query 260/479 ID 59_1
Processing query 261/479 ID 59_2
Processing query 262/479 ID 59_3
Processing query 263/479 ID 59_4
Processing query 264/479 ID 59_5
Processing query 265/479 ID 59_6
Processing query 266/479 ID 59_7
Processing query 267/479 ID 59_8
Processing query 268/479 ID 60_1
Processing query 269/479 ID 60_2
Processing query 270/479 ID 60_3
Processing query 271/479 ID 60_4
Processing query 272/479 ID 60_5
Processing query 273/479 ID 60_6
Processing query 274/479 ID 60_7
Processing query 275/479 ID 61_1
Processing query 276/479 ID 61_2
Processing query 277/479 ID 61_3
Processing query 278/479 ID 61_4
Processing query 279/479 ID 61_5
Processing query 280/479 ID 61_6
Processing query 281/479 ID 61_7
Processing query 282/479 ID 61_8
Processing

In [77]:
%%time
rankings_ltr_trec_firstpass = get_rankings(ltr, q_ids_trec, queries_trec, es, index='trec2019_stem')

Processing query 1/479 ID 31_1
Processing query 2/479 ID 31_2
Processing query 3/479 ID 31_3
Processing query 4/479 ID 31_4
Processing query 5/479 ID 31_5
Processing query 6/479 ID 31_6
Processing query 7/479 ID 31_7
Processing query 8/479 ID 31_8
Processing query 9/479 ID 31_9
Processing query 10/479 ID 32_1
Processing query 11/479 ID 32_2
Processing query 12/479 ID 32_3
Processing query 13/479 ID 32_4
Processing query 14/479 ID 32_5
Processing query 15/479 ID 32_6
Processing query 16/479 ID 32_7
Processing query 17/479 ID 32_8
Processing query 18/479 ID 32_9
Processing query 19/479 ID 32_10
Processing query 20/479 ID 32_11
Processing query 21/479 ID 33_1
Processing query 22/479 ID 33_2
Processing query 23/479 ID 33_3
Processing query 24/479 ID 33_4
Processing query 25/479 ID 33_5
Processing query 26/479 ID 33_6
Processing query 27/479 ID 33_7
Processing query 28/479 ID 33_8
Processing query 29/479 ID 33_9
Processing query 30/479 ID 33_10
Processing query 31/479 ID 34_1
Processing que

Processing query 254/479 ID 58_3
Processing query 255/479 ID 58_4
Processing query 256/479 ID 58_5
Processing query 257/479 ID 58_6
Processing query 258/479 ID 58_7
Processing query 259/479 ID 58_8
Processing query 260/479 ID 59_1
Processing query 261/479 ID 59_2
Processing query 262/479 ID 59_3
Processing query 263/479 ID 59_4
Processing query 264/479 ID 59_5
Processing query 265/479 ID 59_6
Processing query 266/479 ID 59_7
Processing query 267/479 ID 59_8
Processing query 268/479 ID 60_1
Processing query 269/479 ID 60_2
Processing query 270/479 ID 60_3
Processing query 271/479 ID 60_4
Processing query 272/479 ID 60_5
Processing query 273/479 ID 60_6
Processing query 274/479 ID 60_7
Processing query 275/479 ID 61_1
Processing query 276/479 ID 61_2
Processing query 277/479 ID 61_3
Processing query 278/479 ID 61_4
Processing query 279/479 ID 61_5
Processing query 280/479 ID 61_6
Processing query 281/479 ID 61_7
Processing query 282/479 ID 61_8
Processing query 283/479 ID 61_9
Processing

In [30]:
def get_reciprocal_rank(system_ranking, ground_truth):
    """Computes Reciprocal Rank (RR).
    
    Args:
        system_ranking: Ranked list of document IDs.
        ground_truth: Set of relevant document IDs.
    
    Returns:
        RR (float).
    """
    for i, doc_id in enumerate(system_ranking):
        if doc_id in ground_truth:
            return 1 / (i + 1)
    return 0
    
def get_mean_eval_measure(system_rankings, ground_truth, eval_function):
    """Computes a mean of any evaluation measure over a set of queries.
    
    Args:
        system_rankings: Dict with query ID as key and a ranked list of document IDs as value.
        ground_truths: Dict with query ID as key and a set of relevant document IDs as value.
        eval_function: Callback function for the evaluation measure that mean is computed over.
    
    Returns:
        Mean evaluation measure (float).
    """
    sum_score = 0
    for query_id, system_ranking in system_rankings.items():
        sum_score += eval_function(system_ranking, ground_truth[query_id])
    return sum_score / len(system_rankings)

In [135]:
get_mean_eval_measure(rankings_ltr_trec, qrels_trec, get_reciprocal_rank)

0.1773999962395326

In [188]:
es.search(index="trec2019_stem", q=' '.join(["what", "is", "throat", "cancer"]), _source=True, size=100)['hits']['hits'] 

[{'_index': 'trec2019_stem',
  '_type': '_doc',
  '_id': 'MARCO_789620',
  '_score': 25.53606,
  '_source': {'body': 'What is throat cancer? Throat cancer is any cancer that forms in the throat. The throat, also called the pharynx, is a 5-inch-long tube that runs from your nose to your neck. The larynx (voice box) and pharynx are the two main places throat cancer forms. Throat cancer is a type of head and neck cancer, which includes cancer of the mouth, tonsils, nose, sinuses, salivary glands and neck lymph nodes. Learn about the types of throat cancer.\n'}},
 {'_index': 'trec2019_stem',
  '_type': '_doc',
  '_id': 'MARCO_3878347',
  '_score': 25.491285,
  '_source': {'body': 'What is throat cancer? Throat cancer is any cancer that forms in the throat. The throat, also called the pharynx, is a 5-inch-long tube that runs from your nose to your neck. The larynx (voice box) and pharynx are the two main places throat cancer forms. Throat cancer is a type of head and neck cancer, which incl

In [201]:
rankings_ltr_trec

{'31_1': ['MARCO_7208607',
  'MARCO_2231510',
  'MARCO_3878345',
  'MARCO_4339742',
  'MARCO_7632622',
  'MARCO_8719994',
  'MARCO_3990603',
  'MARCO_2231507',
  'MARCO_5289694',
  'MARCO_3990599',
  'MARCO_3090847',
  'MARCO_5488479',
  'MARCO_8046971',
  'CAR_b407590614db7d62fb035409bc0ec44908783c25',
  'MARCO_6914824',
  'MARCO_4976182',
  'MARCO_5234420',
  'MARCO_1425161',
  'CAR_2174ad0aa50712ff24035c23f59a3c2b43267650',
  'MARCO_6430441',
  'MARCO_3553109',
  'MARCO_1842740',
  'CAR_a7df3734d664e1d2cd16bbb2e42727992d2b61be',
  'MARCO_2314856',
  'MARCO_5990560',
  'MARCO_5295408',
  'MARCO_3426729',
  'MARCO_1373522',
  'MARCO_3878347',
  'MARCO_291003',
  'MARCO_8149031',
  'MARCO_7208611',
  'MARCO_7364331',
  'MARCO_2925873',
  'MARCO_3705887',
  'MARCO_191056',
  'MARCO_191050',
  'MARCO_3705888',
  'MARCO_7364330',
  'MARCO_291004',
  'MARCO_8610842',
  'MARCO_8610846',
  'MARCO_2925879',
  'MARCO_5990561',
  'MARCO_7100885',
  'MARCO_3878346',
  'MARCO_3878348',
  'MARCO_1

In [202]:
rankings_ltr_trec_firstpass

{'31_1': ['MARCO_789620',
  'MARCO_3878347',
  'MARCO_291003',
  'MARCO_8149031',
  'MARCO_7208611',
  'MARCO_7364331',
  'MARCO_2925873',
  'MARCO_3705887',
  'MARCO_191056',
  'MARCO_191050',
  'MARCO_3705888',
  'MARCO_7364330',
  'MARCO_291004',
  'MARCO_8610842',
  'MARCO_8610846',
  'MARCO_2925879',
  'MARCO_5990561',
  'MARCO_7100885',
  'MARCO_3878346',
  'MARCO_3878348',
  'MARCO_1905581',
  'MARCO_1462317',
  'MARCO_1373522',
  'MARCO_3426729',
  'MARCO_5295408',
  'MARCO_5990560',
  'MARCO_3878345',
  'MARCO_4339742',
  'MARCO_7632622',
  'MARCO_8719994',
  'MARCO_3990603',
  'MARCO_2231507',
  'MARCO_5289694',
  'MARCO_3990599',
  'MARCO_3090847',
  'MARCO_5488479',
  'MARCO_2231510',
  'MARCO_8046971',
  'MARCO_6914824',
  'MARCO_4976182',
  'MARCO_5234420',
  'MARCO_1425161',
  'CAR_2174ad0aa50712ff24035c23f59a3c2b43267650',
  'MARCO_6430441',
  'MARCO_3553109',
  'MARCO_1842740',
  'CAR_a7df3734d664e1d2cd16bbb2e42727992d2b61be',
  'MARCO_2314856',
  'MARCO_2954451',
  'M

### Creating run-file for OpenMatch to re-rank

We create a run-file by the standard es.search() for getting the top100 results, OpenMatch is these going to re-rank these, using bert-base.

In [205]:
def create_run_file(rankings):
    
    """Formats trec-run file, to easily calculate the MRR score in trec-eval
    
    Args:
        rankings: Dict with query ID as key and a ranked list of document IDs as value.
    
    """
    res = []
    for q_id, doc_ids in rankings.items():
        for i, doc_id in enumerate(doc_ids):
            inp = "{} {} {} {} {} {}".format(q_id, "Q0", str(doc_id), str(i), str(1/(i+1)), "LTR")
            res.append(inp)

    res = "\n".join(res)
    with open("./trec_data/LTR_results.txt", 'w') as f:
        f.write(res)
    return

In [206]:
create_run_file(rankings_ltr_trec)