<h1>Introduction to Information Retrieval - <i>Machado de Assis Collection</i></h1> 



Disclaimer: this tutorial follows the ideas presented in [Stanford CS124 class Week 4](https://www.youtube.com/channel/UC_48v322owNVtORXuMeRmpA) and in Chis Manning's Book [An Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf). Also, some code was borrowed from this [github repo](https://github.com/williamscott701/Information-Retrieval/blob/master/2.%20TF-IDF%20Ranking%20-%20Cosine%20Similarity%2C%20Matching%20Score/TF-IDF.ipynb).

In this tutorial we'll cover the basics of Information Retrieval (IR) concepts and focus on Boolean and TF-IDF Ranked Retrieval models. At the end we present ways to evaluate an IR system using a benchmark dataset and an algorithm shipped with modern search engines based on Lucene (i.e. Elasticsearch and Solr).

For our demo, we'll use a collection of [Machado de Assis](https://pt.wikipedia.org/wiki/Machado_de_Assis)'s books and articles. He is a famous brazilian writer. 

In [None]:
## Library imports
import numpy as np 
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

import os, glob, re, sys, random, unicodedata, collections
from tqdm import tqdm
from functools import reduce
import nltk
from collections import Counter

from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
from nltk.tokenize import sent_tokenize , word_tokenize

STOP_WORDS = set(stopwords.words('portuguese'))

# 1. Boolean Retrieval Model

We'll start this tutorial with the boolean retrieval model.

## 1.1 Basic Concepts

[Information Retrieval](https://en.wikipedia.org/wiki/Information_retrieval) could be defined as a field of study interested in ways to finding relevant material of an unstructured nature (usually text) from a collection of data (i.e.: files stored in your file system) to satisfy user information need. 

Users try to translate their needs into a *query*. This *query* is processed by a *search engine* over the *collection* and retrive matching *results*. Users then evaluates the relevance of these *results* and refine his *query* iteratively.


## 1.2 Term-document frequency

An important processing step of a *search engine* is to build a term-document frequency table, counting the number of occurrences (or a boolean version) each term (or word) occurs in each document (or file). 

We illustrate this in the piece of code below. For simplification, we use a scikit-learn builtin function [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to achieve this.


In [None]:
random.seed(42) ## turn it reproductible

files = []
# Read only the txt version of files and discard the pdf files of the colleciton
for dirname, _, filenames in os.walk('/kaggle/input/machado-de-assis/raw/txt/'):
    for filename in filenames:
        files.append(os.path.join(dirname, filename))

print('There are a total of {} files'.format(len(files)), '\n')
        
# select and read 10 random files 
sample_books = random.sample(files,10)

docs = []
for fname in sample_books:
    with open(fname , "r") as file:
        text = file.read()
    docs.append(text)

# count term frequency using CountVectorizer from scikit-learn
## limiting number of words just for illustrating the concept
vec = CountVectorizer(max_features=10, stop_words=STOP_WORDS) 
X = vec.fit_transform(docs)
df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names())
books_names = [book.split('/')[-1] for book in sample_books]
df['book'] = books_names
df = df.set_index('book')

print(df)

The table above shows 10 words occurrence into 10 books sampled  from the 116 available. Now, from this table, we're able to search documents which contains specific terms using set operations (AND, OR, NOT). 

For example, for table above, if our *query* is the *term* "helena", documents number 2, 3 and 9 match our criteria. If our *query* is "estácio", only document 9 matches. If our *query* is "helena AND estácio", the engine will compute (2, 3, 9) AND (9) and return 9. Also, with a *query* "helena OR estácio", it will compute (2, 3, 9) OR (9) and return (2, 3, 9) as results. (Remember: OR is Union and AND is Intersection).

One problem with this approach is, since a big collection usually has hundreds of thousands of distinct words and possibly another hundreds of thousands of distinct documents, this table (or matrix) will have approxemately 10^10 elements, or 10 Billion. Even using a efficient data structure for sparse data, it's hard to build such matrix. 

One possible solution is to store the information in a inverted manner: An <b>Inverted Index</b>.


## 1.3 Inverted Index

### 1.3.1 Cleanning up and tokenizing text

In this section we'll tokenize (split it into terms or tokens) a piece of text and clean it: eliminate accentuation, puctuation and put all letter into lower case. We also remove stop words, the most common words such as articles, prepositions, etc.
These steps are important to improve the results of our search engine, because since a user can search by "Capitú", "capitu", "Capitu" or "capitú", we want all these forms to match te character of Machado de Assis's book, "Dom Casmurro".  So we process both the text of the books and later the user queries with same methods.

In [None]:
WORD_MIN_LENGTH = 2 ## we'll drop all tokens with less than this size

def strip_accents(text):
    """Strip accents and punctuation from text. 
    For instance: strip_accents("João e Maria, não entrem!") 
    will return "Joao e Maria  nao entrem "

    Parameters:
    text (str): Input text

    Returns:
    str: text without accents and punctuation

   """    
    nfkd = unicodedata.normalize('NFKD', text)
    newText = u"".join([c for c in nfkd if not unicodedata.combining(c)])
    return re.sub('[^a-zA-Z0-9 \\\']', ' ', newText)

def tokenize_text(text):
    """Make all necessary preprocessing of text: strip accents and punctuation, 
    remove \n, tokenize our text, convert to lower case, remove stop words and 
    words with less than 2 chars.

    Parameters:
    text (str): Input text

    Returns:
    str: cleaned tokenized text

   """        
    text = strip_accents(text)
    text = re.sub(re.compile('\n'),' ',text)
    words = word_tokenize(text)
    words = [word.lower() for word in words]
    words = [word for word in words if word not in STOP_WORDS and len(word) >= WORD_MIN_LENGTH]
    return words

### 1.3.2 Building Inverted index

To build our Inverted Index, we'll use two auxiliary functions. The first one process the tokenized text from one document (book) to build a local inverted index. The second merge this information into our Inverted Index adding document id of each document. The final structure of our index will have the form of:

<pre>
{'term1': 
    {doc_id0: [pos0, pos1, pos2], #positions of term1 found in doc_id0
    doc_id1: [pos0, pos1, pos2] #positions of term1 found in doc_id1
}, 'term2': 
    {doc_id0: [pos0, pos1, pos2], #positions of term1 found in doc_id0
    doc_id3: [pos0, pos1, pos2] #positions of term1 found in doc_id3
    ...
}
</pre>

In [None]:
def inverted_index(words):
    """Create a inverted index of words (tokens or terms) from a list of terms

    Parameters:
    words (list of str): tokenized document text

    Returns:
    Inverted index of document (dict)

   """       
    inverted = {}
    for index, word in enumerate(words):
        locations = inverted.setdefault(word, [])
        locations.append(index)
    return inverted

def inverted_index_add(inverted, doc_id, doc_index):
    """Insert document id into Inverted Index

    Parameters:
    inverted (dict): Inverted Index
    doc_id (int): Id of document been added
    doc_index (dict): Inverted Index of a specific document.

    Returns:
    Inverted index of document (dict)

   """        
    for word in doc_index.keys():
        locations = doc_index[word]
        indices = inverted.setdefault(word, {})
        indices[doc_id] = locations
    return inverted

Now we iterate over our collection of books and build our Inverted Index.

In [None]:
inverted_doc_indexes = {}
files_with_index = []
files_with_tokens = {}
doc_id=0
for fname in tqdm(files):
    with open(fname , "r") as file:
        text = file.read()
    #Clean and Tokenize text of each document
    words = tokenize_text(text)
    #Store tokens
    files_with_tokens[doc_id] = words

    doc_index = inverted_index(words)
    inverted_index_add(inverted_doc_indexes, doc_id, doc_index)
    files_with_index.append(os.path.basename(fname))
    doc_id = doc_id+1

In [None]:
## Check presence of capitu token into Dom Casmurro book:
capitu_docs = inverted_doc_indexes['capitu']
for idx in capitu_docs.keys():
    print(files_with_index[idx])

### 1.4 Running boolean search
Now we're ready to run boolean search on our inverted index. This index and query processing together consists of we call a *search engine*. To simplify, again, our boolean search function will use only AND operator, retrieving only documents that contain all terms in user query.

In [None]:
## Using AND as logical operator
def boolean_search(inverted, file_names, query):
    """Run a boolean search with AND operator between terms over 
    the inverted index.

    Parameters:
    inverted (dict): Inverted Index
    file_names (list): List with names of files (books)
    query (txt): Query text

    Returns:
    Names of books that matchs the query.

   """      
    # preprocess the user query using same function used to build Inverted Index
    words = [word for _, word in enumerate(tokenize_text(query)) if word in inverted]
    # list with a disctinct document match for each term from query
    results = [set(inverted[word].keys()) for word in words]
    # AND operator. Replace & for | to modify to OR behavior.
    docs = reduce(lambda x, y: x & y, results) if results else []
    return ([file_names[doc] for doc in docs])

Now we'll test some passages and character names from well known books from author.

In [None]:
# Passage from "Quincas borba"
print(boolean_search(inverted_doc_indexes, files_with_index, 
                     "Ao vencido, ódio ou compaixão; ao vencedor, as batatas"))
print(boolean_search(inverted_doc_indexes, files_with_index, 
                     "Quincas borba"))

In [None]:
# Passage from "Dom Casmurro"
print(boolean_search(inverted_doc_indexes, files_with_index, 
                     "Capitu, apesar daqueles olhos que o diabo lhe deu"))
# Passage from "Memórias Póstumas de Brás Cubas"
print(boolean_search(inverted_doc_indexes, files_with_index, 
                     "Capitu"))

In [None]:
## Passage from "Memórias Póstumas de Brás Cubas"
print(boolean_search(inverted_doc_indexes, files_with_index, 
                     "Sandice criar amor às casas alheias, de modo que, \
                     apenas senhora de uma, dificilmente lha farão despejar"))
print(boolean_search(inverted_doc_indexes, files_with_index, 
                     "Sandice"))

As we can see above, our boolean search works pretty well. But this model has few problems:

* The user must have some knowledge of the collection to proper use the logical operators (OR, AND, NOT...). We over simplified here using only AND.
* We don't have a notion of rank here. For example, if we search for a couple of common words not present into STOP_WORDS, it'll probably return all documents, but without any importance order. 
* Larger documents have bigger probability to be returned in any query, since it contains more terms. 

We'll try to address these issues with another retrieval model called Ranked TF-IDF.

PS: There are several issues related to deal with phrase queries, but we won't deal with them. Please refer to Chapter 2 of Chris Manning's [book](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf).

In [None]:
## Exaple of two term query present in several books, but there's 
## no relevance order to evaluate with result to look first. 
print(boolean_search(inverted_doc_indexes, files_with_index, "árvore rua"))

# 2. Ranked TF-IDF Retrieval Model

TF-IDF stands for term frequency–inverse document frequency. It's a retrieval model whereas each pair of term *i* in document *d* receiveis a weight given by formulas below:

* $ \mbox{Tf-Idf}_{i,d} = \mbox{Tf}_{i,d} \cdot \mbox{Idf}_{i} $

* $ \mbox{Tf}_{i,d} = 1 + log(f_{i,d}) $, where $f_{i,d}$ is how many times term $i$ occurs in document $d$

* $ \mbox{Idf}_{i,d} = log(N / n_{t}) $, where $N$ is the number of documents of Collection and $n_{t}$ is the number of documents the term occurs 


Then, we compute the relevance of a document to a specific query adding weights of each term of query present in each document:

* $ Score_{q,d} = \sum_{t \in q \cap d} \mbox{Tf-Idf}_{t,d} $


See https://en.wikipedia.org/wiki/Tf%E2%80%93idf

To acomplish this model here, we'll create a data structure using a dictionary whereas the key is the pair (term, doc_id) and value is Tf-Idf weight. First we calculate the term frequency (number of documents each term is present).

In [None]:
## Number of documents each term occurs
DF = {}
for word in inverted_doc_indexes.keys():
    DF[word] = len ([doc for doc in inverted_doc_indexes[word]])

total_vocab_size = len(DF)
print(total_vocab_size)

In [None]:
tf_idf = {} # Our data structure to store Tf-Idf weights

N = len(files_with_tokens)

for doc_id, tokens in tqdm(files_with_tokens.items()):
    
    counter = Counter(tokens)
    words_count = len(tokens)
    
    for token in np.unique(tokens):
        
        # Calculate Tf
        tf = counter[token] # Counter returns a tuple with each terms counts
        tf = 1+np.log(tf)
        
        # Calculate Idf
        if token in DF:
            df = DF[token]
        else:
            df = 0
        idf = np.log((N+1)/(df+1))
        
        # Calculate Tf-idf        
        tf_idf[doc_id, token] = tf*idf

## 2.1 Running Ranked search

Similar to boolean search, here we define a function to process a query and return documents that matches it.

In [None]:
def ranked_search(k, tf_idf_index, file_names, query):
    """Run ranked query search using tf-idf model.

    Parameters:
    k (int): number of results to return
    tf_idf_index (dict): Data Structure storing Tf-Idf weights to each 
                        pair of (term,doc_id) 
    file_names (list): List with names of files (books)
    query (txt): Query text

    Returns:
    Top-k names of books that matchs the query.

   """   
    tokens = tokenize_text(query)
    query_weights = {}
    for doc_id, token in tf_idf:
        if token in tokens:
            query_weights[doc_id] = query_weights.get(doc_id, 0) + tf_idf_index[doc_id, token]
    
    query_weights = sorted(query_weights.items(), key=lambda x: x[1], reverse=True)
    results = []
    for i in query_weights[:k]:
        results.append(file_names[i[0]])
    
    return results
    

Again, we'll test same passages and character names we used previously.

In [None]:
# Passage from "Quincas borba"
print(ranked_search(10, tf_idf, files_with_index, 
                    "Ao vencido, ódio ou compaixão; ao vencedor, as batatas"))
print(ranked_search(10, tf_idf, files_with_index, "Quincas borba"))

In [None]:
# Passage from "Dom Casmurro"
print(ranked_search(10, tf_idf, files_with_index, 
                    "Capitu, apesar daqueles olhos que o diabo lhe deu"))
print(ranked_search(10, tf_idf, files_with_index, "Capitu"))

In [None]:
## Passage from "Memórias Póstumas de Brás Cubas"
print(ranked_search(10, tf_idf, files_with_index, 
                    "Sandice criar amor às casas alheias, de modo que, \
                    apenas senhora de uma, dificilmente lha farão despejar"))
print(ranked_search(10, tf_idf, files_with_index, "Sandice"))

In [None]:
## Here we reproduce the previous query with common words, 
## but now we have a score to sort results.
print(ranked_search(10, tf_idf, files_with_index, "árvore rua"))

# 3. Evaluating IR Systems

In this section we'll discuss how to evaluate an Information Retrieval System. To assess our design decisions (kind of data structure, preprocessing steps, type of term weigthing, etc) we need to set up a benchmark.

There are several available benchmarks over the Internet. Probably the most important IR dataset nowadays is [MS Marco Document retrieval dataset](https://microsoft.github.io/msmarco/). It contains more than 3 million documents and 300k queries. Besides documents and queries a IR benchmark must be a relevance mapping between them. This way we can evaluate if a document returned by our system should be returned or not according to this mapping.

There are several metrics to evaluate IR systems. The evaluation process consists of "firing" a set of queries "against" the IR System and compare the returned documents with the answers annotated in relevance mapping. Some metrics use the orders of returned documents, but others don't. In some metrics we define a cut in the number of documents returned (i.e. top 10 documents only). A extensive list of metrics can be found [here](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)).

In this tutorial we'll use [MRR@10 (Mean Reciprocal Rank)](https://en.wikipedia.org/wiki/Mean_reciprocal_rank), a metric which takes into account only the position of the first relevant document returned into the first 10 documents by each query. MS Marco benchmark also uses this metric, but it is calculated with 100 first returned results.


The code above will receive a boolean relevance results vector and return the MRR@10.

In [None]:
##Source: https://gist.github.com/bwhite/3726239
def mean_reciprocal_rank(bool_results, k=10):
    """Score is reciprocal of the rank of the first relevant item
    First element is 'rank 1'.  Relevance is binary (nonzero is relevant).
    Example from http://en.wikipedia.org/wiki/Mean_reciprocal_rank
    >>> rs = [[0, 0, 1], [0, 1, 0], [1, 0, 0]]
    >>> mean_reciprocal_rank(rs)
    0.61111111111111105

    Args:
        rs: Iterator of relevance scores (list or numpy) in rank order
            (first element is the first item)
    Returns:
        Mean reciprocal rank
    """
    bool_results = (np.atleast_1d(r[:k]).nonzero()[0] for r in bool_results)
    return np.mean([1. / (r[0] + 1) if r.size else 0. for r in bool_results])

mean_reciprocal_rank([[0, 0, 1], [0, 1, 0], [1, 0, 0]])

In [None]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi

## 3.1 Load and process CISI dataset

Instead of using a huge dataset, for didatic purposes we'll use a much smaller one, known as [CISI collection](http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/), from Glasgow Univesity. The code below loads the original dataset into dictionaries to easy access.

In [None]:
### Processing DOCUMENTS
doc_set = {}
doc_id = ""
doc_text = ""
with open('/kaggle/input/cisi-a-dataset-for-information-retrieval/CISI.ALL') as f:
    lines = ""
    for l in f.readlines():
        lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
    lines = lines.lstrip("\n").split("\n")
doc_count = 0
for l in lines:
    if l.startswith(".I"):
        doc_id = int(l.split(" ")[1].strip())-1
    elif l.startswith(".X"):
        doc_set[doc_id] = doc_text.lstrip(" ")
        doc_id = ""
        doc_text = ""
    else:
        doc_text += l.strip()[3:] + " " # The first 3 characters of a line can be ignored.    

        
### Processing QUERIES
with open('/kaggle/input/cisi-a-dataset-for-information-retrieval/CISI.QRY') as f:
    lines = ""
    for l in f.readlines():
        lines += "\n" + l.strip() if l.startswith(".") else " " + l.strip()
    lines = lines.lstrip("\n").split("\n")
    
qry_set = {}
qry_id = ""
for l in lines:
    if l.startswith(".I"):
        qry_id = int(l.split(" ")[1].strip()) -1
    elif l.startswith(".W"):
        qry_set[qry_id] = l.strip()[3:]
        qry_id = ""

### Processing QRELS
rel_set = {}
with open('/kaggle/input/cisi-a-dataset-for-information-retrieval/CISI.REL') as f:
    for l in f.readlines():
        qry_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[0]) -1
        doc_id = int(l.lstrip(" ").strip("\n").split("\t")[0].split(" ")[-1])-1
        if qry_id in rel_set:
            rel_set[qry_id].append(doc_id)
        else:
            rel_set[qry_id] = []
            rel_set[qry_id].append(doc_id)

In [None]:
## Here we check some statistics and info of CISI dataset

print('Read %s documents, %s queries and %s mappings from CISI dataset' % 
      (len(doc_set), len(qry_set), len(rel_set)))

number_of_rel_docs = [len(value) for key, value in rel_set.items()]
print('Average %.2f and %d min number of relevant documents by query ' % 
      (np.mean(number_of_rel_docs), np.min(number_of_rel_docs)))

print('Queries without relevant documents: ', 
      np.setdiff1d(list(qry_set.keys()),list(rel_set.keys())))

Below there's a sample of a pair query and a document relevant to it in the dataset.

In [None]:
random.seed(42)
idx = random.sample(rel_set.keys(),1)[0]

print('Query ID %s ==>' % idx, qry_set[idx])
rel_docs = rel_set[idx]
print('Documents relevants to Query ID %s' % idx, rel_docs)
sample_document_idx = random.sample(rel_docs,1)[0]
print('Document ID %s ==>' % sample_document_idx, doc_set[sample_document_idx])

## 3.2 Index CISI dataset using BM25

To evaluate our benchmark, instead of using the Tf-Idf model we build previously, we'll use an API which implements a rank function called [Okapi BM25](https://en.wikipedia.org/wiki/Okapi_BM25). It's the standard function built in [Apache Lucene](https://lucene.apache.org/), [Elastic](https://www.elastic.co/) and [Apache Solr](https://solr.apache.org/), leading solutions for indexing and searching documents.

BM25 is very similar to Tf-Idf concept we discussed earlier in this tutorial.

In the code below we index each document from CISI without any preprocessing and get scores for one random query.

In [None]:
query = qry_set[idx] #get query text
rel_docs = rel_set[idx] #get relevant documents

# Index all documents using BM25
corpus = list(doc_set.values())
tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

# Process query and get scores for each indexed document using BM25
tokenized_query = query.split(" ")
print('Query ==> ', query, '\nRelevant documents IDs: ==> ', rel_docs)
scores = bm25.get_scores(tokenized_query)
print(scores, len(scores), len(doc_set))

Finally we sort documents by score, compare with hand annotated relevant documents from dataset and create a boolean mask of the results. With this boolean array we can calculate MRR@10.

In [None]:
## Argsort gives the indexes of values in increasing order, so we input with the negative values of scores
most_relevant_documents = np.argsort(-scores)

print(most_relevant_documents[:20]) # printing first 20 most relevant results

## Mask relevant documents with 0's and 1's according to query <-> document annotation
masked_relevance_results = np.zeros(most_relevant_documents.shape)
masked_relevance_results[rel_docs] = 1
sorted_masked_relevance_results = np.take(masked_relevance_results, most_relevant_documents)

print(sorted_masked_relevance_results[:20]) #printing first 20 results: 1 is relevant 0 isn't

# Calculate MRR@10
print(mean_reciprocal_rank([sorted_masked_relevance_results]))

Now we're ready to reproduce scores through all queries in dataset. First we'll create a function to return the masked results.

In [None]:
def results_from_query(qry_id, bm25):
    """Return an ordered array of relevant documents returned by query_id

    Args:
        qry_id (int): id of query on dataset
        bm25 (object): indexed corpus

    Returns:
        boolean sorted relevance array of documents
    """    
    query = qry_set[qry_id]
    rel_docs = []
    if qry_id in rel_set:
        rel_docs = rel_set[qry_id]
    tokenized_query = query.split(" ")
    scores = bm25.get_scores(tokenized_query)
    most_relevant_documents = np.argsort(-scores)
    masked_relevance_results = np.zeros(most_relevant_documents.shape)
  
    masked_relevance_results[rel_docs] = 1
    sorted_masked_relevance_results = np.take(masked_relevance_results, most_relevant_documents)
    
    return sorted_masked_relevance_results


results = [results_from_query(qry_id, bm25) for qry_id in list(qry_set.keys())]
print('MRR@10 %.4f' % mean_reciprocal_rank(results))

## 3.3 Trying to improve results

In this section we'll try to improve results through preprocessing our corpus and query using stemming, lowercase and removing stop words.

In [None]:
# Instaciate objects from NLTK
stemmer = nltk.stem.PorterStemmer()
stop_words = nltk.corpus.stopwords.words('english')

In [None]:
def preprocess_string(txt, remove_stop=True, do_stem=True, to_lower=True):
    """
    Return a preprocessed tokenized text.
    
    Args:
        txt (str): original text to process
        remove_stop (boolean): to remove or not stop words (common words)
        do_stem (boolean): to do or not stemming (suffixes and prefixes removal)
        to_lower (boolean): remove or not capital letters.
        
    Returns:
        Return a preprocessed tokenized text.
    """      
    if to_lower:
        txt = txt.lower()
    tokens = nltk.tokenize.word_tokenize(txt)
    
    if remove_stop:
        tokens = [tk for tk in tokens if tk not in stop_words]
    if do_stem:
        tokens = [stemmer.stem(tk) for tk in tokens]
    return tokens

In [None]:
corpus = list(doc_set.values())
# You may experiment with this trying to improve MRR@10
remove_stop = True
do_stem = True
to_lower = True

tokenized_corpus = [preprocess_string(doc, remove_stop, do_stem, to_lower) for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

def results_from_query_new(qry_id, bm25):
    query = qry_set[qry_id]
    rel_docs = []
    if qry_id in rel_set:
        rel_docs = rel_set[qry_id]
    tokenized_query = preprocess_string(query, remove_stop, do_stem, to_lower)
    scores = bm25.get_scores(tokenized_query)
    most_relevant_documents = np.argsort(-scores)
    masked_relevance_results = np.zeros(most_relevant_documents.shape)
    masked_relevance_results[rel_docs] = 1
    sorted_masked_relevance_results = np.take(masked_relevance_results, most_relevant_documents)
    return sorted_masked_relevance_results


results = [results_from_query_new(qry_id, bm25) for qry_id in list(qry_set.keys())]
print('MRR@10 %.4f' % mean_reciprocal_rank(results))

As we can see, a huge improvement (~35%) in MRR@10 doing this 3 preprocessing steps !!