#### Information retrieval - Part1

Information retrieval (IR) is the process of obtaining relevant information from a large repository, such as a database or the internet. It involves the use of algorithms and techniques to search, filter, and rank data to provide users with the most pertinent results. With the exponential growth of digital information, effective information retrieval systems have become essential for accessing and managing vast amounts of data efficiently. 
In these series of posts, I will start from the formative concepts and techniques of information retrieval, exploring its foundational principles and gradually advancing to more complex topics and applications.

I will go through boolean retrieval methods with boolean matrix and inverted index for the given documents

In [None]:
## Boolean Retrieval Model

import re
import os
import numpy as np

path = "../ir"
shakes = ["jc.txt", "kinglear.txt", "macbeth.txt", "othello.txt", "merchvenice.txt"]

tokens = []

# simple function to preprocess text (remove punctuation, leading underscores and lowercase all words)
def preprocess(text):
    cleaned_text = re.sub(r'\b_|\W+|\d', ' ', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
    words = cleaned_text.lower().split()
    return words

for shake in shakes:
    with open(os.path.join(path, shake), "r", encoding="utf-8") as file:
        text = file.read()
        words = preprocess(text)
        tokens.append(words)

tokens = [word for sublist in tokens for word in sublist]

# get all unique words in the plays, sorted alphabetically
unique_tokens = sorted(list(set(tokens)))
print("Total number of words in all the plays: ", len(tokens))
print("Unique words in all the plays: ", len(unique_tokens))
print(unique_tokens[:20])

Total number of words in all the plays:  136282
Unique words in all the plays:  8990
['a', 'abandon', 'abate', 'abated', 'abatement', 'abed', 'abhor', 'abhorr', 'abhorred', 'abide', 'abilities', 'ability', 'abject', 'abjure', 'able', 'abler', 'aboard', 'abode', 'abominable', 'abound']


In [54]:
# This code initializes the boolean matrix with zeros and then iterates through each unique word and each play. 
# If the word appears in the play, it sets the corresponding entry in the boolean matrix to one.
# This is not a very efficient way to do this, but it is simple and easy to understand.  
# Limited to first 20 unique words.
num_tokens = 1000
boolean_matrix = np.zeros((len(unique_tokens[: num_tokens]), len(shakes)))
for i, token in enumerate(unique_tokens[:num_tokens]):
    for j, shake in enumerate(shakes):
        with open(os.path.join(path, shake), "r", encoding="utf-8") as file:
            text = file.read()
            words = preprocess(text)
            boolean_matrix[i, j] = token in words

In [56]:
print(unique_tokens[:num_tokens])

['a', 'abandon', 'abate', 'abated', 'abatement', 'abed', 'abhor', 'abhorr', 'abhorred', 'abide', 'abilities', 'ability', 'abject', 'abjure', 'able', 'abler', 'aboard', 'abode', 'abominable', 'abound', 'about', 'above', 'abram', 'abridg', 'abroad', 'absence', 'absent', 'absolute', 'abundance', 'abus', 'abuse', 'abused', 'abuser', 'abuses', 'accent', 'accents', 'accept', 'acceptance', 'accepted', 'accepting', 'access', 'accessed', 'accessible', 'accident', 'accidental', 'accidents', 'accommodate', 'accommodation', 'accompany', 'accomplished', 'accordance', 'according', 'accordingly', 'account', 'accountant', 'accounted', 'accoutered', 'accoutred', 'accumulate', 'accurs', 'accursed', 'accus', 'accuse', 'accuser', 'accustomed', 'acerb', 'ache', 'acheron', 'aches', 'achiev', 'acknowledg', 'acknowledge', 'acknowledged', 'acknown', 'acquaint', 'acquaintance', 'acquainted', 'acquitted', 'acre', 'across', 'act', 'acted', 'acting', 'action', 'actions', 'active', 'actors', 'acts', 'actual', 'adag

In [60]:
# Get the size of the boolean array in bytes
size_in_kb = boolean_matrix.nbytes
print(f"Size of the boolean array in kilobytes: {size_in_kb / 1024}")

total_elements = boolean_matrix.size
num_zeros = np.count_nonzero(boolean_matrix == 0)
percentage_zeros = (num_zeros / total_elements) * 100

print(f"Percentage of zeros in the boolean matrix: {percentage_zeros:.2f}%")

# Example query with and and or operators
# This function takes a query string as input and returns a list of plays that contain the words in the query.
def query_words(query):
    query = query.lower()
    if " and " in query:
        words = query.split(" and ")
        result_matrix = np.ones(len(shakes), dtype=bool)
        for word in words:
            if word in unique_tokens[:num_tokens]:
                word_index = unique_tokens[:num_tokens].index(word)
                result_matrix = np.logical_and(result_matrix, boolean_matrix[word_index])
            else:
                return []
    elif " or " in query:
        words = query.split(" or ")
        result_matrix = np.zeros(len(shakes), dtype=bool)
        for word in words:
            if word in unique_tokens[:num_tokens]:
                word_index = unique_tokens[:num_tokens].index(word)
                result_matrix = np.logical_or(result_matrix, boolean_matrix[word_index])
    else:
        if query in unique_tokens[:num_tokens]:
            word_index = unique_tokens[:num_tokens].index(query)
            result_matrix = boolean_matrix[word_index]
        else:
            return []

    plays_with_words = [shakes[j] for j in range(len(shakes)) if result_matrix[j]]
    return plays_with_words

# Example queries
and_query = "abandon and lear"
or_query = "abandon or lear"

plays_and = query_words(and_query)
plays_or = query_words(or_query)

print(f"The plays containing both words in and_query': {plays_and}")
print(f"The plays containing either words in or_query : {plays_or}")

Size of the boolean array in kilobytes: 39.0625
Percentage of zeros in the boolean matrix: 57.84%
The plays containing both words in and_query': []
The plays containing either words in or_query : ['othello.txt']


Boolean retrieval is a simple and efficient way to retrieve documents that contain specific words or phrases. But it has limitations such as:

* the inability to rank documents based on relevance.
* Only the presence or absence of words is considered, and there is no notion of term frequency or document frequency.
* For just the first 1000 words, the boolean matrix takes up 39.06 KB of memory. This would grow linearly with the number of unique words.
* The percentage of zeros in the boolean matrix is 58% , which is expected since most words will not appear in most plays. The boolean matrix is not very efficient for large datasets, as it requires a lot of memory and most of the entries are zeros.


In [77]:
# Inverted index implementation for the boolean retrieval model
# it's a simple implementation of an inverted index that stores the frequency of each word and the list of documents in which it appears.
from collections import defaultdict
  

class InvertedIndex:
    def __init__(self):
        self.index = defaultdict(lambda: {"frequency": 0, "posting_list": []})
    
    def preprocess(self, text):
        cleaned_text = re.sub(r'\b_|\W+|\d', ' ', text)
        cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
        words = cleaned_text.lower().split()
        return words
    
    def index_document(self, document, doc_id):
        words = self.preprocess(document)
        for word in words:
            self.index[word]["frequency"] += words.count(word)
            if doc_id not in self.index[word]["posting_list"]:
                self.index[word]["posting_list"].append(doc_id)

    def query(self, word):
            word = word.lower()
            if word in self.index:
                return self.index[word]
            else:
                return {"frequency": 0, "posting_list": []}
    
# Example usage
path = "../ir"
shakes = ["jc.txt", "kinglear.txt", "macbeth.txt", "othello.txt", "merchvenice.txt"]

inverted_index = InvertedIndex()

for doc_id, shake in enumerate(shakes):
    print(f"Indexing document {shake} with id {doc_id}")
    with open(os.path.join(path, shake), "r", encoding="utf-8") as file:
        text = file.read()
        inverted_index.index_document(text, doc_id)


Indexing document jc.txt with id 0
Indexing document kinglear.txt with id 1
Indexing document macbeth.txt with id 2
Indexing document othello.txt with id 3
Indexing document merchvenice.txt with id 4


In [79]:
print("Size of the inverted index in kilo bytes: ", inverted_index.index.__sizeof__() / 1024)
word_to_query = "king"
result = inverted_index.query(word_to_query)
print(f"Frequency of '{word_to_query}': {result['frequency']}")
print(f"Posting list for '{word_to_query}': {[shakes[i] for i in result['posting_list']]}")

Size of the inverted index in kilo bytes:  288.078125
Frequency of 'king': 7791
Posting list for 'king': ['jc.txt', 'kinglear.txt', 'macbeth.txt', 'othello.txt', 'merchvenice.txt']


#### Benefits of using an inverted index for the boolean retrieval model

* Inverted index is faster to construct and more efficient than the boolean matrix for large collections of documents. Just took 1 minute 30 seconds to build for complete vocabulary
* The inverted index is more efficient in terms of memory usage and query performance.

Note :The query function can be modified to support boolean queries by combining the posting lists of the words in the query.

Reference and further reading
-----------------------------

* "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze