# Developing an Information Retrieval System with Advanced Boolean Search

<div style="line-height: 2;">This project aims to develop an Information Retrieval (IR) system that supports both standard Boolean queries and proximity queries. The system is designed to handle a collection of text documents, building an Inverted Index and a Positional Index to facilitate efficient document retrieval.</div>

<div style="line-height: 2;">For developing this system, the first step is reading document files which are prepared. But before that, we need to import some neccessary libraries:
    <ul>
        <li>
            <b>NLTK</b>: Natural Language ToolKit: A leading platform for building Python programs to work with human language data
        </li>
        <li>
            <b>OS</b>: A built-in module that provides a wide range of functions for interacting with the operating system. It allows you to work with file systems, directories, paths, environment variables, and more.
        </li>
        <li>
            <b>String</b>: A package which provides a wide range of string manipulation functions and methods
        </li>
    </ul>

Also we import some specific parts of NLTK which are described later:
<ul>
    <li>
        <b>Stopwords</b>
    </li>
    <li>
        <b>Regexp_tokenize</b>
    </li>
</ul>

In [1]:
import nltk
import os
import string

In the current directory, we have a file named **data** which contains our text files.

In [2]:
# Define path of data files here
folder_path = "data"

With `os.listdir()` we make a list of all files and directories in the `folder_path`.

In [3]:
# make a list of all files in "data" directory which are our text files
file_list = os.listdir(folder_path)



As there might exist some files which are not text or some directories, we use the below code to extract `.txt` files.


In [4]:
# text_files stores text files' names
text_files = [file for file in file_list if file.endswith('.txt')]

# print text files titles
print("document titles:\n" + "-"*40)
for text_file in text_files:
    print(text_file)
print( "-"*40)

document titles:
----------------------------------------
Jerry Decided To Buy a Gun.txt
Rentals at the Oceanside Community.txt
Gasoline Prices Hit Record High.txt
Cloning Pets.txt
Crazy Housing Prices.txt
Man Injured at Fast Food Place.txt
A Festival of Books.txt
Food Fight Erupted in Prison.txt
Better To Be Unlucky.txt
Sara Went Shopping.txt
Freeway Chase Ends at Newsstand.txt
Trees Are a Threat.txt
A Murder-Suicide.txt
Happy and Unhappy Renters.txt
Pulling Out Nine Tons of Trash.txt
----------------------------------------



<div style="line-height: 2;">
As we mentioned above, some specific functions of NLTK package are needed.

- **regexp_tokenize**: we use this function for tokenization, specifically for tokenizing text based on a regular expression pattern.
- **stopwords**: This module specifically contains a list of common stop words for multiple languages, which in this project we set English.Some examples of English stopwords are: "the", "and", "or", "is", "was", ... .
</div>


In [5]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords

Also, we need to download stopwords for preprocessing instructons.

In [6]:
# nltk.download('punkt')
# download stopwords here
# nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Fatemeh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Preprocessing

<div style="line-height: 2;">
In this step, we want to preprocess text files functionally. Below steps have to be implemented:
    <ol>
        <li>
            <b>tokenization</b>: which we will do it using `regexp_tokenize()` function from `NLTK` library. For this job, we need to specify a pattern for natural language. As in our documents we have words, numbers with less or equal to 3 digits, numbers with more than 3 digits which are seperated 3 by 3, and floating point numbers, we have to use a pattern to handle all these types concurrently. We used `r'\d{1,3}(?:,\d{3})*(?:\.\d+)?|\w+'` which is decoded below: <br>
            <ul>
                <li>
                    \d{1,3}: This part matches one to three-digit numbers like "4", "32", "879", not "1234" or "1,000".
                </li>
                <li>
                (?:,\d{3})*: This part matches comma-separated thousands separators for numbers like "75,000".
                </li>
                <li>
                (?:\.\d+)?: This part handles floating point numbers.
                </li>
                <li>
                |: This symbol acts as an OR operator, allowing the regular expression to match either a number (as described above) or its next part.
                </li>
                <li>
                \w+: This part matches words, which consist of one or more word characters (letters, digits, and underscores). It's not limited to numbers and can match words like "apple," "word123," and "my_variable."
                </li>
            </ul>
        </li>
      <li>
          <b>Lowercase tokens</b>: as step of preprocessing, we have to convert all words' characters to lowercase. So, after query preprocessing, our job is easier.      
      </li>
        <li>
         <b>Delete Stopwrds</b>: We have to delete stopwords(These are the words which are use frequently in documents and not neccessary in Information Retrival. There is a list of stopwords which is prepared for English and we downloaded above.). So, the process of retriving will be faster because there are less words for checking.
        </li>
        <li>
           <b>Delete punctuations</b>: Trivially, punctuations are not needed when we want to retrive information. So, it is needed to omit all punctuations from all documents.
        <\li>
    </ol>
</div>

In [7]:
def preprocessing(text):
    '''
    Define a regular expression pattern for tokenization 
    (matching words, 3-digit numbers, numbers with more than 3 digits and 
    have thousands seperators in which n > 3, and floating-point numbers)
    '''
    pattern = r'\d{1,3}(?:,\d{3})*(?:\.\d+)?|\w+'
    
    # Tokenize the text using the regular expression pattern    
    tokens = regexp_tokenize(text, pattern)
    
    # Lowercase the tokens
    tokens = [word.lower() for word in tokens]
    
    # Get the set of English stopwords
    stop_words = set(stopwords.words('english'))
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]
    
    # Join tokens to return a string for each document
    return ' '.join(tokens)


## Trie tree

Define trie node and trie tree to make a trie tree as a permuterm index and inverted index.
(Trie implementation is copied from ChatGPT with some addition because it is not asked in this course.)

In [9]:
class TrieNode:
    def __init__(self, char):
        self.char = char
        self.children = {}
        self.is_end_of_term = False
        self.postingList = {}
        
class Trie:
    def __init__(self):
        self.root = TrieNode("")
        self.result = []
    
    # Function for inserting a new word in trie
    def insert(self, term, docID, index):
        node = self.root
        for char in term:
            if char not in node.children:
                node.children[char] = TrieNode(char)
            node = node.children[char]
        node.is_end_of_term = True
        node.postingList[docID] = [index]            
        
    # Function for searching a word in trie
    def search(self, term):
        self.result = []
        node = self.root
        for char in term:
            if char not in node.children:
                return None
            node = node.children[char]
        self.dfs(node, term[:-1])
        return self.result
    
    # This function uses dfs algorithm to find children of a node
    def dfs(self, node, prefix):
        if node.is_end_of_term:
            self.result.append(prefix + node.char)
        for child in node.children.values():
            self.dfs(child, prefix + node.char)

In [10]:
# Make a trie tree
trie_inverted_index = Trie()

There is another function we have to import from `collections` module named `defaultdict` which creates dictionaries with default values for new intered keys.

In [8]:
from collections import defaultdict

## Inverted index

<div style="line-height: 2;">Now, after preprocessing, it is time to create our inverted index.<br>
    An inverted index is a kind of data structure which is particularly a dictionary that stores distinct words of all documnet files as dictionary keys and the values of dictionay keys are the docIDs of documents which have the key word in theirselves. Also, for each docID, we store a list which contains the index of that specific word in the document which is the word appeared in.<br>
    <span style="color:red;">Note: </span>I used try-execpt because as I wanted to work with files and read them all using UTF-8 endoding, I confronted the error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 394: invalid start byte". That means there exists a file that contains non-UTF-8 encoded characters. So, I used latin-1 encoding for these type of files.<br>
    <span style="color:blue;">Warnin: </span>code in line 53 through 55 should be only run once because in the fisrt time, the processed files are created and the content is added. If we run it more than one time, the content of each, will be duplicate which is a trash!
</div>

Also there is a function ```generate_permuterm``` which creates all permuterms of a term and returns the list of permuterms


In [69]:
def generate_permuterm(term, docID, index):
    # Add $ at the end of each term to determine this is the end of a word
    term = term + "$"
    permuterms = [term]
    # Create all permuterms
    for char in term:
        term = term[-1] + term[0:-1]
        permuterms.append(term)
    for permuterm in permuterms:
        trie_inverted_index.insert(permuterm, docID, index)
    return permuterms

In [12]:
# Create the inverted index using defaultdict() for simplicity
inverted_index = defaultdict(dict)

#Create permuterm index which has all words' permuterms as keys and all base words as values
permuterm_index = defaultdict(dict)

# Create an empty list which will contain docIDs
docIDs = []

# Define a variable for allocating docIDs to the files
id = 1

# Start to read document files
for text_file in text_files:
    
    # Define docID to for the current file we want to read
    file_id = id 
    
    # Append the docID to the corresponding list
    docIDs.append(id)
    
    # Join file_path and file name to make a full file path
    file_path = os.path.join(folder_path, text_file)
    
    # Open files, read them, and store their content in file_content variable
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            file_content = file.read()
    except UnicodeDecodeError:
        with open(file_path, 'r', encoding='latin-1') as file:
            file_content = file.read()

    # Preprocess each file content using preprocessing 
    # function whivch is defined above
    processed_text = preprocessing(file_content)
    
    # Create inverted index: check if each word exists in processed document or not
    for index, token in enumerate(processed_text.split()):

        # Generate all token's permuterms and add them in trie_inverted_index
        permuterms = generate_permuterm(token, file_id, index)
        
        # # Expand permuterm_index dict.
        for permuterm in permuterms:
            if permuterm not in permuterm_index:
                permuterm_index[permuterm][token] = [index]
            else:
                permuterm_index[permuterm][token].append(index)
        
        # Expand inverted_index dict.
        if token in inverted_index:
            
            # If file_id of the token is in the corresponding list of the 
            # token in dictionary, just add its index to the corresponding list
            if file_id in inverted_index[token]:
                inverted_index[token][file_id].append(index)
         
            # Else, add the file_id and the token's index in the file to the dictionary
            else:
                inverted_index[token][file_id] = [index]

        # If token is not in the dictionary, add token, its file id,
        # and its index in the corresponding file to the dictionary
        else:
            inverted_index[token][file_id] = [index]
            
    # Prepare id variable for next file
    id = id + 1    

# Spell Correction

For correcting the spelling of a query, we use Levenshtein distance as a criterian. 

In [14]:
import numpy as np

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)
    
    # Create a matrix to store the distances between prefixes of s1 and s2
    dp = np.zeros((m + 1, n + 1), dtype=int)

    # Initialize the first row and first column
    for i in range(m + 1):
        dp[i, 0] = i
    for j in range(n + 1):
        dp[0, j] = j

    # Fill in the matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            dp[i, j] = min(
                dp[i - 1, j] + 1,  # Deletion
                dp[i, j - 1] + 1,  # Insertion
                dp[i - 1, j - 1] + cost  # Substitution
            )

    return dp[m, n]

The function ```spell_correction``` finds the nearest word in ```inverted_index``` to the query

In [100]:
def spell_correction(query):
    # Initialize an empty list to store corrected words
    corrected_query = []
    
    # Split the input query into individual terms
    for term in query.split():
        corrected_word = None  # Initialize the corrected word as None
        min_distance = float('inf')  # Set the minimum distance to positive infinity
        
        # Iterate through the inverted index
        for word in inverted_index:
            # Calculate the Levenshtein distance between the query term and each word in the index
            distance = levenshtein_distance(term, word)
            
            # If the calculated distance is smaller than the minimum distance found so far
            if distance < min_distance:
                min_distance = distance  # Update the minimum distance
                corrected_word = word  # Update the corrected word to the closest match
        
        corrected_query.append(corrected_word)  # Append the closest matched word to the corrected query
    
    # Join the corrected words into a new query string
    new_query = " ".join(corrected_query)
    return new_query  # Return the new corrected query


# Boolean/Proximity Query

In [16]:
# Merge 2 sorted lists which will give a sorted array of docIDs
def merge(docID1, docID2):
    result = []
    i = 0
    j = 0
    flag = 1
    while i < len(docID1) and j < len(docID2):
        if docID1[i] == docID2[j]:
            result.append(docID1[i])
            i += 1
            j += 1
        elif docID1[i] < docID2[j]:
            result.append(docID1[i])
            i += 1
        else:
            result.append(docID2[j])
            j += 1
    if i == len(docID1):
        if j < len(docID2):
            for k in range(j, len(docID2)):
                result.append(docID2[k])
    if j == len(docID2):
        if i < len(docID1):
            for k in range(i, len(docID1)):
                result.append(docID1[k])
    return result

<div style="line-height: 2;">
In this part, I defined a while loop to get the queries from user until he/she inters "0" keyword to interrup the search engine.<br>
    As it is guaranteed that the queries are which of the forms below, I did the needed process.
    <ul>
        <li>
            x OR y
        </li>
        <li>
            x AND y
        </li>
        <li>
            NOT x
        </li>
    </ul>
    For this job, first of all, I defined 2 lists to store docIDs of query tokens(I mean x and y in the list above, not OR, NOT, and ANd) which contain docIDs of the files which have the token. Another list I defined, is "terms". It contains tokens of the query after splitting. And the last list is "result" which contains docIDs which we have to print as a response to each query. <br>
    If the query contains "NOT" in it, do the preprocessing fo the single-word after "NOT" and store it in the term1 variable. Then, append the values of the word in dictionary to docIDs_1. Then, for implementing "NOT" instruction, search in docIDs list for IDs which are not it docIDs_1 and print it as the result.<br>
    If the query contains "OR", "AND", or "NEAR/", we have to words in our query which we have to do the neccessary procces for them. But the common part for all these 3 cases is that we have to seperate 2 words and store them in term1 and term2 variables. The, append docIDs corresponding to each term to its corresponding docIDs list(for term1, store in docIDs_1 and for term2 store in docIDs_2) to store docIDs which each word appead in the corresponding document. Now, if "OR" is in the query, clearly we have to do the logical or on docIDs_1 and docIDs_2 and output the result. But if there is "AND" or "NEAR/" in the query, we have to intersect the docIDs_1 and cocIDs_2. If there is "AND", we obtained the result and give it to output. <br>
    But if there is "NEAR/", after intersecting, we have to do more. Tracing on intersected documents and find the indecies which the words appeared in. If the distance is less or equal to the number after / in "NEAR/", add it to the result. If not, do the process for next document.
    
</div>

In [17]:
def boolean_proximity_query(query):
    docIDs_1 = []
    docIDs_2 = []
    terms = []
    result = []
    terms = query.split()
    
    if "NOT" in query:
        term1 = preprocessing(terms[1])
        term1 = spell_correction(term1)
        for key in inverted_index[term1]:
            docIDs_1.append(key)
        result = [docID for docID in docIDs if docID not in docIDs_1]

    else:
        term1 = preprocessing(terms[0])
        term2 = preprocessing(terms[2])
        term1 = spell_correction(term1)
        term2 = spell_correction(term2)

        for key in inverted_index[term1]:
            docIDs_1.append(key)
        for key in inverted_index[term2]:
            docIDs_2.append(key)

        if "OR" in query:
            result = merge(docIDs_1, docIDs_2)

        else:
            intersection = list(filter(lambda value: value in docIDs_2, docIDs_1))

            if "NEAR/" in query:
                # Calculate number of words between terms[0] and terms[2]
                near = terms[1][5:]
                near = int(near[0])

                # Check if terms[0] and terms[2] are near atmost "near" words
                for i in range(len(intersection)):
                    for index1 in inverted_index[term1][intersection[i]]:
                        for index2 in inverted_index[term2][intersection[i]]:
                            diff = abs(index1 - index2) - 1
                            if diff <= near and diff >= 0:
                                result.append(intersection[i])
            else:
                result = intersection

    return result

# Wildcard query

In this section I handled wildcard query type in which if the user chose this type, this functions start to run and find all words which have the structure of the input query

This function normalizes the wildcard query input and puts it's last ```*``` at the end of the string

In [13]:
def find_prefix(term):
    while term[-1] != "*":
        term = term[-1] + term[0:-1]
    return term

This function checks if the wildcard has 1 or 2 stars. If there are 2 stars in the wildcard, it makes a new prefix which includes last prefix + $ + last suffix and then search for this prefix in ```trie_inverted_index```. And if there exists only 1 star, it searchs for the normal query which is found by ```find_prefix``` function and searchs in ```trie_inverted_index```

In [None]:
def wildcard_query(query):
    
    stars = query.count("*")
    children = []
    if stars == 2:
        if query[0] == '*' and query[-1] == '*':
            query = query[1:] + "$*"
            children = trie_inverted_index.search(query)
        else:
            result1 = []
            result2= []
            prefix, mid, suffix = query.split("*")
            query1 = suffix+"$"+prefix
            children = trie_inverted_index.search(query1)

        for term in children:
            if term in permuterm_index:
                result1.extend(permuterm_index[term])

        result1 = list(set(result1))
        

        for term in result1:
            if mid in term[len(prefix):len(term)-len(suffix)]:
                result2.append(term)

    elif stars == 1:
        result1 = []
        result2 = []
        inds = []
        query = query + "$"
        normal_query = find_prefix(query)
        children = trie_inverted_index.search(normal_query[0:-1])
        for term in children:
            if term in permuterm_index:
                result1.extend(permuterm_index[term])
        result2 = list(set(result1))
    return result2

## Boolean Wildcard

If user chooses to inter some boolean wildcard query, this part will run. It checks which kind of boolean query is intered and then try to find the type of its componants and search for them in right ways.

In [97]:
def boolean_wildcard_query(query):
    docIDs_1 = []
    docIDs_2 = []
    wildcard_1 = None
    wildcard_2 = None
    result = []
    terms = query.split()
    
    if "NOT" in query:
        
        term1 = terms[1]
        wildcard_1 = wildcard_query(term1)
        for word in wildcard_1:
            for key in inverted_index[word]:
                docIDs_1.append(key)
        result = [docID for docID in docIDs if docID not in docIDs_1]
        return result
    
    else:
        term1 = terms[0]
        term2 = terms[2]
        
        if "*" in term1:
            wildcard_1 = wildcard_query(term1)
            for word in wildcard_1:
                for key in inverted_index[word]:
                    docIDs_1.append(key)
            term1 = wildcard_1    
            if "*" in term2:
                wildcard_2 = wildcard_query(term2)
                for word in wildcard_2:
                    for key in inverted_index[word]:
                        docIDs_2.append(key)
                term2 = wildcard_2
                
            elif "*" not in term2:
                term2 = preprocessing(term2)
                term2 = spell_correction(term2)
                for key in inverted_index[term2]:
                    docIDs_2.append(key)
                wildcard_2 = [term2]
                
                
                
        elif "*" not in term1:
            term1 = preprocessing(term1)
            term1 = spell_correction(term1)
            for key in inverted_index[term1]:
                docIDs_1.append(key)
            wildcard_1 = [term1]
            if "*" in term2:
                wildcard_2 = wildcard_query(term2)
                for word in wildcard_2:
                    for key in inverted_index[word]:
                        docIDs_2.append(key)
                
            elif "*" not in term2:
                term2 = preprocessing(term2)
                term2 = spell_correction(term2)
                for key in inverted_index[term2]:
                    docIDs_2.append(key)
                wildcard_2 = [term2]
                
        if "OR" in query:
            result = merge(docIDs_1, docIDs_2)
        
        else:
            intersection = list(filter(lambda value: value in docIDs_2, docIDs_1))
#             print(intersection)
            if "NEAR/" in query:
                    # Calculate number of words between terms[0] and terms[2]
                near = terms[1][5:]
                near = int(near)

#                 for i in range()

                for i in range(len(intersection)):
                    for word1 in wildcard_1:
                        for word2 in wildcard_2:
                            for index1 in inverted_index[word1][intersection[i]]:
                                for index2 in inverted_index[word2][intersection[i]]:
                                    diff = abs(index1 - index2) - 1
                                    if diff <= near and diff >= 0:
                                        result.append(intersection[i])
            else:
                result = intersection
        
    return list(set(result))

## Query

Now, this is the time of taking query and try to retrive what usesr wants!

First, user is asked to choose on of the 4 possible query types. Then, for each type, the corresponding function will start to run and do the searchnig process.
<ol>
    <li>Boolean query: Queries like x AND y, x OR y, nad NOT x.</li>
    <li>Proximity query: Queries like x NEAR/n y</li>
    <li>Wildcard: Queries like a*b*c</li>
    <li>Spell correction: Queries which have spelling mistakes</li>
    <li>Boolean Wildcard: Queries like x AND a*b*c</li>
</ol>

In [99]:
query = None
choice = None
while True:
    print("Choose query type:\n1. Boolean/Proximity\n2. Wildcard\n3. Spell Correction\n4. Bolean Wildcard\n0. Quit")
    choice = int(input())
    result3 = []
    if choice == 0:
        print("Quit")
        break
    else:
        query = input()
        if choice == 1:
            result3 = boolean_proximity_query(query)
        elif choice == 2:
            result3 = wildcard_query(query)
        elif choice == 3:
            result3 = spell_correction(query)
        else:
            result3 = boolean_wildcard_query(query)

        print("Output: ", result3)

Choose query type:
1. Boolean/Proximity
2. Wildcard
3. Spell Correction
4. Bolean Wildcard
0. Quit
1
jerry AND 30
Output:  [1]
Choose query type:
1. Boolean/Proximity
2. Wildcard
3. Spell Correction
4. Bolean Wildcard
0. Quit
2
n*b*y
Output:  ['nearby', 'nobody']
Choose query type:
1. Boolean/Proximity
2. Wildcard
3. Spell Correction
4. Bolean Wildcard
0. Quit
3
festival founders
Output:  festival founders
Choose query type:
1. Boolean/Proximity
2. Wildcard
3. Spell Correction
4. Bolean Wildcard
0. Quit
4
gas OR JerRy
Output:  [1, 3]
Choose query type:
1. Boolean/Proximity
2. Wildcard
3. Spell Correction
4. Bolean Wildcard
0. Quit
4
gas NEAR/5 gasoli*
[3]
Output:  [3]
Choose query type:
1. Boolean/Proximity
2. Wildcard
3. Spell Correction
4. Bolean Wildcard
0. Quit
0
Quit
