# Developing an Information Retrieval System with Advanced Boolean Search

<div style="line-height: 2;">This project aims to develop an Information Retrieval (IR) system that supports both standard Boolean queries and proximity queries. The system is designed to handle a collection of text documents, building an Inverted Index and a Positional Index to facilitate efficient document retrieval.</div>

<div style="line-height: 2;">For developing this system, the first step is reading document files which are prepared. But before that, we need to import some neccessary libraries:
    <ul>
        <li>
            <b>NLTK</b>: Natural Language ToolKit: A leading platform for building Python programs to work with human language data
        </li>
        <li>
            <b>OS</b>: A built-in module that provides a wide range of functions for interacting with the operating system. It allows you to work with file systems, directories, paths, environment variables, and more.
        </li>
        <li>
            <b>String</b>: A package which provides a wide range of string manipulation functions and methods
        </li>
    </ul>

Also we import some specific parts of NLTK which are described later:
<ul>
    <li>
        <b>Stopwords</b>
    </li>
    <li>
        <b>Regexp_tokenize</b>
    </li>
</ul>

In [37]:
pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [38]:
import nltk
import os
import string

In the current directory, we have a file named **data** which contains our text files.

In [39]:
# Define path of data files here
folder_path = "data"

With `os.listdir()` we make a list of all files and directories in the `folder_path`.

In [40]:
# make a list of all files in "data" directory which are our text files
file_list = os.listdir(folder_path)



As there might exist some files which are not text or some directories, we use the below code to extract `.txt` files.


In [41]:
# text_files stores text files' names
text_files = [file for file in file_list if file.endswith('.txt')]

# print text files titles
print("document titles:\n" + "-"*40)
for text_file in text_files:
    print(text_file)
print( "-"*40)

document titles:
----------------------------------------
Jerry Decided To Buy a Gun.txt
Rentals at the Oceanside Community.txt
Gasoline Prices Hit Record High.txt
Cloning Pets.txt
Crazy Housing Prices.txt
Man Injured at Fast Food Place.txt
A Festival of Books.txt
Food Fight Erupted in Prison.txt
Better To Be Unlucky.txt
Sara Went Shopping.txt
Freeway Chase Ends at Newsstand.txt
Trees Are a Threat.txt
A Murder-Suicide.txt
Happy and Unhappy Renters.txt
Pulling Out Nine Tons of Trash.txt
----------------------------------------



<div style="line-height: 2;">
As we mentioned above, some specific functions of NLTK package are needed.

- **regexp_tokenize**: we use this function for tokenization, specifically for tokenizing text based on a regular expression pattern.
- **stopwords**: This module specifically contains a list of common stop words for multiple languages, which in this project we set English.Some examples of English stopwords are: "the", "and", "or", "is", "was", ... .
</div>


In [42]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords

Also, we need to download stopwords for preprocessing instructons.

In [43]:
# nltk.download('punkt')
# download stopwords here
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


False

## Preprocessing

<div style="line-height: 2;">
In this step, we want to preprocess text files functionally. Below steps have to be implemented:
    <ol>
        <li>
            <b>tokenization</b>: which we will do it using `regexp_tokenize()` function from `NLTK` library. For this job, we need to specify a pattern for natural language. As in our documents we have words, numbers with less or equal to 3 digits, numbers with more than 3 digits which are seperated 3 by 3, and floating point numbers, we have to use a pattern to handle all these types concurrently. We used `r'\d{1,3}(?:,\d{3})*(?:\.\d+)?|\w+'` which is decoded below: <br>
            <ul>
                <li>
                    \d{1,3}: This part matches one to three-digit numbers like "4", "32", "879", not "1234" or "1,000".
                </li>
                <li>
                (?:,\d{3})*: This part matches comma-separated thousands separators for numbers like "75,000".
                </li>
                <li>
                (?:\.\d+)?: This part handles floating point numbers.
                </li>
                <li>
                |: This symbol acts as an OR operator, allowing the regular expression to match either a number (as described above) or its next part.
                </li>
                <li>
                \w+: This part matches words, which consist of one or more word characters (letters, digits, and underscores). It's not limited to numbers and can match words like "apple," "word123," and "my_variable."
                </li>
            </ul>
        </li>
      <li>
          <b>Lowercase tokens</b>: as step of preprocessing, we have to convert all words' characters to lowercase. So, after query preprocessing, our job is easier.      
      </li>
        <li>
         <b>Delete Stopwrds</b>: We have to delete stopwords(These are the words which are use frequently in documents and not neccessary in Information Retrival. There is a list of stopwords which is prepared for English and we downloaded above.). So, the process of retriving will be faster because there are less words for checking.
        </li>
      <li>
          <b>Delete punctuations</b>: Trivially, punctuations are not needed when we want to retrive information. So, it is needed to omit all punctuations from all documents.
      <\li>
    </ol>
</div>

In [44]:
def preprocessing(text):
    '''
    Define a regular expression pattern for tokenization 
    (matching words, 3-digit numbers, numbers with more than 3 digits and 
    have thousands seperators in which n > 3, and floating-point numbers)
    '''
    pattern = r'\d{1,3}(?:,\d{3})*(?:\.\d+)?|\w+'
    
    # Tokenize the text using the regular expression pattern    
    tokens = regexp_tokenize(text, pattern)
    
    # Lowercase the tokens
    tokens = [word.lower() for word in tokens]
    
    # Get the set of English stopwords
    stop_words = set(stopwords.words('english'))
    
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # Remove punctuation
    tokens = [word for word in tokens if word not in string.punctuation]
    
    # Join tokens to return a string for each document
    return ' '.join(tokens)


There is another function we have to import from `collections` module named `defaultdict` which creates dictionaries with default values for new intered keys.

In [45]:
from collections import defaultdict

## Inverted index

<div style="line-height: 2;">Now, after preprocessing, it is time to create our inverted index.<br>
    An inverted index is a kind of data structure which is particularly a dictionary that stores distinct words of all documnet files as dictionary keys and the values of dictionay keys are the docIDs of documents which have the key word in theirselves. Also, for each docID, we store a list which contains the index of that specific word in the document which is the word appeared in.<br>
    <span style="color:red;">Note: </span>I used try-execpt because as I wanted to work with files and read them all using UTF-8 endoding, I confronted the error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 394: invalid start byte". That means there exists a file that contains non-UTF-8 encoded characters. So, I used latin-1 encoding for these type of files.<br>
    <span style="color:blue;">Warnin: </span>code in line 53 through 55 should be only run once because in the fisrt time, the processed files are created and the content is added. If we run it more than one time, the content of each, will be duplicate which is a trash!
</div>

In [46]:
# Create the inverted index using defaultdict() for simplicity
inverted_index = defaultdict(dict)

# Create an empty list which will contain docIDs
docIDs = []

# Define a variable for allocating docIDs to the files
id = 1

# Start to read document files
for text_file in text_files:
    
    # Define docID to for the current file we want to read
    file_id = id 
    
    # Append the docID to the corresponding list
    docIDs.append(id)
    
    # Join file_path and file name to make a full file path
    file_path = os.path.join(folder_path, text_file)
    
    # Open files, read them, and store their content in file_content variable
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            file_content = file.read()
    except UnicodeDecodeError:
        with open(file_path, 'r', encoding='latin-1') as file:
            file_content = file.read()

    # Preprocess each file content using preprocessing 
    # function whivch is defined above
    processed_text = preprocessing(file_content)
    
    # Create inverted index: check if each word exists in processed document or not
    for index, token in enumerate(processed_text.split()):

        # If the word is in the dictionary, its file_id may or 
        # may not exist in the inverted index
        if token in inverted_index:
            
            # If file_id of the token is in the corresponding list of the 
            # token in dictionary, just add its index to the corresponding list
            if file_id in inverted_index[token]:
                inverted_index[token][file_id].append(index)
         
            # Else, add the file_id and the token's index in the file to the dictionary
            else:
                inverted_index[token][file_id] = [index]

        # If token is not in the dictionary, add token, its file id,
        # and its index in the corresponding file to the dictionary
        else:
            inverted_index[token][file_id] = [index]
            
    # To store processed texts, create some new .txt 
    # files an append the processed_text to them
    # Note: run this part of the code only once.
    #with open("data/processed/processed%i.txt" % (id), "a") as processed_file:
    #    # Write text to the file
    #    processed_file.write(processed_text)
    
    # Prepare id variable for next file
    id = id + 1    

# Print the inverted index:
for key in inverted_index:
    print(key)
    print(inverted_index[key])

jerry
{1: [0, 19, 37, 44, 73, 85, 130, 133]}
baldwin
{1: [1]}
30
{1: [2], 3: [17], 10: [6]}
years
{1: [3, 96], 2: [48], 3: [46], 5: [46, 163], 7: [136], 9: [16], 10: [20, 25], 12: [12, 26], 13: [24, 72, 98], 15: [32]}
old
{1: [4, 50], 6: [2], 9: [17, 127], 10: [26], 11: [2], 12: [140], 13: [12, 18], 15: [135]}
manager
{1: [5], 2: [32], 3: [35], 6: [53]}
pizza
{1: [6]}
restaurant
{1: [7, 13], 6: [15, 52, 66]}
lived
{1: [8], 10: [7], 13: [68]}
apartment
{1: [9]}
one
{1: [10, 24], 2: [126], 5: [20, 78], 6: [61, 65], 7: [2, 54, 113, 154, 158], 9: [77], 11: [12, 34, 73, 112], 12: [56, 124], 13: [48, 54], 14: [6, 95], 15: [29, 127]}
mile
{1: [11], 10: [45], 15: [110]}
north
{1: [12], 11: [90, 130]}
walked
{1: [14]}
work
{1: [15], 2: [95, 134], 3: [131], 12: [47], 15: [0, 77, 106]}
raining
{1: [16]}
took
{1: [17]}
bus
{1: [18], 11: [93, 98, 100, 115]}
loved
{1: [20, 60]}
gangster
{1: [21, 48]}
movies
{1: [22], 7: [12]}
new
{1: [23, 30, 52, 118], 4: [42, 44, 72], 5: [76], 6: [9, 71], 14: [23]}

## Query Processing

<div style="line-height: 2;">This is the last section which contains query processing.<br>
In this part, I defined a while loop to get the queries from user until he/she inters "Quit" keyword to interrup the seatch engine.<br>
    As it is guaranteed that the queries are which of the forms below, I did the needed process.
    <ul>
        <li>
            x OR y
        </li>
        <li>
            x AND y
        </li>
        <li>
            NOT x
        </li>
    </ul>
    For this job, first of all, I defined 2 lists to store docIDs of query tokens(I mean x and y in the list above, not OR, NOT, and ANd) which contain docIDs of the files which have the token. Another list I defined, is "terms". It contains tokens of the query after splitting. And the last list is "result" which contains docIDs which we have to print as a response to each query. <br>
    If the query contains "NOT" in it, do the preprocessing fo the single-word after "NOT" and store it in the term1 variable. Then, append the values of the word in dictionary to docIDs_1. Then, for implementing "NOT" instruction, search in docIDs list for IDs which are not it docIDs_1 and print it as the result.<br>
    If the query contains "OR", "AND", or "NEAR\", we have to words in our query which we have to do the neccessary procces for them. But the common part for all these 3 cases is that we have to seperate 2 words and store them in term1 and term2 variables. The, append docIDs corresponding to each term to its corresponding docIDs list(for term1, store in docIDs_1 and for term2 store in docIDs_2) to store docIDs which each word appead in the corresponding document. Now, if "OR" is in the query, clearly we have to do the logical or on docIDs_1 and docIDs_2 and output the result. But if there is "AND" or "NEAR/" in the query, we have to intersect the docIDs_1 and cocIDs_2. If there is "AND", we obtained the result and give it to output. <br>
    But if there is "NEAR/", after intersecting, we have to do more. Tracing on intersected documents and find the indecies which the words appeared in. If the distance is less or equal to the number after / in "NEAR/", add it to the result. If not, do the process for next document.
    
</div>

Also, there is a merge function which is used for computing union of 2 lists.

In [47]:
# Merge 2 sorted lists which will give a sorted array of docIDs
def merge(docID1, docID2):
    result = []
    i = 0
    j = 0
    flag = 1
    while i < len(docID1) and j < len(docID2):
        if docID1[i] == docID2[j]:
            result.append(docID1[i])
            i += 1
            j += 1
        elif docID1[i] < docID2[j]:
            result.append(docID1[i])
            i += 1
        else:
            result.append(docID2[j])
            j += 1
    if i == len(docID1):
        if j < len(docID2):
            for k in range(j, len(docID2)):
                result.append(docID2[k])
    if j == len(docID2):
        if i < len(docID1):
            for k in range(i, len(docID1)):
                result.append(docID1[k])
    return result

# Query processing

In [48]:
query = ""

# Define a true while to give the query from user
while True:
    docIDs_1 = []
    docIDs_2 = []
    terms = []
    result = []
    query = input()
    
    if query == "Quit":
        break
        
    else:
        terms = query.split()

        if "NOT" in query:
            term1 = preprocessing(terms[1])
            for key in inverted_index[term1]:
                docIDs_1.append(key)
            result = [docID for docID in docIDs if docID not in docIDs_1]

        else:
            term1 = preprocessing(terms[0])
            term2 = preprocessing(terms[2])

            for key in inverted_index[term1]:
                docIDs_1.append(key)
            for key in inverted_index[term2]:
                docIDs_2.append(key)

            if "OR" in query:
                result = merge(docIDs_1, docIDs_2)

            else:
                intersection = list(filter(lambda value: value in docIDs_2, docIDs_1))

                if "NEAR/" in query:
                    # Calculate number of words between terms[0] and terms[2]
                    near = terms[1][5:]
                    near = int(near)

                    # Check if terms[0] and terms[2] are near atmost "near" words
                    for i in range(len(intersection)):
                        for index1 in inverted_index[term1][intersection[i]]:
                            for index2 in inverted_index[term2][intersection[i]]:
                                diff = abs(index1 - index2) - 1
                                if diff <= near and diff >= 0:
                                    result.append(intersection[i])
                else:
                    result = intersection

        # Print result docIDs
        print(result)
        
        # Print result documents' titles
        for docID in result:
            print(text_files[docID - 1])

[1, 6]
Jerry Decided To Buy a Gun.txt
Man Injured at Fast Food Place.txt
