# Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

### Step 3: Implementing Boolean Search
- **Enhance Input Query**: Modify the function to accept complex queries that can include the Boolean operators AND, OR, and NOT.
- **Implement Boolean Logic**:
  - **AND**: The document must contain all the terms. For example, `python AND programming` should return documents containing both "python" and "programming".
  - **OR**: The document can contain any of the terms. For example, `python OR programming` should return documents containing either "python", "programming", or both.
  - **NOT**: The document must not contain the term following NOT. For example, `python NOT snake` should return documents that contain "python" but not "snake".

### Step 4: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.
- **Handling Case Sensitivity and Partial Matches**: Optionally, you can handle cases and partial matches to refine the search results.

### Step 5: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

## Additional Challenges (Optional)
- **Nested Boolean Queries**: Allow for nested queries using parentheses, such as `(python OR java) AND programming`.
- **Phrase Searching**: Implement the ability to search for exact phrases enclosed in quotes.
- **Proximity Searching**: Extend the search to find terms that are within a specific distance from one another.

This exercise will deepen your understanding of how search engines process and respond to complex user queries. By incorporating Boolean search, you not only enhance the functionality of your search engine but also mimic more closely how real-world information retrieval systems operate.


In [1]:
import os
import re

In [2]:
def load_inverted_index(input_file):
    inverted_index = {}
    with open(input_file, 'r', encoding='utf-8') as f:
        for line in f:
            word, occurrences = line.strip().split(': ') #split the line into word and occurrences
            occurrences = occurrences.strip('{}').replace("'", "").split(', ')  #remove curly braces and split the occurrences
            inverted_index[word] = set(occurrences) #add the word and its occurrences to the inverted index
    return inverted_index

In [3]:
set_file = '../words_index/inverted_index.txt' #path to the inverted index file
inverted_index = load_inverted_index(set_file) #load the inverted index

In [4]:
def search_word(word):  #function to search for a word in the inverted index
    if word in inverted_index: #check if the word is in the inverted index
        return inverted_index[word] #return the occurrences of the word
    else:
        return set() #return an empty set if the word is not in the inverted index

In [5]:
def tokenize(text):
    return re.findall(r'\b\w+\b', text.lower()) #return a list of words in the text

In [6]:
operators = ['and', 'or', 'not'] #list of operators
def search_query(query): #function to search for a query
    words = tokenize(query) #tokenize the query
    result = None      #initialize the result
    operator = None      #initialize the operator
    for word in words:      #loop through the words in the query
        if word in operators:
            operator = word
        else:
            if result is None:     #check if the result is None
                result = search_word(word)   #search for the word
            else:
                if operator == 'and':   #check if the operator is 'and'
                    result = result & search_word(word)
                elif operator == 'or':  #check if the operator is 'or'
                    result = result | search_word(word)
                elif operator == 'not': #check if the operator is 'not'
                    result = result - search_word(word)
    return result

In [13]:
query = 'tall or program'
words = tokenize(query)
print("Words in the query:", words)
result = search_query(query)

if result:
    print("Results",result)
    print('Results found:', len(result))
else:
    print('No results found')

Words in the query: ['tall', 'or', 'program']
Results {'pg408.txt', 'pg2554.txt', 'pg5197.txt', 'pg1184.txt', 'pg43.txt', 'pg73447.txt', 'pg45848.txt', 'pg1727.txt', 'pg5200.txt', 'pg6593.txt', 'pg16.txt', 'pg55.txt', 'pg145.txt', 'pg76.txt', 'pg46.txt', 'pg50038.txt', 'pg120.txt', 'pg768.txt', 'pg4085.txt', 'pg244.txt', 'pg996.txt', 'pg45.txt', 'pg18893.txt', 'pg48191.txt', 'pg37106.txt', 'pg74.txt', 'pg2641.txt', 'pg2814.txt', 'pg52882.txt', 'pg28054.txt', 'pg345.txt', 'pg98.txt', 'pg73448.txt', 'pg12582.txt', 'pg1260.txt', 'pg844.txt', 'pg1661.txt', 'pg600.txt', 'pg219.txt', 'pg30254.txt', 'pg2701.txt', 'pg67979.txt', 'pg41070.txt', 'pg100.txt', 'pg61419.txt', 'pg59468.txt', 'pg2852.txt', 'pg21700.txt', 'pg59469.txt', 'pg2160.txt', 'pg47948.txt', 'pg10676.txt', 'pg16389.txt', 'pg62119.txt', 'pg6761.txt', 'pg41445.txt', 'pg64317.txt', 'pg26073.txt', 'pg394.txt', 'pg1998.txt', 'pg73444.txt', 'pg47312.txt', 'pg2591.txt', 'pg174.txt', 'pg2600.txt', 'pg25344.txt', 'pg84.txt', 'pg10907.tx