# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.

# Excercise

In [169]:
import os
import re

Get the total of books

In [170]:
DIR = "../data"
total_books = len([name for name in os.listdir(DIR) if os.path.isfile(os.path.join(DIR, name))])
print(total_books)

100


Helper function to flatten arrays

In [171]:
def flatten(xss):
    return [x for xs in xss for x in xs]

Helper function to read the book content

In [172]:
def get_book_content(book_path: str) -> str:
    with open(f"{DIR}/{book_path}") as file:
        book_content = file.read()
        return book_content


Regex to filter out punctuation and special characters

Empty list to store the content of all the books

In [173]:
punctuation_regex = r"[^\w\s]"
books = []

Function to remove all non-alphanumeric characters and convert to lower case

In [174]:
def normalize_tokens(tokens: list[str]) -> list[str]:
    normalized_tokens = [re.sub(punctuation_regex, "", str.lower()) for str in tokens]
    return normalized_tokens

Function to tokenize the book's content

In [175]:
def tokenize_input(input: str) -> list[str]:
    tokens = normalize_tokens(input.split())
    return tokens


Function to tokenize all books

In [176]:
def tokenize_books() -> list[str]:
    book_tokens = []
    for i in range(1, total_books + 1):
        book_path = f"{DIR}/book{i}.txt"
        book_content = get_book_content(book_path).lower()
        books.append(book_content)
        book_tokens.append(tokenize_input(book_content))

    return book_tokens

We fill a rows array with the filename of our books

In [177]:
rows = [""]
for i in range(1, total_books + 1):
    rows.append(f"book{i}")

Then we call the tokenize_books function to store the tokens of every book in a "tokens" list

In [178]:
tokens = tokenize_books()

We filter the tokens and store only the unique tokens. We use the flatten function to accomplish this

In [179]:
unique_tokens = set(flatten(tokens))

We initialize the columns array to store a dictionary with the token and a list of 0 and 1's indicating if the word was found in the book content

In [180]:
columns = [{"token": str, "appereances": list[int]}]

With this code block we construct the index.
- First we loop through the unique tokens array to populate the columns array
- Then we loop through the books to check if the token is contained in the book content
- If the token is found, we append a 1 to the columns "appereances" value 
- If is not found we append a 0


In [181]:
for index, token in enumerate(unique_tokens):
    appereances = []
    columns.append({"token": token, "appereances": appereances})
    for i, book in enumerate(books):
        if token in book:
            appereances.append(1)
            columns[index]["appereances"] = appereances
        else:
            appereances.append(0)
            columns[index]["appereances"] = appereances

regular_index = {"books": rows, "tokens": columns}

We print the results

In [182]:
print(len(regular_index["books"]))
print(regular_index["books"])

101
['', 'book1', 'book2', 'book3', 'book4', 'book5', 'book6', 'book7', 'book8', 'book9', 'book10', 'book11', 'book12', 'book13', 'book14', 'book15', 'book16', 'book17', 'book18', 'book19', 'book20', 'book21', 'book22', 'book23', 'book24', 'book25', 'book26', 'book27', 'book28', 'book29', 'book30', 'book31', 'book32', 'book33', 'book34', 'book35', 'book36', 'book37', 'book38', 'book39', 'book40', 'book41', 'book42', 'book43', 'book44', 'book45', 'book46', 'book47', 'book48', 'book49', 'book50', 'book51', 'book52', 'book53', 'book54', 'book55', 'book56', 'book57', 'book58', 'book59', 'book60', 'book61', 'book62', 'book63', 'book64', 'book65', 'book66', 'book67', 'book68', 'book69', 'book70', 'book71', 'book72', 'book73', 'book74', 'book75', 'book76', 'book77', 'book78', 'book79', 'book80', 'book81', 'book82', 'book83', 'book84', 'book85', 'book86', 'book87', 'book88', 'book89', 'book90', 'book91', 'book92', 'book93', 'book94', 'book95', 'book96', 'book97', 'book98', 'book99', 'book100']

In [183]:
print(len(regular_index["tokens"]))

329087


In [None]:
print(regular_index["tokens"])