# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.

In [93]:
import os
import re

In [94]:
def remove_symbols(palabra):
    # Remove punctuation and convert to lowercase
    return re.sub(r'[^\w\s]', '', palabra).lower()

In [95]:
def create_inverter_index(path_files):
    # Create a dictionary to store the inverted index
    inverted_index = {}
    # Iterate over the files in the directory
    for name_book in os.listdir(path_files):
        book = os.path.join(path_files, name_book)
        # Read the file
        with open(book, 'r', encoding='utf-8') as f:
            for line in f:
                # Remove punctuation and convert to lowercase
                line = remove_symbols(line)
                # Split the line into words
                words = line.split()
                # Iterate over the words
                for word in words:
                    # If the word is not in the inverted index, add it
                    if word not in inverted_index:
                        inverted_index[word] = set()  # use a set to avoid duplicates
                    # Add the book to the inverted index for this word
                    inverted_index[word].add(name_book)
    return inverted_index

In [96]:
def save_inverted_index(inverted_index, output_file):
    # Save the inverted index to a file
    with open(output_file, 'w', encoding='utf-8') as f:
        for word, occurrences in inverted_index.items():
            f.write(f"{word}: {occurrences}\n")

In [97]:
def search_word(word, inverted_index_path):
    # Read the inverted index
    with open(inverted_index_path, 'r', encoding='utf-8') as f:
        # Iterate over the lines in the file
        for line in f:
            # Split the line into word and ocurrences
            word_inverted_index, ocurrences = line.split(':')
            # If the word is the one we are looking for, return the ocurrences
            if word_inverted_index == word:
                return ocurrences
    return None

In [98]:
path_files = '../data'
path_inverted_index = '../words_index/inverted_index.txt' # Output file for the inverted index 
inverted_index = create_inverter_index(path_files)
save_inverted_index(inverted_index, path_inverted_index)

In [101]:
word = input('type the word to search: ')
word_clean = remove_symbols(word)
result = search_word(word_clean, path_inverted_index)
if result:
    print(f"The word '{word}' appears in the following books: {result} ")
else:
    print(f"The word '{word}' does not appear in any book")

The word 'https://www.gutenberg.org/browse/scores/top#books-last1' appears in the following books:  {'datasource.txt'}
 
