# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.



In [1]:
import re
import os

stop_words = ["the", "is", "and", "of", "a", "in", "to", "it", "that", "this"]
def load_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
        words = re.findall(r'\b[a-zA-Z]+\b', content)
        words = [word.lower() for word in words if word.lower() not in stop_words] 
    return words

This code defines a function called load_file that reads the contents of a file specified by a given file path. It extracts words from the file using a regular expression, converts them to lowercase, and filters out any words found in a predefined list of stop words. The function then returns a list of the remaining words.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.



In [2]:
def create_inverted_index(directory):
    inverted_index = {}
    for file_name in os.listdir(directory):
        if file_name.endswith('.txt'):
            words = load_file(os.path.join(directory, file_name))
            for word in words:
                if word not in inverted_index:
                    inverted_index[word] = {file_name}
                else:
                    inverted_index[word].add(file_name)
    return inverted_index

This code defines a function called create_inverted_index that scans through a specified directory for text files. For each text file found, it extracts the words using the load_file function, then creates an inverted index where words are keys and the set of filenames where each word appears is the value. Finally, it returns this inverted index dictionary.

In [3]:

directory = '../data/'
inverted_index = create_inverted_index(directory)

Now we pass the directory where we have the files and display the first 3.

In [4]:
import json
first_three = {k: inverted_index[k] for k in list(inverted_index)[:3]}
first_three_as_lists = {k: list(v) for k, v in first_three.items()}
print(json.dumps(first_three_as_lists, indent=4))


{
    "works": [
        "War and Peace.txt",
        "The Pleasures of the Table.txt",
        "The Expedition of Humphry Clinker.txt",
        "Treasure Island.txt",
        "A Christmas Carol in Prose; Being a Ghost Story of Christmas.txt",
        "The Romance of Lust: A classic Victorian erotic novel.txt",
        "Mo.txt",
        "A Study in Scarlet.txt",
        "My Life \u2014 Volume 1.txt",
        "The Importance of Being Earnest: A Trivial Comedy for Serious People.txt",
        "Childe Harold's Pilgrimage.txt",
        "Adventures of Huckleberry Finn.txt",
        "The Strange Case of Dr. Jekyll and Mr. Hyde.txt",
        "Don Quijote.txt",
        "A Doll's House : a play.txt",
        "Chronicles of London Bridge.txt",
        "Thus Spake Zarathustra: A Book for All and None.txt",
        "The Hound of the Baskervilles.txt",
        "The Adventures of Ferdinand Count Fathom \u2014 Complete.txt",
        "Twenty years after.txt",
        "The Works of the Rev. Hugh Binnin

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.



In [6]:
def search_documents(query, inverted_index):
    if query in inverted_index:
        matching_documents = inverted_index[query]
        sorted_documents = sorted(matching_documents) 
        return sorted_documents
    else:
        return []


We define a function called search_documents that takes a query (a word) and an inverted index as input. It checks if the query exists as a key in the inverted index. If it does, it retrieves the set of documents where the query appears, sorts the document names alphabetically, and returns them. If the query does not exist in the inverted index, it returns an empty list.


### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

In [8]:
print("Welcome to Document Search Engine!")
print("")

while True:
    query = input("Enter your query (or type 'exit_program' to quit): ").strip().lower()
    if query == 'exit_program':
        print("Exiting...")
        break

    matching_documents = search_documents(query, inverted_index)
    if matching_documents:
        print(f"Matching documents found for '{query}':")
        for document in matching_documents:
            print("---", document)
    else:
        print("No matching documents found.")

    print("") 
    print("") 
    print("**************End of search") 

Welcome to Document Search Engine!

Matching documents found for 'pride':
--- A Christmas Carol in Prose; Being a Ghost Story of Christmas.txt
--- A Doll's House : a play.txt
--- A Modest Proposal.txt
--- A Room with a View.txt
--- A Short History of English Agriculture.txt
--- A Smaller History of Rome.txt
--- A Study in Scarlet.txt
--- A Tale of Two Cities.txt
--- Adventures of Huckleberry Finn.txt
--- Anne of Green Gables.txt
--- Beyond Good and Evil.txt
--- Biographical Anecdotes of William Hogarth, With a Catalogue of His Works.txt
--- Cambridge Papers.txt
--- Childe Harold's Pilgrimage.txt
--- Christopher Columbus and How He Received and Imparted the Spirit of Discovery.txt
--- Chronicles of London Bridge.txt
--- Cranford.txt
--- Crime and Punishment.txt
--- Don Juan.txt
--- Don Quixote.txt
--- Dracula.txt
--- Dubliners.txt
--- Fan Fare May 1953.txt
--- Frankenstein; Or, The Modern Prometheus.txt
--- Great Expectations.txt
--- Grimms' Fairy Tales.txt
--- Heart of Darkness.txt
---



## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.