# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.



In [8]:
import re
import os

def load_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
        words = re.findall(r'\b[a-zA-Z]+\b', content)
        words = [word.lower() for word in words]
    return words

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.



In [9]:
def create_inverted_index(directory):
    inverted_index = {}
    for file_name in os.listdir(directory):
        if file_name.endswith('.txt'):
            words = load_file(os.path.join(directory, file_name))
            for word in words:
                if word not in inverted_index:
                    inverted_index[word] = {file_name}
                else:
                    inverted_index[word].add(file_name)
    return inverted_index

In [10]:

directory = '../data/'
inverted_index = create_inverted_index(directory)

In [12]:
import json
first_three = {k: inverted_index[k] for k in list(inverted_index)[:3]}
first_three_as_lists = {k: list(v) for k, v in first_three.items()}
print(json.dumps(first_three_as_lists, indent=4))


{
    "the": [
        "War and Peace.txt",
        "Dubliners.txt",
        "The Romance of Lust: A classic Victorian erotic novel.txt",
        "Pygmalion.txt",
        "Beyond Good and Evil.txt",
        "History of Tom Jones, a Foundling.txt",
        "Memoirs of a London doll.txt",
        "The Great Gats.txt",
        "Cambridge Papers.txt",
        "Crime and Punishment.txt",
        "Christopher Columbus and How He Received and Imparted the Spirit of Discovery.txt",
        "Pride and Prejudice.txt",
        "The Odyssey.txt",
        "The Philippines a Century Hence.txt",
        "The Adventures of Sherlock Holmes.txt",
        "Standard Selections (327).txt",
        "The Adventures of Tom Sawyer, Complete.txt",
        "The Enchanted April.txt",
        "Leviathan.txt",
        "Ulysses.txt",
        "The Iliad.txt",
        "Don Quixote.txt",
        "Little Women; Or, Meg, Jo, Beth, and Amy.txt",
        "The Yellow Wallpaper.txt",
        "My Life \u2014 Volume 1.txt",
  

In [7]:
del(inverted_index)

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.



In [17]:
def search_documents(query, inverted_index):
    if query in inverted_index:
        matching_documents = inverted_index[query]
        sorted_documents = sorted(matching_documents) 
        return sorted_documents
    else:
        return []



### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

In [27]:
print("Welcome to Document Search Engine!")
print("")

while True:
    query = input("Enter your query (or type 'exit_program' to quit): ").strip().lower()
    if query == 'exit_program':
        print("Exiting...")
        break

    matching_documents = search_documents(query, inverted_index)
    if matching_documents:
        print(f"Matching documents found for '{query}':")
        for document in matching_documents:
            print("---", document)
    else:
        print("No matching documents found.")

    print("") 

Welcome to Document Search Engine!

Matching documents found for 'confused':
--- A Christmas Carol in Prose; Being a Ghost Story of Christmas.txt
--- A Room with a View.txt
--- A Study in Scarlet.txt
--- A Tale of Two Cities.txt
--- Alice's Adventures in Wonderland.txt
--- Beyond Good and Evil.txt
--- Biographical Anecdotes of William Hogarth, With a Catalogue of His Works.txt
--- Christopher Columbus and How He Received and Imparted the Spirit of Discovery.txt
--- Chronicles of London Bridge.txt
--- Cranford.txt
--- Crime and Punishment.txt
--- Don Juan.txt
--- Don Quixote.txt
--- Dracula.txt
--- Dubliners.txt
--- Fan Fare May 1953.txt
--- Frankenstein; Or, The Modern Prometheus.txt
--- Great Expectations.txt
--- Heart of Darkness.txt
--- History of Tom Jones, a Foundling.txt
--- History of Woman Suffrage, Volume III (590).txt
--- Jane Eyre: An Autobiography.txt
--- John Dewey's logical theory.txt
--- Kentucky in American Letters, 1784-1912. Vol. 2 of 2.txt
--- Leviathan.txt
--- Littl



## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.