# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.

In [50]:
import os

# Path to the folder containing the .txt files
folder_path = '../data/'

# Initialize variables
book_names = []
unique_words = set()

# Process each .txt file
for file_name in os.listdir(folder_path):
  if file_name.endswith('.txt'):
    file_path = os.path.join(folder_path, file_name)
    with open(file_path, 'r') as file:
      words_list = set(file.read().split())
      book_names.append(file_name)
      unique_words.update(words_list)

# Sort unique words for consistency
unique_words_list = sorted(list(unique_words))

In [51]:
len(unique_words_list)

618452

In [52]:
import pandas as pd
import os

# List to store book names and word data
books = []
word_data = []

# Path to the folder containing .txt files
folder = '../data/'

# Iterate through files in the folder
for file_name in os.listdir(folder):
    file_path = os.path.join(folder, file_name)
    with open(file_path, 'r') as file:
        # Read the content of the file and add the book name
        content = file.read()
        books.append(file_name)
            
        # Create a set of words for the current book
        words_book = set(content.split())
            
        # Create a list of boolean values indicating the presence of each word in the first book
        word_presence = [word in words_book for word in unique_words_list]
        word_data.append(word_presence)

# Create a DataFrame from the word data
df = pd.DataFrame(word_data, columns=unique_words_list)


# Add the book name as the first column of the DataFrame
df.insert(0, 'Bookss', books)

# Set the "Bookss" column as the index of the DataFrame
df.set_index('Bookss', inplace=True)

# transpose the df
df = df.transpose()

# Save the DataFrame to a CSV file
# df.to_csv('words.csv', index=False)  # Set index=False to exclude row numbers in the CSV


In [53]:
df

Bookss,A Room with a View.txt,Chronicles of London Bridge.txt,Winnie-the-Pooh.txt,The Enchanted April.txt,Moby Dick.txt,A Doll's House.txt,A Christmas Carol in Prose Being a Ghost Story of Christmas.txt,Ulysses.txt,The Brothers Karamazov.txt,Jane Eyre- An Autobiography.txt,...,A Smaller History of Rome.txt,The Adventures of Ferdinand Count Fathom — Complete.txt,The Count of Monte Cristo.txt,John Dewey's logical theory.txt,Christopher Columbus and How He Received and Imparted the Spirit of Discovery.txt,The Hound of the Baskervilles.txt,Romeo and Juliet.txt,The Blue Castle- a novel.txt,The Metamorphoses of Ovid.txt,The History of Woman Suffrage.txt
!,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
!Mal,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
"""",False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
"""!!!!!!""",False,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
"""!"";",False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
☜,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
☞,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
☞SENT,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
✠.,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False


In [56]:
word= input('Input the word that you want to search')

In [57]:
# Buscar la palabra en cada columna de la tabla
libros_con_palabra = []
for libro in df.columns:
    if word in df.index and df.loc[word, libro]:
        libros_con_palabra.append(libro)

# Mostrar los libros donde se encuentra la palabra
if libros_con_palabra:
    print(f"La palabra '{word}' se encuentra en los siguientes libros:")
    for libro in libros_con_palabra:
        print(libro)
else:
    print(f"La palabra '{word}' no se encuentra en ningún libro.")

La palabra 'coop' se encuentra en los siguientes libros:
The Iliad.txt
Little Women; Or, Meg, Jo, Beth, and Amy.txt
Little Women.txt
