# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

In [2]:
import os
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from nltk.stem import SnowballStemmer

# Descargar el corpus de stopwords y punkt si no está disponible
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\erick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\erick\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

In [3]:
def create_inverted_index(directory):
    index = {}
    stemmer = SnowballStemmer('english')
    
    for root, dirs, files in os.walk(directory):
        for file in files[:100]:  # Take only the first 100 files
            file_path = os.path.join(root, file)
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read().lower()  # Convert to lowercase
                tokens = word_tokenize(content)
                tokens = [stemmer.stem(token) for token in tokens if token not in stopwords.words('english') and token not in string.punctuation]
                for word in tokens:
                    if word in index:
                        index[word].append(file_path)
                    else:
                        index[word] = [file_path]
    return index

In [4]:
# Directory where the documents are located
directory = "../../data"

inverted_index = create_inverted_index(directory)
print("Inverted index created.")

Inverted index created.


In [5]:
# Convert the inverted index into a dataframe
df_index = pd.DataFrame([(word, files) for word, files in inverted_index.items()], columns=['Word', 'Files'])

# Save the inverted index as a CSV file
df_index.to_csv('inverted_index.csv', index=False)

print("Inverted index saved as 'inverted_index.csv'.")

Inverted index saved as 'inverted_index.csv'.


### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

In [14]:
# Cargar el archivo CSV generado
df_index = pd.read_csv('inverted_index.csv')

# Mostrar el DataFrame cargado
print(df_index.head())

        Word                                              Files
0       ﻿the  ['../../data\\pg100.txt', '../../data\\pg100.t...
1    project  ['../../data\\pg100.txt', '../../data\\pg100.t...
2  gutenberg  ['../../data\\pg100.txt', '../../data\\pg100.t...
3      ebook  ['../../data\\pg100.txt', '../../data\\pg100.t...
4    complet  ['../../data\\pg100.txt', '../../data\\pg100.t...


In [43]:
def search_query(query, df_index):
    result = df_index[df_index['Word'] == query]['Files'].tolist()
    result = eval(result[0]) if result else []
    return set(result)

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

In [None]:
# Ejemplo de consulta
query = "bug"

# Realizar la búsqueda
result = search_query(query, df_index)

# Formatear y mostrar los resultados de manera elegante
if result:
    print("Documentos encontrados:")
    formatted_results = [os.path.basename(file_path) for file_path in result]
    for file_name in formatted_results:
        print(file_name)
else:
    print("No se encontraron documentos relacionados con la consulta.")

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.