# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

In [1]:
import os
import re

def doc_reader(carpeta):
    documentos = []
    nom_documentos = []
    for filename in os.listdir(carpeta):
        if filename.endswith(".txt"):
            with open(os.path.join(carpeta, filename), 'r', encoding='utf-8') as file:
                nom_documentos.append(filename)
                documentos.append(file.read())
    return documentos, nom_documentos

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

In [2]:
def indice_invertido(documentos):
    indice = {}
    for doc_id, doc_texto in enumerate(documentos):
        palabras = re.findall(r"\b[\w']+\b", doc_texto.lower()) 
        for palabra in palabras:
            if palabra not in indice:
                indice[palabra] = []
            if doc_id not in indice[palabra]:  # Evitar duplicados
                indice[palabra].append(doc_id)
    return indice

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.

In [3]:
def word_search(palabra, indice_invertido, nombres_documentos):
    palabra = palabra.lower()  # Convertir la palabra buscada a minúsculas
    if palabra in indice_invertido:
        documentos_coincidentes = indice_invertido[palabra]
        documentos_encontrados = []  
        print(f"La palabra '{palabra}' aparece en los siguientes documentos:")
        for doc_id in documentos_coincidentes:
            documentos_encontrados.append(nombres_documentos[doc_id])
            print(f" - {nombres_documentos[doc_id]}")
        print(f"Number of documents found: {len(documentos_encontrados)}")
    else:
        print(f"La palabra '{palabra}' no se encuentra en ningún documento.")

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

In [4]:
doc_folder = "data folder"
documentos, nombres_documentos = doc_reader(doc_folder)
ind_invertido = indice_invertido(documentos)


palabra_a_buscar = input("Ingrese la palabra que desea buscar: ")
word_search(palabra_a_buscar, ind_invertido, nombres_documentos)

Ingrese la palabra que desea buscar:  police


La palabra 'police' aparece en los siguientes documentos:
 - pg10676.txt
 - pg10681.txt
 - pg1184.txt
 - pg1259.txt
 - pg1400.txt
 - pg145.txt
 - pg1661.txt
 - pg1727.txt
 - pg174.txt
 - pg19694.txt
 - pg19926.txt
 - pg2160.txt
 - pg244.txt
 - pg2554.txt
 - pg2581.txt
 - pg2600.txt
 - pg2641.txt
 - pg2701.txt
 - pg27827.txt
 - pg28054.txt
 - pg2814.txt
 - pg2852.txt
 - pg28556.txt
 - pg29870.txt
 - pg30254.txt
 - pg345.txt
 - pg37106.txt
 - pg3825.txt
 - pg39407.txt
 - pg40438.txt
 - pg408.txt
 - pg43.txt
 - pg4300.txt
 - pg4363.txt
 - pg514.txt
 - pg5197.txt
 - pg52862.txt
 - pg600.txt
 - pg64317.txt
 - pg6761.txt
 - pg67979.txt
 - pg73442.txt
 - pg76.txt
 - pg844.txt
 - pg98.txt
Number of documents found: 45


## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.