# Boolean Search in Documents
## Objective

Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description

You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

In [1]:
import os
import re
import inflect

## Requirements
### Step 1: Update Data Preparation

Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.

In [2]:
def crear_matriz(carpeta):
    fila_palabras = []
    matriz = []
    columna_documentos = []
    
    for archivo in os.listdir(carpeta):
        if archivo.endswith(".txt"):
            columna_documentos.append(archivo)
            print(archivo)
            
            fila_palabras_temporal = set()
            ruta_archivo = os.path.join(carpeta, archivo)
            
            with open(ruta_archivo, "r", encoding="utf-8") as file:
                contenido = file.read().lower()
                
                # Dividir el contenido en palabras y eliminar caracteres no deseados
                eliminar = str.maketrans('', '', '.!?\'“”,:;_-"—[]{}‘’()')
                palabras = [palabra.translate(eliminar) for palabra in contenido.split()]
                
                # Iterar sobre cada palabra en la lista palabras
                inflect_engine = inflect.engine()
                for palabra in palabras:
                    if palabra:
                        palabra_singular = inflect_engine.singular_noun(palabra) or palabra
                        #print(palabra_singular + "\n")
                        fila_palabras_temporal.add(palabra_singular)
            
            # Agregar palabras únicas al conjunto global de palabras
            for palabra in fila_palabras_temporal:
                if palabra not in fila_palabras:
                    fila_palabras.append(palabra)
                    
            # Inicializar la fila de la matriz para el documento actual
            fila_actual = [1 if palabra in fila_palabras_temporal else 0 for palabra in fila_palabras]
            matriz.append(fila_actual)
    
    # Ajustar la matriz para que todas las filas tengan la misma longitud
    for fila in matriz:
        while len(fila) < len(fila_palabras):
            fila.append(0)
    
    return columna_documentos, fila_palabras, matriz


### Step 3: Implementing Boolean Search

- Enhance Input Query: Modify the function to accept complex queries that can include the Boolean operators AND, OR, and NOT.
- Implement Boolean Logic:
 - AND: The document must contain all the terms. For example, python AND programming should return documents containing both "python" and "programming".
 - OR: The document can contain any of the terms. For example, python OR programming should return documents containing either "python", "programming", or both.
 - NOT: The document must not contain the term following NOT. For example, python NOT snake should return documents that contain "python" but not "snake".


In [3]:
def buscar_palabra(ejey_documentos, ejex_palabras, matriz, palabra_a_buscar, carpeta):
    archivos = []
    archivos_con_palabra = []
    
    palabra_a_buscar = palabra_a_buscar.lower()

    for archivo in os.listdir(carpeta):
        if archivo.endswith(".txt"):
            archivos.append(archivo)

    if palabra_a_buscar in ejex_palabras:
        indice_palabra = ejex_palabras.index(palabra_a_buscar)
        
        for i, documento in enumerate(ejey_documentos):
            if matriz[i][indice_palabra] == 1:
                archivos_con_palabra.append(documento)
    
    return archivos_con_palabra

In [4]:
# Carpeta donde se encuentran los archivos txt
carpeta_data = "data"

In [5]:
# Realizar diccionario con la matriz binaria
columnas_doc, fila_palabras, matriz_binaria = crear_matriz(carpeta_data)

pg100.txt
pg10676.txt
pg10681.txt
pg1080.txt
pg11.txt
pg1184.txt
pg120.txt
pg1232.txt
pg1259.txt
pg1260.txt
pg1342.txt
pg1400.txt
pg145.txt
pg1513.txt
pg16.txt
pg16389.txt
pg16594.txt
pg1661.txt
pg1727.txt
pg174.txt
pg1952.txt
pg19694.txt
pg19926.txt
pg1998.txt
pg2000.txt
pg20228.txt
pg205.txt
pg2160.txt
pg21700.txt
pg21765.txt
pg219.txt
pg23.txt
pg24238.txt
pg244.txt
pg25344.txt
pg2542.txt
pg2554.txt
pg2581.txt
pg2591.txt
pg2600.txt
pg2641.txt
pg2701.txt
pg27827.txt
pg28054.txt
pg2814.txt
pg2852.txt
pg28556.txt
pg29870.txt
pg30254.txt
pg3207.txt
pg33283.txt
pg345.txt
pg35899.txt
pg37106.txt
pg38141.txt
pg3825.txt
pg394.txt
pg39407.txt
pg40438.txt
pg408.txt
pg4085.txt
pg41445.txt
pg42059.txt
pg43.txt
pg4300.txt
pg4363.txt
pg45.txt
pg46.txt
pg47475.txt
pg47629.txt
pg5131.txt
pg514.txt
pg5197.txt
pg5200.txt
pg52862.txt
pg54023.txt
pg55.txt
pg5740.txt
pg600.txt
pg6130.txt
pg62091.txt
pg62354.txt
pg64317.txt
pg6593.txt
pg67098.txt
pg6761.txt
pg67979.txt
pg73441.txt
pg73442.txt
pg73444.txt


### Step 4: Query Processing

- Parse the Query: Implement a function to parse the input query to identify the terms and operators.
- Search Documents: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.
- Handling Case Sensitivity and Partial Matches: Optionally, you can handle cases and partial matches to refine the search results.


In [6]:
# Pedir al usuario que ingrese la palabra a buscar
palabra_a_buscar = input("Ingrese la palabra que desea buscar: ")

Ingrese la palabra que desea buscar: Rose


In [7]:
#Realizar busqueda
archivos_encontrados = buscar_palabra(columnas_doc, fila_palabras, matriz_binaria, palabra_a_buscar, carpeta_data)

### Step 5: Displaying Results

- Output the Results: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.


In [8]:
# Se comprueba que la lista archivos_con_palabra tenga algo
if archivos_encontrados: 
    print("")
    print("Esa palabra aparece en los siguientes archivos:")
    for archivo in archivos_encontrados:
        print(archivo)
else:
    print("La palabra no se encontró en ningún archivo.")


Esa palabra aparece en los siguientes archivos:
pg100.txt
pg10676.txt
pg10681.txt
pg11.txt
pg1184.txt
pg120.txt
pg1232.txt
pg1259.txt
pg1260.txt
pg1342.txt
pg1400.txt
pg145.txt
pg1513.txt
pg16.txt
pg16389.txt
pg16594.txt
pg1661.txt
pg1727.txt
pg174.txt
pg1952.txt
pg19694.txt
pg19926.txt
pg1998.txt
pg205.txt
pg2160.txt
pg21700.txt
pg21765.txt
pg219.txt
pg23.txt
pg24238.txt
pg244.txt
pg25344.txt
pg2542.txt
pg2554.txt
pg2581.txt
pg2591.txt
pg2600.txt
pg2641.txt
pg2701.txt
pg27827.txt
pg28054.txt
pg2814.txt
pg2852.txt
pg28556.txt
pg29870.txt
pg30254.txt
pg3207.txt
pg345.txt
pg37106.txt
pg38141.txt
pg3825.txt
pg394.txt
pg39407.txt
pg40438.txt
pg408.txt
pg4085.txt
pg41445.txt
pg42059.txt
pg43.txt
pg4300.txt
pg4363.txt
pg45.txt
pg46.txt
pg47475.txt
pg5131.txt
pg514.txt
pg5197.txt
pg52862.txt
pg54023.txt
pg55.txt
pg5740.txt
pg6130.txt
pg62091.txt
pg62354.txt
pg64317.txt
pg6593.txt
pg6761.txt
pg67979.txt
pg73441.txt
pg73442.txt
pg73444.txt
pg73447.txt
pg74.txt
pg76.txt
pg768.txt
pg84.txt
pg844

### Evaluation Criteria

- Correctness: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- Efficiency: Consider the efficiency of your search process, especially as the complexity of queries increases.
- User Experience: Ensure that the interface for inputting queries and viewing results is user-friendly.

### Additional Challenges (Optional)

- Nested Boolean Queries: Allow for nested queries using parentheses, such as (python OR java) AND programming.
- Phrase Searching: Implement the ability to search for exact phrases enclosed in quotes.
- Proximity Searching: Extend the search to find terms that are within a specific distance from one another.

This exercise will deepen your understanding of how search engines process and respond to complex user queries. By incorporating Boolean search, you not only enhance the functionality of your search engine but also mimic more closely how real-world information retrieval systems operate.