# Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.

### Step 2: Implementing Boolean Search
- **Enhance Input Query**: Modify the function to accept complex queries that can include the Boolean operators AND, OR, and NOT.
- **Implement Boolean Logic**:
  - **AND**: The document must contain all the terms. For example, `python AND programming` should return documents containing both "python" and "programming".
  - **OR**: The document can contain any of the terms. For example, `python OR programming` should return documents containing either "python", "programming", or both.
  - **NOT**: The document must not contain the term following NOT. For example, `python NOT snake` should return documents that contain "python" but not "snake".

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.
- **Handling Case Sensitivity and Partial Matches**: Optionally, you can handle cases and partial matches to refine the search results.

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

## Additional Challenges (Optional)
- **Nested Boolean Queries**: Allow for nested queries using parentheses, such as `(python OR java) AND programming`.
- **Phrase Searching**: Implement the ability to search for exact phrases enclosed in quotes.
- **Proximity Searching**: Extend the search to find terms that are within a specific distance from one another.

This exercise will deepen your understanding of how search engines process and respond to complex user queries. By incorporating Boolean search, you not only enhance the functionality of your search engine but also mimic more closely how real-world information retrieval systems operate.


In [13]:
import os
import re


def limpiar_texto(texto):
    # Utilizamos una expresión regular para eliminar caracteres no alfabéticos y convertir el texto a minúsculas
    texto_limpio = re.sub(r'[^a-zA-Z\s]', '', texto).lower()
    # Dividimos el texto en palabras y eliminamos espacios en blanco adicionales
    palabras = texto_limpio.split()
    return palabras


def leer_archivo_txt(archivo):
    palabras_unicas = set()
    title = os.path.splitext(os.path.basename(archivo))[0]
    print(title)
    with open(archivo, 'r') as f:
        texto = f.read()  # Leer todo el texto del archivo en una sola cadena

    palabras = limpiar_texto(texto)  # Limpieza del texto y obtención de las palabras

    for palabra in palabras:
        palabras_unicas.add(palabra)

    diccionario = {
        'title': title,
        'words': {palabra: True for palabra in palabras_unicas}  # Clave "words" con las palabras únicas
    }

    return diccionario


directorio = '../data/'
lista_diccionarios = []

for archivo in os.listdir(directorio):
    if archivo.endswith('.txt'):
        ruta_archivo = os.path.join(directorio, archivo)
        diccionario_archivo = leer_archivo_txt(ruta_archivo)
        lista_diccionarios.append(diccionario_archivo)


The Works of the Rev. Hugh Binning
Jane Eyre: An Autobiography
Middlemarch
Narrative of the Life of Frederick Douglass, an American Slave
The Count of Monte Cristo
The Complete Works of William Shakespeare
Fan Fare May 1953
Little Women; Or, Meg, Jo, Beth, and Amy
The Adventures of Tom Sawyer, Complete
The Great Gats
Memoirs of a London doll
The Modern Regime, Volume 1
A Doll's House : a play
Treasure Island
Cranford
War and Peace
Pride and Prejudice
History of Tom Jones, a Foundling
Cambridge Papers
Plato and the Other Companions of Sokrates, 3rd ed. Volume 4
Twenty years after
The History of Woman Suffrage, Volume IV (430)
The Expedition of Humphry Clinker
Don Quixote
The Adventures of Roderick Random
A Modest Proposal
The Scarlet Letter
A Room with a View
Walden, and On The Duty Of Civil Disobedience
Dracula
The Pleasures of the Table
John Dewey's logical theory
Ang "Filibusterismo" (Karugtóng ng Noli Me Tangere)
Don Juan
Wuthering Heights
The Prince
Pygmalion
The Odyssey
The star-s

In [14]:
print(lista_diccionarios[0])



In [15]:
import pandas as pd
data_dict = {}
for diccionario in lista_diccionarios:
    title = diccionario['title']
    words = diccionario['words']
    for word in words:
        if word not in data_dict:
            data_dict[word] = {title: True}
        else:
            data_dict[word][title] = True

df = pd.DataFrame(data_dict)
df = df.T
df = df.fillna(False)
df

  df = df.fillna(False)


Unnamed: 0,The Works of the Rev. Hugh Binning,Jane Eyre: An Autobiography,Middlemarch,"Narrative of the Life of Frederick Douglass, an American Slave",The Count of Monte Cristo,The Complete Works of William Shakespeare,Fan Fare May 1953,"Little Women; Or, Meg, Jo, Beth, and Amy","The Adventures of Tom Sawyer, Complete",The Great Gats,...,The Yellow Wallpaper,Thus Spake Zarathustra: A Book for All and None,Don Quijote,Leviathan,Metamorphosis,Peter Pan,The Philippines a Century Hence,Beyond Good and Evil,Childe Harold's Pilgrimage,Second Treatise of Government
minute,True,True,True,True,True,True,True,True,True,True,...,False,False,False,False,False,False,False,False,False,False
inscription,True,True,True,False,True,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
departments,True,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
you,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
assimilation,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fogyea,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
pathhe,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
cerebellum,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
sterned,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [16]:
class Nodo:
    def __init__(self, valor):
        self.valor = valor
        self.izquierda = None
        self.derecha = None

def construir_arbol(terminos):
    if not terminos:
        return None
    
    if len(terminos) == 1:
        return Nodo(terminos[0])

    indice_and = -1
    for i, termino in enumerate(terminos):
        if termino == "and":
            indice_and = i
            break

    if indice_and != -1:
        nodo_and = Nodo("and")
        nodo_and.izquierda = construir_arbol(terminos[:indice_and])
        nodo_and.derecha = construir_arbol(terminos[indice_and + 1:])
        return nodo_and

    indice_or = -1
    for i, termino in enumerate(terminos):
        if termino == "or":
            indice_or = i
            break

    if indice_or != -1:
        nodo_or = Nodo("or")
        nodo_or.izquierda = construir_arbol(terminos[:indice_or])
        nodo_or.derecha = construir_arbol(terminos[indice_or + 1:])
        return nodo_or
    
    return Nodo(terminos[0])

def imprimir_arbol(arbol, nivel=0):
    if arbol is None:
        return
    imprimir_arbol(arbol.derecha, nivel + 1)
    print("   " * nivel + str(arbol.valor))
    imprimir_arbol(arbol.izquierda, nivel + 1)

def procesar_arbol(arbol, df):
    if arbol.valor in ["and", "or"]:
        izquierda = procesar_arbol(arbol.izquierda, df)
        derecha = procesar_arbol(arbol.derecha, df)
        if arbol.valor == "and":
            return izquierda & derecha
        else:
            return izquierda | derecha
    elif arbol.valor == "not":
        return ~procesar_arbol(arbol.izquierda, df)
    else:
        return df[arbol.valor]

def query_processing(query, df):
    query = query.lower()
    terminos = query.split()
    
    arbol = construir_arbol(terminos)
    imprimir_arbol(arbol)
    input("veamos el arbol")
    resultado = procesar_arbol(arbol, df)
    
    return resultado

# Uso de la función:
resultado = query_processing("python and programming or dota", df)
print(resultado)


      dota
   or
      programming
and
   python


KeyError: 'python'

### Step 2: Implementing Boolean Search
- **Enhance Input Query**: Modify the function to accept complex queries that can include the Boolean operators AND, OR, and NOT.
- **Implement Boolean Logic**:
  - **AND**: The document must contain all the terms. For example, `python AND programming` should return documents containing both "python" and "programming".
  - **OR**: The document can contain any of the terms. For example, `python OR programming` should return documents containing either "python", "programming", or both.
  - **NOT**: The document must not contain the term following NOT. For example, `python NOT snake` should return documents that contain "python" but not "snake".

### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.
- **Handling Case Sensitivity and Partial Matches**: Optionally, you can handle cases and partial matches to refine the search results.

In [52]:
def find_word(df, word):
    word = word.lower()
    print("La palabra a buscar es: ",word)
    if word in df.index:
        fila_palabra_hola = df.loc[word]
        #print(f"La palabra {word} está en la fila:", fila_palabra_hola)
        df_fila_hola = pd.DataFrame(fila_palabra_hola).transpose()
        df_fila_hola
        return df.loc[word][df.loc[word] == True].index.tolist()
    else:
        return []

In [37]:
def print_results(results,query):
    print(f'Documents containing the term "{query}":')
    for book in results:
        print(f"\t{book} ")
        

In [38]:
import pandas as pd
import numpy as np

def query_procesing(query):
    query=query.lower()
    isAnd=query.find("and")
    isOr = query.find("or")
    if isAnd !=-1:
        words = query.split(" and ")
        firstword = words[0]
        secondword = words[1]
        print("Palabras antes de 'and':", firstword)
        print("Palabras después de 'and':", secondword)
        is_firstword=find_word(df,firstword)
        is_secondword=find_word(df,secondword)
        response= set(is_firstword).intersection(set(is_secondword))
        print_results(response,query)
    elif isOr != -1:
        words = query.split(" or ")
        firstword = words[0]
        secondword = words[1]
        print("Palabras antes de 'or':", firstword)
        print("Palabras después de 'or':", secondword)
        is_firstword = find_word(df, firstword)
        is_secondword = find_word(df, secondword)
        response = set(is_firstword).union(set(is_secondword))
        print_results(response,query)
        
            



In [53]:
query_procesing("hola or mundo")

Palabras antes de 'or': hola
Palabras después de 'or': mundo
La palabra a buscar es:  hola
La palabra a buscar es:  mundo
Documents containing the term "hola or mundo":
	Ulysses 
	Don Quijote 
	Ang "Filibusterismo" (Karugtóng ng Noli Me Tangere) 
	Noli Me Tangere 
	Christopher Columbus and How He Received and Imparted the Spirit of Discovery 


In [59]:
# Supongamos que ya tienes el DataFrame df y has encontrado la fila que contiene la palabra "hola"
fila_palabra_hola = df.loc['hola']

# Convertir la fila en un DataFrame
df_fila_hola = pd.DataFrame(fila_palabra_hola).transpose()

# Mostrar el DataFrame con la fila que contiene la palabra "hola"
df_fila_hola


Unnamed: 0,The Works of the Rev. Hugh Binning,Jane Eyre: An Autobiography,Middlemarch,"Narrative of the Life of Frederick Douglass, an American Slave",The Count of Monte Cristo,The Complete Works of William Shakespeare,Fan Fare May 1953,"Little Women; Or, Meg, Jo, Beth, and Amy","The Adventures of Tom Sawyer, Complete",The Great Gats,...,The Yellow Wallpaper,Thus Spake Zarathustra: A Book for All and None,Don Quijote,Leviathan,Metamorphosis,Peter Pan,The Philippines a Century Hence,Beyond Good and Evil,Childe Harold's Pilgrimage,Second Treatise of Government
hola,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


In [60]:
valores_fila_hola = df_fila_hola.values.tolist()
print("Valores en la fila que contiene la palabra 'hola':")
for valor in valores_fila_hola[0]:
    print(valor)

Valores en la fila que contiene la palabra 'hola':
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
False
True
False
False
False
False
False
False
False


In [61]:
# Mostrar las columnas que tienen el valor True en la fila que contiene la palabra "hola"
columnas_true = [columna for columna, valor in fila_palabra_hola.items() if valor == True]
print("Columnas con el valor True en la fila que contiene la palabra 'hola':", columnas_true)


Columnas con el valor True en la fila que contiene la palabra 'hola': ['Ulysses', 'Don Quijote']
