# Basic Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

```
pandas
numpy
scikit-learn
matplotlib
jupyter
requests
bs4
wget
```

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.



In [1]:
import requests
from bs4 import BeautifulSoup
import wget
import os
import re


# Crear la carpeta para almacenar los archivos descargados si no existe
download_folder = "../../week01/data"
if not os.path.exists(download_folder):
    os.makedirs(download_folder)


# URL de la página de los primeros 100 libros de Gutenberg
url = "https://www.gutenberg.org/browse/scores/top#books-last1"

# Descargar la página
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Encontrar todos los enlaces que contienen "/ebooks/"
links = soup.find_all("a", href=lambda href: href and "/ebooks/" in href)


# Obtener los primeros 100 enlaces
top_100_links = links[:100]


for link in top_100_links:
    # Extraer el número de identificación del libro del enlace
    match = re.search(r'/(\d+)/?$', link["href"])
    if match:
        book_id = match.group(1)
        download_link = f"https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"
        file_name = f"pg{book_id}.txt"
        file_path = os.path.join(download_folder, file_name)  # Ruta completa del archivo a guardar
        print("Descargando:", download_link)
        try:
            wget.download(download_link, out=file_path)
            print("Descargado y guardado en:", file_path)
        except Exception as e:
            print("Error al descargar el archivo:", str(e))
    else:
        print("Sin numero de identificación en el enlace:", link["href"])

Sin numero de identificación en el enlace: /ebooks/
Sin numero de identificación en el enlace: /ebooks/
Sin numero de identificación en el enlace: /ebooks/bookshelf/
Sin numero de identificación en el enlace: /ebooks/offline_catalogs.html
Descargando: https://www.gutenberg.org/cache/epub/84/pg84.txt
Descargado y guardado en: ../../week01/data/pg84.txt
Descargando: https://www.gutenberg.org/cache/epub/1342/pg1342.txt
Descargado y guardado en: ../../week01/data/pg1342.txt
Descargando: https://www.gutenberg.org/cache/epub/2701/pg2701.txt
Descargado y guardado en: ../../week01/data/pg2701.txt
Descargando: https://www.gutenberg.org/cache/epub/1513/pg1513.txt
Descargado y guardado en: ../../week01/data/pg1513.txt
Descargando: https://www.gutenberg.org/cache/epub/145/pg145.txt
Descargado y guardado en: ../../week01/data/pg145.txt
Descargando: https://www.gutenberg.org/cache/epub/844/pg844.txt
Descargado y guardado en: ../../week01/data/pg844.txt
Descargando: https://www.gutenberg.org/cache/ep

### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.



In [13]:
import os

def create_inverted_index(directory):
    inverted_index = {}
    for filename in os.listdir(directory):
        filepath = os.path.join(directory, filename)
        if os.path.isfile(filepath) and filename.endswith('.txt'):
            with open(filepath, 'r', encoding='utf-8') as file:
                document_id = filename[:-4]  
                words = file.read().split()
                for word in words:
                    if word in inverted_index:
                        if document_id not in inverted_index[word]:
                            inverted_index[word].append(document_id)
                    else:
                        inverted_index[word] = [document_id]
    return inverted_index

In [14]:
directory = "../../week01/data"
inverted_index = create_inverted_index(directory)

In [15]:
num_words = len(inverted_index)
print("Número total de palabras en el índice invertido:", num_words)


total_documents = sum(len(document_ids) for document_ids in inverted_index.values())
average_documents_per_word = total_documents / num_words
print("Número medio de documentos por palabra:", average_documents_per_word)

Número total de palabras en el índice invertido: 572780
Número medio de documentos por palabra: 2.909195502636265


### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.



In [16]:
def get_query_term():
    query_term = input("Ingrese su término de consulta: ")
    return query_term


query = get_query_term()

### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.



In [17]:
print("Término de consulta ingresado:", query)
print(f"Libros en los que aparece la palabra {query}:", inverted_index.get(query), f"Numero de Libros: {len(inverted_index.get(query))}", sep="\n")

Término de consulta ingresado: dark
Libros en los que aparece la palabra dark:
['pg1400', 'pg4300', 'pg1998', 'pg345', 'pg3825', 'pg43240', 'pg45', 'pg4363', 'pg600', 'pg25344', 'pg105', 'pg244', 'pg1250', 'pg67979', 'pg67098', 'pg1260', 'pg6130', 'pg6593', 'pg1184', 'pg10676', 'pg2591', 'pg768', 'pg40156', 'pg28054', 'pg5200', 'pg5197', 'pg41095', 'pg35899', 'pg394', 'pg84', 'pg514', 'pg27827', 'pg829', 'pg98', 'pg408', 'pg73490', 'pg2814', 'pg55', 'pg73488', 'pg73487', 'pg2600', 'pg58585', 'pg26', 'pg120', 'pg2160', 'pg2641', 'pg46722', 'pg4085', 'pg36', 'pg2701', 'pg996', 'pg1342', 'pg6761', 'pg145', 'pg1661', 'pg11', 'pg2852', 'pg73489', 'pg1513', 'pg43', 'pg2554', 'pg100', 'pg23', 'pg16', 'pg73491', 'pg39225', 'pg219', 'pg1952', 'pg76', 'pg73492', 'pg23958', 'pg1727', 'pg46', 'pg3207', 'pg1259', 'pg73484', 'pg30254', 'pg8800', 'pg74', 'pg64317', 'pg205', 'pg2542', 'pg37106', 'pg56463', 'pg174', 'pg16389']
Numero de Libros: 86


## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.