# Term Search in Documents

## Objective
The goal of this exercise is to develop a simple information retrieval system that allows the user to search for a specific term across a set of text documents. This will introduce you to the basics of text processing and searching algorithms in the context of information retrieval.

## Problem Description
You are provided with a set of text documents. Your task is to implement a search function that:
- Takes a user-inputted term as the query.
- Searches for this term across all the provided documents.
- Returns a list of documents where the term appears.




## Requirements

```
pandas
numpy
scikit-learn
matplotlib
jupyter
requests
bs4
wget
```


### Step 1: Preparing the Data
- **Load the Documents**: You will start by loading the text documents into your program. These documents can be in plain text format stored in a directory.




In [9]:
import requests
from bs4 import BeautifulSoup
import wget
import os
import re


# Crear la carpeta para almacenar los archivos descargados si no existe
download_folder = "../../week01/data"
if not os.path.exists(download_folder):
    os.makedirs(download_folder)


# URL de la página de los primeros 100 libros de Gutenberg
url = "https://www.gutenberg.org/browse/scores/top#books-last1"

# Descargar la página
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

# Encontrar todos los enlaces que contienen "/ebooks/"
links = soup.find_all("a", href=lambda href: href and "/ebooks/" in href)


# Obtener los primeros 100 enlaces
top_100_links = links[:100]


for link in top_100_links:
    # Extraer el número de identificación del libro del enlace
    match = re.search(r'/(\d+)/?$', link["href"])
    if match:
        book_id = match.group(1)
        download_link = f"https://www.gutenberg.org/cache/epub/{book_id}/pg{book_id}.txt"
        file_name = f"pg{book_id}.txt"
        file_path = os.path.join(download_folder, file_name)  # Ruta completa del archivo a guardar
        print("Descargando:", download_link)
        try:
            wget.download(download_link, out=file_path)
            print("Descargado y guardado en:", file_path)
        except Exception as e:
            print("Error al descargar el archivo:", str(e))
    else:
        print("Sin numero de identificación en el enlace:", link["href"])

Sin numero de identificación en el enlace: /ebooks/
Sin numero de identificación en el enlace: /ebooks/
Sin numero de identificación en el enlace: /ebooks/bookshelf/
Sin numero de identificación en el enlace: /ebooks/offline_catalogs.html
Descargando: https://www.gutenberg.org/cache/epub/84/pg84.txt
Descargado y guardado en: ../../week01/data/pg84.txt
Descargando: https://www.gutenberg.org/cache/epub/1342/pg1342.txt
Descargado y guardado en: ../../week01/data/pg1342.txt
Descargando: https://www.gutenberg.org/cache/epub/41287/pg41287.txt
Descargado y guardado en: ../../week01/data/pg41287.txt
Descargando: https://www.gutenberg.org/cache/epub/2701/pg2701.txt
Descargado y guardado en: ../../week01/data/pg2701.txt
Descargando: https://www.gutenberg.org/cache/epub/1513/pg1513.txt
Descargado y guardado en: ../../week01/data/pg1513.txt
Descargando: https://www.gutenberg.org/cache/epub/61419/pg61419.txt
Descargado y guardado en: ../../week01/data/pg61419.txt
Descargando: https://www.gutenberg.

- **Read Each Document**: Implement a function to read each document and store its contents in a data structure of your choice (e.g., a list).

In [10]:
# Estructura de datos para almacenar el contenido de los documentos
document_contents = []


for file_name in os.listdir(download_folder):
    file_path = os.path.join(download_folder, file_name)
    if file_name.endswith(".txt"):
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
            document_contents.append(content)

# Listar Libros
for index, content in enumerate(document_contents, start=1):
    print(f"Documento {index}:")
    print("-" * 50)


Documento 1:
--------------------------------------------------
Documento 2:
--------------------------------------------------
Documento 3:
--------------------------------------------------
Documento 4:
--------------------------------------------------
Documento 5:
--------------------------------------------------
Documento 6:
--------------------------------------------------
Documento 7:
--------------------------------------------------
Documento 8:
--------------------------------------------------
Documento 9:
--------------------------------------------------
Documento 10:
--------------------------------------------------
Documento 11:
--------------------------------------------------
Documento 12:
--------------------------------------------------
Documento 13:
--------------------------------------------------
Documento 14:
--------------------------------------------------
Documento 15:
--------------------------------------------------
Documento 16:
--------------------

### Step 2: Implementing the Search
- **Input Query**: Implement a function to accept a query term from the user.

In [13]:
def get_query_term():
    query_term = input("Ingrese su término de consulta: ")
    return query_term


query = get_query_term()
print("Término de consulta ingresado:", query)


Término de consulta ingresado: python


- **Search Function**: Create a function that:
  - Iterates through each document.
  - Checks if the query term appears in the document.
  - You may choose to implement case-insensitive search to improve user experience.

In [22]:
def search_term_in_documents(query, document_contents):
    
    matching_documents = []
    
    for index, content in enumerate(document_contents, start=1):
        if query.lower() in content.lower():
            title_line = None
            for line in content.split("\n"):
                if "Title:" in line:
                    title_line = line
                    break
            if title_line:
                title = title_line.split(":", 1)[1].strip()
            else:
                title = "Título no encontrado"
            matching_documents.append((f"Documento {index}", title))
    
    return matching_documents



- **Return Results**: The function should return the names or identifiers of the documents where the term is found.

In [23]:
matching_documents = search_term_in_documents(query, document_contents)

Nombre del archivo: Documento 22 | Título del documento: The Christ Myth
Nombre del archivo: Documento 27 | Título del documento: The Metamorphoses of Ovid, Books VIII-XV
Nombre del archivo: Documento 81 | Título del documento: The Count of Monte Cristo
Nombre del archivo: Documento 91 | Título del documento: Leviathan


### Step 3: Displaying Results
- **Output the Results**: For each search query, output the results in a user-friendly format, listing the documents where the term was found, or a message indicating that the term does not appear in any document.


In [24]:
for document_info in matching_documents:
    print(f"Nombre del archivo: {document_info[0]} | Título del documento: {document_info[1]}")

Nombre del archivo: Documento 22 | Título del documento: The Christ Myth
Nombre del archivo: Documento 27 | Título del documento: The Metamorphoses of Ovid, Books VIII-XV
Nombre del archivo: Documento 81 | Título del documento: The Count of Monte Cristo
Nombre del archivo: Documento 91 | Título del documento: Leviathan


## Evaluation Criteria
- **Correctness**: The search function should accurately identify documents containing the term.
- **Efficiency**: While efficiency may not be critical for small datasets, consider the efficiency of your search algorithm.
- **Usability**: The interface for inputting search terms and viewing results should be clear and easy to use.

## Additional Challenges (Optional)
- **Enhance the search functionality**: Allow for more complex queries, such as phrases or multiple terms.
- **Improve the search with regular expressions**: Use regex for pattern matching to enhance the flexibility of the search.
- **Implement a simple ranking system**: Rank the documents based on the frequency of the term within each document.

This exercise will help you understand the fundamental mechanisms behind storing and retrieving data in the field of information retrieval. By the end of this task, you will have a basic prototype that mimics core functions of larger, more complex search engines.