<a href="https://colab.research.google.com/github/PaolaMaribel18/RI_2024a/blob/main/week01/notebooks/01_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Term Search in Documents

## Objective
The goal of this exercise is to develop a simple information retrieval system that allows the user to search for a specific term across a set of text documents. This will introduce you to the basics of text processing and searching algorithms in the context of information retrieval.

## Problem Description
You are provided with a set of text documents. Your task is to implement a search function that:
- Takes a user-inputted term as the query.
- Searches for this term across all the provided documents.
- Returns a list of documents where the term appears.

## Requirements

### Step 1: Preparing the Data
- **Load the Documents**: You will start by loading the text documents into your program. These documents can be in plain text format stored in a directory.
- **Read Each Document**: Implement a function to read each document and store its contents in a data structure of your choice (e.g., a list).

### Step 2: Implementing the Search
- **Input Query**: Implement a function to accept a query term from the user.
- **Search Function**: Create a function that:
  - Iterates through each document.
  - Checks if the query term appears in the document.
  - You may choose to implement case-insensitive search to improve user experience.
- **Return Results**: The function should return the names or identifiers of the documents where the term is found.

### Step 3: Displaying Results
- **Output the Results**: For each search query, output the results in a user-friendly format, listing the documents where the term was found, or a message indicating that the term does not appear in any document.

## Evaluation Criteria
- **Correctness**: The search function should accurately identify documents containing the term.
- **Efficiency**: While efficiency may not be critical for small datasets, consider the efficiency of your search algorithm.
- **Usability**: The interface for inputting search terms and viewing results should be clear and easy to use.



In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [30]:
corpus = "/content/drive/MyDrive/ri_2024a/week1/data"

In [31]:
import os

def load_documents_from_drive(corpus):
    documents = {}
    for filename in os.listdir(corpus):
        with open(os.path.join(corpus, filename), 'r') as file:
            content = file.read()
            documents[filename] = content
    return documents

In [32]:
def print_document_names(file_names):
    print("Names of the documents:")
    for filename in file_names:
        print(filename)

# Cargar los nombres de los documentos
document_names = load_documents_from_drive(corpus)

# Imprimir los nombres de los documentos
print_document_names(document_names)

Names of the documents:
pg1342.txt
pg120.txt
pg11.txt
pg1080.txt
pg10676.txt
datasource.txt
pg67979.txt
pg6761.txt
pg1400.txt
pg5200.txt
pg50038.txt
pg67098.txt
pg64317.txt
pg600.txt
pg55.txt
pg52281.txt
pg47629.txt
pg45540.txt
pg45.txt
pg44388.txt
pg43.txt
pg46.txt
pg1260.txt
pg45848.txt
pg1259.txt
pg42933.txt
pg41445.txt
pg408.txt
pg394.txt
pg35899.txt
pg12582.txt
pg345.txt
pg29728.txt
pg2852.txt
pg2814.txt
pg27827.txt
pg1184.txt
pg2641.txt
pg26073.txt
pg2591.txt
pg2542.txt
pg25344.txt
pg244.txt
pg10907.txt
pg219.txt
pg21700.txt
pg2160.txt
pg205.txt
pg1998.txt
pg1952.txt
pg174.txt
pg1727.txt
pg1661.txt
pg100.txt
pg16.txt
pg1232.txt
pg16389.txt
pg1513.txt
pg37106.txt
pg4085.txt
pg3207.txt
pg30254.txt
pg28054.txt
pg2701.txt
pg2600.txt
pg2554.txt
pg21012.txt
pg20228.txt
pg2000.txt
pg8800.txt
pg84.txt
pg768.txt
pg98.txt
pg844.txt
pg18893.txt
pg7370.txt
pg73448.txt
pg73447.txt
pg76.txt
pg74.txt
pg73444.txt
pg73442.txt
pg145.txt
pg59468.txt
pg52882.txt
pg5197.txt
pg514.txt
pg48191.txt
pg47

In [33]:
def search_term(documents, term):
    results = []
    for doc_name, content in documents.items():
        if term.lower() in content.lower():  # Case-insensitive search
            results.append(doc_name)
    return results

In [34]:
def display_results(results):
    if results:
        print("Term found in the following documents:")
        for doc in results:
            print("-", doc)
    else:
        print("Term not found in any document.")

In [35]:
# Load the documents
documents = load_documents_from_drive(corpus)

# Realizar la búsqueda de términos y mostrar los resultados
simple_query = input("Enter the search term (type 'exit' to quit): ")
while simple_query.lower() != 'exit':
    search_results = search_term(documents, simple_query)
    display_results(search_results)
    simple_query = input("Enter the search term (type 'exit' to quit): ")

Enter the search term (type 'exit' to quit): Jekyll
Term found in the following documents:
- pg43.txt
- pg44837.txt
Enter the search term (type 'exit' to quit): Wonderland
Term found in the following documents:
- pg11.txt
- pg45.txt
- pg45848.txt
- pg29728.txt
- pg5197.txt
Enter the search term (type 'exit' to quit): Pride
Term found in the following documents:
- pg1342.txt
- pg120.txt
- pg1080.txt
- pg10676.txt
- pg67979.txt
- pg6761.txt
- pg1400.txt
- pg5200.txt
- pg50038.txt
- pg64317.txt
- pg600.txt
- pg45540.txt
- pg45.txt
- pg46.txt
- pg1260.txt
- pg45848.txt
- pg1259.txt
- pg42933.txt
- pg41445.txt
- pg408.txt
- pg394.txt
- pg35899.txt
- pg12582.txt
- pg345.txt
- pg29728.txt
- pg2852.txt
- pg2814.txt
- pg27827.txt
- pg1184.txt
- pg2641.txt
- pg26073.txt
- pg2591.txt
- pg2542.txt
- pg25344.txt
- pg244.txt
- pg10907.txt
- pg219.txt
- pg21700.txt
- pg2160.txt
- pg205.txt
- pg1998.txt
- pg174.txt
- pg1727.txt
- pg1661.txt
- pg100.txt
- pg16.txt
- pg16389.txt
- pg1513.txt
- pg37106.t

## Additional Challenges (Optional)
- **Enhance the search functionality**: Allow for more complex queries, such as phrases or multiple terms.
- **Improve the search with regular expressions**: Use regex for pattern matching to enhance the flexibility of the search.
- **Implement a simple ranking system**: Rank the documents based on the frequency of the term within each document.

This exercise will help you understand the fundamental mechanisms behind storing and retrieving data in the field of information retrieval. By the end of this task, you will have a basic prototype that mimics core functions of larger, more complex search engines.

In [40]:
#Improve the search with regular expressions
import re

def search_with_regex(query, documents):
    results = []
    for doc_name, content in documents.items():
        if re.search(query, content, re.IGNORECASE):
            results.append(doc_name)
    return results

def rank_documents(query, documents):
    rankings = {}
    for doc_name, content in documents.items():
        frequency = content.lower().count(query.lower())
        rankings[doc_name] = frequency
    sorted_rankings = sorted(rankings.items(), key=lambda x: x[1], reverse=True)
    return sorted_rankings


In [42]:
query = input("Enter the search term (type 'exit' to quit): ")
while query.lower() != 'exit':
    # Realizar búsqueda básica
    search_results = search_term(documents, query)
    display_results(search_results)

    # Realizar búsqueda con expresiones regulares
    regex_query = input("Enter the regex search term (type 'exit' to quit): ")
    if regex_query.lower() == 'exit':
        break
    regex_results = search_with_regex(regex_query, documents)
    display_results(regex_results)

    # Clasificar documentos
    ranked_results = rank_documents(query, documents)
    print("Ranked Results:")
    for doc, rank in ranked_results:
        print(f"{doc}: {rank}")

    query = input("Enter the search term (type 'exit' to quit): ")

Enter the search term (type 'exit' to quit): Wonderland
Term found in the following documents:
- pg11.txt
- pg45.txt
- pg45848.txt
- pg29728.txt
- pg5197.txt
Enter the regex search term (type 'exit' to quit): Wonderland
Term found in the following documents:
- pg11.txt
- pg45.txt
- pg45848.txt
- pg29728.txt
- pg5197.txt
Ranked Results:
pg11.txt: 7
pg45.txt: 1
pg45848.txt: 1
pg29728.txt: 1
pg5197.txt: 1
pg1342.txt: 0
pg120.txt: 0
pg1080.txt: 0
pg10676.txt: 0
datasource.txt: 0
pg67979.txt: 0
pg6761.txt: 0
pg1400.txt: 0
pg5200.txt: 0
pg50038.txt: 0
pg67098.txt: 0
pg64317.txt: 0
pg600.txt: 0
pg55.txt: 0
pg52281.txt: 0
pg47629.txt: 0
pg45540.txt: 0
pg44388.txt: 0
pg43.txt: 0
pg46.txt: 0
pg1260.txt: 0
pg1259.txt: 0
pg42933.txt: 0
pg41445.txt: 0
pg408.txt: 0
pg394.txt: 0
pg35899.txt: 0
pg12582.txt: 0
pg345.txt: 0
pg2852.txt: 0
pg2814.txt: 0
pg27827.txt: 0
pg1184.txt: 0
pg2641.txt: 0
pg26073.txt: 0
pg2591.txt: 0
pg2542.txt: 0
pg25344.txt: 0
pg244.txt: 0
pg10907.txt: 0
pg219.txt: 0
pg21700.txt: