<a href="https://colab.research.google.com/github/PaolaMaribel18/RI_2024a/blob/main/week01/notebooks/02_boolean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.


In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [51]:
data_directory = '/content/drive/MyDrive/ri_2024a/week1/data'


In [52]:
import os

def load_documents_from_drive(data_directory):
    documents = {}
    for filename in os.listdir(data_directory):
        with open(os.path.join(data_directory, filename), 'r') as file:
            content = file.read()
            # Perform preprocessing
            preprocessed_content = preprocess_content(content)
            documents[filename] = preprocessed_content
    return documents

def preprocess_content(content):
    content = content.lower()  # Convert text to lowercase
    content = content.replace('\n', ' ')  # Remove newline characters
    return content


In [53]:
# Cargar los documentos
documents = load_documents_from_drive(data_directory)

In [54]:
primer_documento = next(iter(documents.values()))
print(primer_documento)



In [68]:
def create_index(documents):
    index = {}
    for doc_name, content in documents.items():
        for term in content.split():
            original_term = term  # Conserva la versión original del término
            term = term.lower()  # Convertir término a minúsculas
            if term not in index:
                index[term] = [0] * len(documents)  # Inicializar lista de ceros para cada término
            index[term][list(documents.keys()).index(doc_name)] = 1  # Marcar documento como 1 si contiene el término
            # Almacena también la versión original del término
            if original_term not in index:
                index[original_term] = [0] * len(documents)
            index[original_term][list(documents.keys()).index(doc_name)] = 1
    return index


def print_index(index):
    count = 0
    for term, appearances in index.items():
        print(term, appearances)
        count += 1
        if count >= 5:
            break

index_matrix = create_index(documents)

# Llamar a la función print_index con el índice creado
print_index(index_matrix)

﻿the [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
project [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
gutenberg [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
ebook [1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1


### Step 2: Implementing Boolean Search
- **Enhance Input Query**: Modify the function to accept complex queries that can include the Boolean operators AND, OR, and NOT.
- **Implement Boolean Logic**:
  - **AND**: The document must contain all the terms. For example, `python AND programming` should return documents containing both "python" and "programming".
  - **OR**: The document can contain any of the terms. For example, `python OR programming` should return documents containing either "python", "programming", or both.
  - **NOT**: The document must not contain the term following NOT. For example, `python NOT snake` should return documents that contain "python" but not "snake".


In [56]:
def search_boolean(index, terms, operators, case_sensitive=False):
    # Initialize the result set
    result_set = set(index.keys())

    for term in terms:
        # Convert the term to lowercase if case_sensitive is False
        if not case_sensitive:
            term_lower = term.lower()
            term_upper = term.upper()
        else:
            term_lower = term_upper = term

        # Get the documents that contain the term
        term_documents_lower = set([doc_name for doc_name, appearances in index.items() if appearances.count(term_lower) > 0])
        term_documents_upper = set([doc_name for doc_name, appearances in index.items() if appearances.count(term_upper) > 0])

        # Perform Boolean operations based on the operators
        if operators:
            operator = operators[0]
            if operator == "AND":
                result_set = result_set.intersection(term_documents_lower).union(term_documents_upper)
            elif operator == "OR":
                result_set = result_set.union(term_documents_lower).union(term_documents_upper)
            elif operator == "NOT":
                result_set = result_set.difference(term_documents_lower).difference(term_documents_upper)

            # Remove the processed operator from the list
            operators = operators[1:]

    return list(result_set)


### Step 3: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.
- **Handling Case Sensitivity and Partial Matches**: Optionally, you can handle cases and partial matches to refine the search results.


In [57]:
def parse_query(query, case_sensitive=False, partial_match=False):
    # Parse the query to identify terms and operators
    # Split the query into terms and operators
    tokens = query.split()

    # Initialize lists to store terms and operators
    terms = []
    operators = []

    # Iterate through tokens and classify them as terms or operators
    for token in tokens:
        if token.upper() in ["AND", "OR", "NOT"]:
            operators.append(token.upper())
        else:
            term = token
            if not case_sensitive:
                term = term.lower()  # Convert to lowercase if case insensitive
            if partial_match:
                term = '*' + term + '*'  # Add asterisks for partial matching
            terms.append(term)

    return terms, operators


In [58]:
def search_documents(documents, terms, operators):
    # Initialize a dictionary to store document scores
    document_scores = {doc_id: 0 for doc_id in documents.keys()}

    # Iterate through each term and operator
    for i, term in enumerate(terms):
        # If the term is a wildcard search term, skip it
        if '*' in term:
            continue

        # Search for exact matches of the term in each document
        for doc_id, content in documents.items():
            if term in content:
                # Increment the score of the document by 1 for each match
                document_scores[doc_id] += 1

    # Apply boolean operators to the document scores
    for i, operator in enumerate(operators):
        if operator == "AND":
            # Only keep documents that have scores equal to the number of terms
            document_scores = {doc_id: score for doc_id, score in document_scores.items() if score == len(terms)}
        elif operator == "OR":
            # Keep all documents that have a score greater than 0
            document_scores = {doc_id: score for doc_id, score in document_scores.items() if score > 0}
        elif operator == "NOT":
            # Remove documents that have a score greater than 0
            document_scores = {doc_id: score for doc_id, score in document_scores.items() if score == 0}

    # Rank documents by score (descending order)
    ranked_documents = sorted(document_scores.items(), key=lambda x: x[1], reverse=True)

    return ranked_documents



### Step 4: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.




In [59]:
def display_results(results, documents):
    if not results:
        print("No matching documents found.")
    else:
        print("Matching documents:")
        for doc_index in results:
            doc_name = list(documents.keys())[doc_index]  # Obtener el nombre del documento usando el índice
            print(doc_name)


In [60]:
def main():
    documents = load_documents_from_drive(data_directory)
    index_matrix = create_index(documents)

    while True:
        # Solicitar al usuario que ingrese la consulta
        query = input("Enter the search query: ")
        if query.lower() == 'exit':
            break

        # Parse the query
        terms, operators = parse_query(query)

        # Procesar la consulta y obtener los resultados
        search_results = search_boolean(index_matrix, terms, operators, case_sensitive=False)

        # Mostrar los resultados
        display_results(search_results, documents)

if __name__ == "__main__":
    main()

Enter the search query: pride and prejudice
No matching documents found.
Enter the search query: pride
Matching documents:


TypeError: list indices must be integers or slices, not str

## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.



## Additional Challenges (Optional)
- **Nested Boolean Queries**: Allow for nested queries using parentheses, such as `(python OR java) AND programming`.
- **Phrase Searching**: Implement the ability to search for exact phrases enclosed in quotes.
- **Proximity Searching**: Extend the search to find terms that are within a specific distance from one another.

This exercise will deepen your understanding of how search engines process and respond to complex user queries. By incorporating Boolean search, you not only enhance the functionality of your search engine but also mimic more closely how real-world information retrieval systems operate.