<a href="https://colab.research.google.com/github/PaolaMaribel18/RI_2024a/blob/main/week01/notebooks/02_boolean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Boolean Search in Documents

## Objective
Expand the simple term search functionality to include Boolean search capabilities. This will allow users to perform more complex queries by combining multiple search terms using Boolean operators.

## Problem Description
You must enhance the existing search engine from the previous exercise to support Boolean operators: AND, OR, and NOT. This will enable the retrieval of documents based on the logical relationships between multiple terms.

## Requirements

### Step 1: Update Data Preparation
Ensure that the documents are still loaded and preprocessed from the previous task. The data should be clean and ready for advanced querying.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
data_directory = '/content/drive/MyDrive/ri_2024a/week1/data'


In [3]:
import os

def load_documents(data_directory):
    documents = {}
    for filename in os.listdir(data_directory):
        with open(os.path.join(data_directory, filename), 'r') as file:
            content = file.read()
            # Perform preprocessing
            preprocessed_content = preprocess_content(content)
            documents[filename] = preprocessed_content
    return documents

def preprocess_content(content):
    content = content.lower()  # Convert text to lowercase
    content = content.replace('\n', ' ')  # Remove newline characters
    return content

In [4]:
documents = load_documents(data_directory)


### Step 2: Create an Inverted Index

Create an inverted index from the documents. This index maps each word to the set of document IDs in which that word appears. This facilitates word lookup in the search process.


In [28]:
def create_inverted_index(documents):
    inverted_index = {}
    for doc_id, content in documents.items():
        # Tokenize the content into words
        words = content.split()
        # Iterate over each word
        for word in words:
            # Check if the word already exists in the inverted index
            if word in inverted_index:
                # If yes, add the document ID to the set of IDs for that word
                inverted_index[word].add(doc_id)
            else:
                # If not, create a new entry with the document ID
                inverted_index[word] = {doc_id}
    return inverted_index

inverted_index = create_inverted_index(documents)


### Step 3: Implementing Boolean Search
Enhance Input Query: Modify the function to accept complex queries that can include the Boolean operators AND, OR, and NOT.
Implement Boolean Logic:
AND: The document must contain all the terms. For example, python AND programming should return documents containing both "python" and "programming".
OR: The document can contain any of the terms. For example, python OR programming should return documents containing either "python", "programming", or both.
NOT: The document must not contain the term following NOT. For example, python NOT snake should return documents that contain "python" but not "snake".

In [68]:
def boolean_search(query_terms, operators, inverted_index):
    # Initialize a set to store the document IDs that match the query
    result_set = set()

    # Handle the case when the query contains only one term
    if len(query_terms) == 1:
        return inverted_index.get(query_terms[0], set())

    # Iterate over each term-operator pair in the query
    for i in range(len(query_terms)):
        # Get the current term and the next operator (if it exists)
        term = query_terms[i]
        operator = operators[i] if i < len(operators) else None

        # Perform Boolean logic based on the operator
        if operator == "and":
            # If the operator is AND, intersect the result set with the documents containing the term
            result_set &= inverted_index.get(term, set())
        elif operator == "or":
            # If the operator is OR, union the result set with the documents containing the term
            result_set |= inverted_index.get(term, set())
        elif operator == "not":
            # If the operator is NOT, subtract the documents containing the term from the result set
            result_set -= inverted_index.get(term, set())

    return result_set



### Step 4: Query Processing
- **Parse the Query**: Implement a function to parse the input query to identify the terms and operators.
- **Search Documents**: Based on the parsed query, implement the logic to retrieve and rank the documents according to the Boolean expressions.



In [56]:
def parse_query(query):
    # Split the query into individual terms
    terms = query.split()
    # Initialize lists to store terms and operators separately
    query_terms = []
    operators = []
    # Iterate over each term in the query
    for term in terms:
        # Check if the term is an operator
        if term.lower() in ["and", "or", "not"]:
            # If it's an operator, add it to the operators list
            operators.append(term.lower())
        else:
            # If it's not an operator, add it to the query terms list
            query_terms.append(term.lower())
    return query_terms, operators


In [57]:
query = "Pride and Prejudice"
parsed_query = parse_query(query)
print("Consulta original:", query)
print("Consulta analizada:", parsed_query)

Consulta original: Pride and Prejudice
Consulta analizada: (['pride', 'prejudice'], ['and'])


### Step 5: Displaying Results
- **Output the Results**: Display the documents that match the query criteria. Include functionalities to handle queries that result in no matching documents.



In [70]:
def display_results(results):
    if results:
        print("Query found in the following documents:")
        for doc in results:
            print("-", doc)
    else:
        print("Quey not found in any document.")

In [71]:
simple_query = input("Enter the search query (type 'exit' to quit): ")

while simple_query.lower() != 'exit':
    if not simple_query:
        print("Please enter a search query.")
    else:
        # Parsea la consulta para obtener los términos y operadores
        query_terms, operators = parse_query(simple_query)
        print (query_terms, operators)

        # Busca los documentos basados en la consulta
        search_results = boolean_search (query_terms, operators, inverted_index)

        # Muestra los resultados de la búsqueda
        display_results(search_results)

    # Solicita una nueva consulta al usuario
    simple_query = input("Enter the search query (type 'exit' to quit): ")


Enter the search query (type 'exit' to quit): Jekyll
['jekyll'] []
Query found in the following documents:
- pg43.txt
Enter the search query (type 'exit' to quit): Wonderland
['wonderland'] []
Query found in the following documents:
- pg5197.txt
- pg11.txt
- pg29728.txt
Enter the search query (type 'exit' to quit): Jekyll and Wonderland
['jekyll', 'wonderland'] ['and']
Quey not found in any document.
Enter the search query (type 'exit' to quit): exit


## Evaluation Criteria
- **Correctness**: The Boolean search implementation should correctly interpret and process the queries according to the Boolean logic.
- **Efficiency**: Consider the efficiency of your search process, especially as the complexity of queries increases.
- **User Experience**: Ensure that the interface for inputting queries and viewing results is user-friendly.

This exercise will deepen your understanding of how search engines process and respond to user queries.
