# Term Search in Documents

## Objective
The goal of this exercise is to develop a simple information retrieval system that allows the user to search for a specific term across a set of text documents. This will introduce you to the basics of text processing and searching algorithms in the context of information retrieval.

## Problem Description
You are provided with a set of text documents. Your task is to implement a search function that:
- Takes a user-inputted term as the query.
- Searches for this term across all the provided documents.
- Returns a list of documents where the term appears.

## Requirements

### Step 1: Preparing the Data
- **Load the Documents**: You will start by loading the text documents into your program. These documents can be in plain text format stored in a directory.
- **Read Each Document**: Implement a function to read each document and store its contents in a data structure of your choice (e.g., a list).

### Step 2: Implementing the Search
- **Input Query**: Implement a function to accept a query term from the user.
- **Search Function**: Create a function that:
  - Iterates through each document.
  - Checks if the query term appears in the document.
  - You may choose to implement case-insensitive search to improve user experience.
- **Return Results**: The function should return the names or identifiers of the documents where the term is found.

### Step 3: Displaying Results
- **Output the Results**: For each search query, output the results in a user-friendly format, listing the documents where the term was found, or a message indicating that the term does not appear in any document.

## Evaluation Criteria
- **Correctness**: The search function should accurately identify documents containing the term.
- **Efficiency**: While efficiency may not be critical for small datasets, consider the efficiency of your search algorithm.
- **Usability**: The interface for inputting search terms and viewing results should be clear and easy to use.

## Additional Challenges (Optional)
- **Enhance the search functionality**: Allow for more complex queries, such as phrases or multiple terms.
- **Improve the search with regular expressions**: Use regex for pattern matching to enhance the flexibility of the search.
- **Implement a simple ranking system**: Rank the documents based on the frequency of the term within each document.

This exercise will help you understand the fundamental mechanisms behind storing and retrieving data in the field of information retrieval. By the end of this task, you will have a basic prototype that mimics core functions of larger, more complex search engines.


In [1]:
import os
import time

data_directory = '../data/'

def load_data():
    files = {}
    for filename in os.listdir(data_directory):
        if filename.endswith('.txt'):
            with open(os.path.join(data_directory, filename), 'r') as file:
                content = file.readlines()
                files[filename] = content
    return files


In [2]:
files=load_data()

In [3]:
def get_query_term():
    return input("Enter the query term: ").strip()


In [4]:
import re
def search_documents(files, query):
    query_regex = re.compile(re.escape(query), re.IGNORECASE)
    results = {}
    for filename, contents in files.items():
        content_string = ' '.join(contents)  
        matches = re.findall(query_regex, content_string)
        if matches:
            results[filename] = len(matches) 
    sorted_results = sorted(results.items(), key=lambda item: item[1], reverse=True)
    return sorted_results


In [5]:
query_term = get_query_term()

matching_files = search_documents(files, query_term)
if matching_files:
    print(f'Documents containing the term "{query_term}":')
    for file in matching_files:
        print('\t',file)
else:
    print("No documents found containing the term.")

Documents containing the term "search":
	 ('The Works of the Rev. Hugh Binning.txt', 176)
	 ('The Count of Monte Cristo.txt', 74)
	 ('Christopher Columbus and How He Received and Imparted the Spirit of Discovery.txt', 73)
	 ('The Complete Works of William Shakespeare.txt', 71)
	 ('Don Quixote.txt', 65)
	 ('War and Peace.txt', 46)
	 ('Dracula.txt', 45)
	 ('Twenty years after.txt', 41)
	 ('Plato and the Other Companions of Sokrates, 3rd ed. Volume 4.txt', 38)
	 ('History of Tom Jones, a Foundling.txt', 34)
	 ('The Adventures of Tom Sawyer, Complete.txt', 30)
	 ('Crime and Punishment.txt', 29)
	 ('Ulysses.txt', 27)
	 ('History of Woman Suffrage, Volume III (590).txt', 27)
	 ('Kentucky in American Letters, 1784-1912. Vol. 2 of 2.txt', 24)
	 ('The divine comedy.txt', 24)
	 ('Middlemarch.txt', 24)
	 ("Roget's Thesaurus of English Words and Phrases.txt", 24)
	 ("Grimms' Fairy Tales.txt", 24)
	 ('The Adventures of Sherlock Holmes.txt', 24)
	 ('The Brothers Karamazov.txt', 22)
	 ('The Adventure