Inverted index.
An Inverted Index is a data structure used to create full text search. Given a set of text files, the program  creates an inverted index. It has a simple  user interface to do a search using that inverted index which returns a list of files that contain the query term / terms.  Below is an   implementation of an inverted index along with a simple user interface for searching through it.
To run this program you'll need to install  docx library: pip install python-docx

Creating an inverted index involves processing a collection of text files and mapping each unique word (term) to the list of files that contain it. Below is an implementation of an inverted index along with a simple user interface for searching through it.

Here’s how the program works:
It reads a set of text files from a specified directory.
It builds an inverted index mapping each term to the filenames in which that term appears.
It provides a simple command-line interface for users to search terms.

Code step by step:
InvertedIndex Class: This class contains methods to add files and search terms.
The add_file() method reads the file, splits its content into words, and adds each unique word to the index.The add_file method  checks if the file is a .docx file and reads its content using the read_docx method.
Reading .docx Files: The read_docx method uses the python-docx library to extract text from the .docx file.
The search() method takes a query (string or list of strings) and returns a set of filenames that contain the terms.
build_inverted_index Function: This function takes a directory path, iterates over all the .txt files in that directory, and populates the inverted index by adding each file.
main Function: what it does: prompts the user for the directory containing the files. Builds the inverted index using the specified directory. 
Allows the user to input search terms and outputs the files that contain those terms. 

Running the Program:
Create a directory and populate it with some text files (.txt or .docx). Important note - it doesn't work with old .doc format
Run the program. When prompted, enter the path to the directory containing the text files.
You can then enter search terms to find which files contain those terms.
Type exit to quit the program.

Important:
Make sure to have some .txt or .doc files in the specified directory for this program to function.
This implementation is case-sensitive (it's important if the letters are uppercase or lowercase)


In [None]:
import os
from collections import defaultdict
from docx import Document

class InvertedIndex:
    def __init__(self):
        self.index = defaultdict(set)

    def add_file(self, file_path):
        #Add a file to the inverted index
        if file_path.endswith('.txt'):
            with open(file_path, 'r', encoding='utf-8') as file:
                content = file.read()
        elif file_path.endswith('.docx'):
            content = self.read_docx(file_path)
        else:
            print(f"Unsupported file type: {file_path}")
            return

        words = set(content.split())  # Unique words in the file
        for word in words:
            self.index[word].add(os.path.basename(file_path))  # Map word to the filename

    def read_docx(self, file_path):
        #Read content from a .docx file
        doc = Document(file_path)
        content = []
        for paragraph in doc.paragraphs:
            content.append(paragraph.text)
        return '\n'.join(content)

    def search(self, query):
        #Search for a term or terms
        if isinstance(query, str):
            query = [query]  # Make it a list if a single string is provided
        result_files = set()
        for term in query:
            result_files.update(self.index.get(term, set()))
        return result_files

def build_inverted_index(directory):
    #Build the inverted index from files in the specified directory
    inverted_index = InvertedIndex()
    for filename in os.listdir(directory):
        if filename.endswith('.txt') or filename.endswith('.docx'):  # Process .txt and .docx files
            file_path = os.path.join(directory, filename)
            inverted_index.add_file(file_path)
            print(f"Added file: {filename}")
    return inverted_index

def main():
    directory = input("Enter the directory containing text files: ")
    if not os.path.exists(directory):
        print("Directory not found!")
        return

    inverted_index = build_inverted_index(directory)
    print("Inverted index created.")

    while True:
        query = input("Enter search terms (or 'exit' to quit): ")
        if query.lower() == 'exit':
            break
        results = inverted_index.search(query.split())

        if results:
            print(f"Files containing the term(s): {', '.join(results)}")
        else:
            print("No files found containing the term(s).")

if __name__ == "__main__":
    main()