2. Implement a program for retrieval of documents using inverted files.

In [None]:
# Import libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download required resources
nltk.download('stopwords')
nltk.download('punkt')

# Open and read file
file = open('file.txt', 'r', encoding='utf8')
lines = file.readlines()
file.close()

# Preprocessing: build inverted index
inverted_index = {}
stop_words = set(stopwords.words('english'))

for i, line in enumerate(lines):
    # Tokenize and clean
    words = word_tokenize(line.lower())
    words = [w for w in words if w.isalnum() and w not in stop_words]
    
    # Build inverted index
    for word in words:
        if word not in inverted_index:
            inverted_index[word] = [i + 1]
        elif (i + 1) not in inverted_index[word]:
            inverted_index[word].append(i + 1)

# Display the inverted index
print("\nInverted Index:")
for word, doc_list in inverted_index.items():
    print(f"{word} : {doc_list}")

# Search for a word
search_word = input("\nEnter a word to search: ").lower()
if search_word in inverted_index:
    print(f"The word '{search_word}' is found in lines: {inverted_index[search_word]}")
else:
    
    print(f"The word '{search_word}' is not found in the document.")


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sayal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sayal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!



Inverted Index:
first : [1]
word : [1]
second : [2]
text : [2]
hello : [2]
third : [3]


Purpose of the Practical
The purpose of this practical is to build a simple search engine.

A "normal" file (called a forward index) maps a document to the words it contains. This is slow to search. If you want to find the word "database," you have to read every single document.

An "inverted file" (or inverted index) does the opposite. It builds a map (like an index at the back of a book) that links each word to a list of documents (or lines) that contain it.

This allows for incredibly fast lookups. To find "database," you just look it up in the index once and instantly get the list of all documents [3, 42, 77] that contain it.

üß† Core Theory (How it Works)
This entire practical is built on a Python Dictionary ({}).

Keys: The unique, pre-processed words (e.g., "secret", "elara", "book").

Values: A list of document IDs (in this case, line numbers) where that word appears. This list is often called a "postings list."

The process is:

Scan: Read every line (document) one by one.

Process: Clean the line (tokenize, lowercase, remove stopwords) just like in Practical 1.

Invert & Add: For each clean word, add the current line number to its "postings list" in the dictionary.

If the word is new, you create a new list: "secret" : [1]

If the word exists, you append to its list: "secret" : [1, 4]

The final result is your inverted index. The "retrieval" part is just looking up a word (key) in your dictionary and returning its list (value).

üìã Step-by-Step Code Explanation
Import & Download NLTK: You import nltk, stopwords, and word_tokenize. You also download 'punkt' and 'stopwords' so the tools will work.

Read File: The code opens file.txt (which must be in the same folder) and reads all lines into a list called lines.

Initialize: It creates an empty dictionary inverted_index = {} and loads the stopwords into a set for fast lookups.

Build the Index (The Core Loop):

It uses enumerate(lines) to loop through the file, getting both the line number (i) and the text of the line (line).

It tokenizes and cleans the line, storing the clean words in a list called words.

It then loops through each word in that words list.

if word not in inverted_index:: Checks if this is the first time seeing this word. If it is, it adds the word as a new key to the dictionary and sets its value to a new list containing the line number ([i + 1]). We use i + 1 because enumerate starts at 0, but humans count lines from 1.

elif (i + 1) not in inverted_index[word]:: This is for words we've seen before. It checks if the current line number is already in the list. This prevents adding the same line number twice if a word appears multiple times on the same line.

If the line number is new for that word, it .append(i + 1) to the word's existing list.

Print Index: After the loop finishes, it just prints out the inverted_index dictionary to show you what it built.

Search (Retrieval):

search_word = input(...): Asks the user to type in a word.

if search_word in inverted_index:: This is the actual "retrieval." It checks (very quickly) if the word exists as a key in the dictionary.

If it exists, it prints the value (the list of line numbers) for that key.

If not, it prints a "not found" message.

üõ†Ô∏è Key Libraries & Functions
nltk: Used for tokenizing (word_tokenize) and stop word removal (stopwords).

open('file.txt', 'r'): The standard Python function to open and read a text file.

enumerate(lines): A key function for this practical. It lets you loop over a list and get the index (i) and the item (line) at the same time.

Python Dictionary ({}): The most important part. The entire index is stored in this data structure.

inverted_index[word] = [i + 1]: Creates a new key-value pair.

inverted_index[word].append(i + 1): Adds an item to an existing key's list.

if search_word in inverted_index:: The fast way to check if a key (the word) exists in the index.