# IR Practical No.-2
                           

Implement a program for retrieval of documents using inverted files.

In [1]:
documents = [
    "Information retrieval is the process of obtaining information from a large collection.",
    "Inverted file is an index structure used to retrieve documents efficiently.",
    "Information retrieval systems use inverted file for fast searching of documents.",
    "The inverted index stores mapping from content to its locations in the database."
]

In [2]:
inverted_index = {}

In [3]:
for doc_id, text in enumerate(documents):
    for word in text.lower().split():
        word = word.strip('.,')
        if word not in inverted_index:
            inverted_index[word] = []
        if doc_id not in inverted_index[word]:
            inverted_index[word].append(doc_id)


In [4]:
print("Inverted File / Index:\n")
for word in sorted(inverted_index.keys()):
    print(f"{word:15} -> {inverted_index[word]}")

Inverted File / Index:

a               -> [0]
an              -> [1]
collection      -> [0]
content         -> [3]
database        -> [3]
documents       -> [1, 2]
efficiently     -> [1]
fast            -> [2]
file            -> [1, 2]
for             -> [2]
from            -> [0, 3]
in              -> [3]
index           -> [1, 3]
information     -> [0, 2]
inverted        -> [1, 2, 3]
is              -> [0, 1]
its             -> [3]
large           -> [0]
locations       -> [3]
mapping         -> [3]
obtaining       -> [0]
of              -> [0, 2]
process         -> [0]
retrieval       -> [0, 2]
retrieve        -> [1]
searching       -> [2]
stores          -> [3]
structure       -> [1]
systems         -> [2]
the             -> [0, 3]
to              -> [1, 3]
use             -> [2]
used            -> [1]


In [5]:
query = input("\nEnter a term to search: ").lower()
if query in inverted_index:
    print(f"\nDocuments containing '{query}':")
    for doc_id in inverted_index[query]:
        print(f"Doc {doc_id+1}: {documents[doc_id]}")
else:
    print("\nTerm not found in any document.")


Enter a term to search:  the



Documents containing 'the':
Doc 1: Information retrieval is the process of obtaining information from a large collection.
Doc 4: The inverted index stores mapping from content to its locations in the database.


In [6]:
# That's great you're preparing for your practical exam! Yes, I **do understand** the provided Python code. 
# It successfully implements a **document retrieval program using an inverted file (or inverted index)**, 
# which is a fundamental concept in Information Retrieval (IR).

# Here is an explanation of the code, designed to prepare you for potential questions in your practical exam.

# ---

# ## ðŸ’¡ Information Retrieval (IR) Practical Explanation

# The goal of the code is to efficiently find which documents contain a specific search term. It does this by building and using an **Inverted Index**.

# ### 1. The Document Collection

# The code starts with the document collection, which is a list of strings (Code Cell 1).

# * **Code:** `documents = [...]`
# * **Concept:** This is the **corpus** or the set of documents that the search system indexes and queries against. 
# In this example, there are four documents (indexed 0 to 3).
#     * Doc 0: "Information retrieval is the process..."
#     * Doc 1: "Inverted file is an index structure..."
#     * ...and so on.
# * **Potential Questions:**
#     * *What is this list?* It's the collection of documents (the corpus) we are indexing.
#     * *How many documents are there?* Four, indexed from 0 to 3.

# ### 2. Initializing the Inverted Index

# A dictionary is initialized to store the index (Code Cell 2).

# * **Code:** `inverted_index = {}`
# * **Concept:** This dictionary is the core of the inverted file.
#     * **Key:** A **term** (a unique word from the documents).
#     * **Value:** A **posting list**, which is a list of the **document IDs** where that term appears.
# * **Potential Questions:**
#     * *What data structure did you use for the inverted index?* A Python **dictionary** (hash map/table).
#     * *Why a dictionary?* Dictionaries provide **fast (average O(1)) lookups** of the words (keys), which is crucial for quick searching.

# ---

# ### 3. Building the Inverted Index (The Indexing Process)

# This is the main logic where the index is constructed (Code Cell 3).

# | Line of Code | Explanation |
# | :--- | :--- |
# | `for doc_id, text in enumerate(documents):` | Iterates through the document list, getting both the **index** (the `doc_id`: 0, 1, 2, 3) 
# and the document **text**. |
# | `for word in text.lower().split():` | **Tokenization:** Splits the text into individual words, and converts everything to **lowercase** 
# for consistency (case-folding). |
# | `word = word.strip('.,')` | **Text Normalization:** Removes punctuation (like '.', ',') from the beginning and end of the word so, for example, 
# "collection." and "collection" are treated as the same term. |
# | `if word not in inverted_index: ...` | If the word is encountered for the first time, it creates a new entry (key) in the dictionary 
# with an empty list (`[]`) as its posting list (value). |
# | `if doc_id not in inverted_index[word]: ...` | **Crucial Step:** Checks if the current document ID is *already* in the word's posting list. |
# | `inverted_index[word].append(doc_id)` | If the `doc_id` is new, it is added to the posting list. 
# This ensures each document ID appears **only once** per term, regardless of how many times the term appears in that document. |

# * **Potential Questions:**
#     * *What is **tokenization**?* It is the process of breaking a stream of text into meaningful units, called tokens (words). 
#     Here, `split()` is used.
#     * *Why do you use `.lower()` and `.strip()`?* To perform **text normalization** (or pre-processing). 
#     Lowercasing ensures case-insensitivity ("The" and "the" are the same term). Stripping punctuation handles common symbols.
#     * *Why the `if doc_id not in inverted_index[word]:` check?* To ensure the index is a **Boolean Inverted Index**, 
#     where the posting list only contains unique document IDs. If we were building a **Ranked Retrieval** index, 
#     we might record the term frequency instead.
#     * *Show an example of an entry in the index.* The term **`inverted`** is in documents 1, 2, and 3, so its entry is: `inverted -> [1, 2, 3]`.

# ### 4. Displaying the Inverted Index

# This part (Code Cell 4) simply prints the generated index in a sorted, readable format. It's for visualization and verification.

# * **Concept:** The output confirms that for every unique word, you have a list of document IDs where it appears.

# ---

# ### 5. Document Retrieval (The Query Process)

# This section handles a user query and returns the matching documents (Code Cell 5).

# | Line of Code | Explanation |
# | :--- | :--- |
# | `query = input(...).lower()` | Takes user input and converts it to lowercase, mirroring the normalization done during index building. |
# | `if query in inverted_index:` | **The main search step:** Directly looks up the search term in the inverted index dictionary. This is **very fast**. |
# | `for doc_id in inverted_index[query]:` | Iterates through the posting list (the document IDs) associated with the query term. |
# | `print(f"Doc {doc_id+1}: {documents[doc_id]}")` | Retrieves and prints the original document text using the stored `doc_id`. 
# Note that it adds **`+1`** for user readability (displaying Doc 1, Doc 2, etc., instead of Doc 0, Doc 1, etc.). |

# * **Potential Questions:**
#     * *How does the inverted index speed up retrieval?* Instead of scanning every word in every document (a full-text scan) 
#     for the query term, we only perform a **single, fast dictionary lookup** to get the list of relevant document IDs.
#     * *What is the time complexity difference between using an index and not using one?* Without an index, search time is roughly proportional 
#     to the total size of the corpus ($\mathbf{O(N)}$, where $N$ is total text length). With an index, it's a very fast lookup ($\mathbf{O(1)}$ average) 
#     followed by reading the posting list ($\mathbf{O(L)}$, where $L$ is posting list length), making it vastly faster for large collections.
#     * *What happens if the term isn't found?* The `else` block executes, printing "Term not found in any document."

# ---

# ## ðŸš€ Key Takeaways for the Exam

# 1.  **Definition:** An **Inverted Index** maps terms to the documents they appear in.
# 2.  **Implementation:** The index is implemented as a **dictionary** in Python, where keys are terms and values are lists of document IDs.
# 3.  **Indexing Steps:** The core steps are **Tokenization**, **Normalization** (lowercasing, punctuation removal), and building the 
# **Posting List** (ensuring unique document IDs per term).
# 4.  **Retrieval Efficiency:** The inverted index allows for **instantaneous lookup** of the relevant document IDs, 
# avoiding slow linear scans of the entire document set.

# I can guide you through a step-by-step trace of how a specific term, like "**information**," is indexed or retrieved, if that would be helpful!