# Indexing Techniques in Information Retrieval

In this lab, we demonstrate two important indexing techniques:

1. **Inverted Index** – maps terms to the list of documents in which they appear.  
2. **Positional Inverted Index** – maps terms to documents and positions within the documents.  

We will use a set of example sentences and perform preprocessing before indexing.



In [1]:
import re

# Sample documents (slightly different from the original)
docs = [
    "the player hit the ball into the field",
    "the bowler delivered a fast ball",
    "the player ran after the ball",
    "batsmen and bowlers play cricket together"
]

# Preprocessing: lowercase, remove punctuation, tokenize
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text.split()

tokenized_docs = [preprocess(doc) for doc in docs]
print("Tokenized Documents:")
for i, tokens in enumerate(tokenized_docs):
    print(f"Doc {i}: {tokens}")


Tokenized Documents:
Doc 0: ['the', 'player', 'hit', 'the', 'ball', 'into', 'the', 'field']
Doc 1: ['the', 'bowler', 'delivered', 'a', 'fast', 'ball']
Doc 2: ['the', 'player', 'ran', 'after', 'the', 'ball']
Doc 3: ['batsmen', 'and', 'bowlers', 'play', 'cricket', 'together']


## Inverted Index

An **inverted index** stores each term and the documents in which it appears.  
This allows fast retrieval of documents containing a given term.


In [2]:
from collections import defaultdict

# Create inverted index
inv_index = defaultdict(list)

for doc_id, tokens in enumerate(tokenized_docs):
    for token in set(tokens):  # use set to avoid duplicates
        inv_index[token].append(doc_id)

# Convert to regular dict for readability
inv_index = dict(inv_index)

print("Inverted Index:")
for term, doc_ids in inv_index.items():
    print(f"{term}: {doc_ids}")


Inverted Index:
hit: [0]
the: [0, 1, 2]
field: [0]
ball: [0, 1, 2]
player: [0, 2]
into: [0]
a: [1]
delivered: [1]
bowler: [1]
fast: [1]
after: [2]
ran: [2]
together: [3]
cricket: [3]
play: [3]
and: [3]
batsmen: [3]
bowlers: [3]


## Positional Inverted Index

A **positional inverted index** stores each term, the documents it appears in,  
and the positions of the term within each document.  

This allows advanced queries like phrase search.


In [3]:
# Create positional inverted index
pos_index = defaultdict(lambda: defaultdict(list))

for doc_id, tokens in enumerate(tokenized_docs):
    for pos, token in enumerate(tokens):
        pos_index[token][doc_id].append(pos)

# Convert nested defaultdict to dict
pos_index = {term: dict(doc_pos) for term, doc_pos in pos_index.items()}

print("Positional Inverted Index:")
for term, doc_pos in pos_index.items():
    print(f"{term}: {doc_pos}")


Positional Inverted Index:
the: {0: [0, 3, 6], 1: [0], 2: [0, 4]}
player: {0: [1], 2: [1]}
hit: {0: [2]}
ball: {0: [4], 1: [5], 2: [5]}
into: {0: [5]}
field: {0: [7]}
bowler: {1: [1]}
delivered: {1: [2]}
a: {1: [3]}
fast: {1: [4]}
ran: {2: [2]}
after: {2: [3]}
batsmen: {3: [0]}
and: {3: [1]}
bowlers: {3: [2]}
play: {3: [3]}
cricket: {3: [4]}
together: {3: [5]}
