<a href="https://colab.research.google.com/github/ShivamThapa243/Information-Retrieval/blob/main/positional_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **POSITIONAL INVERTED INDEX**

# Building a Positional Inverted Index

In [1]:
# connecting to google drive to access the preprocessed data set

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Structure of Positional Inverted Index:**

    {
      token_1: [
          {
            'document_id' : x, 'position' : [position_1, position_2, position_3]
          },
          {
            'document_id' : y, 'position' : [position_4, position_5, position_6]
          },
          {
            'document_id' : z, 'position' : [position_7, position_8, position_9]
          }
      ],
      "caesar": [
          {
            'document_id' : 51, 'position' : [0, 7, 20]
          },
          {
            'document_id' : 79, 'position' : [67, 81]
          },
          {
            'document_id' : 103, 'position' : [4, 12, 51]
          }
      ], ...
    }

**Function to build a Positional Inverted Index**

In [24]:
import os

def positional_inverted_index_builder(dataset_directory):

  positional_inverted_index = {}

  list_of_files = os.listdir(dataset_directory)

  for filename in list_of_files:
    if filename.endswith(".txt"):
      # reading each .txt file present in the directory
      file_path = os.path.join(dataset_directory, filename)
      with open(file_path, 'r') as file:
        content = file.read()
        tokens = content.split()

        for position, token in enumerate(tokens):
          if token not in positional_inverted_index:
            positional_inverted_index[token] = [{'document_id': filename, 'positions': [position]}]
          else:
            # Check if the document entry already exists for this token
            doc_entry_found = False
            for entry in positional_inverted_index[token]:
              if entry['document_id'] == filename:
                entry['positions'].append(position)
                doc_entry_found = True
                break
            # If the document entry was not found, add a new entry
            if not doc_entry_found:
              positional_inverted_index[token].append({'document_id': filename, 'positions': [position]})


  return positional_inverted_index

**Invoking the positional_inverted_index_builder function to build the positional inverted index**

In [30]:
# fetching the preprocessed data files
dataset_directory = "/content/drive/MyDrive/Information Retrieval/preprocessed_data"

# positional-inverted_index_builder function called
positional_inverted_index = positional_inverted_index_builder(dataset_directory)

directory_path = "/content/drive/MyDrive/Information Retrieval"
text_file_name = "positional_inverted_index.txt"

# writing the positional inverted index to a text file

with open(os.path.join(directory_path, text_file_name), 'w') as file:
    for term in positional_inverted_index:
        file.write(f"{term} : \n")
        for entry in positional_inverted_index[term]:
          file.write(f"\tDocument ID :  {entry['document_id']}")
          file.write(f"\tPositions : {entry['positions']}\n")
        file.write("\n")

print("Positional Inverted index created")

Positional Inverted index created
