<a href="https://colab.research.google.com/github/ShivamThapa243/Information-Retrieval/blob/main/unigram_inverted_index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# UNIGRAM INVERTED INDEX





**1. Building Unigram Inverted Index**

---





In [10]:
# importing files from drive

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Function to build an unigram inverted index

Structure of inverted index:
*   word1: {counts: x, documents: [doc1, dox2, doc3...]}
*   word2: {counts: y, documents: [doc3, doc4, doc5...]}
*   ...


    

In [11]:
import os

def unigram_inverted_index_builder(dataset_directory):
  unigram_inverted_index = {}
  list_file = os.listdir(dataset_directory)

  # Iterating through each file present in the directory
  for filename in list_file:
    if filename.endswith(".txt"):
      # Reading the content of the file
      file_path = os.path.join(dataset_directory, filename)
      with open(file_path, 'r') as file:
        content = file.read()

      # tokenizing the documnet to get unique tokens
      content_list = content.split()
      unique_content = set(content_list)

      # Updating the unigram inverted index
      for token in unique_content:
        # if the token is already in the inverted_index
        if token in unigram_inverted_index:
          unigram_inverted_index[token]['count'] += 1
          if filename not in unigram_inverted_index[token]['documents']:
            unigram_inverted_index[token]['documents'].append(filename)
        else:
          unigram_inverted_index[token] = {'count' : 1, 'documents': [filename]}
  return unigram_inverted_index


Invoking the unigram_inverted_index_builder function to build the inverted index

In [14]:
# location of preprocessed data set which is stored in the google drive
dataset_directory = "/content/drive/MyDrive/Information Retrieval/preprocessed_data"

# calling the builder function by passing the data set location
unigram_inverted_index = unigram_inverted_index_builder(dataset_directory)

# storing the newely generated unigram inverted index into a new text file
directory_name = "/content/drive/MyDrive/Information Retrieval"
text_file_name = "unigram_inverted_index_dataset.txt"
text_file = os.path.join(directory_name, text_file_name)

with open(text_file, 'w') as file:
  for term, info in unigram_inverted_index.items():
    file.write(f"{term}: {info}\n")

print("Unigram inverted index created")

Unigram inverted index created


Sorting the inverted index

In [15]:
# sorting the earlier generated unigram inverted index
sorted_items = sorted(unigram_inverted_index.items(), key = lambda x : x[0])
sorted_inverted_index = dict(sorted_items)

# storing the sorted inverted index into a new text file
sorted_text_file_name = "sorted_unigram_inverted_index_dataset.txt"
sorted_text_file = os.path.join(directory_name, sorted_text_file_name)

with open(sorted_text_file, 'w') as file:
  for term, info in sorted_inverted_index.items():
    file.write(f"{term} : {info}\n")

print("Sorted unigram inverted index created.")

Sorted unigram inverted index created.


# **2. Pickling the Unigram Inverted Index**

---



In [16]:
import pickle

pkl_file_name = "sorted_unigram_inverted_index.pkl"
pkl_file_path = os.path.join(directory_name, pkl_file_name)

with open(pkl_file_path, 'wb') as file:
  pickle.dump(sorted_inverted_index, file)

print("Sorted unigram inverted index pickled.")

Sorted unigram inverted index pickled.


# **3. Providing support for the query operations:**

---



1.   T1 **AND** T2
2.   T1 **OR** T2
3.   T1 **AND NOT** T2
4.   T1 **OR NOT** T2



Function to preprocess the query and return a preprocessed result

In [43]:
import string
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('stopwords')

def preprocessing(query):
  # lower case the query
  query = query.lower()

  # tokenizing
  tokens = word_tokenize(query)

  # punctuation removal
  tokens = [word for word in tokens if word not in string.punctuation]

  # stopwords removal
  stop_words = set(stopwords.words('english'))
  tokens = [word for word in tokens if word not in stop_words]

  return tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [42]:
# loading the pickled file
with open(pkl_file_path, 'rb') as file:
  inverted_index = pickle.load(file)

# input query
print("Enter input sequence: ")
query = input()

# input operations
print("Enter operations (AND, OR, AND NOT, OR NOT) seperated by comams:")
operations = input()

# passing the query for preprocessing
preprocessed_query = preprocessing(query)

Enter input sequence: 
Hello, WORlD!
Enter operations (AND, OR, AND NOT, OR NOT) seperated by comams:
and
['hello', 'world']
