# Aarib Ahmed Vahidy 22K-4004 BAI-6A

## Information Retrieval Assignment # 1

### Inverted Index, Positional Index, Boolean Model 

Assignment Objective
The objective of this assignment is to make you understand how different indexes work in
retrieving different queries from a collection. You will create an Inverted index and Positional
index for a set of collection to facilitate the Boolean Model of IR. Inverted files and Positional
files are the primary data structures to support the efficient determination of which documents
contain specified terms and at which proximity. You also learn to process simple Boolean
expression queries through this assignment.
Datasets
You are given a collection of Abstracts (File name: Abstracts.zip) for implementing inverted index
and positional index. This zip file contains 448 abstracts of some computer science journal. The
language of all these documents is English. You also need to implement a pre-processing pipeline.
It is recommended to first review the given text file for indexing. You need to treat each document
as a unique document. This observation offers you many clues for your pipeline implementation
and feature extraction.
Query Processing
In this assignment, all you need to implement an information retrieval model called Boolean
Information Retrieval Model with some simplified assumptions. You need to treat each abstract
(document or file as a document and need to index it content separately. you need to implement a
simplified Boolean user query that can only be formed by joining three terms (t1, t2 and t3) with
(AND, OR, and NOT) Boolean operators. For example, a user query may be in the form (t1 AND
t2 AND t3). For positional queries, the query text contains “/” along with a k intended to return
all documents that contain t1 and t2, k words apart on either side of the text.
Basic Assumption for Boolean Retrieval Model
1. An index term (word) is either present (1) or absent (0) in the document. A dictionary
contains all index terms.
2. All index terms provide equal evidence with respect to information needs. (No frequency
count necessary, but in next assignment it can be)
3. Queries are Boolean combinations of index terms (at max 3).
4. Boolean Operators (AND, OR and NOT) are allowed. For examples:
X AND Y: represents doc that contains both X and Y
X OR Y: represents doc that contains either X or Y
NOT X: represents the doc that do not contain X
5. Queries of the type X Y / 3 represents doc that contains both X and Y and 3 words apart.

As we discussed during the lectures, we will implement a Boolean Model by creating a posting
list of all the terms present in the documents. You are free to implement a posting list with your
choice of data structures; you are only allowed to preprocess the text from the documents in term
of tokenization in which you can do case folding and stop-words removal and stemming. The stop
word list is also provided to you in assignments files. Your query processing routine must address
a query parsing, evaluation of the cost, and through executing it to fetch the required list of
documents. A command line interface is simply required to demonstrate the working model. You
are also provided by a set of 10 queries, for evaluating your implementation.
Coding can be done in either Java, Python, C/C++ or C# programming language. There are
additional marks for intuitive GUI for demonstrating the working Boolean Model along with
phrase query search.
Files Provided with this Assignment:
1. Abstracts
2. Stop-words list as a single file
3. Queries Result-set (Gold Standard- 10 example queries)
Evaluation/ Grading Criteria
The grading will be done as per the scheme of implementations, query responses and matching
with a gold standard (provided query set).
Grading Criteria:
Preprocessing (2 marks)
Formation of Inverted and Positional Indexes (1 mark for code complexity 1 mark for saving and
loading the indexes)
Simple Boolean Queries (2 marks)
Complex Boolean Queries (2 marks)
Proximity Queries (2 marks)
Bonus: GUI (1 mark for making the GUI 1 mark for good friendly GUI)
The proper clean and well commented code will get 2 more marks.

<The End>

## Loading the documents

In [117]:
#Importing necessary libraries
import nltk
import pandas as pd
import numpy as np
import os
import json
import re
print("All libraries have been installed successfully")

All libraries have been installed successfully


In [54]:
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\aarib\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aarib\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [55]:
#Setting dataset path and checking the abstracts.rar files
dataset_path = r"C:\Users\aarib\6thSemester\IRAssignment\Dataset\Abstracts"

print(os.listdir(dataset_path)[:5])

['1.txt', '10.txt', '100.txt', '101.txt', '102.txt']


In [56]:
sample_file = os.listdir(dataset_path)[0]  #Picking the first file
sample_path = os.path.join(dataset_path, sample_file)

with open(sample_path, "r", encoding="utf-8") as f:
    content = f.read()

print(content[:100])  #Only printing first 100 characters of first file to just check and avoid clutter

Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment

statistical word alignmen


In [57]:
#List to store document contents which were provided in the abstracts.rar file.
documents = []

#Reading each .txt file in the folder
for filename in os.listdir(dataset_path): #dataset_path defined above
    if filename.endswith(".txt"):
        file_path = os.path.join(dataset_path, filename)
        with open(file_path, "r", encoding="windows-1252") as file: 
            #encoding = "utf-8" did not work so tried "latin-1" which worked but "windows-1252" worked best according to the documents given. 
            documents.append(file.read())

#Checking if loading was a success and the number of documents loaded
print(f"Total documents loaded: {len(documents)}")
print("First Sample document:\n", documents[0][:1500])

Total documents loaded: 448
First Sample document:
 Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment

statistical word alignment, ensemble learning, heuristic word alignment

Statistical word alignment models need large amount of training data while they are weak in small-size corpora. This paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. This algorithm uses three base alignment models in several rounds to generate alignments. The ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. The underlying alignment algorithms used in this study include IBM Model 1, 2 and a heuristic method based on Dice measurement. Our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.


## Preprocessing the Documents

1. **Tokenization** (Splitting text into words)
2. **Special Character Handling** (Replacing special characters such as /,_ with a space ' ' and storing hyphenated words both separately and in base form)
3. **Case Folding** (Converting text into lowercase)  
4. **Stop word removal** (Removing common words)  
   Stop words list provided: *a, is, the, of, all, and, to, can, be, as, once, for, at, am, are, has, have, had, up, his, her, in, on, no, we, do*  
5. **Stemming** (Reducing words to their base form)

In [113]:
#Custom stop words as provided by Stopword-list.txt
my_stopwords = set([
    "a", "is", "the", "of", "all", "and", "to", "can", "be", "as", "once", 
    "for", "at", "am", "are", "has", "have", "had", "up", "his", "her", 
    "in", "on", "no", "we", "do"
])

#Initializing Stemmer
stemmer = PorterStemmer()

#Dictionary to store the processed documents
processed_docs = {} #It will store filename as key and preprocessed word list as the value.

#Reading and processing each document
for filename in os.listdir(dataset_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(dataset_path, filename)
        with open(file_path, "r", encoding="windows-1252") as file:
            text = file.read()

        #Handling hyphenated words:
        #Keeping the original word (e.g., "time-series").
        #Replacing hyphens with spaces to store individual words (e.g., "time series").
        #This preprocessing is necessary so that my results match those of the golden set. e.g in query 4
        text = text.replace("-", " ") + " " + text  # Ensures both variants are indexed.

         #Handling special characters (/, _, etc.) Can add more to regex if needed but I think these two are enough
        #Replacing '/', '_' with spaces to split words properly
        #This preprocessing is necessary so that my results match those of the golden set. e.g in query 9
        text = re.sub(r'[/_]', ' ', text)  #Converts "classification/clustering" → "classification clustering"
        
        #Tokenization (SPlits the text into individual words)
        tokens = word_tokenize(text)
        
        #Case Folding (Converts all words to their lowercase equivalent.)
        tokens = [word.lower() for word in tokens]
        
        #Stop-word Removal (Only considers those words which are not in out stopwords list)
        filtered_tokens = [word for word in tokens if word not in my_stopwords]
        
        #Stemming (Stems words to their root forms for example running becomes run)
        stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
        
        #Store processed document as key value pair where filename is key and list of processed words in value.
        processed_docs[filename] = stemmed_tokens

### Checking if my preprocessing was successful

In [59]:
#Displaying a sample to check
print(f"Processed {len(processed_docs)} documents successfully!")
sample_doc = list(processed_docs.keys())[0]
print(f"Sample processed document ({sample_doc}):\n", processed_docs[sample_doc][:1500]) #Displaying first 1500 words to check, can also compare with old 1500 words.

Processed 448 documents successfully!
Sample processed document (1.txt):
 ['ensembl', 'statist', 'heurist', 'model', 'unsupervis', 'word', 'align', 'statist', 'word', 'align', ',', 'ensembl', 'learn', ',', 'heurist', 'word', 'align', 'statist', 'word', 'align', 'model', 'need', 'larg', 'amount', 'train', 'data', 'while', 'they', 'weak', 'small', 'size', 'corpora', '.', 'thi', 'paper', 'propos', 'new', 'approach', 'unsupervis', 'hybrid', 'word', 'align', 'techniqu', 'use', 'ensembl', 'learn', 'method', '.', 'thi', 'algorithm', 'use', 'three', 'base', 'align', 'model', 'sever', 'round', 'gener', 'align', '.', 'ensembl', 'algorithm', 'use', 'weigh', 'scheme', 'resampl', 'train', 'data', 'vote', 'score', 'consid', 'aggreg', 'align', '.', 'underli', 'align', 'algorithm', 'use', 'thi', 'studi', 'includ', 'ibm', 'model', '1', ',', '2', 'heurist', 'method', 'base', 'dice', 'measur', '.', 'our', 'experiment', 'result', 'show', 'that', 'by', 'thi', 'approach', ',', 'align', 'error', 'rate', 'cou

In [60]:
#Checking random documents to make sure the preprocessing worked.
import random
sample_doc = random.choice(list(processed_docs.keys()))
print(f"Sample processed document ({sample_doc}):\n", processed_docs[sample_doc][:100])

Sample processed document (223.txt):
 ['neural', 'network', 'base', 'kidney', 'segment', 'from', 'mr', 'imag', ':', 'preliminari', 'result', 'kidney', 'segment', ',', 'neural', 'network', ',', 'mr', 'imag', 'autom', 'robust', 'kidney', 'segment', 'from', 'medic', 'imag', 'sequenc', 'veri', 'difficult', 'task', 'particularli', 'becaus', 'gray', 'level', 'similar', 'adjac', 'organ', ',', 'partial', 'volum', 'effect', 'inject', 'contrast', 'media', '.', 'addit', 'these', 'difficulti', ',', 'variat', 'kidney', 'shape', ',', 'posit', 'gray', 'level', 'make', 'autom', 'identif', 'segment', 'kidney', 'harder', '.', 'also', ',', 'differ', 'imag', 'characterist', 'with', 'differ', 'scanner', 'much', 'more', 'increas', 'difficulti', 'segment', 'task', '.', 'therefor', ',', 'thi', 'paper', ',', 'present', 'an', 'autom', 'kidney', 'segment', 'method', 'by', 'use', 'multi', 'layer', 'perceptron', 'base', 'approach', 'that', 'adapt', 'paramet', 'accord']


## Building the Positings List (Inverted Index)

DataStructure Used: Dictionary of Lists (Hashmap) O(1) time complexity for insertion and searching

The keys are unique words (terms from documents)

The values are lists of document filenames where the word appears

In [61]:
#Dictionary to store the inverted index/posting list
posting_list = {}

#Constructing the posting list
for filename, words in processed_docs.items():
    doc_id = filename  #Using the given filenames as the documentID
    for word in set(words):  # Using a set to avoid duplicate entries
        if word not in posting_list:
            posting_list[word] = set()  #Using a set to store unique doc IDs, will only make new posting list if word is new, not encountered before.
        posting_list[word].add(doc_id) #Adding the document ID where word appeared to the posting list for that word.

#Converting sets to lists for better readability
for word in posting_list:
    posting_list[word] = sorted(list(posting_list[word]))  #Sorting based on documnet IDs for better consistency

In [63]:
#Checking for a specific word, e.g., "statist" (which was "statistical" before preprocessing)
sample_term = "statist"
if sample_term in posting_list:
    print(f"Sample Posting List Entry: '{sample_term}' appears in: {posting_list[sample_term]}")
else:
    print(f"The term '{sample_term}' is not found in the posting list.")

Sample Posting List Entry: 'statist' appears in: ['1.txt', '102.txt', '112.txt', '115.txt', '116.txt', '121.txt', '128.txt', '14.txt', '145.txt', '147.txt', '15.txt', '156.txt', '158.txt', '17.txt', '170.txt', '190.txt', '193.txt', '194.txt', '202.txt', '204.txt', '208.txt', '228.txt', '255.txt', '269.txt', '283.txt', '319.txt', '320.txt', '336.txt', '343.txt', '355.txt', '368.txt', '370.txt', '385.txt', '393.txt', '405.txt', '41.txt', '42.txt', '429.txt', '43.txt', '430.txt', '438.txt', '445.txt', '447.txt', '448.txt', '60.txt', '71.txt', '72.txt', '92.txt']


In [64]:
#Saving the inverted index/posting list
with open("inverted_index.json", "w") as file:
    json.dump(posting_list, file, indent=4)
print("The inverted Index was saved successfully")

The inverted Index was saved successfully


In [65]:
#Loading the previously saved inverted index
with open("inverted_index.json", "r") as file:
    loaded_inverted_index = json.load(file)

print("The Inverted Index was loaded successfully!")

The Inverted Index was loaded successfully!


## Building the Positional Index 

DataStructure Used: Nested Dictionary of Lists (HashMap) O(1) time complexity for insertion and searching

The outer dictionary's key is a unique word

The inner dictionary's key is the document ID (filename.txt)

The inner dictionary's value is a list of positions where the word appears in that document

In [66]:
#Dictionary to store the positional index
positional_index = {}

#Constructing the positional index
for filename, words in processed_docs.items():
    doc_id = filename #Using the given filename as the documentID
    for position, word in enumerate(words): #Tracking the position of each word
        if word not in positional_index:
            positional_index[word] = {} #Creating a new dictionary for each word

        if doc_id not in positional_index[word]:
            positional_index[word][doc_id] = [] #Creating a list for positions
            
        positional_index[word][doc_id].append(position) #Storing the position

In [68]:
#Checking for specific word for example statist (statistical before processing)
sample_term = "statist"
if sample_term in positional_index:
    print(f"\nPositional Index Entry for '{sample_term}':")
    for doc, positions in positional_index[sample_term].items():
        print(f"Document {doc}: Positions {positions}")
else:
    print(f"\nThe term '{sample_term}' is not found in the positional index.")


Positional Index Entry for 'statist':
Document 1.txt: Positions [1, 7, 17, 115, 121, 131]
Document 102.txt: Positions [16, 100, 108, 141, 225, 233]
Document 112.txt: Positions [140, 337]
Document 115.txt: Positions [0, 87]
Document 116.txt: Positions [0, 5, 146, 151]
Document 121.txt: Positions [131, 192, 355, 416]
Document 128.txt: Positions [116, 144, 276, 304]
Document 14.txt: Positions [56, 165]
Document 145.txt: Positions [84, 272]
Document 147.txt: Positions [8, 191]
Document 15.txt: Positions [2, 4, 21, 96, 114, 116, 133, 206]
Document 156.txt: Positions [0, 17, 60, 244, 261, 304]
Document 158.txt: Positions [42, 219]
Document 17.txt: Positions [60, 173]
Document 170.txt: Positions [149, 357]
Document 190.txt: Positions [49, 266]
Document 193.txt: Positions [176, 406]
Document 194.txt: Positions [51, 71, 209, 229]
Document 202.txt: Positions [100, 226]
Document 204.txt: Positions [0, 17, 96, 129, 145, 162, 241, 270]
Document 208.txt: Positions [115, 187, 310, 377]
Document 228.

In [69]:
#Saving the Positional Index
with open("positional_index.json", "w") as file:
    json.dump(positional_index, file, indent=4)
print("The Positional Index was saved successfully")

The Positional Index was saved successfully


In [70]:
#Loading the previously saved positional index
with open("positional_index.json", "r") as file:
    loaded_positional_index = json.load(file)
print("The Inverted Index was loaded successfully!")

The Inverted Index was loaded successfully!


## Implementing Simple Boolean Queries

In [71]:
def boolean_query(term1, operator1=None, term2=None, operator2=None, term3=None):
    """Processes Boolean queries with up to three terms using the posting list.
        We will need to stem the words in the query so they match the processed words in postings list.
        Supports single-term, two-term, and three-term queries.
        if we have three terms we will need to perform 2 operations on the three terms.
        The .txt is removed and sorted integer list of documentIDS is returned.
    """

    #Stemming the received terms so they match the postings format.
    term1 = stemmer.stem(term1)
    term2 = stemmer.stem(term2) if term2 else None
    term3 = stemmer.stem(term3) if term3 else None  

    #Universal Set of all document IDs
    all_docs = set(doc for docs in posting_list.values() for doc in docs)

    #Fetching the posting lists for the terms
    docs_term1 = set(posting_list.get(term1, []))
    docs_term2 = set(posting_list.get(term2, [])) if term2 else set() #Treating none case as an empty set
    docs_term3 = set(posting_list.get(term3, [])) if term3 else set()

    #Handling a single-term query
    if not operator1:
        return sorted([int(doc.replace('.txt', '')) for doc in docs_term1])

    #Performing the first operation (between term1 and term2) and getting an interim result
    if operator1 == "AND":
        result1 = docs_term1 & docs_term2 #Boolean AND (Intersection)
    elif operator1 == "OR":
        result1 = docs_term1 | docs_term2#Boolean OR (Union)
    elif operator1 == "NOT":
        result1 = all_docs - docs_term1  #Boolean NOT (Complement using Universal Set)
    else:
        print("The entered operator1 is not valid boolean, use 'AND'/'OR'/'NOT'")
        return []

    #If only two terms were provided, we will directly return result1
    if not operator2:
        return sorted([int(doc.replace('.txt', '')) for doc in result1])

    #Performing the second operation(between result1 which was computed and term3 which was passed to the function) and getting the final result
    if operator2 == "AND":
        final_result = result1 & docs_term3 #Boolean AND (Intersection)
    elif operator2 == "OR":
        final_result = result1 | docs_term3 #Boolean OR (Union)
    elif operator2 == "NOT":
        final_result = result1 - docs_term3#Boolean NOT (Exclusion)
    else:
        print("The entered operator2 is not valid boolean, use 'AND'/'OR'/'NOT'")
        return []

    return sorted([int(doc.replace('.txt', '')) for doc in final_result])

### Checking the function

In [72]:
print(boolean_query("statist", "AND", "word", "AND", "align")) #Performing the stemming myself
print(boolean_query("statistical", "AND", "word", "AND", "alignment")) #Sending the natural words to check if function stems and outputs match.

[1]
[1]


In [73]:
print(boolean_query("mixture", "OR", "gamma", "OR", "svm"))

[9, 12, 35, 46, 67, 84, 107, 117, 122, 123, 124, 125, 126, 128, 137, 155, 161, 165, 167, 169, 171, 187, 198, 216, 237, 240, 280, 283, 285, 289, 291, 292, 305, 319, 321, 368, 380, 400, 427, 428, 439, 443, 447]


In [74]:
print(boolean_query("dimensionality", "AND", "reduction", "NOT", "security"))

[15, 17, 42, 106, 136, 240, 268, 287, 353, 354]


In [75]:
print(boolean_query("housing", "AND", "pricing", "OR", "svm"))

[9, 12, 67, 84, 107, 122, 123, 124, 125, 126, 128, 137, 155, 161, 167, 169, 171, 237, 280, 283, 285, 291, 292, 305, 319, 321, 324, 368, 400, 427, 428, 437, 439, 447]


## Checking Golden Queries for Simple Boolean Queries (Query Number 1 to 10)

In [76]:
#1.
print(f"Result Set: {boolean_query("image", "AND", "restoration")}") #Matching

Result Set: [359, 375]


In [77]:
#2.
print(f"Result Set: {boolean_query("deep", "AND", "learning")}") #Not matching, learning gets stemmed to learn and matches more.
#176 should get matched as it contains both deep and learning but it is not included in the golden set, dont understand why.

Result Set: [23, 24, 174, 175, 176, 177, 213, 245, 247, 250, 254, 258, 267, 272, 273, 278, 279, 281, 325, 333, 345, 346, 347, 348, 352, 357, 358, 360, 362, 371, 373, 374, 375, 380, 381, 382, 396, 397, 401, 404, 405, 415, 421, 432, 444]


In [78]:
#3.
print(f"Result Set: {boolean_query("autoencoders")}") #Matching

Result Set: [187, 273, 279, 325, 333, 405]


In [79]:
#4.
print(f"Result Set: {boolean_query("temporal", "AND", "deep", "AND", "learning")}") #Matching

Result Set: [279, 358, 373, 405]


In [80]:
#5.
print(f"Result Set: {boolean_query("time", "AND", "series")}") #Matching, required handling hyphenated words

Result Set: [40, 54, 110, 111, 112, 113, 158, 163, 173, 180, 181, 202, 220, 237, 238, 239, 240, 258, 277, 283, 295, 305, 350, 405, 421, 437, 438, 445]


In [81]:
#6.
print(f"Result Set: {boolean_query("time", "AND", "series", "AND", "classification")}") #Matching

Result Set: [40, 237, 283, 445]


In [82]:
#7.
print(f"Result Set: {boolean_query("time", "AND", "series", "OR", "classification")}") #Matching

Result Set: [4, 6, 9, 10, 16, 22, 24, 33, 34, 38, 40, 43, 45, 46, 49, 51, 54, 55, 56, 58, 59, 60, 63, 64, 66, 67, 71, 73, 75, 76, 77, 80, 84, 85, 94, 95, 98, 99, 106, 107, 110, 111, 112, 113, 120, 121, 122, 123, 125, 126, 128, 140, 143, 147, 158, 163, 164, 165, 167, 168, 169, 171, 173, 174, 175, 176, 177, 180, 181, 182, 187, 193, 197, 198, 202, 208, 210, 213, 215, 220, 228, 229, 234, 235, 236, 237, 238, 239, 240, 245, 247, 248, 249, 252, 256, 258, 259, 261, 265, 268, 272, 273, 277, 280, 283, 286, 287, 289, 295, 299, 302, 303, 305, 310, 313, 316, 317, 321, 327, 328, 334, 338, 341, 345, 348, 350, 352, 353, 354, 357, 363, 369, 371, 375, 377, 378, 382, 384, 385, 386, 387, 395, 397, 404, 405, 408, 420, 421, 424, 425, 427, 432, 437, 438, 439, 442, 445]


In [83]:
#8.
print(f"Result Set: {boolean_query("pattern")}") #Matching

Result Set: [9, 10, 18, 21, 23, 26, 30, 34, 40, 50, 73, 118, 126, 127, 139, 145, 148, 155, 180, 186, 189, 194, 201, 209, 214, 216, 230, 231, 234, 238, 279, 280, 288, 326, 343, 350, 351, 368, 369, 383, 394, 406, 412, 413, 424, 425, 429, 446, 447]


In [84]:
#9.
print(f"Result Set: {boolean_query("pattern", "AND", "clustering")}") 
#40 missing as it contains "Classification/Clustering" which is being treated as one token so clustering is not being indexed
#We need to improve preprocessing to handle '/' as well. After dealing with /, 40 matches.

Result Set: [40, 73, 180, 216, 326, 350, 351, 413, 446]


In [87]:
#10.
print(f"Result Set: {boolean_query("pattern", "AND", "clustering", "AND", "heart")}") #Matching.

Result Set: [73]


## Implementing Proximity Queries

In [109]:
def proximity_query(term1, term2, k, positional_index):
    """Finds documents where term1 and term2 appear within k positions of each other. so k+1 positions apart of each other
    Queries of the type X Y / 3 represents doc that contains both X and Y and 3 words apart."""
    
    #Stemming terms to match stored format
    term1 = stemmer.stem(term1)
    term2 = stemmer.stem(term2)

    #Get document position lists for each term from positional index that we have created previously
    if term1 not in positional_index or term2 not in positional_index:
        return []  #If either term is missing matching is not possible

    docs_term1 = positional_index[term1]
    docs_term2 = positional_index[term2]

    matching_docs = []

    #Check for common documents containing both terms so that we check for positions in those documents only which contain both terms.
    common_docs = set(docs_term1.keys()) & set(docs_term2.keys())

    for doc in common_docs:
        positions1 = docs_term1[doc]  #Positions of term1
        positions2 = docs_term2[doc]  #Positions of term2

        #Two-pointer technique to check if any position satisfies the proximity condition
        i, j = 0, 0
        while i < len(positions1) and j < len(positions2):
            if abs(positions1[i] - positions2[j]) <= k+1:
                matching_docs.append(int(doc.replace('.txt', ''))) #Add document to list as terms appear within k of each other.
                break  #If one match is found, we don't need to check further
            elif positions1[i] < positions2[j]: #if i is behind move i forward or else move j forward
                i += 1
            else:
                j += 1

    return sorted(matching_docs)

In [110]:
#Checking if it is working
result = proximity_query("time", "series", 3, positional_index)
print("Result Set:", result)

Result Set: [40, 54, 110, 111, 112, 113, 158, 163, 173, 180, 181, 202, 220, 237, 238, 239, 240, 258, 277, 283, 295, 305, 350, 405, 421, 437, 438]


## Checking Golden Queries for Proximity Queries (Query Number 11 and 12)

In [111]:
#11.
result = proximity_query("neural", "information", 2, positional_index)
print("Result Set:", result) #Matching

Result Set: [26]


In [112]:
#12.
result = proximity_query("feature", "track", 5, positional_index)
print("Result Set:", result) #Matching

Result Set: [13, 212]


## Working on the GUI 

The GUI has been implemented in separate .py files. run app.py from the command prompt using 'streamlit run app.py'
Make sure the files '4004Assignment1.ipynb', 'app.py', 'inverted_index.json', 'positional_index.json' and 'search_engine.py'
are present in the same directory. Navigate to the directory which contains these files from the command prompt using the cd command
and then write streamlit run app.py.