# Aarib Ahmed Vahidy 22K-4004 BAI-6A

## Information Retrieval Assignment # 2

### Vector Space Model(VSM) for information retrieval.

Assignment Objective

This assignment focuses on Vector Space Model(VSM) for information retrieval. You will be
implementing and testing a set of queries using VSM for information retrieval. You need to build
a vector space of features using some specified feature selection techniques. The dimension of the
space will be Rn, the query is also represented in the same feature space. Cosine similarity is used
to compute the similarity between documents and queries. A given threshold can be used to filter
the results for a given query. The threshold should be fixed for a given set for queries.

Datasets

You are given a collection of Abstracts (File name: Abstracts.zip) for implementing inverted index
and positional index. This zip file contains 448 abstracts of some computer science journal. The
language of all these documents are English. You also need to implement a pre-processing
pipeline. It is recommended to first review the given text file for indexing. You need to treat each
document as a unique document. This observation offers you many clues for your pipeline
implementation and feature extraction. You will be implementing term feature selection based on
Term Frequency (tf) and Inverse Document Frequency (idf) scoring. The parameters of tf and idf
can be set while creating a specific space of representation. The set of queries are also provided
for this assignment. You need to place the queries on the same space and compute the score based
on cosine similarity. The weighting scheme used for VSM will be tf*idf, which is a combination
of both tf (term frequency of term t in a document) and idf (inverse document frequency computing
as (log(df)/ N).

Query Processing

The query processing of VSM is quite tricky, you need of optimize every aspect of computation.
The high-dimensional vector product and similarity values of query (q) and documents (d) need to
optimized.

Basic Assumption for Vector Space Model (VSM) Retrieval Model

1.Simple model based on linear algebra. Terms are considered as features using a weighting
scheme.

2.Allows partial matching of documents with the queries. Hence, able to produce good institutive 
scoring. Continuous scoring between queries and documents.

3.Ranking of documents are possible using relevance score between document and query.
As we discussed during the lectures, we will implement a VSM Model by selecting features from
the document by specifying tf and idf values. You are free to implement a posting list with your
choice of data structures; you are only allowed to preprocess the text from the documents in term
of tokenization in which you can do case folding, stop-words removal and lemmatization. The stop
word list is also provided to you with assignments files. Your query processing routine must 
address a query parsing, evaluation of the cost, and through executing it to fetch the required list
of documents. The list of documents should be filtered with an alpha value say (alpha = 0.05), A
command line interface is simply required to demonstrate the working model. You are also
provided by a set of 10 queries, for evaluating your implementation.
Coding can be done in either Java, C/C++, Python, or C# programming language. There are
additional marks for intuitive GUI for demonstrating the working VSM along with free text query
search.

Files Provided with this Assignment:

1. Abstracts

2. Stop-words list as a single file

3. Queries Result-set (Gold Standard- 15 example queries)

Evaluation/ Grading Criteria

The grading will be done as per the scheme of implementations, query responses and matching
with a gold standard (provided query set).

Grading Criteria:

Preprocessing (3 marks)

Formation of Index (1 mark for code complexity 1 mark for saving and loading the indexes)

Vector Space Model (2 marks)

Query processing (2 marks)

Code Clarity (1 mark)

Bonus: GUI (1 mark for making the GUI 1 mark for Good Looking GUI)
The proper clean and well commented code will get 05% more marks.

In [1]:
#Importing necessary libraries
import nltk
import pandas as pd
import numpy as np
import os
import json
import re
from collections import defaultdict, Counter
import pickle
import math
print("All libraries have been installed successfully")

All libraries have been installed successfully


In [2]:
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\aarib\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aarib\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aarib\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
#Setting dataset path and checking the abstracts.rar files
dataset_path = r"C:\Users\aarib\6thSemester\IR\IRAssignment\Assignment2\Abstracts"

print(os.listdir(dataset_path)[:5])

['1.txt', '10.txt', '100.txt', '101.txt', '102.txt']


In [4]:
sample_file = os.listdir(dataset_path)[0]  #Picking the first file
sample_path = os.path.join(dataset_path, sample_file)

with open(sample_path, "r", encoding="utf-8") as f:
    content = f.read()

print(content[:100])  #Only printing first 100 characters of first file to just check and avoid clutter

Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment

statistical word alignmen


In [5]:
#List to store document contents which were provided in the abstracts.rar file.
documents = []

#Reading each .txt file in the folder
for filename in os.listdir(dataset_path): #dataset_path defined above
    if filename.endswith(".txt"):
        file_path = os.path.join(dataset_path, filename)
        with open(file_path, "r", encoding="windows-1252") as file: 
            #encoding = "utf-8" did not work so tried "latin-1" which worked but "windows-1252" worked best according to the documents given. 
            documents.append(file.read())

#Checking if loading was a success and the number of documents loaded
print(f"Total documents loaded: {len(documents)}")
print("First Sample document:\n", documents[0][:1500])

Total documents loaded: 448
First Sample document:
 Ensemble Statistical and Heuristic Models for Unsupervised Word Alignment

statistical word alignment, ensemble learning, heuristic word alignment

Statistical word alignment models need large amount of training data while they are weak in small-size corpora. This paper proposes a new approach of unsupervised hybrid word alignment technique using ensemble learning method. This algorithm uses three base alignment models in several rounds to generate alignments. The ensemble algorithm uses a weighed scheme for resampling training data and a voting score to consider aggregated alignments. The underlying alignment algorithms used in this study include IBM Model 1, 2 and a heuristic method based on Dice measurement. Our experimental results show that by this approach, the alignment error rate could be improved by at least %15 for the base alignment models.


## Preprocessing the Documents

1. **Tokenization** (Splitting text into words)
2. **Special Character Handling** (Replacing special characters such as /,_ with a space ' ' and storing hyphenated words both separately and in base form)
3. **Case Folding** (Converting text into lowercase)  
4. **Stop word removal** (Removing common words)  
   Stop words list provided: *a, is, the, of, all, and, to, can, be, as, once, for, at, am, are, has, have, had, up, his, her, in, on, no, we, do*  
5. **Lemmatization** (Reducing words to their base form)

In [6]:
#Custom stop words as provided by Stopword-list.txt
my_stopwords = set([
    "a", "is", "the", "of", "all", "and", "to", "can", "be", "as", "once", 
    "for", "at", "am", "are", "has", "have", "had", "up", "his", "her", 
    "in", "on", "no", "we", "do"
])

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

In [7]:
#Reusable function for preprocessing a single text string (for docs or queries)
def preprocess_text(text, for_query=False):
    #Only duplicate text if not for query
    #if not for_query:
        #text = text.replace("-", " ") + " " + text
    #else:
    #text = text.replace("-", " ") #Removed this for the time being to check k-means 

    #Replace "/", "_" with spaces
    #text = re.sub(r'[/_]', ' ', text)

    #Tokenization
    tokens = word_tokenize(text)

    #Lowercasing
    tokens = [word.lower() for word in tokens]

    #Stopword removal
    filtered_tokens = [word for word in tokens if word not in my_stopwords]

    # Simple lemmatization - treats all words as nouns by default
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return lemmatized_tokens


#Function to preprocess all .txt files in the given folder
def preprocess_documents(dataset_path):
    processed_docs = {}  #Dictionary to store preprocessed tokens per document

    for filename in os.listdir(dataset_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(dataset_path, filename)
            with open(file_path, "r", encoding="windows-1252") as file:
                text = file.read()

            #Preprocess the content using function
            tokens = preprocess_text(text)
            processed_docs[filename] = tokens

    return processed_docs

In [40]:
processed_docs = preprocess_documents(dataset_path)

query = "images processing techniques"
processed_query = preprocess_text(query, for_query = True)
print(processed_query)

['image', 'processing', 'technique']


In [9]:
processed_docs = preprocess_documents(dataset_path)

query = "humans"
processed_query = preprocess_text(query, for_query = True)
print(processed_query)

['human']


In [10]:
processed_docs = preprocess_documents(dataset_path)

query = "supervised kernel k-means cluster"
processed_query = preprocess_text(query, for_query = True)
print(processed_query)

['supervised', 'kernel', 'k-means', 'cluster']


### VSM
### Vocabulary, TF and DF

In [11]:
#Initializing an empty set to store the unique vocabulary(terms) across all documents
vocab = set()

#Dictionary to store term frequency(TF) for each document
tf_dict = {}

#Counter to store document frequency(DF) for each term(in how many documents a term appears)
df_counter = Counter()

#Looping through each document and its preprocessed list of tokens
for doc, tokens in processed_docs.items():
    
    #Counting the frequency of each term in the current document
    tf = Counter(tokens)
    
    #saving the term frequency dictionary for the current document using its filename as the key 
    tf_dict[doc] = tf

    #For each term in the document, increment the DF count (once per document)
    for term in tf:
        df_counter[term] += 1
    
    #Add the tokens from this document to the global vocabulary set as it is a set, onky unique terms will be added
    vocab.update(tokens)

#Sort the vocabulary to create a consistent order for vector representation
vocab = sorted(vocab)

#Calculate the total number of documents(used later for IDF calculation)
N = len(processed_docs)

In [12]:
#Dictionary to store TF-IDF vectors for each document
doc_vectors = {}

#Function to compute the TF-IDF vector for a single document
def compute_tfidf_vector(tf, df_counter, N):
    vector = []  #This will hold the tf-idf values for all terms in the vocabulary(in order)

    #Iterating over the global vocabulary to ensure consistent term ordering
    for term in vocab:
        #Getting the term frequency of 'term' in the current document
        tf_val = tf.get(term, 0)  #If the term is not in the document, return 0

        #Getting the document frequency of the term (in how many documents it appears)
        df_val = df_counter.get(term, 1)  #To avoid division by zero, default to 1 if term appears in no documents

        #Computing tf-idf: tf * log(N / df)
        tfidf = tf_val * math.log(N / df_val)

        #Appending the tf-idf value for this term to the document vector
        vector.append(tfidf)

    return vector  #Return the full tf-idf vector for the document

#Compute the TF-IDF vector for each document in the corpus
for doc in processed_docs:
    #tf_dict[doc] contains the term frequencies for the current document
    doc_vectors[doc] = compute_tfidf_vector(tf_dict[doc], df_counter, N)

#### Building the Indexes, Saving and Loading

In [13]:
from collections import defaultdict

#Inverted Index     (term -> set of document IDs)
inverted_index = defaultdict(set)

for doc_id, tokens in processed_docs.items():
    for token in set(tokens):  #using set to avoid duplicate doc_ids
        inverted_index[token].add(doc_id)

#converting sets to sorted lists for consistent output
for term in inverted_index:
    inverted_index[term] = sorted(list(inverted_index[term]))

In [14]:
# Positional Index   (term -> dict of doc_id -> list of positions)
positional_index = defaultdict(lambda: defaultdict(list))

for doc_id, tokens in processed_docs.items():
    for pos, token in enumerate(tokens):
        positional_index[token][doc_id].append(pos)

In [15]:
import pickle

#Saving
with open("inverted_index.pkl", "wb") as f:
    pickle.dump(dict(inverted_index), f)

with open("positional_index.pkl", "wb") as f:
    pickle.dump(dict(positional_index), f)

#Loading
with open("inverted_index.pkl", "rb") as f:
    inverted_index = pickle.load(f)

with open("positional_index.pkl", "rb") as f:
    positional_index = pickle.load(f)

In [16]:
#Function to calculate cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    #Compute the dot product of the two vectors
    #Dot product = sum of element-wise multiplication
    #The zip() function takes two or more iterables and pairs up elements at the same index from each iterable.
    dot = sum(a * b for a, b in zip(vec1, vec2))
    
    #Compute the L2 norm (magnitude) of vec1
    norm1 = math.sqrt(sum(a * a for a in vec1))
    
    #Compute the L2 norm (magnitude) of vec2
    norm2 = math.sqrt(sum(b * b for b in vec2))
    
    #Return the cosine similarity: dot product divided by product of norms
    #The tiny constant (1e-10) prevents division by zero when both vectors are zero
    return dot / (norm1 * norm2 + 1e-10)

In [17]:
#Function to process a user query using the Vector Space Model (VSM)
def process_query(query, alpha=0.001, top_k=40):
    #Step 1: Preprocess the query using the same preprocessing as the documents
    query_tokens = preprocess_text(query, for_query=True)
    
    #Step 2: Compute term frequency for the query
    tf_q = Counter(query_tokens)
    
    #Step 3: Convert query term frequencies to a TF-IDF vector using the same vocabulary and df values
    query_vec = compute_tfidf_vector(tf_q, df_counter, N)
    
    #Step 4: Initialize list to hold similarity scores
    similarities = []
    
    #Step 5: Compute cosine similarity between query vector and each document vector
    for doc, vec in doc_vectors.items():
        sim = cosine_similarity(query_vec, vec)
        
        #Step 6: Only include documents whose similarity score is above the threshold alpha
        if sim >= alpha:
            doc_num = int(doc.replace(".txt", "")) # Extract numeric doc ID
            similarities.append((doc_num, sim))
    
    #Step 7: Sort the documents by similarity score in descending order and return the top k
    return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_k]

### Checking Golden Queries 

In [18]:
#1. deep
query = "deep"  #Query string
results = process_query(query, top_k = 39) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  # Display document name and its cosine similarity with the query

#258 coming extra in result rest all matching

21: 0.0654
24: 0.1113
174: 0.1945
175: 0.0527
176: 0.1109
177: 0.1439
213: 0.0690
245: 0.0658
246: 0.0842
247: 0.1251
250: 0.1188
254: 0.0839
258: 0.0501
267: 0.2177
273: 0.0969
278: 0.1885
279: 0.2852
280: 0.1381
281: 0.1197
325: 0.1075
345: 0.1312
346: 0.1482
347: 0.1328
348: 0.1486
352: 0.1264
358: 0.0507
360: 0.0802
362: 0.0697
374: 0.0846
376: 0.0908
380: 0.0814
396: 0.2540
397: 0.2404
398: 0.0885
401: 0.0983
405: 0.0649
415: 0.1574
421: 0.0817
432: 0.1072


In [19]:
#2.
query = "weak heuristic"  #Query string
results = process_query(query, alpha = 0.001, top_k = 12) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

1: 0.1925
35: 0.0503
93: 0.0321
101: 0.0730
172: 0.0711
174: 0.0488
213: 0.0433
257: 0.0394
413: 0.0394
435: 0.0488


In [20]:
#CHECKING WHY 391 GOT MISSED IN RESULTS
#Preprocess the query text
query = "weak heuristic"

#Step 1: Preprocess the query (like how documents are preprocessed)
query_tokens = preprocess_text(query, for_query=True)

#Step 2: Compute the term frequency for the query
tf_q = Counter(query_tokens)

#Step 3: Convert the query's term frequencies to a TF-IDF vector
query_vec = compute_tfidf_vector(tf_q, df_counter, N)

#Step 4: Retrieve the TF-IDF vector for Document 391
doc_391_vec = doc_vectors.get("391.txt")

#Step 5: Compute cosine similarity between the query vector and document 391's vector
cosine_sim = cosine_similarity(query_vec, doc_391_vec)

#Step 6: Print the similarity score
print(f"Cosine Similarity between query and Document 391: {cosine_sim:.4f}")

Cosine Similarity between query and Document 391: 0.0000


In [21]:
#3.
query = "principle component analysis"  #Query string
results = process_query(query, alpha = 0.05) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

6: 0.0690
45: 0.0909
53: 0.1440
66: 0.0739
101: 0.0638
277: 0.0557
310: 0.1262
311: 0.1081
364: 0.1975
426: 0.1162
445: 0.0799


In [22]:
#4.
query = "human interaction"  #Query string
results = process_query(query, alpha = 0.001, top_k = 100) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

7: 0.0403
10: 0.0773
21: 0.0576
22: 0.0373
23: 0.0899
26: 0.1659
30: 0.1460
63: 0.0146
83: 0.0280
98: 0.0202
101: 0.0266
127: 0.0745
145: 0.0655
162: 0.0227
164: 0.0245
171: 0.0375
174: 0.0229
186: 0.0950
187: 0.0267
191: 0.1149
194: 0.0392
203: 0.0197
230: 0.0189
247: 0.0184
249: 0.0566
250: 0.5117
255: 0.0249
256: 0.0208
265: 0.0214
273: 0.0285
289: 0.0198
345: 0.0578
359: 0.0142
369: 0.0314
383: 0.1497
391: 0.0212
395: 0.0543
403: 0.0243
426: 0.0915
428: 0.1046
436: 0.0244
444: 0.0612


In [23]:
#CHECKING WHY 203 GOT MISSED IN RESULTS, It got missed due to a lower value of top_k
#Preprocess the query text
query = "human interaction"

#Step 1: Preprocess the query (like how documents are preprocessed)
query_tokens = preprocess_text(query, for_query=True)

#Step 2: Compute the term frequency for the query
tf_q = Counter(query_tokens)

#Step 3: Convert the query's term frequencies to a TF-IDF vector
query_vec = compute_tfidf_vector(tf_q, df_counter, N)

#Step 4: Retrieve the TF-IDF vector for Document 391
doc_203_vec = doc_vectors.get("203.txt")

#Step 5: Compute cosine similarity between the query vector and document 391's vector
cosine_sim = cosine_similarity(query_vec, doc_203_vec)

#Step 6: Print the similarity score
print(f"Cosine Similarity between query and Document 203: {cosine_sim:.4f}")

Cosine Similarity between query and Document 203: 0.0197


In [24]:
#5.
query = "supervised kernel k-means cluster"  #Query string
results = process_query(query, alpha = 0.001, top_k = 26) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

31: 0.1413
79: 0.1922
106: 0.1019
122: 0.2590
123: 0.0859
125: 0.0931
158: 0.1222
173: 0.1326
198: 0.1085
235: 0.1641
242: 0.1819
243: 0.1060
244: 0.0894
258: 0.0938
275: 0.1790
277: 0.0873
280: 0.1222
304: 0.1483
321: 0.1606
326: 0.0953
342: 0.2135
349: 0.3669
351: 0.1770
383: 0.1522
401: 0.0920
446: 0.1280


In [25]:
#Checking my mismatch.......................:(
gold_docs = set([31, 53, 122, 123, 124, 125, 158, 167, 173, 177, 241, 242, 243, 244, 
                 245, 264, 275, 280, 281, 291, 334, 368, 383, 427, 430, 447])

your_docs = set([doc for doc, _ in results])

print("Missing Docs:", gold_docs - your_docs)
print("Extra Docs:", your_docs - gold_docs)
print("Correct Matches:", gold_docs & your_docs)

Missing Docs: {291, 167, 264, 427, 430, 334, 368, 177, 241, 53, 245, 281, 124, 447}
Extra Docs: {321, 258, 198, 326, 106, 235, 79, 304, 401, 277, 342, 349, 446, 351}
Correct Matches: {383, 173, 242, 275, 243, 244, 280, 122, 123, 125, 158, 31}


In [26]:
#6.
query = "patients depression anxiety"  #Query string
results = process_query(query, alpha = 0.001, top_k = 26) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

37: 0.1598
40: 0.0171
62: 0.0236
72: 0.0868
80: 0.0425
168: 0.0489
225: 0.0174
259: 0.0746
263: 0.0210
328: 0.0668
332: 0.1316
333: 0.0179
355: 0.0570
368: 0.0270
391: 0.0212
400: 0.0630
433: 0.1095
447: 0.0408
448: 0.4611


In [27]:
#Checking my mismatch.......................:(     NOO MISMATCH!!
gold_docs = set([37 , 40 , 62 , 72 , 80 , 168 , 225 , 259 , 263 , 328 , 332 , 333 , 355 , 368 , 391 , 
400 , 433 , 447 , 448])

your_docs = set([doc for doc, _ in results])

print("Missing Docs:", gold_docs - your_docs)
print("Extra Docs:", your_docs - gold_docs)
print("Correct Matches:", gold_docs & your_docs)

Missing Docs: set()
Extra Docs: set()
Correct Matches: {259, 263, 391, 400, 37, 168, 40, 433, 62, 447, 448, 72, 328, 332, 333, 80, 225, 355, 368}


In [28]:
#7.
query = "local global clusters"  #Query string
results = process_query(query, alpha = 0.001, top_k = 35) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

3: 0.0662
21: 0.0496
30: 0.0377
31: 0.0708
74: 0.0662
79: 0.1088
106: 0.0938
113: 0.1215
125: 0.1703
126: 0.0626
134: 0.2418
168: 0.0384
179: 0.0527
184: 0.0357
196: 0.0790
198: 0.1240
200: 0.0412
215: 0.0692
235: 0.0569
266: 0.2336
271: 0.1693
274: 0.0767
275: 0.0621
293: 0.0631
326: 0.1089
331: 0.0612
342: 0.3553
350: 0.0869
351: 0.2023
377: 0.0557
379: 0.0634
394: 0.0412
407: 0.1065
446: 0.1464
448: 0.0647


In [29]:
#Checking my mismatch.......................:(
gold_docs = set([19 , 21 , 23 , 26 , 30 , 38 , 54 , 76 , 113 , 125 , 126 , 134 , 136 , 156 , 158 , 168 , 
179 , 196 , 211 , 215 , 242 , 257 , 266 , 271 , 295 , 331 , 335 , 336 , 342 , 361 , 377 , 
394 , 407 , 423])

your_docs = set([doc for doc, _ in results])

print("Missing Docs:", gold_docs - your_docs)
print("Extra Docs:", your_docs - gold_docs)
print("Correct Matches:", gold_docs & your_docs)

Missing Docs: {257, 38, 295, 136, 423, 361, 76, 335, 336, 242, 19, 211, 54, 23, 26, 156, 158}
Extra Docs: {448, 350, 3, 293, 198, 326, 200, 351, 74, 106, 235, 79, 274, 275, 184, 379, 446, 31}
Correct Matches: {126, 196, 134, 168, 266, 394, 331, 271, 113, 179, 21, 342, 407, 377, 215, 125, 30}


In [30]:
#8.
query = "synergy analysis"  #Query string
results = process_query(query, alpha = 0.001, top_k = 7) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

38: 0.0816
92: 0.0231
136: 0.0221
257: 0.0199
420: 0.0324
428: 0.0180
446: 0.0189


In [31]:
#9.
query = "github mashup apis"  #Query string
results = process_query(query, alpha = 0.001, top_k = 2) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

178: 0.1355
362: 0.3238


In [32]:
#10.
query = "Bayesian nonparametric"  #Query string
results = process_query(query, alpha = 0.001, top_k = 23) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

16: 0.3008
35: 0.0497
39: 0.2033
62: 0.0666
65: 0.0791
93: 0.0424
117: 0.1422
118: 0.2130
119: 0.4109
155: 0.0319
196: 0.0517
243: 0.0883
244: 0.0744
255: 0.0703
271: 0.0707
290: 0.0485
324: 0.0249
332: 0.0680
370: 0.0227
440: 0.1192
442: 0.0659
448: 0.1418


In [33]:
#11.
query = "diabetes and obesity"  #Query string
results = process_query(query, alpha = 0.001, top_k = 3) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

72: 0.1071
148: 0.1429
391: 0.0522


In [34]:
#12.
query = "bootstrap"  #Query string
results = process_query(query, alpha = 0.001, top_k = 3) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

181: 0.3273
193: 0.3763
379: 0.0655


In [35]:
#13.
query = "ensemble"  #Query string
results = process_query(query, alpha = 0.001, top_k = 23) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

1: 0.1969
2: 0.2002
3: 0.3087
32: 0.0970
52: 0.0323
89: 0.3205
105: 0.1038
120: 0.1455
171: 0.0876
198: 0.2313
229: 0.1751
256: 0.2913
262: 0.0299
268: 0.0468
284: 0.3337
310: 0.0493
311: 0.1553
327: 0.1343
352: 0.1737
378: 0.1739
386: 0.0581
425: 0.0529


In [36]:
#14.
query = "markov"  #Query string
results = process_query(query, alpha = 0.001, top_k = 23) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query


11: 0.0801
16: 0.2345
22: 0.0338
69: 0.1673
110: 0.0526
129: 0.0512
149: 0.2765
197: 0.0952
230: 0.0456
251: 0.0555
257: 0.1336
260: 0.0998
289: 0.0954
305: 0.0569
312: 0.0677
323: 0.0782
335: 0.1596
381: 0.0377
439: 0.0463
445: 0.0479


In [37]:
#15.
query = "prioritize and critical correlate"  #Query string
results = process_query(query, alpha = 0.001, top_k = 23) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

44: 0.0149
101: 0.0194
112: 0.0128
118: 0.0122
138: 0.0129
140: 0.0170
166: 0.0338
195: 0.0570
208: 0.0160
218: 0.0164
227: 0.0155
230: 0.1764
239: 0.0157
250: 0.0169
257: 0.0134
281: 0.0128
283: 0.0138
298: 0.0221
318: 0.0147
354: 0.0193
422: 0.0184
426: 0.0125
436: 0.0177


In [42]:
#CAN RUN YOUR OWN CHANGED QUERY HERE
query = "robust against noise"  #Query string
results = process_query(query, alpha = 0.2, top_k = 20) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

357: 0.2795
398: 0.2137


In [43]:
#CAN RUN YOUR OWN CHANGED QUERY HERE
query = "grammar induction"  #Query string
results = process_query(query, alpha = 0.5, top_k = 20) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

238: 0.5247


In [44]:
#15. CAN RUN YOUR OWN CHANGED QUERY HERE
query = "lifelong chess patterns"  #Query string
results = process_query(query, alpha = 0.1, top_k = 20) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query

18: 0.2565
444: 0.3563


In [None]:
# CAN RUN YOUR OWN CHANGED QUERY HERE
query = ""  #Query string
results = process_query(query, alpha = 0.1, top_k = 20) #Getting ranked results using VSM
results_sorted = sorted(results, key=lambda x: x[0])

#Printing the top results with similarity scores sorted by document number in ascending order
for doc, score in results_sorted:
    print(f"{doc}: {score:.4f}")  #Displaying document name and its cosine similarity with the query