# Lab 2: Boolean Queries, TF-IDF, and Cosine Similarity

### Objective
To create a small document collection, evaluate Boolean queries using posting lists, compute TF, DF, TF-IDF values, represent documents as vectors, and compute cosine similarity to rank documents based on query relevance.


In [None]:
documents = {
    "D1": "Data mining is an important part of AI and analytics.",
    "D2": "AI is used in agriculture for better crop yield.",
    "D3": "Robotics is transforming the manufacturing industry.",
    "D4": "Data science and AI can improve healthcare outcomes.",
    "D5": "Agriculture and robotics can work together to improve productivity."
}

# Display documents
for doc_id, text in documents.items():
    print(f"{doc_id}: {text}")


D1: Data mining is an important part of AI and analytics.
D2: AI is used in agriculture for better crop yield.
D3: Robotics is transforming the manufacturing industry.
D4: Data science and AI can improve healthcare outcomes.
D5: Agriculture and robotics can work together to improve productivity.


### 1. Boolean Queries

Boolean queries use logical operators:

- **AND** → Intersection of posting lists
- **OR** → Union of posting lists
- **NOT** → Complement of posting list

Example Queries:

1. `(data AND mining)`  
2. `(ai OR agriculture) AND NOT robotics`


In [2]:
from collections import defaultdict
import re
import pandas as pd

# Step 1: Create posting lists
posting_lists = defaultdict(list)
for doc_id, text in documents.items():
    terms = re.findall(r'\w+', text.lower())
    for term in set(terms):
        posting_lists[term].append(doc_id)

# Step 2: Convert posting lists to a well-structured table
posting_df = pd.DataFrame({
    'Term': list(posting_lists.keys()),
    'Posting List': [', '.join(posting_lists[term]) for term in posting_lists.keys()]
})

# Sort alphabetically by term and reset index
posting_df = posting_df.sort_values(by='Term').reset_index(drop=True)

# Display the table
posting_df

Unnamed: 0,Term,Posting List
0,agriculture,"D2, D5"
1,ai,"D1, D2, D4"
2,an,D1
3,analytics,D1
4,and,"D1, D4, D5"
5,better,D2
6,can,"D4, D5"
7,crop,D2
8,data,"D1, D4"
9,for,D2


In [4]:
# Boolean Query 1: data AND mining
term1 = 'data'
term2 = 'mining'

pl1 = set(posting_lists[term1])
pl2 = set(posting_lists[term2])
and_result = pl1 & pl2

# Prepare table for Query 1
query1_table = pd.DataFrame({
    'Step': ['Posting list for data', 'Posting list for mining', 'AND Operation (data AND mining)'],
    'Documents': [', '.join(pl1), ', '.join(pl2), ', '.join(and_result)]
})

# Boolean Query 2: (ai OR agriculture) AND NOT robotics
term3 = 'ai'
term4 = 'agriculture'
term5 = 'robotics'

pl3 = set(posting_lists[term3])
pl4 = set(posting_lists[term4])
pl5 = set(posting_lists[term5])

or_result = pl3 | pl4
final_result = or_result - pl5

# Prepare table for Query 2
query2_table = pd.DataFrame({
    'Step': [
        'Posting list for ai', 
        'Posting list for agriculture', 
        'Posting list for robotics',
        'OR Operation (ai OR agriculture)', 
        'Final result ((ai OR agriculture) AND NOT robotics)'
    ],
    'Documents': [
        ', '.join(pl3),
        ', '.join(pl4),
        ', '.join(pl5),
        ', '.join(or_result),
        ', '.join(final_result)
    ]
})

# Display both tables
print("Boolean Query 1: data AND mining\n")
display(query1_table)

print("\nBoolean Query 2: (ai OR agriculture) AND NOT robotics\n")
display(query2_table)


Boolean Query 1: data AND mining



Unnamed: 0,Step,Documents
0,Posting list for data,"D4, D1"
1,Posting list for mining,D1
2,AND Operation (data AND mining),D1



Boolean Query 2: (ai OR agriculture) AND NOT robotics



Unnamed: 0,Step,Documents
0,Posting list for ai,"D4, D2, D1"
1,Posting list for agriculture,"D2, D5"
2,Posting list for robotics,"D3, D5"
3,OR Operation (ai OR agriculture),"D4, D2, D1, D5"
4,Final result ((ai OR agriculture) AND NOT robo...,"D4, D2, D1"


### 2. TF-IDF Representation
Documents are transformed into numerical vectors that reflect the importance of terms.Term Frequency (TF): How often term $t$ appears in document $d$.$$TF_{t,d} = \frac{\text{count}(t, d)}{\text{total terms in } d}$$Inverse Document Frequency (IDF): Measures how rare a term is across the corpus (penalizes common words).$$IDF_t = \log_{10} \left( \frac{N}{DF_t} \right)$$Where:$N$ = total documents$DF_t$ = number of documents containing $t$.TF-IDF Weight:$$W_{t,d} = TF_{t,d} \times IDF_t$$

In [None]:
import math
from collections import Counter

# DF and IDF
df = {term: len(posting_lists[term]) for term in posting_lists}
N = len(documents)
idf = {term: math.log(N/df_val) for term, df_val in df.items()}

# Top 5 rarest terms (smallest DF)
rarest_terms = sorted(df, key=lambda x: df[x])[:5]

# Compute TF-IDF for top 5 rarest terms
tf_idf_selected = {}
for doc_id, text in documents.items():
    terms_in_doc = re.findall(r'\w+', text.lower())
    term_count = Counter(terms_in_doc)
    total_terms = len(terms_in_doc)
    tf_idf_selected[doc_id] = {term: round((term_count.get(term,0)/total_terms) * idf[term],3) for term in rarest_terms}

# Prepare table: DF + TF-IDF
table = pd.DataFrame({'DF':[df[term] for term in rarest_terms]}, index=rarest_terms)
for doc_id in documents:
    table[f'TF-IDF_{doc_id}'] = [tf_idf_selected[doc_id][term] for term in rarest_terms]

print("\nDF and TF-IDF for Top 5 Rarest Terms:")
display(table)

# 5. TF-IDF Matrix for top 5 terms
tf_idf_matrix = pd.DataFrame(index=documents.keys(), columns=rarest_terms)
for doc_id in documents:
    for term in rarest_terms:
        tf_idf_matrix.loc[doc_id, term] = tf_idf_selected[doc_id][term]
tf_idf_matrix = tf_idf_matrix.astype(float)

print("\nTF-IDF Matrix (Top 5 Terms):")
display(tf_idf_matrix)



DF and TF-IDF for Top 5 Rarest Terms:


Unnamed: 0,DF,TF-IDF_D1,TF-IDF_D2,TF-IDF_D3,TF-IDF_D4,TF-IDF_D5
mining,1,0.161,0.0,0.0,0.0,0.0
of,1,0.161,0.0,0.0,0.0,0.0
analytics,1,0.161,0.0,0.0,0.0,0.0
part,1,0.161,0.0,0.0,0.0,0.0
important,1,0.161,0.0,0.0,0.0,0.0



TF-IDF Matrix (Top 5 Terms):


Unnamed: 0,mining,of,analytics,part,important
D1,0.161,0.161,0.161,0.161,0.161
D2,0.0,0.0,0.0,0.0,0.0
D3,0.0,0.0,0.0,0.0,0.0
D4,0.0,0.0,0.0,0.0,0.0
D5,0.0,0.0,0.0,0.0,0.0


### 3. Cosine Similarity & Ranking

Used to rank documents by relevance to a query in the Vector Space Model. It measures the cosine of the angle between the query vector $\vec{q}$ and document vector $\vec{d}$.

$$ \text{Similarity}(\vec{q}, \vec{d}) = cos(θ) = \frac{\vec{q} \cdot \vec{d}}{||\vec{q}|| \times ||\vec{d}||} = \frac{\sum (q_i \times d_i)}{\sqrt{\sum q_i^2} \times \sqrt{\sum d_i^2}} $$

- **Interpretation**:
  - $1.0$: Vectors are identical (perfect match).
  - $0.0$: Vectors are orthogonal (no shared terms).

Documents with higher scores are considered more relevant to the query.

In [8]:
import numpy as np

query = "data mining"
query_terms = re.findall(r'\w+', query.lower())
query_count = Counter(query_terms)
total_query_terms = len(query_terms)

# Query vector (only top 5 rare terms)
query_vector = np.array([ (query_count.get(term,0)/total_query_terms) * idf.get(term,0) for term in rarest_terms ])

# Compute cosine similarity
def cosine_similarity(vec1, vec2):
    if np.linalg.norm(vec1)==0 or np.linalg.norm(vec2)==0:
        return 0
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity_scores = []
for doc_id in documents:
    vec = tf_idf_matrix.loc[doc_id].values
    score = cosine_similarity(query_vector, vec)
    similarity_scores.append((doc_id, round(score,3), documents[doc_id]))

# Rank documents
similarity_scores.sort(key=lambda x: x[1], reverse=True)
ranking_table = pd.DataFrame(similarity_scores, columns=['Document ID', 'Cosine Similarity', 'Content'])

print("\nDocument Ranking based on Cosine Similarity with Query 'data mining':")
display(ranking_table)


Document Ranking based on Cosine Similarity with Query 'data mining':


Unnamed: 0,Document ID,Cosine Similarity,Content
0,D1,0.447,Data mining is an important part of AI and ana...
1,D2,0.0,AI is used in agriculture for better crop yield.
2,D3,0.0,Robotics is transforming the manufacturing ind...
3,D4,0.0,Data science and AI can improve healthcare out...
4,D5,0.0,Agriculture and robotics can work together to ...
