![](./lab%20header%20image.png)

<div style="text-align: center;">
    <h3>Assignment No. 03</h3>
</div>

<img src="./Student%20Information.png" style="width: 100%;" alt="Student Information">

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: start;">
    <strong>Q. How can we implement and evaluate different text retrieval models, including Boolean Retrieval, Vector Space Model (TF-IDF), and BM25, from scratch, and compare their performance using common evaluation metrics such as precision, recall, and F1-score?</strong>
</div>

Text retrieval models are designed to find relevant documents based on a user query from a collection of documents. Different retrieval models approach this task in unique ways. In this assignment, we will focus on three models: Boolean Retrieval, Vector Space Model (TF-IDF), and BM25. We will also evaluate their performance using precision, recall, and F1-score.

##### 1. Boolean Retrieval Model (BRM):
The Boolean Retrieval Model is the simplest text retrieval model, where documents are retrieved based on exact matches with the query terms using Boolean logic (AND, OR, NOT). This model treats terms as binary (either present or absent) without considering term frequency or relevance.

    Pros: Fast and easy to implement.
    Cons: No ranking, only exact matches; no consideration of term importance.

##### 3. Vector Space Model (TF-IDF):
The Vector Space Model represents documents and queries as vectors in a high-dimensional space where each dimension corresponds to a unique term. The TF-IDF (Term Frequency-Inverse Document Frequency) weighting scheme is used to assign importance to terms.

    - Term Frequency (TF): Measures how frequently a term appears in a document.
    - Inverse Document Frequency (IDF): Reduces the weight of terms that appear in many documents, giving more importance to rare terms.

Documents are ranked by calculating the cosine similarity between the query vector and document vectors.

    Pros: Considers term importance and relevance; ranks documents based on similarity.
    Cons: Assumes term independence and linear relationship.

##### 3. BM25:
BM25 (Best Matching 25) is a probabilistic model that extends TF-IDF by normalizing term frequency using document length and adding tunable parameters. BM25 uses a more sophisticated weighting function than traditional TF-IDF, making it more effective for ranked retrieval.

    Pros: More accurate than TF-IDF due to better handling of term saturation and document length normalization.
    Cons: More complex; requires parameter tuning.

##### 4. Evaluation Metrics:
To evaluate the retrieval models, we use the following metrics:

    Precision: The proportion of retrieved documents that are relevant.
    Recall: The proportion of relevant documents that were retrieved.
    F1-Score: The harmonic mean of precision and recall, balancing both metrics.


#### Implementation: 

##### 1. Tokenization Function

In [1]:
import re
from collections import defaultdict
import math

def tokenize(text):
    # Remove non-alphanumeric characters and convert to lowercase
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens


##### 2. Boolean Retrieval Model (BRM)

In [2]:
def build_boolean_index(documents):
    inverted_index = defaultdict(set)
    for doc_id, document in enumerate(documents):
        tokens = set(tokenize(document))
        for token in tokens:
            inverted_index[token].add(doc_id)
    return inverted_index

def boolean_search(query, boolean_index):
    query_tokens = tokenize(query)
    result_set = set()
    
    for token in query_tokens:
        if token in boolean_index:
            result_set.update(boolean_index[token])
    
    return result_set


##### 3. Vector Space Model (TF-IDF)

In [3]:
def compute_tf(documents):
    tf = []
    for document in documents:
        doc_tokens = tokenize(document)
        doc_tf = defaultdict(float)
        for token in doc_tokens:
            doc_tf[token] += 1
        for token in doc_tf:
            doc_tf[token] /= len(doc_tokens)
        tf.append(doc_tf)
    return tf

def compute_idf(documents):
    N = len(documents)
    idf = defaultdict(float)
    all_tokens = set(token for document in documents for token in tokenize(document))
    for token in all_tokens:
        df = sum(1 for doc in documents if token in tokenize(doc))
        idf[token] = math.log(N / (df + 1))
    return idf

def compute_tf_idf(tf, idf):
    tf_idf = []
    for doc_tf in tf:
        doc_tf_idf = {}
        for token, tf_value in doc_tf.items():
            doc_tf_idf[token] = tf_value * idf.get(token, 0)
        tf_idf.append(doc_tf_idf)
    return tf_idf

def cosine_similarity(doc_vector, query_vector):
    dot_product = sum(doc_vector.get(token, 0) * query_vector.get(token, 0) for token in query_vector)
    doc_norm = math.sqrt(sum(value**2 for value in doc_vector.values()))
    query_norm = math.sqrt(sum(value**2 for value in query_vector.values()))
    if doc_norm * query_norm == 0:
        return 0
    return dot_product / (doc_norm * query_norm)

def tf_idf_search(query, tf_idf, idf):
    query_tokens = tokenize(query)
    query_vector = defaultdict(float)
    for token in query_tokens:
        query_vector[token] += 1
    for token in query_vector:
        query_vector[token] *= idf.get(token, 0)
    
    results = []
    for doc_id, doc_vector in enumerate(tf_idf):
        similarity = cosine_similarity(doc_vector, query_vector)
        results.append((doc_id, similarity))
    
    return sorted(results, key=lambda x: x[1], reverse=True)


##### 4. BM25

In [4]:
def bm25(documents, query, k1=1.5, b=0.75):
    N = len(documents)
    avg_doc_len = sum(len(tokenize(doc)) for doc in documents) / N
    idf = compute_idf(documents)
    
    results = []
    query_tokens = tokenize(query)
    
    for doc_id, document in enumerate(documents):
        score = 0
        doc_tokens = tokenize(document)
        doc_len = len(doc_tokens)
        doc_tf = defaultdict(float)
        for token in doc_tokens:
            doc_tf[token] += 1
        
        for token in query_tokens:
            if token in doc_tf:
                tf = doc_tf[token]
                idf_value = idf.get(token, 0)
                term_score = idf_value * ((tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_len / avg_doc_len))))
                score += term_score
        
        results.append((doc_id, score))
    
    return sorted(results, key=lambda x: x[1], reverse=True)


##### 5. Evaluation Metrics

In [5]:
def precision_recall_f1(retrieved_docs, relevant_docs):
    retrieved_set = set(retrieved_docs)
    relevant_set = set(relevant_docs)

    true_positives = len(retrieved_set.intersection(relevant_set))
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(relevant_set) if relevant_set else 0
    f1_score = 2 * (precision * recall) / (precision + recall) if precision + recall > 0 else 0

    return precision, recall, f1_score

In [6]:
import re

def tokenize(text):
    # Remove non-alphanumeric characters and convert text to lowercase
    text = text.lower()
    tokens = re.findall(r'\b\w+\b', text)
    return tokens


# Sample documents
documents = [
    "Information retrieval models compare different approaches.",
    "We implement a Boolean retrieval model.",
    "BM25 is an extension of TF-IDF.",
    "Text retrieval systems are evaluated using metrics."
]

# Relevant documents for a query
relevant_docs = [0, 2]

# Query
query = "retrieval model"

# Boolean Retrieval
boolean_index = build_boolean_index(documents)
boolean_results = boolean_search(query, boolean_index)

# Evaluate Boolean Model
precision, recall, f1 = precision_recall_f1(boolean_results, relevant_docs)
print(f"Boolean Model - Precision: {precision}, Recall: {recall}, F1-score: {f1}")

# TF-IDF Retrieval
tf = compute_tf(documents)
idf = compute_idf(documents)
tf_idf = compute_tf_idf(tf, idf)
tfidf_results = tf_idf_search(query, tf_idf, idf)

# Get top-ranked document IDs for TF-IDF
retrieved_tf_idf = [doc_id for doc_id, _ in tfidf_results]

# Evaluate TF-IDF Model
precision, recall, f1 = precision_recall_f1(retrieved_tf_idf, relevant_docs)
print(f"TF-IDF Model - Precision: {precision}, Recall: {recall}, F1-score: {f1}")

# BM25 Retrieval
bm25_results = bm25(documents, query)

# Get top-ranked document IDs for BM25
retrieved_bm25 = [doc_id for doc_id, _ in bm25_results]

# Evaluate BM25 Model
precision, recall, f1 = precision_recall_f1(retrieved_bm25, relevant_docs)
print(f"BM25 Model - Precision: {precision}, Recall: {recall}, F1-score: {f1}")

Boolean Model - Precision: 0.3333333333333333, Recall: 0.5, F1-score: 0.4
TF-IDF Model - Precision: 0.5, Recall: 1.0, F1-score: 0.6666666666666666
BM25 Model - Precision: 0.5, Recall: 1.0, F1-score: 0.6666666666666666


<div style="float: right; border: 1px solid black; display: inline-block; padding: 10px; text-align: center">
    <br>
    <br>
    <span style="font-weight: bold;">Signature of Lab Incharge</span>
    <br>
    <span>(Prof. Rupali Sharma)</span> 
</div>