![](./lab%20header%20image.png)

<div style="text-align: center;">
    <h3>Experiment No. 08</h3>
</div>

<img src="./Student%20Information.png" style="width: 100%;" alt="Student Information">

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>AIM</strong>
</div>

**Study Assignment of Information Retrieval Techniques**

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>Theory/Procedure/Algorithm</strong>
</div>

Information retrieval (IR) refers to the process of obtaining relevant information from a large repository of unstructured data. The primary goal of IR is to provide users with results that satisfy their search queries as accurately and quickly as possible. In this experiment, we explore key IR techniques, their implementation, and evaluate their effectiveness using various metrics. IR plays a significant role in search engines, recommendation systems, and document management applications.

There are several core techniques used in information retrieval systems:

1. **Boolean Retrieval Model**: This is one of the earliest and simplest retrieval models where queries are expressed using Boolean operators (AND, OR, NOT). It retrieves documents that either satisfy or do not satisfy the query conditions. This method is very rigid and either returns too many or too few results.

2. **Vector Space Model (VSM)**: In this model, documents and queries are represented as vectors in a multi-dimensional space. The relevance between a document and a query is calculated based on the cosine similarity between the vectors. VSM handles partial matches and ranks documents based on their relevance to the query.

3. **Probabilistic Retrieval Model**: This model estimates the probability of a document being relevant to a query. The most well-known example is the BM25 ranking function, which improves relevance scoring by considering term frequency, inverse document frequency, and document length.

4. **Latent Semantic Indexing (LSI)**: LSI deals with the limitations of traditional keyword-based models by discovering latent relationships between terms. It applies singular value decomposition (SVD) to a term-document matrix to reduce its dimensionality, making it easier to retrieve conceptually related documents.

5. **TF-IDF (Term Frequency-Inverse Document Frequency)**: This technique ranks documents based on the frequency of query terms. The relevance score is determined by two factors:

    - **Term Frequency (TF)**: The number of times a term appears in a document.
    - **Inverse Document Frequency (IDF)**: It reflects how rare a term is across all documents. The less frequent a term, the higher its significance in identifying relevant documents.

##### Steps:

1. **Dataset**: The 20 Newsgroups dataset can be accessed from Scikit-learn's datasets module, and it is freely available for use. Here’s the link to the dataset: 20 Newsgroups Dataset.

2. **Implementation**: I will use the following IR models:
- **TF-IDF Vectorizer** (for vector space model)
- **BM25 (for probabilistic** model)
- **Boolean Retrieval** (with binary vectors)

3. **Performance Metrics**: Precision, Recall, F1 Score, MAP, and MRR will be calculated based on retrieval results.

In [1]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
import numpy as np
from rank_bm25 import BM25Okapi

# Load the 20 Newsgroups dataset
categories = ['sci.space', 'comp.graphics', 'rec.autos', 'talk.politics.guns']
newsgroups = fetch_20newsgroups(subset='all', categories=categories)
data = newsgroups.data
target = newsgroups.target

# Split the dataset into train and test
train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.2, random_state=42)

# Vectorize the train and test data using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(train_data)
tfidf_test = tfidf_vectorizer.transform(test_data)

# Define a sample query (from test data)
query = test_data[0]
query_vector = tfidf_vectorizer.transform([query])

# Cosine similarity between the query and train data (VSM)
cosine_sim = cosine_similarity(query_vector, tfidf_train).flatten()

# BM25 implementation with rank_bm25
tokenized_corpus = [doc.split() for doc in train_data]
bm25 = BM25Okapi(tokenized_corpus)
query_tokenized = query.split()
bm25_scores = bm25.get_scores(query_tokenized)

# For the purpose of comparison, rank the documents using the cosine similarity scores
top_n_vsm = np.argsort(cosine_sim)[::-1][:10]  # Top 10 docs for VSM
top_n_bm25 = np.argsort(bm25_scores)[::-1][:10]  # Top 10 docs for BM25

# Let's compute precision, recall, and F1 Score for a binary relevance task (whether the predicted category matches the query category)
true_category = test_target[0]

# For VSM (Binary relevance based on whether the predicted doc category matches the true query category)
vsm_pred_categories = [train_target[idx] for idx in top_n_vsm]
vsm_relevant = [1 if cat == true_category else 0 for cat in vsm_pred_categories]

# For BM25 (Binary relevance based on whether the predicted doc category matches the true query category)
bm25_pred_categories = [train_target[idx] for idx in top_n_bm25]
bm25_relevant = [1 if cat == true_category else 0 for cat in bm25_pred_categories]

# Precision, Recall, F1 for VSM
precision_vsm = precision_score([1] * len(vsm_relevant), vsm_relevant, zero_division=0)
recall_vsm = recall_score([1] * len(vsm_relevant), vsm_relevant, zero_division=0)
f1_vsm = f1_score([1] * len(vsm_relevant), vsm_relevant, zero_division=0)

# Precision, Recall, F1 for BM25
precision_bm25 = precision_score([1] * len(bm25_relevant), bm25_relevant, zero_division=0)
recall_bm25 = recall_score([1] * len(bm25_relevant), bm25_relevant, zero_division=0)
f1_bm25 = f1_score([1] * len(bm25_relevant), bm25_relevant, zero_division=0)

# Return results
result = {
    "precision_vsm": precision_vsm,
    "recall_vsm": recall_vsm,
    "f1_vsm": f1_vsm,
    "precision_bm25": precision_bm25,
    "recall_bm25": recall_bm25,
    "f1_bm25": f1_bm25
}

print(result)

{'precision_vsm': 1.0, 'recall_vsm': 0.5, 'f1_vsm': 0.6666666666666666, 'precision_bm25': 1.0, 'recall_bm25': 0.5, 'f1_bm25': 0.6666666666666666}


The results indicate the following for both the Vector Space Model (VSM) and BM25 methods when retrieving documents for a given query:

1. **Precision**:
Precision (1.0) for both VSM and BM25 indicates that all the top 10 documents retrieved were relevant to the query. This means that there were no false positives—every document retrieved was in the correct category.

2. **Recall**:
Recall (0.5) for both VSM and BM25 shows that the models retrieved only 50% of the relevant documents available in the dataset for that query. In other words, while the retrieved documents were relevant, the models failed to find all the relevant documents (some relevant documents were missed).

3. **F1 Score**:
F1 Score (0.67) is the harmonic mean of precision and recall. Since precision was high (1.0) but recall was moderate (0.5), the F1 score reflects this balance, sitting at 0.67. It shows that while the system is precise in retrieving relevant documents, it could improve in terms of covering all relevant documents.

**Summary**:
- High precision (1.0) indicates that the methods are good at not retrieving irrelevant documents.
- Moderate recall (0.5) shows that the methods are not retrieving all relevant documents.
- Both VSM and BM25 performed similarly in this test, but the models could be further optimized to improve recall, potentially increasing the F1 score.

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>CONCLUSION</strong>
</div>

The comparison between the Vector Space Model (VSM) using TF-IDF with cosine similarity and the BM25 algorithm for information retrieval on the 20 Newsgroups dataset reveals insights into their relative strengths. VSM, simpler to implement, performs well in many scenarios but lacks effective document length normalization. BM25, incorporating document length normalization and term saturation, may outperform VSM, especially for longer documents. The choice between these models depends on specific requirements, dataset characteristics, and query types. While our experiment used a single query and a subset of data, it provides a valuable baseline for understanding these algorithms' performance in text classification and document retrieval tasks. Future work could involve testing with larger datasets, multiple queries, and exploring other retrieval models to gain more comprehensive insights into their effectiveness in various information retrieval scenarios.

<div style="border: 1px solid #ccc; padding: 8px; background-color: #f0f0f0; text-align: center;">
    <strong>ASSESSMENT</strong>
</div>

<img src="./marks_distribution.png" style="width: 100%;" alt="marks_distribution">