# Information Retrieval and Text Analytics Project

## Introduction

**Overview of Information Retrieval (IR) Systems:** An information retrieval system is a system or software designed to search for and retrieve relevant information from a large collection of unstructured or semi-structured data, such as text documents, images, or multimedia content, based on user queries. These systems are essential in handling vast amounts of information, such as those found on the internet, in digital libraries, or within enterprise data systems.

Examples of information retrieval systems:
1.   **Google:** A search engine that retrieves relevant web pages based on keywords entered by the user.
2.   **Amazon:** An e-commerce platform where users search for products by entering keywords or browsing categories.
3.   **PubMed:** A database for medical research articles, allowing users to search for academic papers related to health and science.



**Background:** Data can be found everywhere around us. With the proliferation of digital devices, vast amounts of data are being generated and shared. We can find valuable information in news, books, papers, documentations and wikis. Though to find this valuable information, one must swim through a sea of redundant and sometimes useless data. Therefore, there is a growing need for a way to quickly obtain and sift through this data to extract valuable insights. Many information retrieval models were developed to handle said issue, albeit with varying performances. This creates the issue where we want to know which model performs better in the realm of information retrieval.



**Objective:** This project aims to develop a system that retrieves relevant information from text datasets using 3 different information retrieval models and comparing between their performances to know which one is better. The system also allows us to improve the understanding and organization of the data through text analytics. This includes leveraging preprocessing techniques to enhance text representation, applying robust retrieval methods to ensure accuracy, and utilizing visualizations to provide actionable insights and evaluate performance effectively.

**Scope:** The scope includes implementing advanced preprocessing techniques such as tokenization, case standardization, stopwords removal, stemming, and TF-IDF to improve text representation. Additionally, it encompasses the integration of Vector Space Model, Boolean Retrieval Model, and BM25 retrieval algorithms for ranking and identifying relevant information based on user queries. We compare between the performances of these algorithms. To enhance usability, the project will also incorporate visualization tools such as word clouds for top keywords in documents, Frequency distribution of words, document-query similarity scores (e.g., bar charts), and clustering topics using LDA (Latent Dirichlet Allocation). The project will be limited to textual data and will not cover multimedia or non-textual information retrieval.

## Data Collection

In [147]:
# Downloading required dataset(s)
!pip install kagglehub
!pip install rank_bm25
from rank_bm25 import BM25Okapi
!pip install nltk
!pip install scikit-learn
import kagglehub
import nltk
import os
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# path = kagglehub.dataset_download("crawford/20-newsgroups")
# The dataset will be download in /home/<user>/.cache/kagglehub




[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/k1ng0a21r/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /home/k1ng0a21r/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/k1ng0a21r/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/k1ng0a21r/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Data Inspection

In [111]:
# Specify the file path
file_path = "./20-newsgroups/alt.atheism.txt"  # Replace with the actual path to your file

# Check if the file exists
if os.path.exists(file_path):
    # Open the file and read its contents
    with open(file_path, 'r', encoding='latin-1') as file:  # Use 'latin-1' encoding
        file_contents = file.read()

    # Print or process the file contents
    print(file_contents[:100])  # Print the first 1000 characters of the file contents
else:
    print(f"Error: File not found at '{file_path}'")

From: mathew <mathew@mantis.co.uk>
Subject: Alt.Atheism FAQ: Atheist Resources

Archive-name: atheis


# Data Preprocessing

#### Implementing tokenization, lowercasing, stopwords removal, stemming and lemmatization.

In [113]:
def preprocess_text(text):
    # Tokenization
    tokens = nltk.word_tokenize(text)

    # Lowercasing
    tokens = [token.lower() for token in tokens]

    # Stopword Removal
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and token.isalnum()]

    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return ' '.join(tokens)

#### Implementing Vectorization (BoW/TF-IDF)

In [114]:
def vectorize_texts(texts, method='bow'):
    """
    Vectorize texts using Bag of Words (BOW) or TF-IDF.

    Parameters:
    texts (list of str): List of texts to vectorize.
    method (str): Method of vectorization ('bow' or 'tfidf').

    Returns:
    X (sparse matrix): Vectorized text data.
    vectorizer (Vectorizer object): Fitted vectorizer.
    """
    if method == 'bow':
        vectorizer = CountVectorizer()
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer()
    else:
        raise ValueError("Method must be 'bow' or 'tfidf'")

    X = vectorizer.fit_transform(texts)
    return X, vectorizer

In [115]:
def read_file(file_path):
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='latin-1') as file:
            text = file.read()
        return text
    else:
        raise FileNotFoundError(f"Error: File not found at '{file_path}'")
    
text = read_file('./20-newsgroups/alt.atheism.txt')
tokens = preprocess_text(text)

##### BoW Vectorization

In [116]:
X_bow, vectorizer_bow = vectorize_texts([tokens], method='bow')
print(X_bow)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 10031 stored elements and shape (1, 10031)>
  Coords	Values
  (0, 6020)	528
  (0, 8734)	2805
  (0, 4027)	303
  (0, 1773)	2022
  (0, 7802)	57
  (0, 12)	30
  (0, 3169)	9
  (0, 133)	24
  (0, 9618)	207
  (0, 1347)	105
  (0, 6740)	162
  (0, 9529)	96
  (0, 4284)	225
  (0, 7731)	1389
  (0, 4253)	51
  (0, 3128)	45
  (0, 4132)	75
  (0, 2292)	6
  (0, 8648)	9
  (0, 1753)	21
  (0, 6882)	6
  (0, 1824)	93
  (0, 9941)	3015
  (0, 4091)	6
  (0, 2192)	99
  :	:
  (0, 1135)	2
  (0, 1136)	2
  (0, 1137)	2
  (0, 1138)	2
  (0, 1139)	2
  (0, 1140)	2
  (0, 1141)	2
  (0, 1142)	2
  (0, 1143)	2
  (0, 1144)	2
  (0, 1145)	2
  (0, 1146)	2
  (0, 1147)	2
  (0, 1148)	2
  (0, 1149)	2
  (0, 1150)	2
  (0, 1151)	2
  (0, 1152)	2
  (0, 1153)	2
  (0, 1154)	2
  (0, 1155)	2
  (0, 1156)	2
  (0, 1157)	2
  (0, 1158)	2
  (0, 1159)	2


##### TF-IDF Vectorization

In [117]:
X_tfidf, vectorizer_tfidf = vectorize_texts([tokens], method='tfidf')
print(X_tfidf)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10031 stored elements and shape (1, 10031)>
  Coords	Values
  (0, 6020)	0.03652461797590051
  (0, 8734)	0.19403703299697145
  (0, 4027)	0.020960150088442906
  (0, 1773)	0.139872684748619
  (0, 7802)	0.003942998531489259
  (0, 12)	0.0020752623849943468
  (0, 3169)	0.0006225787154983041
  (0, 133)	0.0016602099079954776
  (0, 9618)	0.014319310456460994
  (0, 1347)	0.0072634183474802145
  (0, 6740)	0.011206416878969474
  (0, 9529)	0.00664083963198191
  (0, 4284)	0.015564467887457602
  (0, 7731)	0.09608464842523827
  (0, 4253)	0.0035279460544903898
  (0, 3128)	0.0031128935774915206
  (0, 4132)	0.005188155962485868
  (0, 2292)	0.0004150524769988694
  (0, 8648)	0.0006225787154983041
  (0, 1753)	0.0014526836694960428
  (0, 6882)	0.0004150524769988694
  (0, 1824)	0.006433313393482475
  (0, 9941)	0.20856386969193186
  (0, 4091)	0.0004150524769988694
  (0, 2192)	0.006848365870481345
  :	:
  (0, 1135)	0.0001383508256662898
  (0, 1136)	0

#### Read and Process all the documents

In [120]:
def read_and_preprocess_files(directory_path='./20-newsgroups'):
    preprocessed_texts = []
    filenames = []
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='latin-1') as file:
                text = file.read()
                preprocessed_text = preprocess_text(text)
                preprocessed_texts.append(preprocessed_text)
                filenames.append(filename)
    return preprocessed_texts, filenames

In [121]:
preprocessed_texts, filenames = read_and_preprocess_files()
preprocessed_texts[:][0][:50]

'newsgroup 70337 kedz john kedziora subject motorcy'

# Retrieval Models

##### Set the query

In [222]:
query = "god who created the universe"

## Implementing BM25 (Best Matching 25)

In [223]:
def apply_bm25(query=query, preprocessed_texts=preprocessed_texts, filenames=filenames):
    """
    Apply BM25 algorithm to retrieve and rank documents based on a query.

    Parameters:
    query (str): The search query.
    preprocessed_texts (list of str): List of preprocessed texts.
    filenames (list of str): List of filenames corresponding to the texts.

    Returns:
    list: List of retrieved document filenames.
    dict: Dictionary of filenames and their BM25 scores.
    """
    # Tokenize the preprocessed texts
    tokenized_texts = [text.split() for text in preprocessed_texts]

    # Initialize BM25 with the tokenized corpus
    bm25 = BM25Okapi(tokenized_texts)

    # Preprocess and tokenize the query
    tokenized_query = preprocess_text(query).split()

    # Get BM25 scores for the query
    scores = bm25.get_scores(tokenized_query)

    # Rank documents based on the scores
    top_n = 5
    top_n_indices = scores.argsort()[-top_n:][::-1]

    # Retrieve the filenames of the top-ranked documents
    retrieved_docs = [filenames[i] for i in top_n_indices]

    # Create a dictionary of filenames and their scores
    ranked_documents = {filenames[i]: scores[i] for i in top_n_indices}

    return retrieved_docs, ranked_documents

In [224]:
bm25_retrieved_docs, bm25_ranked_documents = apply_bm25(query=query)

In [225]:
bm25_ranked_documents

{'alt.atheism.txt': np.float64(4.0251973789462125),
 'talk.religion.misc.txt': np.float64(4.024080265336261),
 'soc.religion.christian.txt': np.float64(4.01984885336562),
 'talk.politics.mideast.txt': np.float64(4.003892292409839),
 'talk.politics.misc.txt': np.float64(4.003560019654076)}

## Implementing Vector Space Model (VSM)

In [226]:
def apply_vsm(query=query, preprocessed_texts=preprocessed_texts, vectorizer=vectorizer_bow):
    # Vectorize the preprocessed texts
    X = vectorizer.fit_transform(preprocessed_texts)

    # Vectorize the query
    query_vector = vectorizer.transform([query])

    # Calculate cosine similarity between the query and the documents
    similarities = cosine_similarity(query_vector, X).flatten()

    # Rank documents based on similarity scores
    ranked_indices = similarities.argsort()[::-1]
    ranked_indices = ranked_indices[:5]

    # Print the ranked documents with filenames and similarity scores
    dict = {}
    for i in ranked_indices[:]:  # Display top 20 results
        dict[filenames[i]] = similarities[i]
        # print(f"Document: {filenames[i]}, Similarity: {similarities[i]}")

    retrieved_documents = [filenames[i] for i in ranked_indices]
    return retrieved_documents, dict

In [227]:
vsm_bow_retrieved_documents, vsm_bow_ranked_documents = apply_vsm(vectorizer=vectorizer_bow)

In [228]:
vsm_bow_ranked_documents 

{'soc.religion.christian.txt': np.float64(0.2563118508119431),
 'talk.religion.misc.txt': np.float64(0.19650725528315457),
 'alt.atheism.txt': np.float64(0.16728725998947921),
 'talk.politics.mideast.txt': np.float64(0.011509771787109557),
 'rec.motorcycles.txt': np.float64(0.009676559291175625)}

In [229]:
vsm_tfidf_retrieved_documents, vsm_tfidf_ranked_documents = apply_vsm(vectorizer=vectorizer_tfidf)

In [230]:
vsm_tfidf_ranked_documents

{'soc.religion.christian.txt': np.float64(0.11632702756982942),
 'talk.religion.misc.txt': np.float64(0.08837955958448962),
 'alt.atheism.txt': np.float64(0.07484612332476384),
 'rec.motorcycles.txt': np.float64(0.005874924039153561),
 'talk.politics.mideast.txt': np.float64(0.003823562618387289)}

# Evaluation

#### Decide the relevant documents

In [132]:
def get_documents_names(directory_path='./20-newsgroups'):
    # Create a dictionary of filenames with keys from 1 to 20
    filenames_dict = {i+1: filename for i, filename in enumerate(os.listdir(directory_path)[:20])}

    # Print the dictionary for user reference
    for key, value in filenames_dict.items():
        print(f"{key}: {value}")

    return filenames_dict

filenames_dict = get_documents_names()

1: misc.forsale.txt
2: alt.atheism.txt
3: rec.motorcycles.txt
4: comp.windows.x.txt
5: soc.religion.christian.txt
6: comp.graphics.txt
7: sci.crypt.txt
8: talk.politics.guns.txt
9: comp.sys.ibm.pc.hardware.txt
10: sci.space.txt
11: rec.sport.baseball.txt
12: talk.religion.misc.txt
13: sci.electronics.txt
14: comp.sys.mac.hardware.txt
15: rec.autos.txt
16: comp.os.ms-windows.misc.txt
17: sci.med.txt
18: talk.politics.misc.txt
19: rec.sport.hockey.txt
20: talk.politics.mideast.txt


In [231]:
def get_relevant_documents(filenames_dict):
    # Get user input for relevant document numbers
    relevant_numbers = input("Enter the numbers related to the relevant documents, separated by commas: ")
    relevant_numbers = [int(num.strip()) for num in relevant_numbers.split(',')]

    # Save the corresponding filenames in a list
    relevant_docs = [filenames_dict[num] for num in relevant_numbers if num in filenames_dict]

    return relevant_docs

# Example usage
relevant_docs = get_relevant_documents(filenames_dict)
print(f"Relevant documents: {relevant_docs}")

Relevant documents: ['talk.politics.guns.txt', 'talk.politics.mideast.txt', 'alt.atheism.txt']


### Precesion

In [232]:
def calculate_precision(retrieved_docs, relevant_docs):
    """
    Calculate precision.
    {key: value, key: value}
    Parameters:
    retrieved_docs (list): List of retrieved document filenames.
    relevant_docs (list): List of relevant document filenames.

    Returns:
    float: Precision value.
    """
    # Convert lists to sets for easier calculation
    retrieved_set = set(retrieved_docs)
    relevant_set = set(relevant_docs)

    # Calculate the number of relevant documents retrieved
    relevant_retrieved = retrieved_set.intersection(relevant_set)

    # Calculate precision
    precision = len(relevant_retrieved) / len(retrieved_set) if retrieved_set else 0

    return precision

In [233]:
# Calculate precision for BM25
bm25_precision = calculate_precision(bm25_retrieved_docs, relevant_docs)
# Calculate precision for VSM with BoW
vsm_bow_precision = calculate_precision(vsm_bow_retrieved_documents, relevant_docs)
# Calculate precision for VSM with TF-IDF
vsm_tfidf_precision = calculate_precision(vsm_tfidf_retrieved_documents, relevant_docs)

print(f"BM25 Precision: {bm25_precision}")
print(f"VSM with BoW Precision: {vsm_bow_precision}")
print(f"VSM with TF-IDF Precision: {vsm_tfidf_precision}")

BM25 Precision: 0.4
VSM with BoW Precision: 0.4
VSM with TF-IDF Precision: 0.4


### Recall

In [234]:
def calculate_recall(retrieved_docs, relevant_docs):
    """
    Calculate recall.

    Parameters:
    retrieved_docs (list): List of retrieved document filenames.
    relevant_docs (list): List of relevant document filenames.

    Returns:
    float: Recall value.
    """
    # Convert lists to sets for easier calculation
    retrieved_set = set(retrieved_docs)
    relevant_set = set(relevant_docs)

    # Calculate the number of relevant documents retrieved
    relevant_retrieved = retrieved_set.intersection(relevant_set)

    # Calculate recall
    recall = len(relevant_retrieved) / len(relevant_set) if relevant_set else 0

    return recall

In [235]:
# Calculate recall for BM25
bm25_recall = calculate_recall(bm25_retrieved_docs, relevant_docs)
# Calculate recall for VSM with BoW
vsm_bow_recall = calculate_recall(vsm_bow_retrieved_documents, relevant_docs)
# Calculate recall for VSM with TF-IDF
vsm_tfidf_recall = calculate_recall(vsm_tfidf_retrieved_documents, relevant_docs)

print(f"BM25 Recall: {bm25_recall}")
print(f"VSM with BoW Recall: {vsm_bow_recall}")
print(f"VSM with TF-IDF Recall: {vsm_tfidf_recall}")

BM25 Recall: 0.6666666666666666
VSM with BoW Recall: 0.6666666666666666
VSM with TF-IDF Recall: 0.6666666666666666


### Mean Average Precision (MAP)

In [236]:
def calculate_mean_average_precision(retrieved_docs_list, relevant_docs_list):
    """
    Calculate Mean Average Precision (MAP).

    Parameters:
    retrieved_docs_list (list of lists): List of retrieved document filenames for each query.
    relevant_docs_list (list of lists): List of relevant document filenames for each query.

    Returns:
    float: Mean Average Precision value.
    """
    def calculate_average_precision(retrieved_docs, relevant_docs):
        """
        Calculate the average precision for a single query.


        Parameters:
        retrieved_docs (list): List of retrieved document filenames.
        relevant_docs (list): List of relevant document filenames.

        Returns:
        float: Average precision value.
        """
        retrieved_set = set(retrieved_docs)
        relevant_set = set(relevant_docs)
        
        if not relevant_set:
            return 0.0

        relevant_retrieved = 0
        precision_sum = 0.0

        for i, doc in enumerate(retrieved_set):
            if doc in relevant_set:
                relevant_retrieved += 1
                precision_sum += relevant_retrieved / (i + 1)

        return precision_sum / len(relevant_set)

    average_precisions = [
        calculate_average_precision(retrieved_docs, relevant_docs)
        for retrieved_docs, relevant_docs in zip(retrieved_docs_list, relevant_docs_list)
    ]

    return sum(average_precisions) / len(average_precisions) if average_precisions else 0.0

In [237]:
# Calculate Mean Average Precision
retrieved_docs_list = [bm25_retrieved_docs, vsm_bow_retrieved_documents, vsm_tfidf_retrieved_documents]
relevant_docs_list = [relevant_docs] * 3
mean_average_precision = calculate_mean_average_precision(retrieved_docs_list, relevant_docs_list)
print(f"Mean Average Precision: {mean_average_precision}")

Mean Average Precision: 0.3333333333333333


# Visualization

In [238]:
def count_word_in_files(directory_path='./20-newsgroups', word="army"):
    """
    Count the occurrences of a word in all files within a directory.

    Parameters:
    directory_path (str): Path to the directory containing text files.
    word (str): Word to count occurrences of.

    Returns:
    dict: A dictionary with filenames as keys and word counts as values, sorted by word counts.
    """
    word_count = {}
    word = word.lower()  # Convert the word to lowercase for case-insensitive matching

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='latin-1') as file:
                text = file.read()

                # Preprocess the text: remove punctuation, convert to lowercase
                text = re.sub(r'[^\w\s]', '', text)
                text = text.lower()

                # Count occurrences of the word
                count = text.split().count(word)
                word_count[filename] = count

    # Sort the dictionary by word counts in descending order
    sorted_word_count = dict(sorted(word_count.items(), key=lambda item: item[1], reverse=True))

    return sorted_word_count

word_counts = count_word_in_files()
print(word_counts)

{'talk.politics.mideast.txt': 430, 'talk.politics.guns.txt': 114, 'talk.politics.misc.txt': 38, 'sci.space.txt': 36, 'comp.graphics.txt': 35, 'soc.religion.christian.txt': 32, 'talk.religion.misc.txt': 20, 'sci.med.txt': 14, 'misc.forsale.txt': 12, 'sci.crypt.txt': 8, 'rec.autos.txt': 8, 'rec.sport.hockey.txt': 8, 'sci.electronics.txt': 6, 'alt.atheism.txt': 3, 'rec.motorcycles.txt': 2, 'comp.sys.ibm.pc.hardware.txt': 2, 'comp.windows.x.txt': 0, 'rec.sport.baseball.txt': 0, 'comp.sys.mac.hardware.txt': 0, 'comp.os.ms-windows.misc.txt': 0}
