# Information Retrieval and Text Analytics Project

## Introduction

**Overview of Information Retrieval (IR) Systems:** An information retrieval system is a system or software designed to search for and retrieve relevant information from a large collection of unstructured or semi-structured data, such as text documents, images, or multimedia content, based on user queries. These systems are essential in handling vast amounts of information, such as those found on the internet, in digital libraries, or within enterprise data systems.

Examples of information retrieval systems:
1.   **Google:** A search engine that retrieves relevant web pages based on keywords entered by the user.
2.   **Amazon:** An e-commerce platform where users search for products by entering keywords or browsing categories.
3.   **PubMed:** A database for medical research articles, allowing users to search for academic papers related to health and science.



**Background:** Data can be found everywhere around us. With the proliferation of digital devices, vast amounts of data are being generated and shared. We can find valuable information in news, books, papers, documentations and wikis. Though to find this valuable information, one must swim through a sea of redundant and sometimes useless data. Therefore, there is a growing need for a way to quickly obtain and sift through this data to extract valuable insights. Many information retrieval models were developed to handle said issue, albeit with varying performances. This creates the issue where we want to know which model performs better in the realm of information retrieval.



**Objective:** This project aims to develop a system that retrieves relevant information from text datasets using 3 different information retrieval models and comparing between their performances to know which one is better. The system also allows us to improve the understanding and organization of the data through text analytics. This includes leveraging preprocessing techniques to enhance text representation, applying robust retrieval methods to ensure accuracy, and utilizing visualizations to provide actionable insights and evaluate performance effectively.

**Scope:** The scope includes implementing advanced preprocessing techniques such as tokenization, case standardization, stopwords removal, stemming, and TF-IDF to improve text representation. Additionally, it encompasses the integration of Vector Space Model, Boolean Retrieval Model, and BM25 retrieval algorithms for ranking and identifying relevant information based on user queries. We compare between the performances of these algorithms. To enhance usability, the project will also incorporate visualization tools such as word clouds for top keywords in documents, Frequency distribution of words, document-query similarity scores (e.g., bar charts), and clustering topics using LDA (Latent Dirichlet Allocation). The project will be limited to textual data and will not cover multimedia or non-textual information retrieval.

## Data Collection

In [127]:
# Downloading required dataset(s)
!pip install kagglehub
import kagglehub
# path = kagglehub.dataset_download("crawford/20-newsgroups")
# The dataset will be download in /home/<user>/.cache/kagglehub
import os



## Data Inspection

In [109]:
# Specify the file path
file_path = "./20-newsgroups/versions/1/alt.atheism.txt"  # Replace with the actual path to your file

# Check if the file exists
if os.path.exists(file_path):
    # Open the file and read its contents
    with open(file_path, 'r', encoding='latin-1') as file:  # Use 'latin-1' encoding
        file_contents = file.read()

    # Print or process the file contents
    print(file_contents)
else:
    print(f"Error: File not found at '{file_path}'")

From: mathew <mathew@mantis.co.uk>
Subject: Alt.Atheism FAQ: Atheist Resources

Archive-name: atheism/resources
Alt-atheism-archive-name: resources
Last-modified: 11 December 1992
Version: 1.0

                              Atheist Resources

                      Addresses of Atheist Organizations

                                     USA

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

Write to:  FFRF, P.O. Box 750, Madison, WI 53701.
Telephone: (608) 256-8900

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

Write to:  Evolution Designs, 7119 Laurel Canyon #4, North Hollywood,
           CA 91605.

People in the San Francisco Bay area can get Darwin Fish from Lynn Gold

# Data Preprocessing

In [110]:
!pip install nltk
!pip install scikit-learn
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/k1ng0a21r/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /home/k1ng0a21r/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/k1ng0a21r/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/k1ng0a21r/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

#### Implementing tokenization, lowercasing, stopwords removal, stemming and lemmatization.

In [111]:
def preprocess_text(text):
    # Tokenization
    tokens = nltk.word_tokenize(text)

    # Lowercasing
    tokens = [token.lower() for token in tokens]

    # Stopword Removal
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and token.isalnum()]

    # Stemming
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(token) for token in tokens]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]

    return ' '.join(tokens)

#### Implementing Vectorization (BoW/TF-IDF)

In [112]:
def vectorize_texts(texts, method='bow'):
    """
    Vectorize texts using Bag of Words (BOW) or TF-IDF.

    Parameters:
    texts (list of str): List of texts to vectorize.
    method (str): Method of vectorization ('bow' or 'tfidf').

    Returns:
    X (sparse matrix): Vectorized text data.
    vectorizer (Vectorizer object): Fitted vectorizer.
    """
    if method == 'bow':
        vectorizer = CountVectorizer()
    elif method == 'tfidf':
        vectorizer = TfidfVectorizer()
    else:
        raise ValueError("Method must be 'bow' or 'tfidf'")

    X = vectorizer.fit_transform(texts)
    return X, vectorizer

In [113]:
def read_file(file_path):
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='latin-1') as file:
            text = file.read()
        return text
    else:
        raise FileNotFoundError(f"Error: File not found at '{file_path}'")
    
text = read_file('./20-newsgroups/versions/1/alt.atheism.txt')
tokens = preprocess_text(text)

##### BoW Vectorization

In [114]:
X_bow, vectorizer_bow = vectorize_texts([tokens], method='bow')
print(X_bow)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 10031 stored elements and shape (1, 10031)>
  Coords	Values
  (0, 6020)	528
  (0, 8734)	2805
  (0, 4027)	303
  (0, 1773)	2022
  (0, 7802)	57
  (0, 12)	30
  (0, 3169)	9
  (0, 133)	24
  (0, 9618)	207
  (0, 1347)	105
  (0, 6740)	162
  (0, 9529)	96
  (0, 4284)	225
  (0, 7731)	1389
  (0, 4253)	51
  (0, 3128)	45
  (0, 4132)	75
  (0, 2292)	6
  (0, 8648)	9
  (0, 1753)	21
  (0, 6882)	6
  (0, 1824)	93
  (0, 9941)	3015
  (0, 4091)	6
  (0, 2192)	99
  :	:
  (0, 1135)	2
  (0, 1136)	2
  (0, 1137)	2
  (0, 1138)	2
  (0, 1139)	2
  (0, 1140)	2
  (0, 1141)	2
  (0, 1142)	2
  (0, 1143)	2
  (0, 1144)	2
  (0, 1145)	2
  (0, 1146)	2
  (0, 1147)	2
  (0, 1148)	2
  (0, 1149)	2
  (0, 1150)	2
  (0, 1151)	2
  (0, 1152)	2
  (0, 1153)	2
  (0, 1154)	2
  (0, 1155)	2
  (0, 1156)	2
  (0, 1157)	2
  (0, 1158)	2
  (0, 1159)	2


##### TF-IDF Vectorization

In [115]:
X_tfidf, vectorizer_tfidf = vectorize_texts([tokens], method='tfidf')
print(X_tfidf)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 10031 stored elements and shape (1, 10031)>
  Coords	Values
  (0, 6020)	0.03652461797590051
  (0, 8734)	0.19403703299697145
  (0, 4027)	0.020960150088442906
  (0, 1773)	0.139872684748619
  (0, 7802)	0.003942998531489259
  (0, 12)	0.0020752623849943468
  (0, 3169)	0.0006225787154983041
  (0, 133)	0.0016602099079954776
  (0, 9618)	0.014319310456460994
  (0, 1347)	0.0072634183474802145
  (0, 6740)	0.011206416878969474
  (0, 9529)	0.00664083963198191
  (0, 4284)	0.015564467887457602
  (0, 7731)	0.09608464842523827
  (0, 4253)	0.0035279460544903898
  (0, 3128)	0.0031128935774915206
  (0, 4132)	0.005188155962485868
  (0, 2292)	0.0004150524769988694
  (0, 8648)	0.0006225787154983041
  (0, 1753)	0.0014526836694960428
  (0, 6882)	0.0004150524769988694
  (0, 1824)	0.006433313393482475
  (0, 9941)	0.20856386969193186
  (0, 4091)	0.0004150524769988694
  (0, 2192)	0.006848365870481345
  :	:
  (0, 1135)	0.0001383508256662898
  (0, 1136)	0

In [116]:
def print_highest_token(vectorizer, matrix, document_index=0):
    """Prints the token with the highest TF-IDF or BoW value."""

    # Get feature names (tokens)
    feature_names = vectorizer.get_feature_names_out()

    # Get the vector for the specified document
    document_vector = matrix[document_index].toarray()[0]

    # Find the index of the highest value
    highest_index = document_vector.argmax()

    # Print the token and its value
    print(f"Highest token: {feature_names[highest_index]}")
    print(f"Value: {document_vector[highest_index]}")

print_highest_token(vectorizer_bow, X_bow)

Highest token: god
Value: 3411


#### Read and Process all the documents

In [117]:
!pip install rank_bm25
from rank_bm25 import BM25Okapi



In [118]:
def read_and_preprocess_files(directory_path='./20-newsgroups/versions/1'):
    preprocessed_texts = []
    filenames = []
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='latin-1') as file:
                text = file.read()
                preprocessed_text = preprocess_text(text)
                preprocessed_texts.append(preprocessed_text)
                filenames.append(filename)
    return preprocessed_texts, filenames

In [119]:
preprocessed_texts, filenames = read_and_preprocess_files()
preprocessed_texts[:][0]

'newsgroup 70337 kedz john kedziora subject motorcycl want sender kedz distribut ne organ worcest polytechn institut keyword look inexpens motorcycl noth fanci abl maintin self look 400 rang help great plea repli newsgroup 74150 myoakam micah r yoakam subject boat sale boat sale 1989 23 imperi fisherman featur walkaround cuddi cabin 305 v8 volvo duo prop outdriv cassett stereo vhf radio 4x6 hummingbird fishfind safti equip cover much 18000 lb capac includ storag trailer hardli use less 100 hr ask best offer inform contact gerald 419 mansfield oh newsgroup 74720 gt1706a maureen eagl subject want brother say interest buy one littl ca afford new one anybodi tire maureen gt1706a maureen eagl georgia institut technolog atlanta georgia 30332 uucp decvax hplab ncar purdu rutger gatech prism gt1706a internet gt1706a newsgroup 74721 mike diack subject make talk elev complet standalon system comput requir burn sound file eprom consist apollo eprom programm design specif job wont anyth el microph

## Implementing BM25 (Best Matching 25)

In [120]:
def apply_bm25(query, preprocessed_texts, filenames):
    # Initialize BM25 with the preprocessed corpus
    bm25 = BM25Okapi(preprocessed_texts)

    # Preprocess the query
    tokenized_query = preprocess_text(query)

    # Get BM25 scores for the query
    scores = bm25.get_scores(tokenized_query)

    # Print the scores
    # print(scores)

    # Rank documents based on the scores
    top_n = 20
    top_n_indices = bm25.get_top_n(tokenized_query, preprocessed_texts, n=top_n)

    # Print the ranked documents with filenames
    for i, doc in enumerate(top_n_indices):
        print(f"Rank {i+1} Document (Filename: {filenames[i]}), rank: {scores[i]}")

In [121]:
apply_bm25("god", preprocessed_texts, filenames)

Rank 1 Document (Filename: misc.forsale.txt), rank: -2.8447420598088877
Rank 2 Document (Filename: alt.atheism.txt), rank: -2.8447644993032997
Rank 3 Document (Filename: rec.motorcycles.txt), rank: -2.844755717099875
Rank 4 Document (Filename: comp.windows.x.txt), rank: -2.8447572534158603
Rank 5 Document (Filename: soc.religion.christian.txt), rank: -2.8447670553781954
Rank 6 Document (Filename: comp.graphics.txt), rank: -2.844769468815188
Rank 7 Document (Filename: sci.crypt.txt), rank: -2.8447618972574107
Rank 8 Document (Filename: talk.politics.guns.txt), rank: -2.84476458238129
Rank 9 Document (Filename: comp.sys.ibm.pc.hardware.txt), rank: -2.8447487362884534
Rank 10 Document (Filename: sci.space.txt), rank: -2.8447601381521825
Rank 11 Document (Filename: rec.sport.baseball.txt), rank: -2.8447574885934097
Rank 12 Document (Filename: talk.religion.misc.txt), rank: -2.844753593882407
Rank 13 Document (Filename: sci.electronics.txt), rank: -2.844753140645686
Rank 14 Document (Filena

## Implementing Vector Space Model (VSM)

In [122]:
from sklearn.metrics.pairwise import cosine_similarity

In [123]:
def apply_vsm(query, preprocessed_texts, vectorizer):
    # Vectorize the preprocessed texts
    X = vectorizer.fit_transform(preprocessed_texts)

    # Vectorize the query
    query_vector = vectorizer.transform([query])

    # Calculate cosine similarity between the query and the documents
    similarities = cosine_similarity(query_vector, X).flatten()

    # Rank documents based on similarity scores
    ranked_indices = similarities.argsort()[::-1]

    # Print the ranked documents with filenames and similarity scores
    for i in ranked_indices[:20]:  # Display top 20 results
        print(f"Document: {filenames[i]}, Similarity: {similarities[i]}")

In [124]:
apply_vsm("god", preprocessed_texts, vectorizer_tfidf)

Document: soc.religion.christian.txt, Similarity: 0.33038056501938123
Document: talk.religion.misc.txt, Similarity: 0.2510069193864692
Document: alt.atheism.txt, Similarity: 0.20867990336177794
Document: talk.politics.mideast.txt, Similarity: 0.010859305912303812
Document: talk.politics.misc.txt, Similarity: 0.010442857745026967
Document: rec.motorcycles.txt, Similarity: 0.010248346926407133
Document: talk.politics.guns.txt, Similarity: 0.008935004844217696
Document: sci.med.txt, Similarity: 0.007899189182740667
Document: rec.sport.hockey.txt, Similarity: 0.004980021917364271
Document: rec.sport.baseball.txt, Similarity: 0.004787641404817123
Document: misc.forsale.txt, Similarity: 0.00475620509499657
Document: sci.space.txt, Similarity: 0.004655556784613342
Document: comp.graphics.txt, Similarity: 0.0030682504539728725
Document: sci.electronics.txt, Similarity: 0.0023080980802023692
Document: comp.sys.ibm.pc.hardware.txt, Similarity: 0.002160970031673822
Document: rec.autos.txt, Simila

In [125]:
import os
import re

def count_word_in_files(directory_path, word):
    """
    Count the occurrences of a word in all files within a directory.

    Parameters:
    directory_path (str): Path to the directory containing text files.
    word (str): Word to count occurrences of.

    Returns:
    dict: A dictionary with filenames as keys and word counts as values, sorted by word counts.
    """
    word_count = {}
    word = word.lower()  # Convert the word to lowercase for case-insensitive matching

    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            with open(file_path, 'r', encoding='latin-1') as file:
                text = file.read()

                # Preprocess the text: remove punctuation, convert to lowercase
                text = re.sub(r'[^\w\s]', '', text)
                text = text.lower()

                # Count occurrences of the word
                count = text.split().count(word)
                word_count[filename] = count

    # Sort the dictionary by word counts in descending order
    sorted_word_count = dict(sorted(word_count.items(), key=lambda item: item[1], reverse=True))

    return sorted_word_count

# Example usage
directory_path = './20-newsgroups/versions/1'
qurey = 'god'
word_counts = count_word_in_files(directory_path, qurey)
print(word_counts)

{'soc.religion.christian.txt': 4092, 'alt.atheism.txt': 2898, 'talk.religion.misc.txt': 1746, 'talk.politics.mideast.txt': 228, 'talk.politics.misc.txt': 106, 'talk.politics.guns.txt': 86, 'rec.motorcycles.txt': 78, 'sci.med.txt': 74, 'rec.sport.baseball.txt': 42, 'rec.sport.hockey.txt': 42, 'comp.graphics.txt': 40, 'sci.space.txt': 40, 'misc.forsale.txt': 20, 'sci.crypt.txt': 20, 'comp.sys.ibm.pc.hardware.txt': 18, 'sci.electronics.txt': 16, 'comp.sys.mac.hardware.txt': 14, 'rec.autos.txt': 12, 'comp.windows.x.txt': 10, 'comp.os.ms-windows.misc.txt': 10}


# Evaluation

#### Precesion

#### Recall

#### Evaluate BM25

#### Evaluate VSM