# BIM (binary Independance Model) Implementation

You have been given a dataset (D1), which contains 1000 documents, distributed into 10 folders.
https://www.kaggle.com/datasets/jensenbaxter/10dataset-text-document-classification

Task:- Implement BIM model on these 1000 documents.
       Skeleton code is given to you. Few method are already implemented.
       
    
Driver code statements are also given in sequence.
Run your code after implementation.

## Do not clear the outputs.

In [81]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [82]:
import os
import re
from collections import defaultdict
from math import log
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from copy import deepcopy

In [83]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

In [84]:
dataset_path = '/content/drive/MyDrive/CSE 419/Lab Assign-4/1000_documents'

In [85]:
file_names =os.listdir(dataset_path)
len(file_names)

1000

In [86]:
file_names

['business_11.txt',
 'business_19.txt',
 'business_13.txt',
 'business_17.txt',
 'business_14.txt',
 'business_16.txt',
 'business_15.txt',
 'business_2.txt',
 'business_10.txt',
 'business_1.txt',
 'business_18.txt',
 'business_12.txt',
 'business_100.txt',
 'business_32.txt',
 'business_24.txt',
 'business_30.txt',
 'business_44.txt',
 'business_36.txt',
 'business_20.txt',
 'business_22.txt',
 'business_4.txt',
 'business_29.txt',
 'business_26.txt',
 'business_27.txt',
 'business_39.txt',
 'business_28.txt',
 'business_33.txt',
 'business_35.txt',
 'business_37.txt',
 'business_31.txt',
 'business_23.txt',
 'business_3.txt',
 'business_34.txt',
 'business_43.txt',
 'business_45.txt',
 'business_25.txt',
 'business_21.txt',
 'business_42.txt',
 'business_41.txt',
 'business_40.txt',
 'business_38.txt',
 'business_50.txt',
 'business_67.txt',
 'business_68.txt',
 'business_51.txt',
 'business_55.txt',
 'business_69.txt',
 'business_46.txt',
 'business_53.txt',
 'business_57.txt',
 'b

In [87]:
def import_dataset():
    """
    This function import all the articles in the TIME corpus,
    returning list of lists where each sub-list contains all the
    terms present in the document as a string.
    """
    corpus = []
    for file in file_names:
        with open(os.path.join(dataset_path, file), 'r') as f:
            content = f.read().lower()  # read file and convert to lowercase
            tokens = word_tokenize(content)  # tokenize words
            corpus.append(tokens)
    return corpus


In [88]:
from nltk.corpus import stopwords

In [89]:
def remove_stop_words(corpus):
    '''
    This function removes stop words from the corpus using NLTK's stopwords.
    '''
    stop_words = set(stopwords.words('english'))
    clean_corpus = []
    for doc in corpus:
        clean_doc = [word for word in doc if word not in stop_words and re.match(r'\w+', word)]
        clean_corpus.append(clean_doc)
    return clean_corpus

In [90]:
def make_inverted_index(corpus):
    """
    This function builds an inverted index as an hash table (dictionary)
    where the keys are the terms and the values are ordered lists of
    docIDs containing the term.
    """
    inverted_index = defaultdict(list)
    for doc_id, doc in enumerate(corpus):
        for word in set(doc):  # Using 'set' to avoid duplicates
            inverted_index[word].append(doc_id)
    return inverted_index

In [91]:
def posting_lists_union(pl1, pl2):
    """
    Returns a new posting list resulting from the union of two lists passed as arguments.
    """
    return sorted(set(pl1) | set(pl2))


In [92]:
def DF(term, index):
    """
    Compute Document Frequency for a term.
    """
    return len(index[term]) if term in index else 0


In [93]:
def IDF(term, index, corpus):
    '''
    Function computing Inverse Document Frequency for a term.
    '''
    N = len(corpus)
    df = DF(term, index)
    if df == 0:
        return 0
    return log((N + 1) / (df + 1))  # Smoothed IDF

In [94]:
def RSV_weights(corpus,index):
    '''
    This function precomputes the Retrieval Status Value weights
    for each term in the index
    '''
    N = len(corpus)
    w = {}
    for term in index.keys():
        p = DF(term, index)/(N+0.5)
        w[term] = IDF(term, index, corpus) + log(p/(1-p))
    return w

In [95]:
class BIM():
    '''
    Binary Independence Model class
    '''

    def __init__(self, corpus):
        self.original_corpus = deepcopy(corpus)
        self.articles = corpus
        self.index = make_inverted_index(self.articles)
        self.weights = RSV_weights(self.articles, self.index)
        self.ranked = []
        self.query_text = ''
        self.N_retrieved = 0



    def RSV_doc_query(self, doc_id, query):
        '''
        This function computes the Retrieval Status Value for a given couple:- document - query
        using the precomputed weights
        '''
        score = 0
        doc = self.articles[doc_id]
        for term in doc:
            if term in query:
                score += self.weights[term]
        return score



    def ranking(self, query):
        '''
        Auxiliary function for the function answer_query. Computes the score only for documents
        that are in the posting list of at least one term in the query
        '''

        '''
        step 01: find all docs that are in the posting list of at least one term in the query
        '''

        '''
        step 02: calculate score for all the docs retrieved in step 01. using RSV_doc_query() function
        '''

        '''
        Sort the docs based on the RSV score and return the rankings.
        '''
        docs_to_score = set()

        # Find all docs that are in the posting list of at least one term in the query
        for term in query:
            if term in self.index:
                docs_to_score.update(self.index[term])

        # Calculate score for all the docs retrieved
        scores = []
        for doc_id in docs_to_score:
            score = self.RSV_doc_query(doc_id, query)
            scores.append((doc_id, score))

        # Sort the docs based on the RSV score and return the rankings
        ranked_docs = sorted(scores, key=lambda x: x[1], reverse=True)
        return ranked_docs







    def answer_query(self, query_text):
        '''
        Function to answer a free text query. Shows the first 20 words of the
        5 most relevant documents.
        Also implements the pseudo relevance feedback with k = 5
        '''

        self.query_text = query_text
        query = query_text.lower().split()  # Split query into words
        ranking = self.ranking(query)

        self.N_retrieved = min(5, len(ranking))  # Handle if fewer than 5 docs are ranked

        # Print retrieved documents
        for i in range(0, self.N_retrieved):
            doc_id = ranking[i][0]
            article = self.original_corpus[doc_id]

            # Print first 20 words of the article, handling cases where the article has fewer words
            first_20_words = ' '.join(article[:20]) if len(article) >= 20 else ' '.join(article)
            print(f"Document {doc_id}: {first_20_words}...")

In [96]:
articles = import_dataset()

In [97]:
cleaned_articles = remove_stop_words(articles)

In [98]:
bim = BIM(cleaned_articles)

In [99]:
# Query 01:-
bim.answer_query("symptoms of heart attack")

Document 418: war west 1914 german invasion smooth working plan invasion france germans preliminarily reduce ring fortress liège commanded route prescribed 1st...
Document 590: cut hicnet medical newsletter page 13 volume 6 number 11 april 25 1993 food drug administration news fda approves depo...
Document 442: major developments 1916 western front 1916 1914 centre gravity world war western front 1915 shifted eastern 1916 moved back france...
Document 453: battle jutland summer 1916 saw long-deferred confrontation germany high seas fleet great britain grand fleet battle jutland—history biggest naval battle...
Document 410: eastern fronts 1914 learn war east 1914 eastern front greater distances quite considerable differences equipment quality opposing armies ensured fluidity...


In [100]:
bim.answer_query("Impact of war")

Document 459: historiography although opening archives ministry foreign affairs 30-year lock-up enabled new historical research war including jean-charles jauffret book la guerre...
Document 486: america first world war jennifer keene explores events led united states america joining first world war describes effect participation war...
Document 498: china great war professor xu guoqi provides overview china involvement first world war including role chinese labour corps clc western...
Document 504: prisoners war reality prisoners war world war one dr heather jones looks beyond propaganda consider facts around prisoner mistreatment labour...
Document 403: world war world war often abbreviated wwi ww1 also known first world war great war global war originating europe lasted...


In [101]:
bim.answer_query("3D animations")

Document 298: actually trying write something like encounter problems amongst drawing 3d wireframe view quadric/quartic requires explicit equation quadric/quartic x z functions...
Document 299: hi interested writing program generate sird picture know stereogram cross eyes picture becomes 3d anyone one know get one please...
Document 308: currently looking 3d graphics library runs ms windows 3.1. libraries visuallib must run vga require add-on graphics cards visuallib run...
Document 948: cheaper chip mobiles mobile phone chip combines modem computer processor one bit silicon instead two could make phones cheaper powerful...
Document 701: archive-name space/math last-modified date 93/04/01 14:39:12 references frequently recommended net fundamentals astrodynamics roger bate donald mueller jerry white 1971 dover...


In [102]:
bim.answer_query("Health benefits")

Document 564: cut volume 6 number 10 april 20 1993 health info-com network medical newsletter editor david dodell d.m.d 10250 north 92nd...
Document 569: cut volume 6 number 11 april 25 1993 health info-com network medical newsletter editor david dodell d.m.d 10250 north 92nd...
Document 553: cut limits azt efficacy suggest using drug either sequentially drugs kind aids treatment cocktail combining number drugs fight virus treating...
Document 590: cut hicnet medical newsletter page 13 volume 6 number 11 april 25 1993 food drug administration news fda approves depo...
Document 562: cut university arizona tucson arizona suggested reading tan sl royston p campbell jacobs hs betts j mason b edwards rg...


In [103]:
bim.answer_query("top movies")

Document 991: apple laptop gadget' apple powerbook 100 chosen greatest gadget time us magazine mobile pc 1991 laptop chosen one first lightweight...
Document 174: de niro film leads us box office film star robert de niro returned top north american box office film hide...
Document 195: boogeyman takes box office lead low-budget horror film boogeyman knocked robert de niro thriller hide seek top spot uk box...
Document 71: japanese mogul arrested fraud one japan best-known businessmen arrested thursday charges falsifying shareholder information selling shares based false data yoshiaki...
Document 100: holmes wins top tv moment' sprinter kelly holmes olympic victory named top television moment 2004 bbc poll holmes 800m gold...


In [104]:
bim.answer_query("Key factors for startup success")

Document 714: archive-name space/acronyms edition 8 acronym list sci.astro sci.space sci.space.shuttle edition 8 1992 dec 7 last posted 1992 aug 27 list...
Document 491: first world war ended professor david stevenson explains war came end germany accepted harsh terms armistice understand first world war...
Document 162: bening makes awards breakthrough film actress annette bening oscar starring role award-winning film julia bening born texas 1958 gained prominence...
Document 432: western eastern fronts 1915 western front 1915 repeated french attacks february–march 1915 germans trench barrier champagne 500 yards 460 metres...
Document 973: wi-fi web reaches farmers peru network community computer centres linked wireless technology providing helping hand poor farmers peru pilot scheme...
