# Binary Independence Model (BIM) for probabilistic information retrieval

## Select the BIM Model
I have choosen the Binary Independence Model (BIM) as the probabilistic retrieval model for your information retrieval task.


## Preprocessing
I need to tokenize, stem, and remove stop words from both the documents and the query. Here is the function for the preprocessing

In [1]:
def preprocess(text):
    # Tokenize
    tokens = text.lower().split()
    # Remove stop words
    stop_words = {'the', 'is', 'at', 'of', 'on', 'and', 'a', 'to'}
    tokens = [t for t in tokens if t not in stop_words]
    return tokens


## Term Weighting
For each document and the query, create a binary vector where each term is marked as 1 if present and 0 otherwise.

In [2]:
def create_binary_vector(terms, vocab):
    vector = [1 if term in terms else 0 for term in vocab]
    return vector

In [3]:
def term_weighting(documents):
    # Preprocess documents and build vocabulary
    processed_docs = [preprocess(doc) for doc in documents]
    vocab = sorted(set([term for doc in processed_docs for term in doc]))  # Unique terms in all documents
    
    # Create binary vectors for each document
    doc_vectors = [create_binary_vector(doc, vocab) for doc in processed_docs]
    return doc_vectors, vocab

## Query Representation
Convert the user’s query into a binary vector using the vocabulary (terms across all documents).

In [4]:
def query_representation(query, vocab):
    query_terms = preprocess(query)
    return create_binary_vector(query_terms, vocab)


## Document Scoring
The `scipy.spatial.distance` module includes a function called dice that computes the `Dice dissimilarity` between two boolean 1-D arrays. We can convert this dissimilarity to similarity by subtracting it from 1.

[dice — SciPy v1.14.1 Manual. (n.d.). https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.dice.html](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.dice.html)

In [5]:
from scipy.spatial.distance import dice

def calculate_dice_coefficient(query_vector, doc_vector):
    return 1 - dice(query_vector, doc_vector)

## Ranking
Rank the documents based on their Dice coefficient scores.

In [6]:
def rank_documents(query_vector, doc_vectors):
    scores = [(i, calculate_dice_coefficient(query_vector, doc_vec)) for i, doc_vec in enumerate(doc_vectors)]
    ranked_docs = sorted(scores, key=lambda x: x[1], reverse=True)  # Higher Dice score means more similarity
    return ranked_docs


## Retrieve Top-K Documents
I have retrieved the top-K most similar documents. If user doesnt provide, i will return first five top documents.

In [7]:
def retrieve_top_k_documents(ranked_docs, K=5):
    return ranked_docs[:K]

In [8]:
def binary_independence_model(documents, query, K=5):
    # Preprocess and generate term-weighted document vectors
    doc_vectors, vocab = term_weighting(documents)
    
    # Generate query vector
    query_vector = query_representation(query, vocab)
    
    # Rank documents based on Dice coefficient scores
    ranked_docs = rank_documents(query_vector, doc_vectors)
    
    # Retrieve top K documents
    top_k_docs = retrieve_top_k_documents(ranked_docs, K)
    
    return top_k_docs


## Presentation of Results
I have presented the sorted documented based on ranking score along with document titles as an additional information. 

In [9]:
documents = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "A fox is quick and a dog is lazy",
]

query = "quick fox"

top_k_results = binary_independence_model(documents, query, K=3)
for doc_id, score in top_k_results:
    print(f"Document ID: {doc_id}, Score: {round(score * 100, 2)}% ")

Document ID: 2, Score: 66.67% 
Document ID: 0, Score: 44.44% 
Document ID: 1, Score: 0.0% 


# Non-Overlapped List Model

## Define the Link List Data Structure

In [10]:
# Linked List class
class LinkList:

    # Private Node class
    class Node:
        def __init__(self, data):
            self.data = data
            self.next = None
        
    def __init__(self):
        self.head = None
    
    # Method to add a new node to the end of the list
    def append(self, data):
        new_node = self.Node(data)
        if not self.head:
            self.head = new_node
            return
        last = self.head
        while last.next:
            last = last.next
        last.next = new_node
    
    # Method to convert linked list to a set for set operations
    def to_set(self):
        elements = set()
        current = self.head
        while current:
            elements.add(current.data)
            current = current.next
        return elements
    
    # Method to print the linked list
    def display(self):
        current = self.head
        while current:
            print(current.data, end=" -> ")
            current = current.next
        print("None")

## Identify Terms of Interest
I am intrested in "machine learning" and "data visualization."

## Retrieve Documents per Term

In [11]:
# Create linked lists for each term
docs_machine_learning = LinkList()
docs_machine_learning.append("Introduction to machine learning and its applications.")
docs_machine_learning.append("Machine learning models and data science.")
docs_machine_learning.append("Combining machine learning and data visualization.")

docs_data_visualization = LinkList()
docs_data_visualization.append("Data visualization techniques and tools.")
docs_data_visualization.append("Effective data visualization methods.")
docs_data_visualization.append("Combining machine learning and data visualization.")


## Combine Lists for Non-Overlapping Results
I have defined a linklist to set method within a LinkList class to use set operations to find the union of the two sets of documents. This will give us the non-overlapping set of documents containing either of the terms.

In [12]:
set1 = docs_machine_learning.to_set()
set2 = docs_data_visualization.to_set()
non_overlap_set = set1.union(set2)

## Present Results

In [13]:
non_overlap_set

{'Combining machine learning and data visualization.',
 'Data visualization techniques and tools.',
 'Effective data visualization methods.',
 'Introduction to machine learning and its applications.',
 'Machine learning models and data science.'}

# Proximal Nodes Model

## Define the Graph Data Structure
I have used the dictionary with adjacency lists to represent the graph. Each node will be a key in a dictionary, and its value will be a list of connected nodes.

In [14]:
# Graph class to represent the network of documents and entities
class Graph:
    def __init__(self):
        # Initialize the graph with an empty adjacency list
        self.graph = {}

    # Add a node to the graph
    def add_node(self, node):
        if node not in self.graph:
            self.graph[node] = []

    # Add an edge between two nodes (undirected by default)
    def add_edge(self, node1, node2):
        if node1 in self.graph and node2 in self.graph:
            self.graph[node1].append(node2)
            self.graph[node2].append(node1)

    # Retrieve connected nodes (documents) to the given node
    def get_connected_nodes(self, node):
        return self.graph.get(node, [])

    # Display the graph (for debugging purposes)
    def display(self):
        for node in self.graph:
            print(f"{node}: {self.graph[node]}")


## Add Nodes and Edges
I have created nodes representing documents and add edges to represent relationships between them.

In [15]:
# Create a graph instance
document_graph = Graph()

# Adding nodes (documents/entities)
document_graph.add_node("NASA")
document_graph.add_node("astronauts")
document_graph.add_node("space missions")
document_graph.add_node("moon landing")
document_graph.add_node("Mars exploration")
document_graph.add_node("space exploration")
document_graph.add_node("space telescopes")

# Adding edges (relationships)
document_graph.add_edge("NASA", "astronauts")
document_graph.add_edge("NASA", "space missions")
document_graph.add_edge("astronauts", "moon landing")
document_graph.add_edge("space missions", "Mars exploration")
document_graph.add_edge("NASA", "space telescopes")
document_graph.add_edge("space exploration", "NASA")
document_graph.add_edge("space exploration", "Mars exploration")


## Display graph
![Graph](./Documents_03/graph.png)

## Define Proximal Nodes
Suppose user is intrested in Space exploration 

In [16]:
# Identify proximal nodes based on interest
proximal_nodes = ["moon landing"]


## Explore Network Relationships
We'll traverse the graph to find all nodes connected to these proximal nodes.

In [17]:
# Function to explore network relationships and find connected documents
def retrieve_documents(graph, proximal_nodes):
    connected_documents = set()
    for node in proximal_nodes:
        # Retrieve all nodes directly connected to each proximal node
        connected_nodes = graph.get_connected_nodes(node)
        # Add the connected nodes to the result set
        for connected_node in connected_nodes:
            connected_documents.add(connected_node)
    return connected_documents

## Retrieve Connected Documents

In [18]:
# Retrieve the connected documents
relevant_documents = retrieve_documents(document_graph, proximal_nodes)

##  Present Results

In [19]:
# Present the results
print("Relevant documents based on proximal nodes:")
for doc in relevant_documents:
    print(doc)


Relevant documents based on proximal nodes:
astronauts
