<h1> <center> Assignment 06: Dynamic Indexing Algorithm </center>  </h1>

#### Following Python code is generated from one of the GenAI tool.

In [1]:
import os
from collections import defaultdict

class IndexSegment:
    def __init__(self):
        self.index = defaultdict(list)
    
    def add_document(self, doc_id, content):
        terms = content.lower().split() 
        for position, term in enumerate(terms):
            self.index[term].append((doc_id, position))
    
    def merge(self, other):
        for term, postings in other.index.items():
            self.index[term].extend(postings)
        return self
    
    
class DynamicIndex:
    def __init__(self, merge_threshold=2, buffer_size=5):
        self.segments = []
        self.merge_threshold = merge_threshold
        self.buffer_size = buffer_size
        self.buffer = IndexSegment()
        self.buffer_count = 0
    
    def add_document(self, doc_id, content):
        self.buffer.add_document(doc_id, content)
        self.buffer_count += 1
        
        if self.buffer_count >= self.buffer_size:
            self._flush_buffer()
    
    def _flush_buffer(self):
        if self.buffer_count > 0:
            self.segments.append(self.buffer)
            self.buffer = IndexSegment()
            self.buffer_count = 0
            self._merge_segments()
    
    def _merge_segments(self):
        i = 0
        while i < len(self.segments) - 1:
            if len(self.segments) - i >= self.merge_threshold:
                merged = self.segments[i]
                for j in range(1, self.merge_threshold):
                    merged = merged.merge(self.segments[i + j])
                self.segments[i] = merged
                del self.segments[i + 1 : i + self.merge_threshold]
            i += 1
    
    def search(self, query):
        results = []
        for segment in self.segments:
            results.extend(segment.index.get(query.lower(), []))

        results.extend(self.buffer.index.get(query.lower(), []))
        return results
    
    def update_document(self, doc_id, new_content):

        for segment in self.segments:
            segment.index = defaultdict(list, {term: [(d, p) for d, p in postings if d != doc_id] 
                                               for term, postings in segment.index.items()})
        self.buffer.index = defaultdict(list, {term: [(d, p) for d, p in postings if d != doc_id] 
                                               for term, postings in self.buffer.index.items()})
        self.add_document(doc_id, new_content)

### 1. Go through the code. Write the Driver Code to execute the Dynamic Indexing. 
[Try with dummy Data without reading files from secondary storage]

### 2. Write Explaination for each Method in both the classes.

### 3. Explain merge_threshold and buffer_size in the context on this code.

### 4. Next, execute the code on the 1000 documents dataset. https://www.kaggle.com/datasets/jensenbaxter/10dataset-text-document-classification

### 5. Write the driver code to execute the search document functionality. [Execute 2-3 examples]

