# Assignment 4: Lightweight DBMS with B+ Tree Index

**Name:** Nishit Prajpati
**ID:** 24120036 
**Email:** nishit.prajapati@iitgn.ac.in<br>

**Name:** Mitansh Patel
**ID:** 24120033
**Email:** mitansh.patel@iitgn.ac.in<br>

**Name:** Chinteshwar Dhakate
**ID:** 24120024
**Email:** chinteshwar.dhakate@iitgn.ac.in<br>



## 1. Introduction

Efficient data storage and retrieval present significant challenges in disk-based systems and large-scale memory datasets, where traditional linear structures like lists suffer from $O(n)$ complexity for search/update operations. This project addresses these limitations by implementing a lightweight DBMS using a B+ Tree index, which reduces time complexity to $O(logn)$ for key operations while optimizing disk I/O.

Key Advantages of B+ Trees vs Brute-Force Approaches:
**Structural Efficiency**
* **Balanced Hierarchy:** Automatic node splitting/merging maintains $h=O(logn)$ height regardless of insertion order.
* **Linked Leaf Nodes:** Enables $O(logn+k)$ range queries via sequential scanning of connected leaves.
  
**Performance Comparison :**
| Operation      | B+ Tree Complexity   | Brute-Force List Complexity |
| -------------- | :------------------: | :------------------------: |
| Insertion      | $O(log n)$        |  $O(1)$                   |
| Search         | $O(log n)$         | $O(n)$                   |
| Range Query    | $O(log n + k)$     | $O(n)$                   |
| Memory Overhead| Relatively less          | Linear scaling             |


The implemented system for this assignment features persistent storage using Python's pickle module and automated benchmarking against a BruteForceDB baseline. By leveraging B+ Tree properties like high fan-out (default order=4 nodes) and sequential leaf linkages, the DBMS achieves:
* Faster range queries in tests with 4K+ records.
* Reduction in disk seeks compared to linear scanning.
* Zero data corruption during split/merge operations through parent pointer validation.

This architecture proves particularly effective for database applications requiring ACID-like properties and efficient secondary storage access

## 2. Implementation Details

Describe the key aspects of your implementation:
*   **B+ Tree:** Explain the structure of your `BPlusTreeNode` (leaf vs. internal differentiation, keys, values, children, next_leaf pointers). Detail the logic for:
    *   Insertion (finding leaf, inserting, splitting logic - copy-up vs. push-up, root split).
    *   Deletion (finding key, removing, underflow handling - borrowing logic, merging logic, root changes).
    *   Search (traversal to leaf).
    *   Range Query (finding start leaf, following `next_leaf` pointers).
*   **Table Class:** How does the `Table` class use the `BPlusTree`?
*   **Database Manager:** How does `Database` manage multiple `Table` instances?
*   **Persistence:** Explain how you implemented database persistence (e.g., using `pickle`). What data is saved?

In [None]:
# Optional: Include key code snippets here if useful
# Example: How you represent a node or handle splitting

# Setup code: Import necessary libraries and your custom modules
import sys
import os
import time
import random
import matplotlib.pyplot as plt
import pickle
import numpy as np

# Add the project directory to the Python path to import modules
# Adjust the path ('..') if your notebook is inside the db_management_system folder
module_path = os.path.abspath(os.path.join('.'))
if module_path not in sys.path:
    sys.path.append(module_path)

from database.db_manager import Database
from database.bplustree import BPlusTree # For direct tree testing if needed
from database.bruteforce import BruteForceDB

# --- Performance Analyzer Class (or functions) --- 
# You'll implement Task 2 logic here or import it
class PerformanceAnalyzer:
    def __init__(self):
        self.results = {}

    def time_operation(self, data_structure, operation, keys, values=None):
        start_time = time.perf_counter()
        if operation == 'insert':
            for i, key in enumerate(keys):
                 data_structure.insert(key, values[i] if values else key) # Use value if provided
        elif operation == 'search':
            for key in keys:
                data_structure.search(key)
        elif operation == 'delete':
             for key in keys:
                 data_structure.delete(key)
        elif operation == 'range_query':
             # Assuming keys contains tuples of (start, end)
             for start, end in keys:
                 data_structure.range_query(start, end)
        # Add other operations like 'update', 'get_all' if needed
        end_time = time.perf_counter()
        return end_time - start_time

    def run_benchmarks(self, data_sizes, tree_order=50):
        # data_sizes e.g., [100, 1000, 10000, 50000, 100000]
        results = {'bplus': {'insert': [], 'search': [], 'delete': [], 'range': []}, 
                   'brute': {'insert': [], 'search': [], 'delete': [], 'range': []}}
        memory_usage = {'bplus': [], 'brute': []}
        
        for size in data_sizes:
            print(f"\nBenchmarking for data size: {size}...")
            keys = random.sample(range(size * 5), size) # Generate unique random keys
            values = [f"value_{k}" for k in keys] # Simple string values
            
            # B+ Tree
            bplus_tree = BPlusTree(order=tree_order)
            insert_time = self.time_operation(bplus_tree, 'insert', keys, values)
            results['bplus']['insert'].append(insert_time)
            print(f"  B+ Tree Insert Time: {insert_time:.4f}s")
            # memory_usage['bplus'].append(bplus_tree.get_memory_usage()) # Need to implement this

            # Brute Force
            brute_db = BruteForceDB()
            insert_time_brute = self.time_operation(brute_db, 'insert', keys, values)
            results['brute']['insert'].append(insert_time_brute)
            print(f"  Brute Force Insert Time: {insert_time_brute:.4f}s")
            memory_usage['brute'].append(brute_db.get_memory_usage())

            # --- Search Benchmark ---
            search_keys = random.sample(keys, min(len(keys), 1000)) # Search 1000 existing keys
            search_time = self.time_operation(bplus_tree, 'search', search_keys)
            results['bplus']['search'].append(search_time)
            print(f"  B+ Tree Search Time: {search_time:.4f}s")
            search_time_brute = self.time_operation(brute_db, 'search', search_keys)
            results['brute']['search'].append(search_time_brute)
            print(f"  Brute Force Search Time: {search_time_brute:.4f}s")
            
            # --- Range Query Benchmark ---
            range_queries = []
            for _ in range(100): # Perform 100 range queries
                start = random.randint(0, size*4)
                end = start + random.randint(size // 10, size // 5) # Query range size related to data size
                range_queries.append((start, end))
            range_time = self.time_operation(bplus_tree, 'range_query', range_queries)
            results['bplus']['range'].append(range_time)
            print(f"  B+ Tree Range Query Time: {range_time:.4f}s")
            range_time_brute = self.time_operation(brute_db, 'range_query', range_queries)
            results['brute']['range'].append(range_time_brute)
            print(f"  Brute Force Range Query Time: {range_time_brute:.4f}s")

            # --- Deletion Benchmark ---
            delete_keys = random.sample(keys, min(len(keys), 1000)) # Delete 1000 keys
            delete_time = self.time_operation(bplus_tree, 'delete', delete_keys)
            results['bplus']['delete'].append(delete_time)
            print(f"  B+ Tree Delete Time: {delete_time:.4f}s")
            # Note: Deleting from bruteforce list while iterating needs care, but list comprehension is fine
            delete_time_brute = self.time_operation(brute_db, 'delete', delete_keys)
            results['brute']['delete'].append(delete_time_brute)
            print(f"  Brute Force Delete Time: {delete_time_brute:.4f}s")
            
        self.results = results
        self.memory_usage = memory_usage
        self.data_sizes = data_sizes
        return results, memory_usage

    def plot_results(self):
        if not self.results:
            print("No results to plot. Run benchmarks first.")
            return
        
        operations = ['insert', 'search', 'delete', 'range']
        fig, axes = plt.subplots(len(operations), 1, figsize=(10, 6 * len(operations)), sharex=True)
        fig.suptitle('Performance Comparison: B+ Tree vs Brute Force')

        for i, op in enumerate(operations):
            ax = axes[i]
            ax.plot(self.data_sizes, self.results['bplus'][op], marker='o', label=f'B+ Tree ({op})')
            ax.plot(self.data_sizes, self.results['brute'][op], marker='x', label=f'Brute Force ({op})')
            ax.set_ylabel('Time (seconds)')
            ax.set_title(f'{op.capitalize()} Performance')
            ax.legend()
            ax.grid(True)
            # ax.set_yscale('log') # Use log scale if differences are very large

        axes[-1].set_xlabel('Number of Keys')
        plt.tight_layout(rect=[0, 0.03, 1, 0.97]) # Adjust layout to prevent title overlap
        plt.show()
        
        # Plot Memory Usage (if collected)
        # if self.memory_usage and self.memory_usage['brute']:
        #     plt.figure(figsize=(10, 5))
        #     # plt.plot(self.data_sizes, self.memory_usage['bplus'], marker='o', label='B+ Tree Memory')
        #     plt.plot(self.data_sizes, self.memory_usage['brute'], marker='x', label='Brute Force Memory')
        #     plt.xlabel('Number of Keys')
        #     plt.ylabel('Memory Usage (bytes - approximate)')
        #     plt.title('Memory Usage Comparison')
        #     plt.legend()
        #     plt.grid(True)
        #     plt.show()
        

## 3. Performance Analysis

Present the benchmarking results (Task 4). 
*   Use tables and graphs (`matplotlib`) to show the time taken for insertion, search, deletion, and range queries for both the B+ Tree and BruteForceDB across different data sizes.
*   Discuss the findings. Does the B+ Tree show the expected logarithmic performance compared to the linear performance of the brute-force approach? Analyze the results for each operation type.
*   Discuss memory usage if you measured it.

In [None]:
# --- Run Benchmarks ---
analyzer = PerformanceAnalyzer()
# data_sizes_to_test = [100, 1000, 5000, 10000, 20000] # Adjust sizes as needed
data_sizes_to_test = [100, 500, 1000, 2500, 5000] # Smaller sizes for quicker testing
bplus_order = 10 # Smaller order makes splits/merges happen sooner

results, memory = analyzer.run_benchmarks(data_sizes_to_test, tree_order=bplus_order)

# You can display results in tables here using pandas or simple printing
print("\n--- Benchmark Results Summary ---")
# Example: Print insert times
print("Data Sizes:", analyzer.data_sizes)
print("B+ Insert Times:", [f"{t:.4f}" for t in results['bplus']['insert']])
print("Brute Insert Times:", [f"{t:.4f}" for t in results['brute']['insert']])
# ... print other results ...

# --- Plot Results ---
analyzer.plot_results()

## 4. Visualization

Include examples of the B+ Tree visualizations generated using `graphviz` (Task 3). 
*   Show the tree structure for a small number of insertions.
*   Potentially show the tree before and after a split or merge operation if feasible.
*   Ensure the visualization clearly shows internal nodes, leaf nodes, keys, parent-child relationships, and the leaf node linked list.

In [None]:
# --- Generate Visualizations ---
print("\n--- Generating Visualization Example ---")
db_vis = Database("visualization_db.pkl") # Use a temp DB
vis_table = db_vis.create_table("sample_table", order=3) # Use a small order for easy splits

keys_to_insert = [10, 20, 5, 15, 25, 7, 12] # Example insertion sequence
for i, k in enumerate(keys_to_insert):
    print(f"Inserting {k}...")
    vis_table.insert(k, f"value_{k}")
    # Visualize after each insertion or specific milestones
    vis_table.visualize(filename=f"sample_table_step_{i+1}.gv")
    
# Display the final visualization (or specific steps)
# You might need to render the .gv.png file and display it here
from IPython.display import Image, display
try:
    final_png = f"sample_table_step_{len(keys_to_insert)}.gv.png"
    if os.path.exists(final_png):
        print(f"\nDisplaying final tree structure ({final_png}):")
        display(Image(filename=final_png))
    else:
         print(f"\nGraphviz PNG ({final_png}) not found. Ensure Graphviz is installed and rendered the file.")
except Exception as e:
    print(f"Could not display image: {e}")

# Clean up visualization files
# import glob
# for f in glob.glob("sample_table_step_*.gv*") + glob.glob("visualization_db.pkl"):
#     try:
#         os.remove(f)
#     except OSError as e:
#         print(f"Error removing file {f}: {e}")

## 5. Conclusion

Summarize the project findings:
*   Reiterate the performance benefits observed for the B+ Tree, especially for larger datasets and specific operations (like range queries).
*   Discuss any challenges faced during implementation (e.g., complexity of deletion logic, persistence issues, visualization layout).
*   Suggest potential future improvements (e.g., adding transaction support, more complex queries, different node implementations, concurrency control, more robust persistence).

## 6. Bonus: UI (If Applicable)

If you implemented the Bonus UI, describe its features and how to use it. Include screenshots if possible.