## Indexing in Databases

In order to efficiently find the value for a particular key in the database, we need a
different data structure: **an index** where the the general idea behind them is to keep some
additional metadata on the side, which acts as a signpost and helps you to locate the data you want.

An index is an additional structure that is derived from the primary data. Many data‐
bases allow you to add and remove indexes, and this doesn’t affect the contents of the
database; it only affects the performance of queries. Maintaining additional structures
incurs overhead, especially on writes. For writes, it’s hard to beat the performance of
simply appending to a file, because that’s the simplest possible write operation. Any
kind of index usually slows down writes, because the index also needs to be updated
every time data is written.



## Hash Indexes

A simple key value pair index. Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory
hash map where every key is mapped to a byte offset in the data file—the location at
which the value can be found. Whenever you append a
new key-value pair to the file, you also update the hash map to reflect the offset of the
data you just wrote (this works both for inserting new keys and for updating existing
keys). When you want to look up a value, use the hash map to find the offset in the
data file, seek to that location, and read the value.


**Tombstone:** If you want to delete a key and its associated value, you have to append a special
deletion record to the data file (sometimes called a tombstone). When log seg‐
ments are merged, the tombstone tells the merging process to discard any previ‐
ous values for the deleted key.

**Crash Recovery:**
If the database is restarted, the in-memory hash maps are lost. In principle, you
can restore each segment’s hash map by reading the entire segment file from
beginning to end and noting the offset of the most recent value for every key as
you go along. However, that might take a long time if the segment files are large,
which would make server restarts painful. To avoid this we can snapshot of each segment’s hash map on disk, which can be loaded into memory more quickly.

In [1]:
class HashIndex:
    def __init__(self, name):
        self.name = f"data/{name}.index" # Filename
        self.index = {}
        
# Regenerate In-memory index From file if does not exist

    def set_val(self, key, value):
        # Write to file and save offset
        with open(self.name, 'a') as f:
            f.seek(0, 2) # first line from the end
            end_offset = f.tell() # end offset
            f.write(f"{key}{value}")
        # create index using offset
        self.index[key] = {
            "offset": end_offset,
            "pk_size": len(f"{key}"), 
            "data_size": len(f"{value}"), 
            "data": value
        }
        
    def get_val(self, key):
        # Read from Offset and length
        try:
            key_index = self.index[key]
        except:
            return "<NAN>"
        with open(self.name) as f:
            char = f.seek(key_index["offset"] + key_index["pk_size"])
            return f.readline(key_index["data_size"])

    def del_val(self,key):
        key_index = self.index[key]
        with open(self.name, 'a') as f:
            char = f.seek(key_index["offset"])
            # Tombstone
            f.write(f"{key}{'💀'}")
        self.index.pop(key, None)
        


In [2]:
import json

infile = HashIndex("hash_test")
infile.set_val(1, {"name":"bheem"}) # Dict
infile.set_val(2, "😎") # Emoji
infile.set_val(3, "Some Stupid String") # Dict
infile.set_val(4, 120293292939393) # Big Number
print(json.dumps(infile.index, sort_keys=True, indent=4))

{
    "1": {
        "data": {
            "name": "bheem"
        },
        "data_size": 17,
        "offset": 0,
        "pk_size": 1
    },
    "2": {
        "data": "\ud83d\ude0e",
        "data_size": 1,
        "offset": 18,
        "pk_size": 1
    },
    "3": {
        "data": "Some Stupid String",
        "data_size": 18,
        "offset": 23,
        "pk_size": 1
    },
    "4": {
        "data": 120293292939393,
        "data_size": 15,
        "offset": 42,
        "pk_size": 1
    }
}


### Tests

In [3]:
# Create
assert infile.get_val(1) == "{'name': 'bheem'}"
assert infile.get_val(2) == '😎'
assert infile.get_val(3) == 'Some Stupid String'
assert infile.get_val(4) == '120293292939393'
# Update
infile.set_val(4, 'update 4th index') # Big Number
assert infile.get_val(4) == 'update 4th index'
# Delete
infile.del_val(4)
assert infile.get_val(4) == '<NAN>'

## log-structured indexes

## B-Trees

## Multidimensional Indexes (GIS)

## Full text search & Fuzzy Indexes