# Consistent Hashing

* used to distribute data across a cluster of servers

## Naive: Simple Modulo Hashing

* as you scale up, you might need to distribute your data across multiple databases instead of keeping it in one database
    - this is called sharding
    - this causes an issue: how do you know which database has which data?
* naive approach: Simple Modulo Hashing
    1. hash(event_id) = hash_number
    2. hash_number % num_databases
* this would tell us which database has what data
    - Event #1234 → hash(1234) % 3 = 1 → Database 1
    - Event #5678 → hash(5678) % 3 = 0 → Database 0
    - Event #9012 → hash(9012) % 3 = 2 → Database 2
* but what's the issue?
    - if you decided to add another database, then: hash_number % (original_num_databases + 1)
    - or if you need to delete a databse then: hash_number % (original_num_databases - 1)
    - __this completely changes the modulo hash of the every id and causes a huge movement of data which could cause your app slowdown considerably or crash altogether__

## Consistent Hashing

* solves the problem of data redistribution when adding/deleting an instance in a distrubted system
* uses a "hash ring" to evenly distribute data
    - create a ring with fixed number of points around it, e.g. [0, 100]
        * normally, this would be the range: [0, $2^{32} - 1$], but it can also be bigger like [0, $2^{64} - 1$]
    - then we distribute data across the hash ring
        * if we have 4 databases, we'd put them at point 0, 25, 50, 75
    - to find out which database has our data, we put the id through a hash function
    - the number gives us the point on the range and we walk "clockwise" until we reach a database
        * e.g. if hash(1234) = 16, we walk clockwise until we reach a database at 25
* now when we add/delete a database, we only need to care about the hashes in a certain range
    - e.g. if we added a database at point 90, then we only care about hashes from [75, 90]
    - e.g. if we deleted a database at point 25, then we only care about hashes from [0, 25]
    - this dramatically decreases the amount of data that needs to be redistributed
    

## Virtual Nodes

* there is still one problem left: if we remove a database, then the from a range would all be packed into the next database
    - so one database will have significantly more data than the others
* to solve this, we create virtual nodes associated with a database
    - these virtual nodes occupy different parts of the hash ring instead of just being at one spot
    - so we could have multiple virtual nodes of database 1 distributed across the hash ring
* to get an even distribution of data:
    - use a lot of virtual nodes for each database
    - add randomness to hash the virtual nodes to prevent clustering and distribute them evenly across the ring
* e.g. instead of hashing "DB1", we hash "DB1-vn1", "db1-vn2", etc
    - so if DB1 fails, the data would be distributed evenly to other db-vn instead of just one database at one point

## Consistent Hashing in the Real World

* consistent hashing can be used to distribute data evenly across a cluster of servers, not just for databases
    - these cluster of servers could be for caches, message brokers, or a set of application servers
* Examples:
    1. Redis Cluster: uses consistent hashing to distribute keys across nodes
    2. Apache Cassandra: uses consistent hashing to distrubte data across the ring
    3. Amazon's DynamoDB: uses consistent hashing under the hood
    4. Content Delivery Network (CDNs): uses consistent hashing to determine which edge server should cache specific content

## When to use Consistent Hashing in an Interview:

* most modern distributed systems handle sharding and data distribution for you
    - when you use Redis, DynamoDB, or Cassandra in your system, you can just mention that these systems use consistent hashing under the hood to handle scaling
* consistent hashing is crucial for infrastructure-focused interviews where you're asked to design distributed systems from scratch:
    1. distributed database
    2. distributed cache
    3. distributed message broker
* have to explain:
    - why consistent hashing has advantages over simple modulo-based sharding
    - how virtual nodes help achieve better load distribution across the system
    - strategies for handling node failures and additions to the cluster
    - how to address hot spots and implement effective data rebalancing strategies