# Redis JSON and Indexing Tutorial

This notebook demonstrates how to work with JSON data in Redis and how to create effective indexing strategies to optimize queries.

## Setup

First, let's install the necessary packages and set up our Redis connection.

In [None]:
%%capture
%pip install redis rich tqdm

In [None]:
import redis
from rich.pretty import pprint

In [None]:
# Creating Redis cluster connection
# Make sure Jupyter server is connected to Redis network
# Run this in your terminal: `docker network connect redis_default jupyter-jupyter-1`
r = redis.RedisCluster(host='master', port=6379)

In [None]:
# remove previous run keys
_ = [r.delete(k) for k in r.scan_iter("sample_bicycle_index:type:*")]
_ = [r.delete(k) for k in r.scan_iter("sample_bicycle_index:brand:*")]

## Redis Commands Reference

For a complete reference of Redis commands, see the [official Redis documentation](https://redis.io/docs/latest/commands/).

## Working with JSON Data in Redis

Redis supports JSON data natively through the RedisJSON module. This allows us to store, retrieve, and manipulate JSON documents directly in Redis.

### Retrieving JSON Data

Let's start by retrieving a bicycle record from our sample data:

In [None]:
# Retrieve a specific bicycle using JSON.GET command
# The Python client method r.json().get() maps to JSON.GET in Redis
bicycle = r.json().get("sample_bicycle:1001")
pprint(bicycle)

### Finding All Bicycle Keys

We can use pattern matching to find all the bicycle keys in our database:

In [None]:
# Iterate over all bicycle keys using SCAN
# The pattern "sample_bicycle:*" will match all keys that start with "sample_bicycle:"
bicycle_keys = list(r.scan_iter("sample_bicycle:*"))
print(f"Found {len(bicycle_keys)} bicycle records")
print("Sample keys:")
pprint(bicycle_keys[:5])

## Creating Indexes for Efficient Querying

In Redis, we can create our own indexes for efficient querying. Let's create indexes for bicycle brands and types.

### Building Brand and Type Sets

We'll use Redis Sets to store unique brands and types, making it easy to find all distinct values:

In [None]:
# Iterate over all bicycle records to extract brands and types
# We'll use Redis Sets (SADD command) to store unique values
for key in r.scan_iter("sample_bicycle:*"):
    bicycle = r.json().get(key)
    # Use SADD to add values to sets - duplicates are automatically handled
    r.sadd("sample_bicycle_index:brands", bicycle["brand"])
    r.sadd("sample_bicycle_index:types", bicycle["type"])

print("Indexing complete!")

### Retrieving Unique Bicycle Types

Now we can easily get all unique bicycle types from our index:

In [None]:
# Get all unique bicycle types using SMEMBERS command
types = r.smembers("sample_bicycle_index:types")
print(f"Found {len(types)} unique bicycle types:")
pprint(types)

### Retrieving Unique Bicycle Brands

For larger sets like brands, we use an iterative approach to avoid blocking the server:

In [None]:
# Get unique brands using SSCAN for better performance with large sets
# The sscan_iter method provides a cursor-based iteration through the set
brands = list(r.sscan_iter("sample_bicycle_index:brands"))
print(f"Found {len(brands)} unique bicycle brands")
print("Sample of brands (first 20):")
pprint(brands[:20])

## Creating Advanced Indexes for Query Optimization

Now let's create more sophisticated indexes that will allow us to:
1. Find all bicycles of a specific brand
2. Find all bicycles of a specific type
3. Count bicycles by brand and type

In [None]:
# Reset existing counters
r.delete("sample_bicycle_index:brand_count")
r.delete("sample_bicycle_index:type_count")

# Build comprehensive indexes for brands and types
for key in r.scan_iter("sample_bicycle:*"):
    bicycle = r.json().get(key)
    
    # Create sets of bicycle keys for each brand and type
    # This allows us to quickly look up all bicycles of a specific brand or type
    r.sadd("sample_bicycle_index:brand:" + bicycle["brand"], key)
    r.sadd("sample_bicycle_index:type:" + bicycle["type"], key)
    
    # Use Hash data structure (HINCRBY command) to count occurrences of each brand and type
    # This gives us a quick way to get counts without having to retrieve the full sets
    r.hincrby("sample_bicycle_index:brand_count", bicycle["brand"], 1)
    r.hincrby("sample_bicycle_index:type_count", bicycle["type"], 1)

print("Advanced indexing complete!")

### Finding Bicycles by Type

Now we can efficiently find all bicycles of a specific type, such as "Gravel":

In [None]:
# Get all Gravel bicycles using our type index
gravel_bikes = r.smembers("sample_bicycle_index:type:Gravel")
print(f"Found {len(gravel_bikes)} Gravel bicycles")
print("Sample of Gravel bicycle keys:")
pprint(list(gravel_bikes)[:5])

# Let's also get the first gravel bike's details
if gravel_bikes:
    first_gravel_bike = r.json().get(list(gravel_bikes)[0])
    print("\nDetails of a sample Gravel bicycle:")
    pprint(first_gravel_bike)

### Complex Queries Using Set Operations

One of Redis's powerful features is the ability to perform set operations for complex queries. Let's find all Kids bicycles made by the brand "Tiny Trekkers":

In [None]:
# Find bicycles that are both Kids type AND Tiny Trekkers brand
# We use the SINTER command to perform a set intersection
try:
    kids_tiny_trekkers = r.sinter(["sample_bicycle_index:type:Kids", "sample_bicycle_index:brand:Tiny Trekkers"])
    print(f"Found {len(kids_tiny_trekkers)} Kids bicycles made by Tiny Trekkers")
    print("Bicycle keys:")
    pprint(kids_tiny_trekkers)
    
    # If we found any matching bicycles, let's retrieve the details of the first one
    if kids_tiny_trekkers:
        first_bike = r.json().get(list(kids_tiny_trekkers)[0])
        print("\nDetails of a sample Kids Tiny Trekkers bicycle:")
        pprint(first_bike)
except Exception as e:
    print("Error performing set intersection:", e)

### Enforce Co-location

In Redis, the keys involved in a set intersection operation must map to the same key slot, meaning they must reside within the same shard. In our example, we are referring to the keys `sample_bicycle_index:type:Kids` and `sample_bicycle_index:brand:Tiny Trekkers`. 

However, more generally, all keys of the form `sample_bicycle_index:brand:*` and `sample_bicycle_index:type:*` must be co-located.

To ensure co-location, we will utilize the Redis tag function. This function allows us to group related keys together, ensuring that they are stored in the same shard. By doing so, we can efficiently perform operations such as set intersections without encountering issues related to key distribution across different shards.


In [None]:
# remove the previous keys
[r.delete(k) for k in r.scan_iter("sample_bicycle_index:type:*")]
[r.delete(k) for k in r.scan_iter("sample_bicycle_index:brand:*")]

# Re-build comprehensive indexes for brands and types using {sample_bicycle_index} tag
for key in r.scan_iter("sample_bicycle:*"):
    bicycle = r.json().get(key)
    
    # Create sets of bicycle keys for each brand and type
    # This allows us to quickly look up all bicycles of a specific brand or type
    r.sadd("{sample_bicycle_index}:brand:" + bicycle["brand"], key)
    r.sadd("{sample_bicycle_index}:type:" + bicycle["type"], key)

print("Advanced indexing complete!")

### Perform the Complex Query Again

Now that our sets are co-located, it is possible for the node managing them to access the sets directly and perform the intersection operation without requiring keys stored elsewhere. 

By leveraging the Redis tag function, we have ensured that all relevant keys, such as `{sample_bicycle_index}:type:Kids` and `{sample_bicycle_index}:brand:Tiny Trekkers`, are stored together. 

In [None]:
# Find bicycles that are both Kids type AND Tiny Trekkers brand using updated keys
try:
    kids_tiny_trekkers = r.sinter(["{sample_bicycle_index}:type:Kids", "{sample_bicycle_index}:brand:Tiny Trekkers"])
    print(f"Found {len(kids_tiny_trekkers)} Kids bicycles made by Tiny Trekkers")
    print("Bicycle keys:")
    pprint(kids_tiny_trekkers)
    
    # If we found any matching bicycles, let's retrieve the details of the first one
    if kids_tiny_trekkers:
        first_bike = r.json().get(list(kids_tiny_trekkers)[0])
        print("\nDetails of a sample Kids Tiny Trekkers bicycle:")
        pprint(first_bike)
except Exception as e:
    print("Error performing set intersection:", e)

### Co-location Considerations

Indexes are relative hot keys; while they provide fast access, the need for co-locating them can lead to an increased load on the node that stores them. Caution must be exercised in this case because the primary goal of sharding is not only to spread keys uniformly across multiple working nodes but also to distribute the load effectively. Uniform key distribution is beneficial only if the load associated with each key is also uniform, which is often not the case for indexes.

This problem can be addressed in two ways:

1. **Rebalance The Shards**: We can identify the shard that is managing the hottest indexes and reduce the number of slots associated with that shard. These slots can then be distributed uniformly to the other shards in the cluster, alleviating the load on the overloaded shard.

2. **Increasing Read Replicas**: Another option is to increase the number of replicas associated with the shard managing the indexes. Since indexes are primarily requested in read mode to find relevant keys, adding read replicas can help balance the load and improve read performance.

By carefully considering these strategies, we can optimize the performance of our Redis setup while maintaining the benefits of co-location for efficient data access.


## Bonus: Analyzing Bicycle Statistics

Let's use our indexes to get some statistics about our bicycle inventory:

In [None]:
# Get counts of bicycles by type
type_counts = r.hgetall("sample_bicycle_index:type_count")
print("Bicycle counts by type:")
pprint({type_name.decode('utf-8'): int(count) for type_name, count in type_counts.items()})

# Get top 5 bicycle brands by count
brand_counts = r.hgetall("sample_bicycle_index:brand_count")
brand_count_dict = {brand.decode('utf-8'): int(count) for brand, count in brand_counts.items()}
top_brands = sorted(brand_count_dict.items(), key=lambda x: x[1], reverse=True)[:5]

print("\nTop 5 bicycle brands:")
for brand, count in top_brands:
    print(f"{brand}: {count} bicycles")

## Conclusion

In this tutorial, we've learned how to:
1. Create efficient indexes using Redis Sets
3. Perform complex queries using set operations
4. Use Redis Hashes to store count statistics

These techniques allow for high-performance data access patterns that can scale to millions of records while maintaining fast query response times.

For more information on Redis commands and data structures, refer to the [official Redis documentation](https://redis.io/docs/latest/commands/).