# T4 – Caching and Data Acceleration (Redis + MongoDB)

## Project Overview
This project implements a high-performance caching layer using **Redis** to accelerate data retrieval from a **MongoDB** database (`sample_mflix` dataset).

###  Objectives Met
1.  **Architecture:** Hybrid storage using MongoDB (Persistent) and Redis (In-Memory).
2.  **Strategies:** Implemented **Cache-Aside** (Read) and **Write-Through** (Update).
3.  **Data Structures:** * **Hashes:** For storing movie objects.
    * **Sorted Sets:** For a "Trending Movies" leaderboard.
    * **Strings:** For system status/config.
4.  **Performance:** Benchmarking latency and Hit/Miss ratios.

###  Tech Stack
* **Database:** MongoDB Atlas (Cloud)
* **Cache:** FakeRedis (In-Memory simulation for portability)
* **Language:** Python 3.x

In [8]:
%pip install pymongo fakeredis redis pandas dnspython


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
import time
import json
import random
import pandas as pd
import fakeredis
import redis
from pymongo import MongoClient

# --- CONFIGURATION ---
MONGO_URI = "mongodb+srv://misustefan30:mocococo30@tema.z4qxsyu.mongodb.net/"

try:
    mongo_client = MongoClient(MONGO_URI)
    db = mongo_client.sample_mflix
    movies_col = db.movies
    
    # Quick connectivity test
    count = movies_col.count_documents({})
    print(f" Connected to MongoDB! Total Movies available: {count}")
    
except Exception as e:
    print(f" MongoDB Connection Error: {e}")

# 2. Connect to Redis (Simulated)
# decode_responses=True ensures we get Strings back, not Bytes
r = fakeredis.FakeStrictRedis(decode_responses=True)
print(" Connected to Redis Cache!")

 Connected to MongoDB! Total Movies available: 21349
 Connected to Redis Cache!


## 2. Architecture & Data Flow

We implement the **Cache-Aside** pattern for reading data. This ensures that the application only hits the slow database (MongoDB) when absolutely necessary.
 Data Flow Diagram
```mermaid
graph TD
    User[User / API Request] --> App[Application Layer]
    
    subgraph "Caching Layer"
    App -- 1. GET 'movie:TheGodfather' --> Redis{Check Redis}
    Redis -- HIT (Return Data) --> App
    end
    
    subgraph "Persistent Layer"
    Redis -- MISS (Null) --> Mongo[Query MongoDB Atlas]
    Mongo -- Return JSON --> App
    end
    
    App -- 2. SET 'movie:TheGodfather' (TTL 60s) --> Redis
    App -- Return Response --> User

In [10]:
# --- HELPER: Fetch from DB (The "Slow" Path) ---
def fetch_from_mongo(title):
    """Simulates fetching from the persistent storage."""
    # We fetch specific fields to keep the cache object lightweight
    movie = movies_col.find_one(
        {"title": title},
        {"title": 1, "year": 1, "plot": 1, "imdb.rating": 1, "_id": 0}
    )
    
    if movie:
        # Flatten structure for Redis Hash compatibility
        # Redis Hashes are best for flat Key-Value pairs, not nested JSON
        return {
            "title": movie.get("title"),
            "year": str(movie.get("year", "N/A")),
            "plot": movie.get("plot", "No plot available."),
            "rating": str(movie.get("imdb", {}).get("rating", "N/A"))
        }
    return None

# --- STRATEGY 1: CACHE-ASIDE (Read Operation) ---
def get_movie(title):
    """
    1. Check Redis (Fast).
    2. If missing, Check Mongo (Slow).
    3. Update Redis with new data + TTL.
    """
    cache_key = f"movie:{title}"
    
    # A. Check Cache
    cached_data = r.hgetall(cache_key)
    
    if cached_data:
        # [Requirement: Hit]
        return cached_data, "HIT"
    
    # B. Check Database
    db_data = fetch_from_mongo(title)
    
    if db_data:
        # [Requirement: Miss & Populate]
        # Use Redis Hash (HSET) for object storage
        r.hset(cache_key, mapping=db_data)
        
        # [Requirement: TTL / Eviction]
        # Set data to expire after 60 seconds to prevent stale data
        r.expire(cache_key, 60)
        
        return db_data, "MISS"
    
    return None, "NOT FOUND"

# --- STRATEGY 2: WRITE-THROUGH (Update Operation) ---
def update_movie_rating(title, new_rating):
    """
    Updates the database AND the cache immediately.
    This ensures users never see old data after an update.
    """
    print(f" Updating rating for '{title}' to {new_rating}...")
    
    # 1. Update Persistent DB (MongoDB)
    movies_col.update_one(
        {"title": title}, 
        {"$set": {"imdb.rating": new_rating}}
    )
    
    # 2. Update Cache (Redis)
    cache_key = f"movie:{title}"
    
    # We only update Redis if the key is already there. 
    # If it's not there, Cache-Aside will handle it on the next read.
    if r.exists(cache_key):
        r.hset(cache_key, key="rating", value=new_rating)
        # Reset TTL on update so it stays fresh
        r.expire(cache_key, 60)
        print("Updated MongoDB and Refreshed Redis Cache.")
    else:
        print("Updated MongoDB only (Key was not in cache).")

print(" Caching Logic Implemented.")

 Caching Logic Implemented.


In [11]:
def record_view(title):
    """
    Tracks how many times a movie is viewed using Redis Sorted Sets.
    Score = View Count
    Member = Movie Title
    """
    # ZINCRBY increments the score of the member by 1
    r.zincrby("leaderboard:views", 1, title)

def get_trending_movies(top_n=5):
    """
    Retrieves the top N movies with highest scores (views).
    """
    # ZREVRANGE gets the range sorted from high to low
    return r.zrevrange("leaderboard:views", 0, top_n-1, withscores=True)

print(" Analytics Logic Implemented.")

 Analytics Logic Implemented.


In [12]:
# 1. Prepare Test Data
# We grab 100 random titles from Mongo to simulate user queries
print(" Fetching test data from MongoDB...")
all_movies = list(movies_col.find({}, {"title": 1, "_id": 0}).limit(100))
movie_titles = [m['title'] for m in all_movies]

# Define "Hot" keys (20% of movies that get 80% of traffic)
hot_movies = movie_titles[:20] 

# 2. Run Simulation
TOTAL_REQUESTS = 500
results = []

print(f" Starting Benchmark ({TOTAL_REQUESTS} requests)...")

start_time_global = time.time()

for i in range(TOTAL_REQUESTS):
    # Simulate Pareto distribution (80/20 rule)
    if random.random() < 0.8:
        target = random.choice(hot_movies)
    else:
        target = random.choice(movie_titles)
    
    # Measure Latency
    req_start = time.time()
    data, status = get_movie(target)
    req_end = time.time()
    
    # Track Analytics
    if status != "NOT FOUND":
        record_view(target)
    
    latency_ms = (req_end - req_start) * 1000
    results.append({"status": status, "latency": latency_ms})

print("Benchmark Complete.")

 Fetching test data from MongoDB...
 Starting Benchmark (500 requests)...
Benchmark Complete.


In [13]:
# Create DataFrame
df = pd.DataFrame(results)

# 1. Hit/Miss Ratio
hits = len(df[df['status'] == 'HIT'])
misses = len(df[df['status'] == 'MISS'])
total = hits + misses
hit_ratio = (hits / total) * 100

# 2. Latency Analysis
avg_hit_time = df[df['status'] == 'HIT']['latency'].mean()
avg_miss_time = df[df['status'] == 'MISS']['latency'].mean()

print("="*40)
print(" FINAL PROJECT REPORT")
print("="*40)

print(f"\n1. CACHE EFFICIENCY")
print(f"   Total Requests: {TOTAL_REQUESTS}")
print(f"   Hits (Redis):   {hits}")
print(f"   Misses (Mongo): {misses}")
print(f"    Hit Ratio:   {hit_ratio:.2f}%")

print(f"\n2. PERFORMANCE GAINS")
print(f"   Avg Time (Cache HIT):  {avg_hit_time:.4f} ms ")
print(f"   Avg Time (Cache MISS): {avg_miss_time:.4f} ms ")
print(f"    Speedup Factor:     {avg_miss_time / avg_hit_time:.1f}x Faster")

print(f"\n3. POPULARITY LEADERBOARD (Redis Sorted Sets)")
trending = get_trending_movies(5)
for rank, (name, views) in enumerate(trending, 1):
    print(f"   #{rank}: {name} ({int(views)} views)")

 FINAL PROJECT REPORT

1. CACHE EFFICIENCY
   Total Requests: 500
   Hits (Redis):   426
   Misses (Mongo): 74
    Hit Ratio:   85.20%

2. PERFORMANCE GAINS
   Avg Time (Cache HIT):  0.0744 ms 
   Avg Time (Cache MISS): 41.5272 ms 
    Speedup Factor:     558.4x Faster

3. POPULARITY LEADERBOARD (Redis Sorted Sets)
   #1: Salomè (32 views)
   #2: High and Dizzy (28 views)
   #3: The Perils of Pauline (27 views)
   #4: From Hand to Mouth (25 views)
   #5: Traffic in Souls (23 views)


In [14]:
print("\n---  Testing Write-Through Strategy ---")

# Pick a movie
test_movie = hot_movies[0]

# 1. Read current rating (Should be a HIT now)
data, _ = get_movie(test_movie)
print(f"Current Rating: {data['rating']}")

# 2. Update rating
new_rating = round(random.uniform(1.0, 10.0), 1)
update_movie_rating(test_movie, new_rating)

# 3. Read again (Should immediately reflect new rating)
data_new, status = get_movie(test_movie)
print(f"New Rating:     {data_new['rating']} (Source: {status})")

if data_new['rating'] == str(new_rating):
    print(" SUCCESS: Data is consistent across DB and Cache.")
else:
    print("FAILURE: Data mismatch.")


---  Testing Write-Through Strategy ---
Current Rating: 3.6
 Updating rating for 'The Saphead' to 1.0...
Updated MongoDB and Refreshed Redis Cache.
New Rating:     1.0 (Source: HIT)
 SUCCESS: Data is consistent across DB and Cache.
