# MapReduce Implementation with Redis Backend

This notebook demonstrates a **distributed MapReduce implementation** using Redis as the backend for job coordination and data storage.

## Learning Objectives
- Understand MapReduce paradigm fundamentals
- Implement distributed processing with Redis
- Compare sequential vs parallel execution
- Analyze performance improvements

## Use Case: Word Frequency Analysis
We'll analyze Dante's Divine Comedy to count word frequencies across the three canticles (Inferno, Purgatorio, Paradiso).

---

## 1. Environment Setup

First, let's install required dependencies and establish our Redis connection.

In [None]:
%%capture
# Install required packages
%pip install redis rich tqdm beautifulsoup4 requests rich

In [None]:
# Core imports for our MapReduce implementation
import redis
import json
import time
from multiprocessing import Process

# Data fetching and processing
import requests
from bs4 import BeautifulSoup

# Utilities for better output
from rich.pretty import pprint
from rich.console import Console
from tqdm import tqdm

console = Console()

In [None]:
# Redis Cluster Connection
# Note: Make sure your Jupyter server is connected to the Redis network
# Command: docker network connect redis_default jupyter-jupyter-1

try:
    r = redis.RedisCluster(host='master', port=6379)
    console.print("✅ Redis cluster connection established", style="green")
except Exception as e:
    console.print(f"❌ Redis connection failed: {e}", style="red")
    # Fallback to local Redis if cluster is not available
    r = redis.Redis(host='localhost', port=6379, decode_responses=False)

## 2. Data Loading and Preparation

Let's fetch Dante's Divine Comedy from the web and prepare it for processing.

In [None]:
# Helper functions for Redis operations
def store(key, record):
    """Store a record in Redis as JSON"""
    r.sadd(key, json.dumps(record))

def fetch(key):
    """Fetch and remove a record from Redis"""
    raw = r.spop(key)
    return json.loads(raw) if raw is not None else None

def cleanup_keys(pattern):
    """Clean up Redis keys matching a pattern"""
    keys_to_delete = [key for key in r.scan_iter(pattern)]
    if keys_to_delete:
        for key in keys_to_delete:
            r.delete(key)
    return len(keys_to_delete)

In [None]:
def load_divine_comedy():
    """
    Load Dante's Divine Comedy from online source
    Splits text into individual cantos for processing
    """
    console.print("🧹 Cleaning up previous data...")
    
    # Clean up any existing data
    r.delete("dante_comedy:cantos")
    cleanup_keys("dante_comedy:max_count:step_*")
    
    console.print("📚 Loading Dante's Divine Comedy...")
    
    # URLs for the three canticles
    urls = {
        "inferno": "https://www.liberliber.eu/mediateca/libri/a/alighieri/la_divina_commedia/html/testo_01.htm",
        "purgatorio": "https://www.liberliber.eu/mediateca/libri/a/alighieri/la_divina_commedia/html/testo_02.htm",
        "paradiso": "https://www.liberliber.eu/mediateca/libri/a/alighieri/la_divina_commedia/html/testo_03.htm"
    }
    
    total_cantos = 0
    
    for cantica_name, url in urls.items():
        try:
            response = requests.get(url)
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Extract text from paragraphs with class 'rientrato'
            cantos = soup.select("p.rientrato")
            
            for canto in cantos:
                if canto.text.strip():  # Only store non-empty cantos
                    store("dante_comedy:cantos", {
                        "cantica": cantica_name,
                        "text": canto.text.strip()
                    })
                    total_cantos += 1
                    
        except Exception as e:
            console.print(f"❌ Error loading {cantica_name}: {e}", style="red")
    
    console.print(f"✅ Data loaded successfully! Total cantos: {total_cantos}")
    return total_cantos

# Load the data
canto_count = load_divine_comedy()

## 3. MapReduce Functions

Now let's define our **Map** and **Reduce** functions for word frequency analysis.

In [None]:
def map_words(record):
    """
    MAP FUNCTION: Extract words from text and emit (word-cantica, 1) pairs
    
    Input: {'cantica': 'inferno', 'text': 'Nel mezzo del cammin...'}
    Output: Generator of (key, value) pairs
    """
    text = record["text"]
    cantica = record["cantica"]
    
    # Clean text: remove punctuation
    punctuation = ",.«»!?:;\"'()"
    clean_text = "".join(c for c in text if c not in punctuation)
    
    # Emit (word-cantica, 1) for each word
    for word in clean_text.split():
        if word.strip():  # Skip empty strings
            yield f"{word.lower()}-{cantica}", 1

def count_word(key, records):
    """
    REDUCE FUNCTION: Sum up counts for each word
    
    Input: key='nel-inferno', records=[1, 1, 1, ...]
    Output: Generator of (word, total_count) pairs
    """
    word, cantica = key.split("-", 1)
    total_count = sum(records)
    yield word, total_count

def max_count(key, records):
    """
    REDUCE FUNCTION: Find max and total counts across canticles
    
    Input: key='nel', records=[45, 32, 28]
    Output: Generator of (word, {max, total}) pairs
    """
    yield key, {
        "max": max(records),
        "total": sum(records)
    }

### 3.1. Simple Word Counting in Python

Before implementing the distributed version, let's create a simple Python-only word count to understand the data flow. This will help you appreciate the benefits of the distributed approach we'll build later.

**Exercise**: Use the provided map_words() function to implement a basic word count algorithm.

In [None]:
# Load all cantos from Redis into memory for sequential processing
# This simulates having all data available locally (non-distributed approach)
cantos = [json.loads(canto) for canto in r.smembers("dante_comedy:cantos")]

print(f"📚 Loaded {len(cantos)} cantos for sequential processing")
print(f"📄 Sample canto: {cantos[0]['cantica']} - {cantos[0]['text'][:100]}...")

In [None]:
# Step 1: Apply map function to collect (word-cantica, count) pairs
step1 = {}
for canto in cantos:
    for key, value in map_words(canto):
        step1.setdefault(key, []).append(value)

print(f"🔍 Step 1 complete: {len(step1)} unique word-cantica pairs")
print("📊 Sample mappings:")
# Show a few examples
for i, (key, values) in enumerate(list(step1.items())[:3]):
    print(f"   {key}: {values[:5]}{'...' if len(values) > 5 else ''}")

# Step 2: Reduce to get word counts per cantica
step2 = {}
# Your code here: use count_word function to aggregate counts

# Step 3: Reduce to get max/total counts per word  
step3 = {}
# Your code here: use max_count function to find max and total

# Step 4: Display results
# Your code here: show top 10 most frequent words