# Exercise 3: Approximate Methods (HyperLogLog)

## Learning Objectives

In this exercise, you will:
- Learn about approximate distinct counting with HyperLogLog
- Understand when approximate methods are useful
- Compare approximate vs exact methods

## Overview

**HyperLogLog** estimates the number of distinct elements with minimal memory. Perfect for large datasets where exact counting is expensive.

In [None]:
from bloom_filter_hyperloglog import create_spark_session, hyperloglog_distinct_count
import time

spark = create_spark_session("Exercise3_ApproximateMethods")
print("✓ Spark session created")

In [None]:
# Load data
df = spark.read.csv("data/redundant_data.csv", header=True, inferSchema=True)

# Exact count
print("Calculating exact distinct count...")
start_time = time.time()
exact_count = df.select("email").distinct().count()
exact_time = time.time() - start_time
print(f"Exact distinct emails: {exact_count:,} ({exact_time:.2f}s)")

# HyperLogLog approximate count
print("\nCalculating approximate distinct count with HyperLogLog...")
result = hyperloglog_distinct_count(df, column='email', rsd=0.05)

if result:
    print(f"\nApproximate: {result['approx_distinct']:,}")
    print(f"Exact: {result['exact_distinct']:,}")
    print(f"Error: {result['error_percent']:.2f}%")

## Questions to Answer

1. How accurate is the approximation?
2. How much faster is HyperLogLog?
3. When would approximate methods be useful?

In [None]:
spark.stop()
print("✓ Spark session stopped")