# Exercise 2: Compare Deduplication Methods

## Learning Objectives

In this exercise, you will:
- Compare different deduplication strategies
- Understand when to use each method
- Analyze performance differences

## Overview

Different deduplication methods work better for different scenarios:
- **exact**: Fastest, removes exact duplicates
- **normalized**: Handles case/whitespace variations
- **spark_hash**: Very fast hash-based deduplication

In [None]:
from deduplicate_spark import create_spark_session, process_file_spark
import time

spark = create_spark_session("Exercise2_CompareMethods")
print("✓ Spark session created")

In [None]:
# Methods to test
methods = ['exact', 'normalized', 'spark_hash']
results = []

for method in methods:
    print(f"\n{'='*70}")
    print(f"Testing method: {method}")
    print('='*70)
    
    start_time = time.time()
    stats = process_file_spark(spark, "data/redundant_data.csv", method=method)
    elapsed_time = time.time() - start_time
    
    if stats:
        results.append({
            'method': method,
            'deduplication_rate': stats['deduplication_rate'],
            'time_seconds': elapsed_time
        })
        print(f"✓ Completed in {elapsed_time:.2f} seconds")

# Display comparison
import pandas as pd
if results:
    df_results = pd.DataFrame(results)
    print("\n" + df_results.to_string(index=False))

## Questions to Answer

1. Which method removed the most duplicates?
2. Which method was fastest?
3. When would you use each method?

In [None]:
spark.stop()
print("✓ Spark session stopped")