# Exercise 1: Basic Deduplication

## Learning Objectives

In this exercise, you will:
- Learn how to remove exact duplicates from a dataset
- Understand the `exact` deduplication method
- Analyze deduplication results
- See where results are saved

## Overview

The `exact` method is the simplest and fastest deduplication technique. It removes records that are exactly the same based on specified columns (typically `name` and `email`).

In [None]:
from deduplicate_spark import create_spark_session, process_file_spark

# Create Spark session
spark = create_spark_session("Exercise1_BasicDeduplication")
print("✓ Spark session created")

## Step 2: Generate or Load Data

If you haven't generated data yet, uncomment and run the cell below. Otherwise, we'll use existing data.

In [None]:
import subprocess
import os

# Generate data if it doesn't exist
if not os.path.exists("data/exercise1.csv"):
    if not os.path.exists("data"):
        os.makedirs("data")
    
    print("Generating sample data...")
    result = subprocess.run(
        ["python", "generate_dataset.py", "1000", "data/exercise1.csv"],
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print("✓ Data generated successfully")
    else:
        print(f"✗ Error: {result.stderr}")
else:
    print("✓ Using existing data file")

## Step 3: Run Deduplication

Now let's run the exact deduplication method.

In [None]:
# Run exact deduplication
stats = process_file_spark(
    spark,
    "data/exercise1.csv",
    output_dir=None,  # Uses /tmp/results in Cloudera, data/ locally
    method='exact'
)

if stats:
    print(f"\nOriginal records: {stats['original_count']:,}")
    print(f"Unique records: {stats['unique_count']:,}")
    print(f"Duplicates removed: {stats['duplicates_removed']:,}")
    print(f"Deduplication rate: {stats['deduplication_rate']:.2f}%")

## Questions to Answer

1. How many duplicates were found?
2. What percentage of records were duplicates?
3. Where are the results saved?

In [None]:
# Cleanup
spark.stop()
print("✓ Spark session stopped")