# Exercise 1: Basic Deduplication

## Learning Objectives

In this exercise, you will:
- Learn how to remove exact duplicates from a dataset
- Understand the `exact` deduplication method
- Analyze deduplication results
- See where results are saved

## Overview

The `exact` method is the simplest and fastest deduplication technique. It removes records that are exactly the same based on specified columns (typically `name` and `email`).

In [None]:
# Setup: Add project root to Python path
import sys
import os

# Find project root
current_dir = os.getcwd()
if 'notebooks' in current_dir:
    project_root = os.path.dirname(current_dir)
elif os.path.exists(os.path.join(current_dir, 'deduplicate_spark.py')):
    project_root = current_dir
else:
    # Search up directories
    test_dir = current_dir
    for _ in range(5):
        if os.path.exists(os.path.join(test_dir, 'deduplicate_spark.py')):
            project_root = test_dir
            break
        parent = os.path.dirname(test_dir)
        if parent == test_dir:
            break
        test_dir = parent
    project_root = project_root or current_dir

if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✓ Added to Python path: {project_root}")

# Change to project root for file operations
os.chdir(project_root)
print(f"✓ Changed working directory to: {project_root}")


In [None]:
from deduplicate_spark import create_spark_session, process_file_spark

# Create Spark session
spark = create_spark_session("Exercise1_BasicDeduplication")
print("✓ Spark session created")

## Step 2: Generate or Load Data

If you haven't generated data yet, uncomment and run the cell below. Otherwise, we'll use existing data.

In [None]:
import subprocess
import os

# Generate data if it doesn't exist
data_file = os.path.join(project_root, "data", "exercise1.csv")
data_dir = os.path.join(project_root, "data")

if not os.path.exists(data_file):
    # Create data directory if needed
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
        print(f"✓ Created data directory: {data_dir}")
    
    print("Generating sample data...")
    script_path = os.path.join(project_root, "generate_dataset.py")
    result = subprocess.run(
        ["python", script_path, "1000", data_file],
        cwd=project_root,
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print("✓ Data generated successfully")
    else:
        print(f"✗ Error: {result.stderr}")
        print(f"  Script path: {script_path}")
        print(f"  Script exists: {os.path.exists(script_path)}")
else:
    print("✓ Using existing data file")

## Step 3: Run Deduplication

Now let's run the exact deduplication method.

In [None]:
# Run exact deduplication
stats = process_file_spark(
    spark,
    os.path.join(project_root, "data", "exercise1.csv"),
    output_dir=None,  # Uses /tmp/results in Cloudera, data/ locally
    method='exact'
)

if stats:
    print(f"\nOriginal records: {stats['original_count']:,}")
    print(f"Unique records: {stats['unique_count']:,}")
    print(f"Duplicates removed: {stats['duplicates_removed']:,}")
    print(f"Deduplication rate: {stats['deduplication_rate']:.2f}%")

## Questions to Answer

1. How many duplicates were found?
2. What percentage of records were duplicates?
3. Where are the results saved?

In [None]:
# Cleanup
spark.stop()
print("✓ Spark session stopped")