# Data Deduplication Lab - Getting Started

Welcome to the Data Deduplication Lab! This notebook will help you get started with the lab exercises.

## Learning Objectives

By the end of this lab, you will:
- Understand different deduplication strategies
- Be able to process large datasets efficiently with Spark
- Know how to use approximate methods for memory-efficient operations
- Understand file-level vs record-level deduplication

## Prerequisites

- Access to Cloudera AI Workbench
- Basic Python knowledge
- Familiarity with data processing concepts

## Setup

First, let's verify that we have access to the necessary files and Spark is available.

**Important**: Make sure you have uploaded the following files to your Cloudera AI Workbench project:
- `deduplicate_spark.py` (in the project root)
- `generate_dataset.py` (in the project root)
- `bloom_filter_hyperloglog.py` (in the project root)
- `bloom_filter_file_deduplication.py` (in the project root)

The notebook will automatically add the project root to the Python path so these modules can be imported.


In [None]:
# Add project root to Python path so we can import modules
import sys
import os

# Try multiple methods to find the project root
project_root = None

# Method 1: Check if we're in a notebooks subdirectory
current_dir = os.getcwd()
if 'notebooks' in current_dir:
    # Go up one level from notebooks/
    project_root = os.path.dirname(current_dir)
    print(f"Method 1: Found project root (from notebooks/): {project_root}")
else:
    # Method 2: Look for deduplicate_spark.py in current directory
    if os.path.exists('deduplicate_spark.py'):
        project_root = current_dir
        print(f"Method 2: Found project root (deduplicate_spark.py in current dir): {project_root}")
    else:
        # Method 3: Try going up directories to find deduplicate_spark.py
        test_dir = current_dir
        for _ in range(3):  # Try up to 3 levels up
            if os.path.exists(os.path.join(test_dir, 'deduplicate_spark.py')):
                project_root = test_dir
                print(f"Method 3: Found project root (searched up directories): {project_root}")
                break
            test_dir = os.path.dirname(test_dir)
        
        # Method 4: Use current directory as fallback
        if project_root is None:
            project_root = current_dir
            print(f"Method 4: Using current directory as project root: {project_root}")

# Add project root to path if not already there
if project_root and project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✓ Added to Python path: {project_root}")

# Check if Spark is available
try:
    from pyspark.sql import SparkSession
    print("✓ PySpark is available")
except ImportError:
    print("✗ PySpark not found. Please ensure Spark is installed.")

# Check if our deduplication module is available
try:
    from deduplicate_spark import create_spark_session
    print("✓ deduplicate_spark module is available")
except ImportError as e:
    print("✗ deduplicate_spark module not found.")
    print(f"  Current working directory: {os.getcwd()}")
    print(f"  Project root used: {project_root}")
    print(f"  Python path (first 3): {sys.path[:3]}")
    print(f"  Error: {e}")
    print("\n  Troubleshooting:")
    print("  1. Ensure deduplicate_spark.py is in the project root")
    print("  2. Check that the file is uploaded to Cloudera AI Workbench")
    print("  3. Verify the file is in the same directory as this notebook's parent")

# Check Python version
print(f"\n✓ Python version: {sys.version}")


## Create Spark Session

The `create_spark_session()` function automatically detects if you're running in Cloudera AI Workbench and configures Spark appropriately.


In [None]:
from deduplicate_spark import create_spark_session

# Create Spark session (automatically configured for Cloudera)
spark = create_spark_session("DeduplicationLab")

print(f"Spark version: {spark.version}")
print(f"Spark master: {spark.sparkContext.master}")
print("✓ Spark session created successfully!")


## Generate Sample Data

Let's generate a sample dataset with duplicates to work with. We'll create a CSV file with 1000 records.


In [None]:
import subprocess
import os

# Generate sample data
if not os.path.exists("data"):
    os.makedirs("data")

# Generate dataset with 1000 records
result = subprocess.run(
    ["python", "generate_dataset.py", "1000", "data/redundant_data.csv"],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print("✓ Sample data generated successfully!")
    print(result.stdout)
else:
    print("✗ Error generating data:")
    print(result.stderr)

# Check if file was created
if os.path.exists("data/redundant_data.csv"):
    file_size = os.path.getsize("data/redundant_data.csv")
    print(f"\nFile created: data/redundant_data.csv ({file_size:,} bytes)")


## Load and Inspect Data

Let's load the data into a Spark DataFrame and take a look at it.


In [None]:
# Read the CSV file
df = spark.read.csv("data/redundant_data.csv", header=True, inferSchema=True)

# Show basic information
print(f"Total records: {df.count():,}")
print(f"Columns: {', '.join(df.columns)}")
print("\nFirst 10 records:")
df.show(10, truncate=False)

# Show schema
print("\nSchema:")
df.printSchema()


## Check for Duplicates

Let's see how many duplicates we have in our dataset.


In [None]:
from pyspark.sql import functions as F

# Count total records
total_count = df.count()

# Count unique records based on name and email
unique_count = df.select("name", "email").distinct().count()

# Calculate duplicates
duplicates = total_count - unique_count
duplicate_rate = (duplicates / total_count * 100) if total_count > 0 else 0

print(f"Total records: {total_count:,}")
print(f"Unique records (by name+email): {unique_count:,}")
print(f"Duplicate records: {duplicates:,}")
print(f"Duplicate rate: {duplicate_rate:.2f}%")


## Next Steps

Now that you have your data loaded, you're ready to start the lab exercises:

1. **Exercise 1**: Basic Deduplication - `01_Basic_Deduplication.ipynb`
2. **Exercise 2**: Compare Methods - `02_Compare_Methods.ipynb`
3. **Exercise 3**: Approximate Methods - `03_Approximate_Methods.ipynb`
4. **Exercise 4**: File-Level Deduplication - `04_File_Level_Deduplication.ipynb`

## Cleanup

When you're done, remember to stop the Spark session:


In [None]:
# Stop Spark session when done
spark.stop()
print("✓ Spark session stopped")
