# Data Deduplication Lab - Getting Started

Welcome to the Data Deduplication Lab! This notebook will help you get started with the lab exercises.

## Learning Objectives

By the end of this lab, you will:
- Understand different deduplication strategies
- Be able to process large datasets efficiently with Spark
- Know how to use approximate methods for memory-efficient operations
- Understand file-level vs record-level deduplication

## Prerequisites

- Access to Cloudera AI Workbench
- Basic Python knowledge
- Familiarity with data processing concepts

## Setup

First, let's verify that we have access to the necessary files and Spark is available.


In [None]:
# Check if Spark is available
try:
    from pyspark.sql import SparkSession
    print("✓ PySpark is available")
except ImportError:
    print("✗ PySpark not found. Please ensure Spark is installed.")

# Check if our deduplication module is available
try:
    from deduplicate_spark import create_spark_session
    print("✓ deduplicate_spark module is available")
except ImportError:
    print("✗ deduplicate_spark module not found. Please upload deduplicate_spark.py to your project.")

# Check Python version
import sys
print(f"✓ Python version: {sys.version}")


## Create Spark Session

The `create_spark_session()` function automatically detects if you're running in Cloudera AI Workbench and configures Spark appropriately.


In [None]:
from deduplicate_spark import create_spark_session

# Create Spark session (automatically configured for Cloudera)
spark = create_spark_session("DeduplicationLab")

print(f"Spark version: {spark.version}")
print(f"Spark master: {spark.sparkContext.master}")
print("✓ Spark session created successfully!")


## Generate Sample Data

Let's generate a sample dataset with duplicates to work with. We'll create a CSV file with 1000 records.


In [None]:
import subprocess
import os

# Generate sample data
if not os.path.exists("data"):
    os.makedirs("data")

# Generate dataset with 1000 records
result = subprocess.run(
    ["python", "generate_dataset.py", "1000", "data/redundant_data.csv"],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print("✓ Sample data generated successfully!")
    print(result.stdout)
else:
    print("✗ Error generating data:")
    print(result.stderr)

# Check if file was created
if os.path.exists("data/redundant_data.csv"):
    file_size = os.path.getsize("data/redundant_data.csv")
    print(f"\nFile created: data/redundant_data.csv ({file_size:,} bytes)")


## Load and Inspect Data

Let's load the data into a Spark DataFrame and take a look at it.


In [None]:
# Read the CSV file
df = spark.read.csv("data/redundant_data.csv", header=True, inferSchema=True)

# Show basic information
print(f"Total records: {df.count():,}")
print(f"Columns: {', '.join(df.columns)}")
print("\nFirst 10 records:")
df.show(10, truncate=False)

# Show schema
print("\nSchema:")
df.printSchema()


## Check for Duplicates

Let's see how many duplicates we have in our dataset.


In [None]:
from pyspark.sql import functions as F

# Count total records
total_count = df.count()

# Count unique records based on name and email
unique_count = df.select("name", "email").distinct().count()

# Calculate duplicates
duplicates = total_count - unique_count
duplicate_rate = (duplicates / total_count * 100) if total_count > 0 else 0

print(f"Total records: {total_count:,}")
print(f"Unique records (by name+email): {unique_count:,}")
print(f"Duplicate records: {duplicates:,}")
print(f"Duplicate rate: {duplicate_rate:.2f}%")


## Next Steps

Now that you have your data loaded, you're ready to start the lab exercises:

1. **Exercise 1**: Basic Deduplication - `01_Basic_Deduplication.ipynb`
2. **Exercise 2**: Compare Methods - `02_Compare_Methods.ipynb`
3. **Exercise 3**: Approximate Methods - `03_Approximate_Methods.ipynb`
4. **Exercise 4**: File-Level Deduplication - `04_File_Level_Deduplication.ipynb`

## Cleanup

When you're done, remember to stop the Spark session:


In [None]:
# Stop Spark session when done
spark.stop()
print("✓ Spark session stopped")
