# Exercise 4: File-Level Deduplication

## Learning Objectives

In this exercise, you will:
- Learn how to detect duplicate files by content hash
- Understand file-level vs record-level deduplication
- Calculate potential space savings

## Overview

**File-level deduplication** finds duplicate files based on their content, not their names. Useful for finding duplicate files in storage systems.

In [None]:
# Setup: Add project root to Python path
import sys
import os

# Find project root
current_dir = os.getcwd()
if 'notebooks' in current_dir:
    project_root = os.path.dirname(current_dir)
elif os.path.exists(os.path.join(current_dir, 'deduplicate_spark.py')):
    project_root = current_dir
else:
    # Search up directories
    test_dir = current_dir
    for _ in range(5):
        if os.path.exists(os.path.join(test_dir, 'deduplicate_spark.py')):
            project_root = test_dir
            break
        parent = os.path.dirname(test_dir)
        if parent == test_dir:
            break
        test_dir = parent
    project_root = project_root or current_dir

if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"✓ Added to Python path: {project_root}")

# Change to project root for file operations
os.chdir(project_root)
print(f"✓ Changed working directory to: {project_root}")


In [None]:
from deduplicate_spark import create_spark_session, deduplicate_files
import os
import glob
import subprocess

spark = create_spark_session("Exercise4_FileDeduplication")
print("✓ Spark session created")

In [None]:
# Generate or find duplicate files
duplicate_files_dir = os.path.join(project_root, "data", "duplicatefiles")
data_dir = os.path.join(project_root, "data")

if not os.path.exists(duplicate_files_dir) or len(glob.glob(os.path.join(duplicate_files_dir, "*"))) == 0:
    print("Generating duplicate files...")
    # Create data directory if needed
    if not os.path.exists(data_dir):
        os.makedirs(data_dir)
        print(f"✓ Created data directory: {data_dir}")
    
    script_path = os.path.join(project_root, "generate_duplicate_files.py")
    result = subprocess.run(
        ["python", script_path, "25", "0.9", duplicate_files_dir],
        cwd=project_root,
        capture_output=True,
        text=True
    )
    if result.returncode == 0:
        print("✓ Files generated")
    else:
        print(f"✗ Error generating files: {result.stderr}")
else:
    print("✓ Using existing files")

# Get all files
file_paths = glob.glob(os.path.join(duplicate_files_dir, "*"))
file_paths = [f for f in file_paths if os.path.isfile(f)]
print(f"\nFound {len(file_paths)} files to analyze")

In [None]:
# Run file-level deduplication
print("Running file-level deduplication...")
deduplicate_files(spark, file_paths, output_dir=None)
print("\n✓ Deduplication complete!")

## Questions to Answer

1. How many duplicate files were found?
2. What is the total space that could be saved?
3. How does file deduplication differ from record deduplication?

In [None]:
spark.stop()
print("✓ Spark session stopped")