# Exercise 4: File-Level Deduplication

## Learning Objectives

In this exercise, you will:
- Learn how to detect duplicate files by content hash
- Understand file-level vs record-level deduplication
- Calculate potential space savings

## Overview

**File-level deduplication** finds duplicate files based on their content, not their names. Useful for finding duplicate files in storage systems.

In [None]:
from deduplicate_spark import create_spark_session, deduplicate_files
import os
import glob
import subprocess

spark = create_spark_session("Exercise4_FileDeduplication")
print("✓ Spark session created")

In [None]:
# Generate or find duplicate files
duplicate_files_dir = "data/duplicatefiles"

if not os.path.exists(duplicate_files_dir) or len(glob.glob(os.path.join(duplicate_files_dir, "*"))) == 0:
    print("Generating duplicate files...")
    if not os.path.exists("data"):
        os.makedirs("data")
    subprocess.run(["python", "generate_duplicate_files.py", "25", "0.9", duplicate_files_dir])
    print("✓ Files generated")
else:
    print("✓ Using existing files")

# Get all files
file_paths = glob.glob(os.path.join(duplicate_files_dir, "*"))
file_paths = [f for f in file_paths if os.path.isfile(f)]
print(f"\nFound {len(file_paths)} files to analyze")

In [None]:
# Run file-level deduplication
print("Running file-level deduplication...")
deduplicate_files(spark, file_paths, output_dir=None)
print("\n✓ Deduplication complete!")

## Questions to Answer

1. How many duplicate files were found?
2. What is the total space that could be saved?
3. How does file deduplication differ from record deduplication?

In [None]:
spark.stop()
print("✓ Spark session stopped")