# üéØ Interview Coding Challenge 2: Data Deduplication with Business Rules

**Difficulty:** Medium  
**Time:** 25 minutes  
**Skills:** Window functions, ranking, complex filtering, business logic

## Problem Statement

Implement data deduplication with sophisticated business rules. Given customer records with duplicates, keep only the most appropriate version based on:

1. **Priority 1:** Prefer `status = 'active'` over `status = 'inactive'`
2. **Priority 2:** For same status, keep most recent `updated_at`
3. **Priority 3:** For same timestamp, keep highest `confidence_score`

## Input Schema
```python
customer_schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("email", StringType(), True),
    StructField("status", StringType(), True),  # 'active', 'inactive'
    StructField("updated_at", TimestampType(), True),
    StructField("confidence_score", DoubleType(), True),
    StructField("source_system", StringType(), True)
])
```

## Expected Output
- Single deduplicated record per customer
- Audit trail showing which records were removed
- Clear explanation of business rule application

## Evaluation Criteria
- Correct business rule implementation
- Proper ranking and window function usage
- Efficient deduplication logic
- Clean, maintainable code

---

**Start coding your solution below!** üöÄ

In [None]:
# Initialize Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("Interview Challenge 2") \
    .getOrCreate()

print("Spark initialized successfully!")

In [None]:
# Create sample customer data with duplicates
customer_data = [
    ("C001", "john@example.com", "active", "2023-01-01 10:00:00", 0.9, "crm"),
    ("C001", "john@example.com", "inactive", "2023-01-01 09:00:00", 0.8, "web"),
    ("C001", "john@example.com", "active", "2023-01-01 08:00:00", 0.7, "mobile"),
    ("C002", "jane@example.com", "active", "2023-01-01 11:00:00", 0.95, "crm"),
    ("C002", "jane@example.com", "active", "2023-01-01 10:00:00", 0.85, "web"),
    ("C002", "jane@example.com", "inactive", "2023-01-01 09:00:00", 0.6, "mobile"),
    ("C003", "bob@example.com", "active", "2023-01-01 12:00:00", 0.88, "crm"),
    ("C003", "bob@example.com", "active", "2023-01-01 12:00:00", 0.92, "api"),  # Same timestamp, higher confidence
]

# Define schema
customer_schema = StructType([
    StructField("customer_id", StringType(), True),
    StructField("email", StringType(), True),
    StructField("status", StringType(), True),
    StructField("updated_at", TimestampType(), True),
    StructField("confidence_score", DoubleType(), True),
    StructField("source_system", StringType(), True)
])

# Create DataFrame
customers_df = spark.createDataFrame(customer_data, 
    ["customer_id", "email", "status", "updated_at", "confidence_score", "source_system"])
customers_df = customers_df.withColumn("updated_at", to_timestamp("updated_at"))

print("Original customer data with duplicates:")
customers_df.orderBy("customer_id", "updated_at").show()

## Your Solution

**Implement the deduplication logic with business rules:**

1. Create ranking logic: active > inactive, then by recency, then by confidence
2. Apply window function to rank records within each customer group
3. Keep only the top-ranked (best) record per customer
4. Create audit trail showing what was removed

**Business Rules Priority:**
- Status: 'active' (priority 1) > 'inactive' (priority 2)
- Within same status: Most recent updated_at
- Within same timestamp: Highest confidence_score

**Hints:**
- Use `when().otherwise()` for status priority mapping
- Use `Window.partitionBy().orderBy()` with multiple criteria
- Use `row_number()` to assign ranks
- Create audit trail by filtering rank > 1

In [None]:
# YOUR SOLUTION HERE
# Implement the deduplication functions

def deduplicate_customers(df):
    """
    Remove duplicates based on business rules:
    1. Prefer 'active' over 'inactive' status
    2. Then by most recent updated_at
    3. Then by highest confidence_score
    
    Args:
        df: DataFrame with customer data
    
    Returns:
        DataFrame: Deduplicated customers (one per customer_id)
    """
    
    # Step 1: Define ranking logic
    # Hint: Map status to priority numbers, then order by multiple criteria
    
    # YOUR CODE HERE
    
    # Step 2: Create window specification for ranking
    # Hint: Partition by customer_id, order by status_priority, updated_at desc, confidence_score desc
    
    # YOUR CODE HERE
    
    # Step 3: Apply ranking and filter
    # Hint: Use row_number() and filter for rank = 1
    
    # YOUR CODE HERE
    
    return deduplicated_df

def create_audit_trail(original_df, deduplicated_df):
    """
    Create audit trail showing which records were removed
    
    Args:
        original_df: Original DataFrame with duplicates
        deduplicated_df: Deduplicated DataFrame
    
    Returns:
        DataFrame: Audit trail of removed records
    """
    
    # Step 1: Find records that were removed
    # Hint: Use left_anti join or exceptAll
    
    # YOUR CODE HERE
    
    # Step 2: Group by customer to show summary
    # Hint: Aggregate to show count of removed records per customer
    
    # YOUR CODE HERE
    
    return audit_summary

# Test your solution
try:
    deduplicated = deduplicate_customers(customers_df)
    audit_trail = create_audit_trail(customers_df, deduplicated)
    
    print("‚úÖ Deduplicated customers:")
    deduplicated.orderBy("customer_id").show()
    
    print("\n‚úÖ Audit trail (records removed):")
    audit_trail.show()
    
    print(f"\nüìä Summary: {customers_df.count()} original records ‚Üí {deduplicated.count()} deduplicated records")
    
except Exception as e:
    print(f"‚ùå Error in your solution: {e}")
    print("Keep working on your implementation!")

## Expected Solution Output

**Deduplicated customers:**
```
+-----------+----------------+------+--------------------+----------------+-------------+
|customer_id|email           |status|updated_at          |confidence_score|source_system|
+-----------+----------------+------+--------------------+----------------+-------------+
|C001       |john@example.com|active|2023-01-01 10:00:00|0.9             |crm          |
|C002       |jane@example.com|active|2023-01-01 11:00:00|0.95            |crm          |
|C003       |bob@example.com |active|2023-01-01 12:00:00|0.92            |api          |
+-----------+----------------+------+--------------------+----------------+-------------+
```

**Audit trail:**
```
+-----------+----------------+----------------+
|customer_id|records_removed|removal_reason  |
+-----------+----------------+----------------+
|C001       |2               |Duplicate removal|
|C002       |2               |Duplicate removal|
|C003       |1               |Duplicate removal|
+-----------+----------------+----------------+
```

---

**Excellent work on this deduplication challenge!** üéâ

**Key concepts tested:**
- Complex business rule implementation
- Window functions and ranking
- Data quality and deduplication logic
- Audit trail creation for compliance

**Business rules correctly applied:**
- C001: Kept active record (priority over inactive)
- C002: Kept most recent active record
- C003: Kept highest confidence record (same timestamp)

**Ready for the next challenge?** üöÄ