# Interview Challenge 13: Multi-Source Data Pipeline & Cleaning

## Problem Statement

Build a comprehensive ETL pipeline that reads data from multiple sources (CSV, JSON, different formats), performs extensive data cleaning and transformation, and creates a unified clean dataset ready for analytics.

## Data Sources

**Source 1 - Customer CSV:**
- Format: CSV with headers
- Issues: Missing values, inconsistent formats, duplicates
- Fields: customer_id, name, email, phone, address, signup_date

**Source 2 - Orders JSON:**
- Format: JSON Lines (one JSON object per line)
- Issues: Nested structures, missing fields, invalid dates
- Fields: order_id, customer_id, items[], order_date, total_amount, status

**Source 3 - Product Database Export (TSV):**
- Format: Tab-separated values
- Issues: Special characters, encoding problems, malformed rows
- Fields: product_id, name, category, price, description, created_date

**Source 4 - User Activity Logs (Custom Format):**
- Format: Custom delimited log format
- Issues: Irregular delimiters, mixed data types, corrupted entries
- Fields: timestamp, user_id, action, page, session_id, metadata

## Tasks

1. **Multi-Source Data Reading**
   - Read data from CSV, JSON, TSV, and custom formats
   - Handle different schemas and data types
   - Implement robust error handling for malformed data
   - Use appropriate Spark readers for each format

2. **Data Cleaning & Standardization**
   - Remove duplicates and handle missing values
   - Standardize formats (dates, phone numbers, addresses)
   - Validate data integrity and business rules
   - Handle encoding issues and special characters

3. **Data Transformation**
   - Flatten nested JSON structures
   - Normalize and denormalize data as needed
   - Create derived features and calculated fields
   - Implement complex business logic transformations

4. **Data Quality Validation**
   - Implement comprehensive data quality checks
   - Generate data quality reports
   - Flag suspicious or outlier data
   - Create data profiling summaries

5. **Unified Output**
   - Merge data from all sources into consistent schema
   - Create master tables with proper relationships
   - Optimize output format for downstream analytics
   - Generate data dictionary and documentation

## Technical Requirements
- Handle multiple file formats and schemas
- Implement robust error handling and recovery
- Use appropriate data validation and cleaning techniques
- Optimize for performance with large datasets
- Create reusable and maintainable pipeline code
- Include comprehensive logging and monitoring

## ðŸš€ Try It Yourself

Build a production-ready ETL pipeline that handles multiple data sources and extensive cleaning. Start by reading each data source, then implement cleaning and transformation logic.

**Steps to follow:**
1. Set up readers for each data source format
2. Implement data cleaning functions for each source
3. Create transformation logic to standardize data
4. Merge and validate the unified dataset
5. Generate quality reports and output optimized results

**Tip:** Focus on error handling - real-world data is messy and requires robust processing.

In [None]:
# Import required libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import json
import re
from datetime import datetime

# Create Spark session
spark = SparkSession.builder \
    .appName("MultiSourceDataPipeline") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Sample data (in production, these would be file paths)
customers_csv_data = [
    "customer_id,name,email,phone,address,signup_date",
    "CUST001,John Doe,john@email.com,555-1234,123 Main St,2023-01-15",
    "CUST002,Jane Smith,,555-5678,456 Oak Ave,2023-02-20",
    "CUST003,Bob Johnson,bob@email.com,INVALID,789 Pine St,2023-03-10",
    "CUST001,John Doe,john@email.com,555-1234,123 Main St,2023-01-15",  # Duplicate
    "CUST004,Alice Brown,alice@email.com,555-9012,,2023-04-05"  # Missing address
]

orders_json_data = [
    '{"order_id": "ORD001", "customer_id": "CUST001", "items": [{"product_id": "PROD001", "quantity": 2}], "order_date": "2023-01-15", "total_amount": 299.98, "status": "completed"}',
    '{"order_id": "ORD002", "customer_id": "CUST002", "items": [{"product_id": "PROD002", "quantity": 1}], "order_date": "2023-01-20", "total_amount": 149.99, "status": "pending"}',
    '{"order_id": "ORD003", "customer_id": "CUST003", "items": [], "order_date": "2023-01-25", "total_amount": 0, "status": "cancelled"}',  # Empty items
    '{"order_id": "ORD004", "customer_id": "CUST001", "items": [{"product_id": "PROD003", "quantity": 1}], "total_amount": 199.99}'  # Missing fields
]

products_tsv_data = [
    "product_id\tname\tcategory\tprice\tdescription\tcreated_date",
    "PROD001\tLaptop Pro\tElectronics\t299.99\tHigh-performance laptop\t2023-01-01",
    "PROD002\tWireless Headphones\tElectronics\t149.99\tNoise-cancelling wireless\t2023-01-15",
    "PROD003\tOffice Chair\tFurniture\t199.99\tErgonomic office chair\t2023-02-01",
    "PROD004\tCoffee Mug\tKitchen\t12.99\tCeramic coffee mug\t2023-02-15"
]

activity_logs_data = [
    "2023-01-15 10:30:45|CUST001|VIEW|home|sess_abc123|{\"page_load\": 2.1}",
    "2023-01-15 10:31:12|CUST001|CLICK|products|sess_abc123|{\"element\": \"add_to_cart\"}",
    "2023-01-15 10:32:01|CUST002|LOGIN|login|sess_def456|{\"login_method\": \"email\"}",
    "2023-01-15 10:32:30|CUST001|PURCHASE|checkout|sess_abc123|{\"payment\": \"credit_card\"}",
    "invalid log entry with wrong format",
    "2023-01-15 10:33:15|CUST003|SEARCH|products|sess_ghi789|{\"query\": \"laptop\"}"
]

# Helper function to create RDD from sample data
def create_sample_rdd(data_list):
    """Create RDD from sample data list"""
    return spark.sparkContext.parallelize(data_list)

print("Data sources prepared!")
print(f"Customers CSV: {len(customers_csv_data)} lines")
print(f"Orders JSON: {len(orders_json_data)} records")
print(f"Products TSV: {len(products_tsv_data)} lines")
print(f"Activity logs: {len(activity_logs_data)} entries")

# === YOUR SOLUTION GOES HERE ===
# Implement multi-source ETL pipeline

# Task 1: Read data from multiple sources
# 1a. Read CSV data with proper options
# 1b. Parse JSON data with error handling
# 1c. Read TSV data with custom separator
# 1d. Parse custom log format

# Task 2: Data Cleaning & Standardization
# 2a. Clean customer data (duplicates, missing values, format validation)
# 2b. Clean orders data (nested JSON, missing fields, validation)
# 2c. Clean products data (encoding, special characters)
# 2d. Clean activity logs (parsing, validation)

# Task 3: Data Transformation
# 3a. Flatten JSON structures
# 3b. Normalize data formats
# 3c. Create derived features
# 3d. Implement business logic transformations

# Task 4: Data Quality Validation
# 4a. Implement quality checks
# 4b. Generate quality reports
# 4c. Flag data quality issues

# Task 5: Create Unified Output
# 5a. Merge all sources into master tables
# 5b. Create relationships and foreign keys
# 5c. Optimize output for analytics

print("Implement your multi-source ETL pipeline above!")
