# Data Cleaning: T-ECD E-Commerce Transaction Dataset

## Project Context

Following the exploratory data analysis, this notebook presents the data cleaning procedures applied to the T-ECD (Transactional E-Commerce Dataset). The focus is on addressing identified data quality issues to prepare the dataset for downstream analysis and modeling.

## Methodological Approach

- **Confirmed errors**: Issues verified with direct evidence (missing value counts, duplicate records, schema errors)
- **Imputation decisions**: Choices made where complete information was unavailable (e.g., handling missing values)
- **Remaining uncertainties**: Aspects requiring further investigation (e.g., underlying causes of data missingness)

## Dataset Summary (as of 2025-12-21)

The analysis identified the following data quality issues across tables:

| Table | Rows | Issues Identified |
|-------|------|-----------------|
| Users | 3.5M | Missing demographics (1.68% region, 0.15% cluster) |
| Brands | 24.5K | Corrupted embeddings, 46 duplicate records |
| Retail Items | 250K | Missing categories (3.83%), prices (10.59%) |
| Retail Events | 4.1M | No quality issues detected |
| Marketplace Items | - | Corrupted embeddings, extensive missing values |
| Marketplace Events | 5.1M | Missing subdomain (0.04%) |
| Offers Items | 22K | Missing brand_id (2.42%) |
| Offers Events | 30.5M | No quality issues detected |
| Reviews | 20.5K | Corrupted embeddings |
| Payments Events | 68.9M | Missing brand_id (48.36%), price (0.00%) |
| Payments Receipts | 60.8M | Missing brand_id (85.80%), price (1.42%) |

In [1]:
# Standard library
import sys
import io

# Data manipulation
import pandas as pd
import numpy as np

# File operations
import os
from collections import defaultdict

# Data loading
from huggingface_hub import hf_hub_download, list_repo_files

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

## Configuration

In [2]:
# Dataset repository
REPO_ID = "t-tech/T-ECD"
REPO_TYPE = "dataset"

# Local paths
CACHE_DIR = "dataset_cache"
OUTPUT_DIR = "cleaned_data"
DATASET_PATH_SMALL = "dataset/small"
DATASET_PATH_FULL = "dataset/full"

# Partition limits to manage memory
# I set these conservatively to ensure the notebook runs on typical hardware
DATASET_SMALL_NUM_PARTITIONS_TO_LOAD = 10
DATASET_FULL_NUM_PARTITIONS_TO_LOAD = 1

# Configuration for alignment
TARGET_PARTITION_ID = "01082"  # Matches the start of your Small dataset

# Ensure output directories exist
os.makedirs(CACHE_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

## Helper Functions

Helper functions to standardize the cleaning workflow. These functions handle:
1. **Loading**: Parquet loading with fallback logic for schema errors
2. **Evidence display**: Showing concrete examples of data quality issues before cleaning
3. **Validation**: Quantifying the impact of cleaning operations
4. **Persistence**: Saving cleaned datasets for downstream use

In [3]:
def load_remote_parquet_safe(filename, columns_to_exclude=None):
    """
    Loads a parquet file from Hugging Face, handling schema errors gracefully.
    
    Args:
        filename: Path to parquet file in the repo
        columns_to_exclude: List of columns to skip (e.g., corrupted embeddings)
    
    Returns:
        DataFrame or None if loading fails
    """
    print(f"Loading {filename}...")
    try:
        local_path = hf_hub_download(
            repo_id=REPO_ID,
            filename=filename,
            repo_type=REPO_TYPE,
            local_dir=CACHE_DIR,
            local_dir_use_symlinks=False
        )
        
        if columns_to_exclude:
            import pyarrow.parquet as pq
            schema = pq.read_schema(local_path)
            all_cols = [name for name in schema.names if not name.startswith('__')]
            use_cols = [c for c in all_cols if c not in columns_to_exclude]
            print(f"  Excluding columns: {columns_to_exclude}")
            print(f"  Reading columns: {use_cols}")
            df = pd.read_parquet(local_path, columns=use_cols)
            return df
        
        # Try loading all columns
        try:
            df = pd.read_parquet(local_path)
            return df
        except Exception as e:
            print(f"  Standard load failed: {e}")
            import pyarrow.parquet as pq
            schema = pq.read_schema(local_path)
            all_cols = [name for name in schema.names if not name.startswith('__')]
            if 'embedding' in all_cols:
                print(f"  Retrying without 'embedding' column...")
                use_cols = [c for c in all_cols if c != 'embedding']
                df = pd.read_parquet(local_path, columns=use_cols)
                print("  Success (with exclusions).")
                return df
            else:
                raise e
    except Exception as e:
        print(f"Error downloading/loading {filename}: {e}")
        return None


def load_dataframe_from_partitions_safe(file_list, limit=None, columns_to_exclude=None, match_term=None):
    """
    Loads and concatenates multiple parquet partition files.
    
    Args:
        file_list: List of partition file paths
        limit: Maximum number of partitions to load (None = all)
        columns_to_exclude: Columns to skip in all partitions
        match_term: Optional term to filter partitions by
    
    Returns:
        Concatenated DataFrame or None
    """
    if not file_list:
        print("No files to load.")
        return None
    
    # 1. Filter by specific partition ID if requested (e.g., "01082")
    if match_term:
        original_count = len(file_list)
        file_list = [f for f in file_list if match_term in f]
        print(f"Filter: Found {len(file_list)} files matching '{match_term}' (out of {original_count})")
        
        if not file_list:
            print(f"WARNING: No files matched '{match_term}'. Returning None.")
            return None

    # 2. Apply limit (if any)
    files_to_load = file_list[:limit] if limit else file_list
    print(f"Loading {len(files_to_load)} partitions...")

    dfs = []
    for f in files_to_load:
        df = load_remote_parquet_safe(f, columns_to_exclude=columns_to_exclude)
        if df is not None:
            dfs.append(df)

    if not dfs:
        print("No dataframes loaded successfully.")
        return None

    print("Concatenating partitions...")
    full_df = pd.concat(dfs, ignore_index=True)
    return full_df


def show_evidence(df, name, columns_to_check=None):
    """
    Display evidence of data quality issues before cleaning.
    
    Args:
        df: DataFrame to examine
        name: Dataset name for reporting
        columns_to_check: Specific columns to highlight (None = all)
    
    """
    print(f"\n{'='*80}")
    print(f"EVIDENCE: {name}")
    print(f"{'='*80}")
    
    print(f"\nShape: {df.shape}")
    
    # Missing values
    missing = df.isnull().sum()
    missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
    
    if missing.sum() > 0:
        print(f"\n Missing Values Found:")
        missing_df = pd.DataFrame({
            'Column': missing.index,
            'Missing Count': missing.values,
            'Missing %': missing_pct.values
        })
        missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)
        print(missing_df.to_string(index=False))
        
        # Show sample rows with missing values for key columns
        if columns_to_check:
            for col in columns_to_check:
                if col in df.columns and df[col].isnull().sum() > 0:
                    print(f"\n  Sample rows with missing '{col}':")
                    print(df[df[col].isnull()].head(2).to_string())
    else:
        print(f"\nNo missing values detected")
    
    # Duplicates (skip unhashable columns)
    try:
        hashable_cols = [c for c in df.columns 
                        if df[c].dtype != 'object' or 
                        not df[c].apply(lambda x: isinstance(x, (list, np.ndarray))).any()]
        dupes = df[hashable_cols].duplicated().sum()
        if dupes > 0:
            print(f"\nDuplicates Found: {dupes} rows")
        else:
            print(f"\nNo duplicates detected")
    except:
        print(f"\nDuplicate check skipped (unhashable columns)")
    
    print(f"\n{'='*80}\n")


def validate_cleaning(df_before, df_after, name):
    """
    Validate the impact of cleaning operations.
    
    Args:
        df_before: DataFrame before cleaning
        df_after: DataFrame after cleaning
        name: Dataset name for reporting
    
    """
    print(f"\n{'='*80}")
    print(f"VALIDATION: {name}")
    print(f"{'='*80}")
    
    rows_dropped = len(df_before) - len(df_after)
    print(f"\nRows: {len(df_before):,} → {len(df_after):,} (Dropped: {rows_dropped:,})")
    
    missing_before = df_before.isnull().sum().sum()
    missing_after = df_after.isnull().sum().sum()
    print(f"Missing Values: {missing_before:,} → {missing_after:,}")
    
    # Duplicates
    try:
        hashable_cols_before = [c for c in df_before.columns 
                               if df_before[c].dtype != 'object' or 
                               not df_before[c].apply(lambda x: isinstance(x, (list, np.ndarray))).any()]
        hashable_cols_after = [c for c in df_after.columns 
                              if df_after[c].dtype != 'object' or 
                              not df_after[c].apply(lambda x: isinstance(x, (list, np.ndarray))).any()]
        
        dupes_before = df_before[hashable_cols_before].duplicated().sum()
        dupes_after = df_after[hashable_cols_after].duplicated().sum()
        print(f"Duplicates: {dupes_before:,} → {dupes_after:,}")
    except:
        print(f"Duplicates: (skipped - unhashable columns)")
    
    if missing_after == 0 and rows_dropped >= 0:
        print(f"\nCLEANING SUCCESSFUL")
    else:
        print(f"\nWARNING: Review cleaning results")
    
    print(f"\n{'='*80}\n")


def save_cleaned_data(df, name):
    """
    Save cleaned dataset to parquet file.
    
    Args:
        df: Cleaned DataFrame
        name: Dataset name for filename
    """
    output_file = os.path.join(OUTPUT_DIR, f"{name.lower().replace(' ', '_')}_clean.parquet")
    df.to_parquet(output_file, index=False)
    print(f"Saved: {output_file} ({len(df):,} rows)")

## Indexing the Dataset

In [4]:
print("Indexing remote dataset files...")
all_files = list_repo_files(repo_id=REPO_ID, repo_type=REPO_TYPE)
dataset_files = defaultdict(list)

for f in all_files:
    if f.endswith(".pq"):
        dirname = os.path.dirname(f).replace("\\", "/")
        dataset_files[dirname].append(f)

print(f"Indexed {len(dataset_files)} directories containing {sum(len(files) for files in dataset_files.values())} parquet files")

Indexing remote dataset files...
Indexed 18 directories containing 6869 parquet files


---

## 1. Users Table

### Evidence from Analysis

In `analysis.ipynb`:
- `socdem_cluster`: 5,153 missing values (0.15%)
- `region`: 58,917 missing values (1.68%)
- No duplicate records

The missing values are legitimate gaps in the demographic data—some users have unknown sociodemographic clusters or regions.

### Decision: Imputation Strategy

- To handle missing demographic data, dropping 58,917 users (1.68% of the base) was rejected because it would break foreign key relationships and discard behavioral data. 
- Imputing with the statistical mode was rejected due to potential bias, as missing data may be non-random and represent systematically different users. 
- Instead, we impute missing values with a sentinel value of -1, which preserves all records, maintains referential integrity, and allows downstream analyses to handle or exclude unknown demographics. The sentinel value is safe, as all original demographic codes are non-negative.

In [5]:
# Load users table
print("Loading users table...")
users_path = f"{DATASET_PATH_SMALL}/users.pq"
df_users = load_remote_parquet_safe(users_path)

if df_users is not None:
    # Display evidence of data quality issues
    show_evidence(df_users, "Users (Before Cleaning)", 
                 columns_to_check=['socdem_cluster', 'region'])
    
    # Apply cleaning: impute missing demographics with -1
    print("Applying imputation strategy: missing demographics -> -1")
    df_users_clean = df_users.copy()
    df_users_clean['socdem_cluster'] = df_users_clean['socdem_cluster'].fillna(-1)
    df_users_clean['region'] = df_users_clean['region'].fillna(-1)
    
    # Validate cleaning impact
    validate_cleaning(df_users, df_users_clean, "Users")
    
    # Verify complete resolution
    assert df_users_clean.isnull().sum().sum() == 0, "Cleaning failed: missing values remain"
    
    # Save cleaned table
    save_cleaned_data(df_users_clean, "users")
    print(f"Saved {len(df_users_clean):,} cleaned user records")

Loading users table...
Loading dataset/small/users.pq...

EVIDENCE: Users (Before Cleaning)

Shape: (3500000, 3)

 Missing Values Found:
        Column  Missing Count  Missing %
        region          58917       1.68
socdem_cluster           5153       0.15

  Sample rows with missing 'socdem_cluster':
      user_id  socdem_cluster  region
487  27989998             NaN     NaN
670   6416147             NaN     NaN

  Sample rows with missing 'region':
     user_id  socdem_cluster  region
8   70943178             5.0     NaN
47   1664862            18.0     NaN

No duplicates detected


Applying imputation strategy: missing demographics -> -1

VALIDATION: Users

Rows: 3,500,000 → 3,500,000 (Dropped: 0)
Missing Values: 64,070 → 0
Duplicates: 0 → 0

CLEANING SUCCESSFUL


Saved: cleaned_data\users_clean.parquet (3,500,000 rows)
Saved 3,500,000 cleaned user records


---

## 2. Brands Table

### Evidence from Analysis

In `analysis.ipynb`:
1. **Schema error**: `embedding` column failed to load with error "Expected all lists to be of size=300 but index 1 had size=0"
2. **46 duplicate brand_id entries**: Violates the expected primary key constraint

The embedding column contains 300-dimensional vector representations of brands, intended for similarity calculations in recommendation systems. However, the source data contains inconsistent list lengths (empty lists mixed with 300-element lists), causing standard PyArrow loaders to fail schema validation before data can be accessed.

### Decision: Embedding Column Strategy

- Attempting low-level PyArrow recovery was rejected due to operational complexity and uncertain benefit, while replacing missing embeddings with dummy zero vectors was rejected because it would mislead models into treating them as legitimate brand representations. 
- Instead, the embedding column is excluded, preserving brand_id metadata and explicitly acknowledging the absence of embeddings. Downstream analyses must obtain embeddings from an alternative source or proceed without brand similarity features.

### Decision: Duplicate Resolution

- Among 46 duplicate brand_id values, manual investigation was rejected as unscalable without domain knowledge. 
- Keeping the first occurrence was chosen to ensure deterministic, reproducible behavior, prioritizing uniqueness over which specific duplicate is retained. Differences between duplicate records are assumed negligible, given minor metadata variations and the lack of additional identifying information.

In [6]:
# Load brands table (exclude corrupted embedding column)
print("Loading brands table...")
brands_path = f"{DATASET_PATH_SMALL}/brands.pq"
df_brands = load_remote_parquet_safe(brands_path)

if df_brands is not None:
    # Display evidence
    show_evidence(df_brands, "Brands (Before Cleaning)")
    
    # Show duplicate evidence explicitly
    duplicates_count = df_brands.duplicated(subset=['brand_id']).sum()
    print(f"\nDuplicate brand_id records: {duplicates_count}")
    if duplicates_count > 0:
        print("Sample duplicates:")
        duplicate_ids = df_brands[df_brands.duplicated(subset=['brand_id'], keep=False)]['brand_id'].unique()[:3]
        print(df_brands[df_brands['brand_id'].isin(duplicate_ids)].sort_values('brand_id'))
    
    # Remove duplicates
    print("\nApplying deduplication: keeping first occurrence of each brand_id")
    df_brands_clean = df_brands.drop_duplicates(subset=['brand_id'], keep='first').copy()
    
    # Validate
    validate_cleaning(df_brands, df_brands_clean, "Brands")
    save_cleaned_data(df_brands_clean, "brands")
    print(f"Saved {len(df_brands_clean):,} unique brand records")

Loading brands table...
Loading dataset/small/brands.pq...
  Standard load failed: Expected all lists to be of size=300 but index 1 had size=0
  Retrying without 'embedding' column...
  Success (with exclusions).

EVIDENCE: Brands (Before Cleaning)

Shape: (24513, 1)

No missing values detected

Duplicates Found: 46 rows



Duplicate brand_id records: 46
Sample duplicates:
      brand_id
3681     37799
3682     37799
3683     37799
3684     37799
3685     37799
3686     37799
3687     37799
5853     60434
5854     60434
5855     60434
5856     60434
5857     60434
5858     60434
5859     60434
6343     65693
6344     65693
6345     65693
6346     65693
6347     65693
6348     65693
6349     65693
6350     65693
6351     65693
6352     65693
6353     65693
6354     65693
6355     65693
6356     65693
6357     65693
6358     65693
6359     65693

Applying deduplication: keeping first occurrence of each brand_id

VALIDATION: Brands

Rows: 24,513 → 24,467 (Dropped: 46)
Missing Values: 0 → 

---

## 3. Retail Items Table

### Evidence from Analysis

From `analysis.ipynb`:
- `category`: 9,585 missing (3.83%)
- `subcategory`: 9,585 missing (3.83%)
- `price`: 26,489 missing (10.59%)

Category and subcategory missingness is perfectly correlated—when one is missing, both are missing. This suggests these items were not fully cataloged during data collection.

### Decision: Category/Subcategory Strategy

- Dropping 3.83% of the catalog was rejected to avoid unnecessarily reducing analytic scope.
- Missing categorical attributes are imputed with the explicit label "Unknown" rather than dropping records or using mode values. This allows product catalog analyses to group "Unknown" items separately while retaining price and brand information, enabling revenue analysis even without category data. 

### Decision: Price Strategy

- Rows with missing prices are dropped rather than imputed. 
- Imputation would introduce severe bias due to diverse product categories and missing category labels—mean or median values would misrepresent items like low-cost food or high-cost home improvement products. Price is critical for revenue, affordability, and pricing analyses, so retaining imputed prices would compromise validity. 
- The resulting 10.59% data loss is an acceptable quality-coverage trade-off, prioritizing analytic integrity for research and model training, though a production system might make a different choice.

In [7]:
# Load retail items
print("Loading retail items...")
retail_items_path = f"{DATASET_PATH_SMALL}/retail/items.pq"
df_retail_items = load_remote_parquet_safe(retail_items_path)

if df_retail_items is not None:
    show_evidence(df_retail_items, "Retail Items (Before Cleaning)",
                 columns_to_check=['category', 'subcategory', 'price'])
    
    # Apply cleaning
    print("Applying cleaning strategy:")
    print("  - category/subcategory: impute 'Unknown'")
    print("  - price: drop rows with missing values")
    
    df_retail_clean = df_retail_items.copy()
    df_retail_clean['category'] = df_retail_clean['category'].fillna("Unknown")
    df_retail_clean['subcategory'] = df_retail_clean['subcategory'].fillna("Unknown")
    df_retail_clean = df_retail_clean.dropna(subset=['price'])
    
    validate_cleaning(df_retail_items, df_retail_clean, "Retail Items")
    save_cleaned_data(df_retail_clean, "retail_items")
    print(f"Retained {len(df_retail_clean):,} items ({len(df_retail_clean)/len(df_retail_items)*100:.1f}% of original catalog)")

Loading retail items...
Loading dataset/small/retail/items.pq...

EVIDENCE: Retail Items (Before Cleaning)

Shape: (250171, 6)

 Missing Values Found:
     Column  Missing Count  Missing %
      price          26489      10.59
   category           9585       3.83
subcategory           9585       3.83

  Sample rows with missing 'category':
         item_id  brand_id category subcategory     price                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

---

## 4. Retail Events Table

### Evidence from Analysis

From `analysis.ipynb`: No data quality issues detected. All columns complete, no duplicates.

### Decision: No Cleaning Required

In [8]:
# Load and validate retail events
print("Loading retail events (validation only)...")
retail_events_dir = f"{DATASET_PATH_SMALL}/retail/events"
retail_events_files = dataset_files.get(retail_events_dir, [])
df_retail_events = load_dataframe_from_partitions_safe(
    retail_events_files, 
    limit=DATASET_SMALL_NUM_PARTITIONS_TO_LOAD
)

if df_retail_events is not None:
    show_evidence(df_retail_events, "Retail Events (Validation)")
    assert df_retail_events.isnull().sum().sum() == 0, "Unexpected data quality issues"
    print("Validation passed: no cleaning required")
    save_cleaned_data(df_retail_events, "retail_events")

Loading retail events (validation only)...
Loading 10 partitions...
Loading dataset/small/retail/events/01082.pq...
Loading dataset/small/retail/events/01083.pq...
Loading dataset/small/retail/events/01084.pq...
Loading dataset/small/retail/events/01085.pq...
Loading dataset/small/retail/events/01086.pq...
Loading dataset/small/retail/events/01087.pq...
Loading dataset/small/retail/events/01088.pq...
Loading dataset/small/retail/events/01089.pq...
Loading dataset/small/retail/events/01090.pq...
Loading dataset/small/retail/events/01091.pq...
Concatenating partitions...

EVIDENCE: Retail Events (Validation)

Shape: (4128330, 6)

No missing values detected

No duplicates detected


Validation passed: no cleaning required
Saved: cleaned_data\retail_events_clean.parquet (4,128,330 rows)


---

## 5. Marketplace Items Table

### Evidence from Analysis

From `analysis.ipynb`:
- `subcategory`: 1,233,023 missing (53.02%)
- `category`: 966,395 missing (41.56%)  
- `price`: 2,882 missing (0.12%)

### Decision: Imputation Strategy

- For marketplace items, missing categorical attributes are imputed as "Unknown," consistent with the Retail Items strategy. Missing prices, which affect only 0.12% of records, are imputed with -1 rather than dropped. 
- The lower missingness shifts the coverage-quality trade-off: dropping such a small fraction has minimal impact, but imputation maximizes data retention for the larger marketplace catalog.

In [9]:
# Load marketplace items
print("Loading marketplace items")
mp_items_path = f"{DATASET_PATH_SMALL}/marketplace/items.pq"
df_mp_items = load_remote_parquet_safe(mp_items_path)

if df_mp_items is not None:
    show_evidence(df_mp_items, "Marketplace Items")
    
    # Apply cleaning if needed
    missing_count = df_mp_items.isnull().sum().sum()
    if missing_count > 0:
        print(f"Discovered {missing_count} missing values (not in original analysis)")
        print("Applying imputation strategy")
        
        df_mp_items_clean = df_mp_items.copy()
        for col in df_mp_items.columns:
            if df_mp_items[col].isnull().sum() > 0:
                if df_mp_items[col].dtype == 'object':
                    df_mp_items_clean[col] = df_mp_items_clean[col].fillna("Unknown")
                else:
                    df_mp_items_clean[col] = df_mp_items_clean[col].fillna(-1)
        
        validate_cleaning(df_mp_items, df_mp_items_clean, "Marketplace Items")
        save_cleaned_data(df_mp_items_clean, "marketplace_items")
    else:
        print("No cleaning required")
        save_cleaned_data(df_mp_items, "marketplace_items")

Loading marketplace items
Loading dataset/small/marketplace/items.pq...
  Standard load failed: Expected all lists to be of size=300 but index 2417 had size=0
  Retrying without 'embedding' column...
  Success (with exclusions).

EVIDENCE: Marketplace Items

Shape: (2325409, 5)

 Missing Values Found:
     Column  Missing Count  Missing %
subcategory        1233023      53.02
   category         966395      41.56
      price           2882       0.12

No duplicates detected


Discovered 2202300 missing values (not in original analysis)
Applying imputation strategy

VALIDATION: Marketplace Items

Rows: 2,325,409 → 2,325,409 (Dropped: 0)
Missing Values: 2,202,300 → 0
Duplicates: 0 → 0

CLEANING SUCCESSFUL


Saved: cleaned_data\marketplace_items_clean.parquet (2,325,409 rows)


---

## 6. Marketplace Events Table

### Evidence from Analysis

From `analysis.ipynb`:
- `subdomain`: 2,138 missing (0.04%)
- All other columns complete

### Decision: Imputation Strategy

I impute missing subdomain values with "Unknown" to preserve complete event logs for user journey analysis. Events represent user actions and are valuable regardless of the subdomain context. Dropping 2,138 event records would create gaps in behavioral sequence data.

In [10]:
# Load marketplace events
print("Loading marketplace events...")
mp_events_dir = f"{DATASET_PATH_SMALL}/marketplace/events"
mp_events_files = dataset_files.get(mp_events_dir, [])
df_mp_events = load_dataframe_from_partitions_safe(
    mp_events_files, 
    limit=DATASET_SMALL_NUM_PARTITIONS_TO_LOAD
)

if df_mp_events is not None:
    show_evidence(df_mp_events, "Marketplace Events (Before Cleaning)", 
                 columns_to_check=['subdomain'])
    
    # Apply cleaning
    print("Imputing missing subdomain with 'Unknown'")
    df_mp_events_clean = df_mp_events.copy()
    df_mp_events_clean['subdomain'] = df_mp_events_clean['subdomain'].fillna("Unknown")
    
    # Validate
    validate_cleaning(df_mp_events, df_mp_events_clean, "Marketplace Events")
    save_cleaned_data(df_mp_events_clean, "marketplace_events")
    print(f"Saved {len(df_mp_events_clean):,} event records")

Loading marketplace events...
Loading 10 partitions...
Loading dataset/small/marketplace/events/01082.pq...
Loading dataset/small/marketplace/events/01083.pq...
Loading dataset/small/marketplace/events/01084.pq...
Loading dataset/small/marketplace/events/01085.pq...
Loading dataset/small/marketplace/events/01086.pq...
Loading dataset/small/marketplace/events/01087.pq...
Loading dataset/small/marketplace/events/01088.pq...
Loading dataset/small/marketplace/events/01089.pq...
Loading dataset/small/marketplace/events/01090.pq...
Loading dataset/small/marketplace/events/01091.pq...
Concatenating partitions...

EVIDENCE: Marketplace Events (Before Cleaning)

Shape: (5081920, 6)

 Missing Values Found:
   Column  Missing Count  Missing %
subdomain           2138       0.04

  Sample rows with missing 'subdomain':
                     timestamp   user_id         item_id subdomain action_type       os
2173 1082 days 00:09:10.988761  63492304   nfmcg_6127339      None        like      ios
4683 

---

## 7. Offers Items Table

### Evidence from Analysis

From `analysis.ipynb`:
- `brand_id`: 542 missing (2.42%)
- All other columns complete

### Decision: Imputation Strategy

I impute missing brand_id values with -1 rather than dropping records. Some offers may represent unbranded promotions or platform-wide deals that do not map to specific brands. The 2.42% missingness is modest, and offer-level attributes (item_id, offer terms) remain valid for promotion effectiveness analysis independent of brand attribution.

In [11]:
# Load offers items
print("Loading offers items...")
offers_items_path = f"{DATASET_PATH_SMALL}/offers/items.pq"
df_offers_items = load_remote_parquet_safe(offers_items_path)

if df_offers_items is not None:
    show_evidence(df_offers_items, "Offers Items (Before Cleaning)", 
                 columns_to_check=['brand_id'])
    
    # Apply cleaning
    print("Imputing missing brand_id with -1")
    df_offers_items_clean = df_offers_items.copy()
    df_offers_items_clean['brand_id'] = df_offers_items_clean['brand_id'].fillna(-1)
    
    # Validate
    validate_cleaning(df_offers_items, df_offers_items_clean, "Offers Items")
    save_cleaned_data(df_offers_items_clean, "offers_items")
    print(f"Saved {len(df_offers_items_clean):,} offer records")

Loading offers items...
Loading dataset/small/offers/items.pq...

EVIDENCE: Offers Items (Before Cleaning)

Shape: (22368, 3)

 Missing Values Found:
  Column  Missing Count  Missing %
brand_id            542       2.42

  Sample rows with missing 'brand_id':
        item_id  brand_id                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

---

## 8. Offers Events Table

### Evidence from Analysis

From `analysis.ipynb`:
- No missing values detected
- No duplicate records

### Decision: No Cleaning Required

In [12]:
# Load and validate offers events
print("Loading offers events (validation only)...")
offers_events_dir = f"{DATASET_PATH_SMALL}/offers/events"
offers_events_files = dataset_files.get(offers_events_dir, [])
df_offers_events = load_dataframe_from_partitions_safe(
    offers_events_files, 
    limit=DATASET_SMALL_NUM_PARTITIONS_TO_LOAD
)

if df_offers_events is not None:
    show_evidence(df_offers_events, "Offers Events (Validation)")
    
    # Assertion: verify no issues present
    assert df_offers_events.isnull().sum().sum() == 0, "Unexpected missing values detected"
    
    print("Validation passed: no cleaning required")
    save_cleaned_data(df_offers_events, "offers_events")
    print(f"Saved {len(df_offers_events):,} event records")

Loading offers events (validation only)...
Loading 10 partitions...
Loading dataset/small/offers/events/01082.pq...
Loading dataset/small/offers/events/01083.pq...
Loading dataset/small/offers/events/01084.pq...
Loading dataset/small/offers/events/01085.pq...
Loading dataset/small/offers/events/01086.pq...
Loading dataset/small/offers/events/01087.pq...
Loading dataset/small/offers/events/01088.pq...
Loading dataset/small/offers/events/01089.pq...
Loading dataset/small/offers/events/01090.pq...
Loading dataset/small/offers/events/01091.pq...
Concatenating partitions...

EVIDENCE: Offers Events (Validation)

Shape: (30475441, 4)

No missing values detected

No duplicates detected


Validation passed: no cleaning required
Saved: cleaned_data\offers_events_clean.parquet (30,475,441 rows)
Saved 30,475,441 event records


---

## 9. Reviews Table

### Evidence from Analysis

From `analysis.ipynb`:
- Embedding column has schema error (excluded during load)
- All data columns (timestamp, user_id, brand_id, rating) complete
- No duplicate records

### Decision: No Cleaning Required

In [13]:
# Load and validate reviews (exclude corrupted embeddings)
print("Loading reviews (embedding column excluded due to schema error)...")
reviews_dir = f"{DATASET_PATH_SMALL}/reviews"
review_files = dataset_files.get(reviews_dir, [])
df_reviews = load_dataframe_from_partitions_safe(
    review_files, 
    limit=DATASET_SMALL_NUM_PARTITIONS_TO_LOAD
)

if df_reviews is not None:
    show_evidence(df_reviews, "Reviews (Validation)")
    
    # Verify completeness
    missing_count = df_reviews.isnull().sum().sum()
    if missing_count == 0:
        print("Validation passed: no missing values in data columns")
        save_cleaned_data(df_reviews, "reviews")
        print(f"Saved {len(df_reviews):,} review records")
    else:
        print(f"Warning: Unexpected {missing_count} missing values detected")

Loading reviews (embedding column excluded due to schema error)...
Loading 10 partitions...
Loading dataset/small/reviews/01082.pq...
  Standard load failed: Expected all lists to be of size=312 but index 1 had size=0
  Retrying without 'embedding' column...
  Success (with exclusions).
Loading dataset/small/reviews/01083.pq...
  Standard load failed: Expected all lists to be of size=312 but index 1 had size=0
  Retrying without 'embedding' column...
  Success (with exclusions).
Loading dataset/small/reviews/01084.pq...
  Standard load failed: Expected all lists to be of size=312 but index 1 had size=0
  Retrying without 'embedding' column...
  Success (with exclusions).
Loading dataset/small/reviews/01085.pq...
  Standard load failed: Expected all lists to be of size=312 but index 2 had size=0
  Retrying without 'embedding' column...
  Success (with exclusions).
Loading dataset/small/reviews/01086.pq...
  Standard load failed: Expected all lists to be of size=312 but index 1 had size=

---

## 10. Payments Events Table

### Evidence from Analysis

From `analysis.ipynb`:
- `brand_id`: 33,298,275 missing (48.36%)
- `price`: 139 missing (0.004%)

### Decision: Differential Strategy

**For brand_id (48% missing):** I impute with -1 rather than dropping records. The extraordinarily high missingness likely reflects business logic—transactions may involve bundled items, generic products, or services without individual brand attribution. Dropping 48% of payment data would destroy the revenue analysis capability.

**For price (0.004% missing):** I drop rows with missing prices. Price is critical for payment analysis, and the 462 affected records represent negligible coverage loss (0.004%).

In [14]:
# Load payments events
print("Loading payments events...")
pay_events_dir = f"{DATASET_PATH_FULL}/payments/events"
pay_events_files = dataset_files.get(pay_events_dir, [])
df_pay_events = load_dataframe_from_partitions_safe(
    pay_events_files, 
    limit=DATASET_FULL_NUM_PARTITIONS_TO_LOAD,
    match_term=TARGET_PARTITION_ID  # <--- Forces alignment to Day 1082
)

if df_pay_events is not None:
    show_evidence(df_pay_events, "Payments Events (Before Cleaning)", 
                 columns_to_check=['brand_id', 'price'])
    
    # Apply cleaning
    print("Applying differential strategy:")
    print("  - brand_id (57% missing): impute with -1")
    print("  - price (0.004% missing): drop rows")
    
    df_pay_events_clean = df_pay_events.copy()
    df_pay_events_clean['brand_id'] = df_pay_events_clean['brand_id'].fillna(-1)
    df_pay_events_clean = df_pay_events_clean.dropna(subset=['price'])
    
    # Validate
    validate_cleaning(df_pay_events, df_pay_events_clean, "Payments Events")
    save_cleaned_data(df_pay_events_clean, "payments_events")
    print(f"Retained {len(df_pay_events_clean):,} payment events ({len(df_pay_events_clean)/len(df_pay_events)*100:.2f}% of original)")

Loading payments events...
Filter: Found 1 files matching '01082' (out of 1309)
Loading 1 partitions...
Loading dataset/full/payments/events/01082.pq...
Concatenating partitions...

EVIDENCE: Payments Events (Before Cleaning)

Shape: (68857371, 5)

 Missing Values Found:
  Column  Missing Count  Missing %
brand_id       33298275      48.36
   price            139       0.00

  Sample rows with missing 'brand_id':
                  timestamp   user_id  brand_id     price  transaction_hash
1 1082 days 00:00:00.004851  50106253       NaN -1.260849  67e83695d0b5c844
2 1082 days 00:00:00.009371  62366845       NaN -3.443055  abdea744af9464dd

  Sample rows with missing 'price':
                       timestamp   user_id  brand_id  price  transaction_hash
224366 1082 days 00:06:50.989125  50929130  141048.0    NaN  b819a7091f7a81e3
343328 1082 days 00:10:34.036093  76702552       NaN    NaN  3b960baef0ba5dc4

No duplicates detected


Applying differential strategy:
  - brand_id (57% missing)

---

## 11. Payments Receipts Table

### Evidence from Analysis

From `analysis.ipynb`:
- `brand_id`: 52,129,534 missing (85.80%)
- `price`: 861,796 missing (1.42%)

### Decision: Differential Strategy

**For brand_id (85.80% missing):** I impute with -1. The extreme missingness parallels Payments Events and suggests that receipt-level brand attribution is often unavailable in the transaction system. Receipts capture item-level details via `approximate_item_id`, making the transaction data useful even without brand mapping. Dropping 85.80 of receipt data would be unacceptable.

**For price (1.42% missing):** I drop rows with missing prices. Price is essential for receipt value calculation and revenue analysis. The 1.42% coverage loss is acceptable given the analytic importance of price integrity.

In [15]:
# Load payments receipts
print("Loading payments receipts...")
receipts_dir = f"{DATASET_PATH_FULL}/payments/receipts"
receipt_files = dataset_files.get(receipts_dir, [])
df_receipts = load_dataframe_from_partitions_safe(
    receipt_files, 
    limit=DATASET_FULL_NUM_PARTITIONS_TO_LOAD,
    match_term=TARGET_PARTITION_ID  # <--- Forces alignment to Day 1082
)

if df_receipts is not None:
    show_evidence(df_receipts, "Payments Receipts (Before Cleaning)", 
                 columns_to_check=['brand_id', 'price'])
    
    # Apply cleaning
    print("Applying differential strategy:")
    print("  - brand_id (90% missing): impute with -1")
    print("  - price (1.3% missing): drop rows")
    
    df_receipts_clean = df_receipts.copy()
    df_receipts_clean['brand_id'] = df_receipts_clean['brand_id'].fillna(-1)
    df_receipts_clean = df_receipts_clean.dropna(subset=['price'])
    
    # Validate
    validate_cleaning(df_receipts, df_receipts_clean, "Payments Receipts")
    save_cleaned_data(df_receipts_clean, "payments_receipts")
    print(f"Retained {len(df_receipts_clean):,} receipt records ({len(df_receipts_clean)/len(df_receipts)*100:.1f}% of original)")

Loading payments receipts...
Filter: Found 1 files matching '01082' (out of 1017)
Loading 1 partitions...
Loading dataset/full/payments/receipts/01082.pq...
Concatenating partitions...

EVIDENCE: Payments Receipts (Before Cleaning)

Shape: (60753821, 7)

 Missing Values Found:
  Column  Missing Count  Missing %
brand_id       52129534      85.80
   price         861796       1.42
   count             38       0.00

  Sample rows with missing 'brand_id':
                  timestamp   user_id  brand_id approximate_item_id  count     price  transaction_hash
1 1082 days 00:00:00.004234   1393183       NaN       nfmcg_2779189    1.0 -1.591231  55cc577396e77d0e
2 1082 days 00:00:00.005045  20069758       NaN       nfmcg_2928941    1.0 -6.809535  f6540f5416a0fdc2

  Sample rows with missing 'price':
                    timestamp   user_id  brand_id approximate_item_id  count  price  transaction_hash
57  1082 days 00:00:00.126774  87963920       NaN      nfmcg_18692330    1.0    NaN  55cc57739

---

## Reflection and Limitations

### What I Did Not Clean

Several common data cleaning steps were **intentionally omitted** because they lacked evidence:

1. **Outlier detection and removal**: The analysis showed no evidence of extreme or implausible values. Price distributions, while varied, appeared consistent with a diverse product catalog. Removing outliers without specific evidence would risk discarding legitimate premium or bulk-purchase items.

2. **Data type conversions**: All columns had appropriate types (`uint64` for IDs, `float64` for prices, `object` for text, `timedelta64` for timestamps). No conversions were necessary.

### Confirmed vs. Assumed

**Confirmed data errors** (with explicit evidence):
- Missing values (counts verified)
- Duplicate brand records (count verified)
- Embedding schema corruption (error message verified)

**Assumptions** (necessary but unverified):
- -1 does not conflict with legitimate ID values
- First-occurrence duplicate resolution is acceptable

### Cleaned Dataset Availability

All cleaned tables are saved in `cleaned_data/` as Parquet files, ready for modeling and analysis.