# Ecommerce Events - Data Cleaning Pipeline

## What this notebook does

This notebook takes raw ecommerce event logs and turns them into a clean, consistent dataset that you can trust for analysis or dashboards.

**Input files**

- `2019-Oct.csv`
- `2019-Nov.csv`

Each row is an event with:

- `event_time` - when the event happened
- `event_type` - view / cart / remove_from_cart / purchase
- `product_id` - product identifier
- `category_id`, `category_code` - product category info
- `brand` - product brand
- `price` - price at the time of the event
- `user_id` - user identifier
- `user_session` - session identifier

The raw files contain: missing IDs, bad prices, inconsistent text, extra event types, and scattered nulls. If you build anything on top of that, your numbers will drift and be hard to debug.

## Goal

Build a simple, repeatable cleaning step that:

- enforces basic data integrity  
- removes obvious garbage  
- standardizes key fields  
- produces "analysis-ready" CSVs for BI or modeling

## What the script enforces

In the code below, the pipeline:

- **Keeps only valid events**
  - Drops rows without `user_id` or `product_id`
  - Removes exact duplicate rows

- **Standardizes core columns**
  - Parses `event_time` to a proper datetime
  - Lowercases and trims `brand` and `category_code`

- **Applies clear business rules**
  - Keeps only rows with `price > 0`
  - Restricts `event_type` to the main funnel steps:
    - `view`, `cart`, `remove_from_cart`, `purchase`

- **Handles missing values consistently**
  - Replaces remaining nulls with a single sentinel value (`"-"`)

- **Scales to large files**
  - Reads the CSVs in chunks, applies the same cleaning to each chunk, then writes cleaned outputs

The next cell contains the full cleaning script. No manual tweaking is required: point it at the raw files, run it, and you get cleaned Oct and Nov event logs.

In [None]:
#import libraries
import pandas as pd
import os
import time
from pathlib import Path

#Load the necessary columns in chunks
CHUNK_SIZE = 500_000
COLUMNS = [
    'event_time', 'event_type', 'product_id',
    'category_id', 'category_code', 'brand',
    'price', 'user_id', 'user_session'
]

#valid event types to keep
VALID_EVENTS = ['view', 'cart', 'remove_from_cart', 'purchase']

#clean a single chunk of data
def clean_chunk(df):
    #remove rows where user_id or product_id is missing
    df = df.dropna(subset=['user_id', 'product_id'])

    #remove duplicate rows
    df = df.drop_duplicates()

    #convert event_time to proper datetime format
    df['event_time'] = pd.to_datetime(df['event_time'], errors='coerce')

    #standardize brand and category_code (lowercase, no spaces)
    df['brand'] = df['brand'].str.lower().str.strip()
    df['category_code'] = df['category_code'].str.lower().str.strip()

    #remove rows with price zero or negative
    df = df[df['price'] > 0]

    #keep the 4 event types
    df = df[df['event_type'].isin(VALID_EVENTS)]

    #replace any leftover nulls with a dash
    df = df.fillna('-')

    return df

#check if cleaned data actually makes sense
def validate_cleaned(df, filename):
    issues = []

    #check for any leftover nulls
    if df.isnull().sum().sum() > 0:
        issues.append("Null values remain")

    #check for invalid prices
    if df['price'].le(0).any():
        issues.append("Invalid prices found")

    #check if event_type has anything unexpected
    if not df['event_type'].isin(VALID_EVENTS).all():
        issues.append("Unexpected event types present")

    #check if the DataFrame is empty after cleaning
    if df.shape[0] == 0:
        issues.append("Resulting DataFrame is empty")

    #print issues if found, otherwise confirm it's clean
    if issues:
        print(f"[VALIDATION FAILED] {filename}:")
        for issue in issues:
            print(f" - {issue}")
    else:
        print(f"[VALIDATION PASSED] {filename}")

#load, clean, and save one file
def process_file(path_to_csv):
    file_start = time.time()
    name = Path(path_to_csv).stem
    output_path = f"cleaned_{name}.csv"
    processed_chunks = []

    print(f"[INFO] Processing {path_to_csv}...")

    #load file in chunks and clean each one
    for chunk in pd.read_csv(path_to_csv, usecols=COLUMNS, chunksize=CHUNK_SIZE):
        cleaned = clean_chunk(chunk)
        processed_chunks.append(cleaned)

    #combine all cleaned chunks
    final_df = pd.concat(processed_chunks)

    #save to cleaned CSV
    final_df.to_csv(output_path, index=False)

    print(f"[DONE] Saved cleaned data to {output_path}")
    print(f"[ROWS] {len(final_df)} rows written")

    #run validation check on final output
    validate_cleaned(final_df, output_path)

    file_end = time.time()
    print(f"[TIME] {round(file_end - file_start, 2)} seconds\n")

#process both csv files in one go
def main():
    files = ["2019-Oct.csv", "2019-Nov.csv"]
    for file in files:
        if os.path.exists(file):
            process_file(file)
        else:
            print(f"[ERROR] File not found: {file}")

#run script
if __name__ == "__main__":
    main()

[INFO] Processing 2019-Oct.csv...
[DONE] Saved cleaned data to cleaned_2019-Oct.csv
[ROWS] 42349875 rows written
[VALIDATION PASSED] cleaned_2019-Oct.csv
[TIME] 648.21 seconds

[INFO] Processing 2019-Nov.csv...
[DONE] Saved cleaned data to cleaned_2019-Nov.csv
[ROWS] 67213478 rows written
[VALIDATION PASSED] cleaned_2019-Nov.csv
[TIME] 1114.77 seconds



# What we get after cleaning

After running the script, each raw file (for example `2019-Oct.csv`) becomes a cleaned file (`cleaned_2019-Oct.csv`) with the same columns but a much tighter definition of what a "valid event" is.

## How the data changed

The cleaned files now guarantee:

- **Every event is attached to a real user and product**
  - No rows with missing `user_id` or `product_id`

- **No obvious double-counting**
  - Exact duplicate rows are removed

- **Consistent, usable fields**
  - `event_time` is stored as a proper timestamp
  - `brand` and `category_code` are normalized (lowercase, trimmed) so "NIKE", "nike " and "Nike" are the same thing

- **Prices and events are sane**
  - No non-positive prices
  - `event_type` is limited to the main funnel actions:
    - `view`, `cart`, `remove_from_cart`, `purchase`

- **Missing data is explicit, not hidden**
  - Remaining gaps are marked with `"-"` instead of silent nulls

- **Basic validation is done for you**
  - The script reports if:
    - any nulls slipped through  
    - any invalid prices remain  
    - any unexpected event types exist  
    - the cleaned file ended up empty

## Why this matters

With these constraints in place, you can safely:

- Build funnels from view to cart to purchase by user, session, or product  
- Analyze behavior by brand or category without fighting messy labels  
- Feed the data into dashboards or models that assume "one row = one valid event"  
- Re-run the same cleaning step on future months without touching the logic

In short: raw logs go in, deterministic, analysis-ready event tables come out. This notebook documents both the rules and the code that enforces them.