# Data Cleaning and Preprocessing

## Overview
This notebook performs data quality checks and cleaning on the UCI Online Retail dataset.

## Data Source
- **Dataset**: Online Retail Dataset from UCI Machine Learning Repository
- **Period**: December 2010 - December 2011
- **Business**: UK-based online gift retailer (B2B and B2C)

## Cleaning Steps
1. Handle missing values (specifically CustomerID - required for customer analysis)
2. Remove duplicate transactions
3. Correct data types (dates, IDs)
4. Filter invalid records (non-product entries)
5. Document business rules for data anomalies

## Key Business Rules Discovered
- **InvoiceNo starting with 'C'**: Indicates a **cancellation/return** (negative quantities)
- **StockCode = POST, D, M, C2, etc.**: Non-product entries (postage, discounts, manual adjustments)
- **Valid StockCode pattern**: 5 digits optionally followed by letters (e.g., 85123A)

In [None]:
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

## 1. Load and Inspect Raw Data

In [None]:
# Load data with proper encoding for special characters
df_raw = pd.read_csv("data.csv", encoding='unicode_escape')

print("=== Raw Data Overview ===")
print(f"Shape: {df_raw.shape[0]:,} rows x {df_raw.shape[1]} columns")
print(f"\nColumns: {list(df_raw.columns)}")

In [None]:
df_raw.info()

In [None]:
df_raw.head(10)

## 2. Missing Value Analysis

### Why we only drop rows where CustomerID is null:
- **CustomerID is essential** for customer-level analysis (CLV, segmentation, RFM)
- **Description nulls** can often be inferred from StockCode
- Dropping all nulls blindly loses valuable transaction data

In [None]:
# Analyze missing values
null_summary = pd.DataFrame({
    'Null Count': df_raw.isnull().sum(),
    'Null %': (df_raw.isnull().sum() / len(df_raw) * 100).round(2)
})
null_summary = null_summary[null_summary['Null Count'] > 0]

print("=== Missing Value Summary ===")
print(null_summary)
print(f"\nTotal rows with any null: {df_raw.isnull().any(axis=1).sum():,}")

In [None]:
# Examine rows with null CustomerID
null_customer_sample = df_raw[df_raw['CustomerID'].isnull()].head(10)
print("Sample of transactions with missing CustomerID:")
null_customer_sample

In [None]:
# Drop only rows where CustomerID is null (required for customer analysis)
df = df_raw.dropna(subset=['CustomerID']).copy()

print(f"Rows before: {len(df_raw):,}")
print(f"Rows after dropping null CustomerID: {len(df):,}")
print(f"Rows removed: {len(df_raw) - len(df):,} ({(len(df_raw) - len(df)) / len(df_raw) * 100:.1f}%)")

## 3. Remove Duplicates

In [None]:
# Check for duplicates
duplicate_count = df.duplicated().sum()
print(f"Duplicate rows found: {duplicate_count:,}")

# Remove duplicates
df = df.drop_duplicates()
print(f"Rows after deduplication: {len(df):,}")

## 4. Data Type Corrections

In [None]:
# Convert InvoiceDate to datetime
# Format: MM/DD/YYYY HH:MM (US format)
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%m/%d/%Y %H:%M')

# Convert CustomerID from float to integer
# (Was float due to NaN values in original data)
df['CustomerID'] = df['CustomerID'].astype(int)

print("Data types after conversion:")
print(df.dtypes)

## 5. Understand Invoice Patterns

### InvoiceNo Convention
- **Numeric only** (e.g., 536365): Standard sale
- **Starts with 'C'** (e.g., C536379): **Cancellation/Return** - negative quantity

In [None]:
# Analyze invoice patterns
df['IsCancellation'] = df['InvoiceNo'].astype(str).str.startswith('C')

cancellation_summary = df.groupby('IsCancellation').agg({
    'InvoiceNo': 'count',
    'Quantity': ['sum', 'mean']
}).round(2)
cancellation_summary.columns = ['Transaction Count', 'Total Quantity', 'Avg Quantity']

print("=== Invoice Type Analysis ===")
print("IsCancellation = True means invoice starts with 'C' (return/cancellation)")
print(cancellation_summary)

# Verify cancellations have negative quantities
cancellations = df[df['IsCancellation']]
print(f"\nCancellations with negative quantity: {(cancellations['Quantity'] < 0).sum():,} / {len(cancellations):,}")

## 6. Filter Invalid StockCodes

### StockCode Pattern Analysis
- **Valid pattern**: `^\d{5}[a-zA-Z]*$` 
  - 5 digits, optionally followed by 1+ letters
  - Examples: 85123, 85123A, 84029G
- **Invalid codes to remove**:
  - `POST` - Postage charges
  - `D` - Discount
  - `M` - Manual adjustment
  - `C2` - Carriage
  - `DOT` - Dotcom postage
  - `BANK CHARGES` - Bank fees

### Regex Explanation: `^\d{5}[a-zA-Z]*$`
- `^` - Start of string
- `\d{5}` - Exactly 5 digits (0-9)
- `[a-zA-Z]*` - Zero or more letters
- `$` - End of string

In [None]:
# Identify invalid StockCodes
VALID_STOCKCODE_PATTERN = r'^\d{5}[a-zA-Z]*$'

invalid_stockcodes = df[~df['StockCode'].astype(str).str.match(VALID_STOCKCODE_PATTERN)]

print("=== Invalid StockCode Analysis ===")
print(f"Invalid records: {len(invalid_stockcodes):,} ({len(invalid_stockcodes)/len(df)*100:.2f}%)")
print("\nInvalid StockCode breakdown:")
print(invalid_stockcodes.groupby(['StockCode', 'Description']).size().sort_values(ascending=False).head(10))

In [None]:
# Remove invalid StockCodes (non-product entries)
rows_before = len(df)
df = df[df['StockCode'].astype(str).str.match(VALID_STOCKCODE_PATTERN)]

print(f"Rows removed (invalid StockCode): {rows_before - len(df):,}")
print(f"Rows remaining: {len(df):,}")

## 7. Final Validation

In [None]:
# Drop the temporary IsCancellation column (will recreate in feature engineering if needed)
df = df.drop(columns=['IsCancellation'])

# Final data quality check
print("=== Final Data Quality Report ===")
print(f"\nShape: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"\nDate range: {df['InvoiceDate'].min()} to {df['InvoiceDate'].max()}")
print(f"\nUnique values:")
print(f"  - Customers: {df['CustomerID'].nunique():,}")
print(f"  - Products: {df['StockCode'].nunique():,}")
print(f"  - Invoices: {df['InvoiceNo'].nunique():,}")
print(f"  - Countries: {df['Country'].nunique()}")

print(f"\nData types:")
print(df.dtypes)

In [None]:
# Verify no remaining nulls in critical columns
print("\nNull check (should all be 0):")
print(df.isnull().sum())

In [None]:
df.head()

## 8. Export Cleaned Data

In [None]:
# Save cleaned data
df.to_csv('clean_data.csv', index=False)

print(f"Cleaned data exported to clean_data.csv")
print(f"\n=== Cleaning Summary ===")
print(f"Original rows: {len(df_raw):,}")
print(f"Final rows: {len(df):,}")
print(f"Rows removed: {len(df_raw) - len(df):,} ({(len(df_raw) - len(df)) / len(df_raw) * 100:.1f}%)")
print(f"\nReasons for removal:")
print(f"  - Null CustomerID: ~135K rows")
print(f"  - Duplicates: ~5K rows")
print(f"  - Invalid StockCodes: ~2K rows")