# Lab 05 — Missing Data, Duplicates, & Type Normalization

**Focus Area:** Turning messy raw data into consistent, joined datasets by managing missing data, duplicates, and type normalization — `dropna`, `fillna`, `astype`, `to_datetime`, `drop_duplicates`, `Series.apply`

---

## Outcomes

By the end of this lab, you will be able to:

1. Inspect and quantify missingness with `isna`, `info`, and `value_counts(dropna=False)`.
2. Impute or remove nulls using `fillna`, `ffill/bfill`, group‑wise imputations, and `dropna` with `subset`.
3. Normalize types: parse currency/percent strings to numerics (`str.replace` + `pd.to_numeric`/`astype`), and normalize dates with `pd.to_datetime(..., errors='coerce')` and (optionally) timezone to UTC.
4. Identify and remove duplicates with `duplicated`/`drop_duplicates`, including composite keys and "keep" strategies.
5. Apply `Series.apply` for targeted cleanups and know when to prefer vectorized ops.

## Setup - Generate Synthetic Messy Dataset

We'll create a synthetic dataset with various data quality issues including missing values, duplicates, inconsistent formats, and type issues.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path

# Set random seed for reproducibility
rng = np.random.default_rng(123)

# Generate base dataset
n = 1500
users = pd.DataFrame({
    'user_id': np.arange(n),
    'email': [f'user{i}@example.com' if rng.random() > 0.02 else None for i in range(n)],
    'age': rng.integers(16, 80, size=n).astype('float'),  # will inject NaNs
    'country': rng.choice(['US','U.S.A.','usa','SG','DE','BR','IN', None], size=n, p=[.25,.05,.05,.15,.15,.2,.1,.05]),
    'signup_date': rng.choice(['2025-01-05','01/06/2025','06-01-2025','2025/01/07', None], size=n, p=[.25,.25,.25,.2,.05]),
    'spend': rng.choice(['$12,345.60','$0.00','$99','1,234.50','€45,00','', None], size=n, p=[.15,.15,.25,.2,.15,.05,.05]),
    'is_marketing_opt_in': rng.choice([True, False, None], size=n, p=[.45,.5,.05])
})

# Inject duplicates (same user_id appearing twice with slight diffs)
dup_ids = rng.choice(users['user_id'], size=60, replace=False)
users = pd.concat([users, users.loc[users['user_id'].isin(dup_ids)].assign(spend='$0.00')], ignore_index=True)

print(f"Dataset shape: {users.shape}")
users.sample(5, random_state=7)

## Part A — Inspect & Plan Missing‑Data Strategy

### A1. Quick profile of missingness

We'll inspect the dataset to understand the extent and distribution of missing values.

In [None]:
# Get overall info about the dataset
users.info()

In [None]:
# Calculate the fraction of nulls per column
null_fraction = users.isna().mean().sort_values(ascending=False).to_frame('null_frac')
print("\nNull Fraction by Column:")
print(null_fraction)

In [None]:
# Examine country values including nulls
print("\nCountry value counts (including nulls):")
users['country'].value_counts(dropna=False).head(10)

**Checkpoint Analysis:**
- **Must be present:** `user_id` and `email` are critical business keys - cannot tolerate NaN
- **Can impute:** `spend`, `age`, `is_marketing_opt_in` - can be estimated or given safe defaults
- **Can normalize:** `country` has inconsistent formats (US, U.S.A., usa) - needs standardization
- **Must parse:** `signup_date` has multiple date formats - needs normalization

### A2. Drop rows only when necessary

We'll use `dropna` with `subset` to only drop rows where critical fields are missing.

In [None]:
# Require a user_id and email; tolerate other nulls for now
clean1 = users.dropna(subset=['user_id','email'])

print(f"Original rows: {len(users)}")
print(f"After dropping missing user_id/email: {len(clean1)}")
print(f"Rows dropped: {len(users) - len(clean1)}")

## Part B — Imputation (`fillna`, `ffill/bfill`, groupwise) & Type Fixes

### B1. Normalize country category with simple mapping + `fillna`

In [None]:
# Create mapping for country normalization
country_map = {'U.S.A.':'USA','usa':'USA','US':'USA'}

# Apply mapping and fill nulls
clean1.loc[:, 'country_norm'] = clean1['country'].map(country_map).fillna(clean1['country']).fillna('UNKNOWN')

print("Normalized country value counts:")
clean1['country_norm'].value_counts().head()

### B2. Currency parsing → numeric (`spend`)

We'll use vectorized string operations to parse currency values, then apply group-wise median imputation.

In [None]:
# Replace common currency symbols/commas; handle European comma decimals
s = clean1['spend'].astype('string')
s1 = s.str.replace('[ $,]', '', regex=True)
s1 = s1.str.replace('€', '', regex=False)

# Convert comma-decimal like '45,00' -> '45.00' when there is one comma and no dot
s1 = s1.where(~(s1.str.contains('^\\d+,\\d{1,2}$', regex=True)), s1.str.replace(',', '.', regex=False))

# Convert to numeric
clean1.loc[:, 'spend_usd'] = pd.to_numeric(s1, errors='coerce')

print("Before imputation:")
print(clean1['spend_usd'].describe())
print(f"\nNull values: {clean1['spend_usd'].isna().sum()}")

In [None]:
# Impute missing spend with group median by country
med = clean1.groupby('country_norm')['spend_usd'].transform('median')
clean1.loc[:, 'spend_usd'] = clean1['spend_usd'].fillna(med).fillna(0.0)

print("After imputation:")
print(clean1['spend_usd'].describe())
print(f"\nNull values: {clean1['spend_usd'].isna().sum()}")

### B3. Dates to `datetime64[ns]` with parsing variations

In [None]:
# Parse signup_date with automatic format detection
clean1.loc[:, 'signup_dt'] = pd.to_datetime(clean1['signup_date'], errors='coerce', dayfirst=False)

print("Date parsing comparison:")
print(clean1[['signup_date','signup_dt']].head(8))
print(f"\nNull dates: {clean1['signup_dt'].isna().sum()}")

### B4. Coerce numeric/boolean types with `astype`

In [None]:
# Age may have NaN -> use pandas nullable types if you need to keep nulls
clean1.loc[:, 'age'] = clean1['age'].astype('Float64')

# Normalize boolean with fillna then astype
clean1.loc[:, 'is_marketing_opt_in'] = clean1['is_marketing_opt_in'].fillna(False).astype('bool')

print("Updated data types:")
print(clean1.dtypes)

## Part C — Duplicates & De‑duplication Strategies

### C1. Detect duplicates by key

In [None]:
# A user should be unique by user_id (business key)
dup_mask = clean1.duplicated(subset=['user_id'], keep=False)
clean_dups = clean1.loc[dup_mask].sort_values(['user_id','signup_dt'])

print(f"Number of duplicate user_ids: {clean_dups['user_id'].nunique()}")
print(f"Total duplicate rows: {len(clean_dups)}")
print("\nSample of duplicates:")
clean_dups.head(8)

### C2. Resolve duplicates: pick the "best" record per key

Policy: Prefer newest `signup_dt`, then higher `spend_usd`

In [None]:
# 1) Prefer newest signup_dt, then higher spend
resolved = (clean1
            .sort_values(['user_id','signup_dt','spend_usd'], ascending=[True, False, False])
            .drop_duplicates(subset=['user_id'], keep='first'))

print(f"Before deduplication: {len(clean1)} rows")
print(f"After deduplication: {len(resolved)} rows")
print(f"Duplicates removed: {len(clean1) - len(resolved)}")

### C3. Alternative: custom reducer via `groupby().agg`

In [None]:
def coalesce(series):
    """Return first non-null value"""
    return series.dropna().iloc[0] if series.notna().any() else pd.NA

best = (clean1
        .sort_values(['signup_dt'], ascending=False)
        .groupby('user_id')
        .agg(email=('email','first'),
             country_norm=('country_norm', 'first'),
             signup_dt=('signup_dt','first'),
             spend_usd=('spend_usd','max'),
             age=('age', coalesce),
             is_marketing_opt_in=('is_marketing_opt_in','max'))
        .reset_index())

print(f"Rows after groupby aggregation: {len(best)}")
print("\nSample of deduplicated data:")
best.head()

## Part D — `Series.apply` vs Vectorized Ops

### D1. When to use `apply`

We'll compare the performance of `apply` vs vectorized operations for parsing currency values.

In [None]:
def parse_spend(x):
    """Parse currency value to float"""
    if x is None or (isinstance(x, float) and pd.isna(x)):
        return None
    s = str(x).strip()
    s = s.replace('€','').replace(' ','')
    if s.count(',') == 1 and s.count('.') == 0:
        s = s.replace(',','.')
    s = s.replace('$','').replace(',','')
    try:
        return float(s)
    except ValueError:
        return None

# Test with apply
print("Testing apply method:")
%timeit clean1['spend'].map(parse_spend)

In [None]:
# Test with vectorized operations
print("Testing vectorized method:")
%timeit pd.to_numeric(clean1['spend'].astype('string').str.replace('[ $,]','',regex=True).str.replace('€','',regex=False), errors='coerce')

**Guideline:** Prefer vectorized operations for large frames; reserve `apply` for edge cases, then cache results.

Vectorized operations are typically 10-100x faster than `apply` on large datasets.

## Part E — Bonus (Optional) — Apply to orders artifacts

### E1. Load data

In [None]:
# Try to load orders data from Lab 03 artifacts
from pathlib import Path

p_orders = Path('artifacts/parquet/orders')
joined_path = Path('artifacts/parquet/orders_joined.parquet')

orders = None

if joined_path.exists():
    orders = pd.read_parquet(joined_path)
    print(f"Loaded {len(orders)} rows from {joined_path}")
elif p_orders.exists():
    files = sorted(p_orders.glob('shipcountry=*.parquet'))
    if files:
        orders = pd.concat([pd.read_parquet(f) for f in files], ignore_index=True)
        print(f"Loaded {len(orders)} rows from {len(files)} parquet files")
    else:
        print("No parquet files found in orders directory")
else:
    print("Orders data not available - skipping bonus section")
    print("Run Lab 03 first to generate the required artifacts")

if orders is not None:
    print("\nOrders data preview:")
    display(orders.head())
    print("\nData types:")
    print(orders.dtypes)

### E2. Missingness & types

In [None]:
if orders is not None:
    print("Missing data analysis:")
    print(orders.isna().mean().sort_values(ascending=False).head(10))
    
    # Normalize date and numeric columns if they exist
    if 'OrderDate' in orders.columns:
        orders['OrderDate'] = pd.to_datetime(orders['OrderDate'], errors='coerce')
        print(f"\nOrderDate converted to datetime. Nulls: {orders['OrderDate'].isna().sum()}")
    
    if 'Freight' in orders.columns:
        orders['Freight'] = pd.to_numeric(orders['Freight'], errors='coerce')
        print(f"Freight converted to numeric. Nulls: {orders['Freight'].isna().sum()}")

### E3. Deduplicate line items (composite key)

In [None]:
if orders is not None:
    # If you have line items with duplicate OrderID/ProductID rows, keep the one with max UnitPrice*Quantity
    if {'ProductID','Quantity','UnitPrice'}.issubset(orders.columns):
        orders['ext_price'] = orders['Quantity'] * orders['UnitPrice']
        dedup = (orders
                 .sort_values(['OrderID','ProductID','ext_price'], ascending=[True,True,False])
                 .drop_duplicates(subset=['OrderID','ProductID'], keep='first'))
        print(f"Deduplication by OrderID/ProductID composite key")
    else:
        # Fallback: dedupe by OrderID only (newest date wins)
        if 'OrderDate' in orders.columns:
            dedup = (orders
                     .sort_values(['OrderID','OrderDate'], ascending=[True,False])
                     .drop_duplicates(subset=['OrderID'], keep='first'))
            print(f"Deduplication by OrderID only")
        else:
            dedup = orders.drop_duplicates(subset=['OrderID'], keep='first')
            print(f"Simple deduplication by OrderID")
    
    print(f"\nBefore deduplication: {len(orders)} rows")
    print(f"After deduplication: {len(dedup)} rows")
    print(f"Duplicates removed: {len(orders) - len(dedup)}")

### E4. Export cleaned orders

In [None]:
if orders is not None:
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    out = Path('artifacts/clean')
    out.mkdir(parents=True, exist_ok=True)
    
    output_path = out / 'orders_clean.parquet'
    pq.write_table(pa.Table.from_pandas(dedup, preserve_index=False), output_path)
    
    print(f"Cleaned orders saved to: {output_path}")
    print(f"File size: {output_path.stat().st_size:,} bytes")

## Part F — Wrap‑Up

### Summary and Analysis

### Answers to Key Questions:

#### 1. Which columns did you drop vs impute, and why?

**Dropped:**
- Rows with missing `user_id` or `email` were dropped because these are critical business keys needed for identification, deduplication, and joins. Without these, the row has no referential integrity.

**Imputed:**
- `spend_usd`: Used group-wise median by `country_norm`, then filled remaining nulls with 0.0. Rationale: spending patterns may vary by country, so group-wise imputation preserves geographic differences.
- `age`: Kept as nullable Float64 to preserve the information that some ages were missing. Could alternatively impute with group median.
- `is_marketing_opt_in`: Filled with `False` as the safe default (opt-out by default is safer for compliance).
- `country`: Normalized inconsistent formats and filled nulls with 'UNKNOWN' to enable grouping and analysis.

#### 2. What deduplication policy did you choose? How does it handle ties or null dates?

**Policy:** We used two approaches:

1. **Sort-based deduplication**: Sort by `user_id` (ascending), `signup_dt` (descending), and `spend_usd` (descending), then keep='first'. This prefers:
   - Newest signup date
   - If dates are tied or null, prefer highest spend
   - Null dates will sort to the bottom (least preferred)

2. **Aggregation-based**: Group by `user_id` and aggregate:
   - `email`, `country_norm`, `signup_dt`: take first (after sorting by date desc)
   - `spend_usd`: take maximum
   - `age`: use custom coalesce to get first non-null
   - `is_marketing_opt_in`: take max (True > False)

**Handling ties/nulls:** The sort order is deterministic, so ties are resolved consistently. Null dates are treated as "least preferred" by pandas' default null-last sorting.

#### 3. Show a before/after `dtypes` table and explain at least two `astype`/`to_datetime` choices

In [None]:
# Compare dtypes before and after cleaning
print("BEFORE CLEANING:")
print(users.dtypes)
print("\n" + "="*50 + "\n")
print("AFTER CLEANING:")
print(resolved.dtypes)

**Key type conversion choices:**

1. **`pd.to_datetime(signup_date, errors='coerce')`:**
   - Converts mixed date formats (YYYY-MM-DD, MM/DD/YYYY, DD-MM-YYYY) into consistent `datetime64[ns]`
   - `errors='coerce'` ensures unparseable values become NaT instead of raising errors
   - `infer_datetime_format=True` helps pandas auto-detect formats for better parsing
   - Essential for time-based operations, sorting, and sequencing

2. **`age.astype('Float64')`:**
   - Uses pandas nullable Float64 type instead of numpy float64
   - Allows explicit NaN values without confusion with special float values
   - Better for data that should distinguish "missing" from "zero"
   - Float64 (capital F) is a nullable type; float64 (lowercase) is not

3. **`pd.to_numeric(spend_string, errors='coerce')`:**
   - Converts currency strings to float after cleaning
   - More robust than `astype(float)` which would fail on strings
   - `errors='coerce'` turns unparseable values into NaN for safe imputation

4. **`is_marketing_opt_in.fillna(False).astype('bool')`:**
   - First fills nulls with safe default
   - Then converts to strict boolean type (not nullable)
   - Ensures downstream logic can rely on True/False without null checks

### Export Final Cleaned Dataset

In [None]:
import pyarrow.parquet as pq
import pyarrow as pa
from pathlib import Path

# Create output directory
out = Path('artifacts/clean')
out.mkdir(parents=True, exist_ok=True)

# Choose which cleaned dataset to export (resolved or best)
final = resolved  # or `best` depending on policy preference

# Write to parquet
output_path = out / 'users_clean.parquet'
pq.write_table(pa.Table.from_pandas(final, preserve_index=False), output_path)

print(f"Cleaned users data exported to: {output_path}")
print(f"Final row count: {len(final)}")
print(f"File size: {output_path.stat().st_size:,} bytes")
print(f"\nFinal data shape: {final.shape}")
print(f"\nSample of cleaned data:")
final.head()

### Data Quality Summary

In [None]:
print("DATA QUALITY METRICS")
print("=" * 60)
print(f"Original rows: {len(users)}")
print(f"Rows after dropping missing keys: {len(clean1)}")
print(f"Final rows after deduplication: {len(final)}")
print(f"Total rows removed: {len(users) - len(final)} ({100*(len(users)-len(final))/len(users):.1f}%)")
print("\n" + "=" * 60)

print("\nFINAL MISSING DATA:")
print(final.isna().sum())

print("\n" + "=" * 60)
print("\nDATA TYPE SUMMARY:")
print(final.dtypes)

print("\n" + "=" * 60)
print("\nSPEND STATISTICS:")
print(final['spend_usd'].describe())

print("\n" + "=" * 60)
print("\nCOUNTRY DISTRIBUTION:")
print(final['country_norm'].value_counts())

### Common Pitfalls Avoided

1. **Over-dropping with `dropna()`**: We used `subset=['user_id','email']` instead of dropping all rows with any nulls, preserving 95%+ of data.

2. **`astype(float)` on non-numeric strings**: We used `pd.to_numeric(..., errors='coerce')` which safely handles unparseable values.

3. **Inconsistent duplicate resolution**: We sorted by multiple columns before `drop_duplicates()` to ensure deterministic results.

4. **Mixing object, string, and nullable dtypes**: We were explicit with types like `Float64` for nullable numerics and proper `bool` after imputation.

5. **Ignoring group-wise patterns**: We used country-specific median imputation for spend rather than global median, preserving geographic patterns.

## Conclusion

In this lab, we:

✅ Inspected and quantified missing data across multiple dimensions  
✅ Applied strategic imputation using group-wise statistics  
✅ Normalized inconsistent data formats (currency, dates, categories)  
✅ Converted data types appropriately using nullable types where needed  
✅ Identified and removed duplicates using composite key strategies  
✅ Compared `apply` vs vectorized operations for performance  
✅ Exported clean, analysis-ready data in Parquet format  

The cleaned dataset is now ready for downstream analytics, machine learning, or LLM pipeline integration.