# FRE 521D: Data Analytics in Climate, Food and Environment
## Lecture 9: Data Cleaning I - Formats, Types, Keys, and Deduplication

**Date:** Monday, February 2, 2026  
**Instructor:** Asif Ahmed Neloy  
**Program:** UBC Master of Food and Resource Economics

---

### Today's Agenda

1. Understanding Dirty Data
2. Data Type Standardization
3. Handling European Number Formats
4. Date and Time Parsing
5. String Cleaning and Normalization
6. Key Generation for Unique Identifiers
7. Deduplication Strategies
8. Building Reusable Cleaning Functions

---

## Learning Objectives

By the end of this lecture, you will be able to:

1. Identify common data quality issues in real-world datasets
2. Convert European number formats (comma decimals) to standard floats
3. Parse dates and times from various string formats
4. Clean and normalize string data
5. Generate unique keys for records that lack them
6. Detect and remove duplicate records using multiple strategies
7. Build reusable cleaning functions for production pipelines

---

## Setting Up

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import re
import hashlib
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 60)
pd.set_option('display.max_colwidth', 50)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Ready for Data Cleaning I!")

The imports above include `re` for regular expressions (pattern matching in strings) and `hashlib` for generating hash-based unique keys. These are essential tools for data cleaning.

---

## 1. Understanding Dirty Data

### What is "Dirty" Data?

Dirty data is data that contains errors, inconsistencies, or is otherwise unsuitable for analysis. Studies show that data scientists spend 60-80% of their time cleaning data.

### Common Data Quality Issues

```
┌─────────────────────────────────────────────────────────────────┐
│                   TYPES OF DIRTY DATA                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  STRUCTURAL ISSUES          CONTENT ISSUES                     │
│  ├── Wrong data types       ├── Missing values                 │
│  ├── Inconsistent formats   ├── Outliers                       │
│  ├── Mixed encodings        ├── Typos and misspellings         │
│  └── Nested structures      └── Invalid values                 │
│                                                                 │
│  CONSISTENCY ISSUES         COMPLETENESS ISSUES                │
│  ├── Duplicates             ├── Missing records                │
│  ├── Contradictions         ├── Truncated data                 │
│  ├── Referential breaks     ├── Partial records                │
│  └── Unit mismatches        └── Missing columns                │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### The Cost of Dirty Data

Dirty data leads to:
- **Wrong conclusions**: Flawed analysis from flawed inputs
- **Failed pipelines**: ETL jobs crash on unexpected values
- **Wasted time**: Hours debugging issues that stem from data quality
- **Lost trust**: Stakeholders question results when errors surface

### Our Datasets

Today we will work with two datasets that have real-world quality issues:

1. **AirQualityUCI.csv**: European dataset with comma decimals and coded missing values
2. **GlobalWeatherRepository.csv**: Large dataset with potential duplicates and inconsistencies

---

In [None]:
# Load the AirQuality dataset
# Note: This file uses semicolons as delimiters (common in European CSVs)

air_quality_path = '../../Datasets/AirQualityUCI.csv'

# First, let's look at the raw file to understand its structure
print("First 3 lines of AirQualityUCI.csv:")
print("-" * 80)
with open(air_quality_path, 'r', encoding='utf-8') as f:
    for i, line in enumerate(f):
        if i < 3:
            print(line.strip()[:100] + "...")
        else:
            break

Notice several issues in the raw data:
1. **Semicolon delimiter** instead of comma (European standard)
2. **Comma as decimal separator** (e.g., "2,6" instead of "2.6")
3. **Extra empty columns** at the end

These are classic signs of a European-formatted CSV file.

In [None]:
# Load with semicolon delimiter
df_air = pd.read_csv(air_quality_path, sep=';', encoding='utf-8')

print(f"Shape: {df_air.shape}")
print(f"\nColumns ({len(df_air.columns)}):")
for i, col in enumerate(df_air.columns):
    print(f"  {i+1:2d}. '{col}'")

print("\nFirst 5 rows:")
df_air.head()

The data is loaded, but there are issues to fix:
- Extra unnamed columns at the end
- Numeric columns are stored as strings because of comma decimals
- The value -200 appears frequently (this is the missing value code)

In [None]:
# Examine data types
print("Data Types:")
print(df_air.dtypes)

Most columns that should be numeric (like CO(GT), C6H6(GT)) are showing as `object` type, which means they contain strings. This is because pandas cannot automatically parse European number formats.

---

## 2. Data Type Standardization

### Why Types Matter

Wrong data types cause problems:

| Issue | Example | Consequence |
|-------|---------|-------------|
| Math fails | `"2.5" + "3.0"` | Returns `"2.53.0"` (string concat) |
| Sorting breaks | `["10", "2", "1"]` | Sorts as `["1", "10", "2"]` |
| Aggregations wrong | `mean("2", "3")` | Error or NaN |
| Memory waste | Strings use 10x more memory than numbers |

### Type Conversion Strategy

```
┌─────────────────────────────────────────────────────────────────┐
│                TYPE CONVERSION WORKFLOW                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   1. IDENTIFY  ───►  Which columns need conversion?             │
│                      Look at dtypes and sample values           │
│                                                                 │
│   2. CLEAN     ───►  Remove/replace characters that block       │
│                      conversion (commas, currency symbols)      │
│                                                                 │
│   3. CONVERT   ───►  Apply type conversion with error handling  │
│                      Use pd.to_numeric() or astype()            │
│                                                                 │
│   4. VERIFY    ───►  Check results and handle failures          │
│                      Look for NaN from failed conversions       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [None]:
# First, let's drop the empty columns at the end
# These are artifacts from the CSV export

# Find columns that are entirely empty
empty_cols = [col for col in df_air.columns if df_air[col].isna().all()]
print(f"Empty columns to drop: {empty_cols}")

# Also drop unnamed columns
unnamed_cols = [col for col in df_air.columns if 'Unnamed' in str(col)]
print(f"Unnamed columns to drop: {unnamed_cols}")

# Drop both
cols_to_drop = list(set(empty_cols + unnamed_cols))
df_air_clean = df_air.drop(columns=cols_to_drop)

print(f"\nShape after dropping: {df_air_clean.shape}")

We removed the empty columns that were artifacts of the European CSV export. Now we have a cleaner dataset to work with.

---

## 3. Handling European Number Formats

### The Problem

In many European countries, the decimal separator is a comma, not a period:

| Region | Number Format | Example |
|--------|--------------|----------|
| US/UK | Period decimal | 1,234.56 |
| Europe (many) | Comma decimal | 1.234,56 |

When data is exported from European systems, pandas sees "2,6" and treats it as a string because it doesn't recognize the comma as a decimal point.

### Solution: Replace and Convert

In [None]:
def convert_european_decimal(series):
    """
    Convert a series with European decimal format (comma) to numeric.
    
    Parameters:
    -----------
    series : pd.Series
        Series with European-formatted numbers (e.g., "2,6")
    
    Returns:
    --------
    pd.Series : Numeric series with proper float values
    
    Example:
    --------
    >>> s = pd.Series(['2,6', '3,14', '-1,5'])
    >>> convert_european_decimal(s)
    0    2.60
    1    3.14
    2   -1.50
    """
    # Step 1: Convert to string (handles mixed types)
    str_series = series.astype(str)
    
    # Step 2: Replace comma with period
    str_series = str_series.str.replace(',', '.', regex=False)
    
    # Step 3: Convert to numeric, coercing errors to NaN
    numeric_series = pd.to_numeric(str_series, errors='coerce')
    
    return numeric_series


# Test the function
test_values = pd.Series(['2,6', '3,14', '-1,5', 'N/A', '100'])
print("Before conversion:")
print(test_values)
print(f"Type: {test_values.dtype}")

print("\nAfter conversion:")
converted = convert_european_decimal(test_values)
print(converted)
print(f"Type: {converted.dtype}")

The `convert_european_decimal` function handles the conversion in three steps:

1. **Convert to string**: Ensures consistent input type
2. **Replace comma with period**: Transforms European to US format
3. **Convert to numeric**: Uses `pd.to_numeric` with error handling

The `errors='coerce'` parameter is important: it converts invalid values to NaN instead of raising an error, which allows the pipeline to continue.

In [None]:
# Identify columns that need European decimal conversion
# These are columns with object dtype that should be numeric

# Check which columns have comma in their values
def check_european_format(df):
    """
    Identify columns that appear to have European decimal format.
    
    Returns a list of column names where values contain commas
    that look like decimal separators.
    """
    european_cols = []
    
    for col in df.columns:
        if df[col].dtype == 'object':
            # Sample non-null values
            sample = df[col].dropna().head(100).astype(str)
            
            # Check if values match pattern like "2,6" or "-1,23"
            comma_count = sample.str.contains(r'^-?\d+,\d+$', regex=True).sum()
            
            if comma_count > 10:  # If more than 10% have this pattern
                european_cols.append(col)
    
    return european_cols


european_columns = check_european_format(df_air_clean)
print(f"Columns with European decimal format ({len(european_columns)}):")
for col in european_columns:
    sample_val = df_air_clean[col].dropna().iloc[0] if df_air_clean[col].notna().any() else 'N/A'
    print(f"  - {col}: example value = '{sample_val}'")

We've identified which columns need conversion. The detection function looks for the pattern of digits, comma, digits which is characteristic of European decimals.

In [None]:
# Convert all European decimal columns
print("Converting European decimal columns...\n")

for col in european_columns:
    original_dtype = df_air_clean[col].dtype
    original_sample = df_air_clean[col].iloc[0]
    
    df_air_clean[col] = convert_european_decimal(df_air_clean[col])
    
    new_dtype = df_air_clean[col].dtype
    new_sample = df_air_clean[col].iloc[0]
    
    print(f"{col}:")
    print(f"  Before: {original_sample} ({original_dtype})")
    print(f"  After:  {new_sample} ({new_dtype})")

print("\nConversion complete!")

All the European decimal columns have been converted to proper float types. The example values show the transformation: "2,6" became 2.6.

---

## 4. Date and Time Parsing

### The Challenge with Dates

Date formats vary widely across sources:

| Format | Example | Region/Standard |
|--------|---------|------------------|
| DD/MM/YYYY | 25/12/2025 | Europe, most of world |
| MM/DD/YYYY | 12/25/2025 | United States |
| YYYY-MM-DD | 2025-12-25 | ISO 8601 (databases) |
| DD-Mon-YY | 25-Dec-25 | Legacy systems |

### Ambiguous Dates

The date "01/02/2025" could mean:
- January 2, 2025 (US format)
- February 1, 2025 (European format)

You must know your data source to parse correctly!

In [None]:
# Examine the date column in our air quality data
print("Date column sample:")
print(df_air_clean['Date'].head(10))
print(f"\nCurrent dtype: {df_air_clean['Date'].dtype}")

The date format is DD/MM/YYYY (European format). We need to specify this when parsing.

In [None]:
def parse_date_smart(series, formats=None):
    """
    Parse date strings by trying multiple formats.
    
    Parameters:
    -----------
    series : pd.Series
        Series with date strings
    formats : list
        List of date formats to try, in order of preference
        Default: common European and US formats
    
    Returns:
    --------
    pd.Series : Datetime series
    """
    if formats is None:
        formats = [
            '%d/%m/%Y',    # European: 25/12/2025
            '%Y-%m-%d',    # ISO: 2025-12-25
            '%m/%d/%Y',    # US: 12/25/2025
            '%d-%m-%Y',    # European with dashes
            '%Y/%m/%d',    # Alternative ISO
        ]
    
    # Try pandas' automatic parsing first
    for fmt in formats:
        try:
            result = pd.to_datetime(series, format=fmt, errors='coerce')
            
            # Check if most values parsed successfully
            success_rate = result.notna().sum() / len(result)
            
            if success_rate > 0.9:  # 90% success threshold
                print(f"  Parsed with format '{fmt}' (success rate: {success_rate:.1%})")
                return result
        except:
            continue
    
    # Fallback to pandas inference
    print("  Using pandas automatic inference")
    return pd.to_datetime(series, errors='coerce', dayfirst=True)


# Parse the Date column
print("Parsing Date column:")
df_air_clean['Date'] = parse_date_smart(df_air_clean['Date'])

print(f"\nNew dtype: {df_air_clean['Date'].dtype}")
print(f"Date range: {df_air_clean['Date'].min()} to {df_air_clean['Date'].max()}")

The `parse_date_smart` function tries multiple date formats and uses the one with the highest success rate. This is more robust than assuming a single format.

Key features:
1. **Multiple format attempts**: Tries common formats in order of likelihood
2. **Success rate check**: Ensures the format actually works for most data
3. **Fallback option**: Uses pandas inference if explicit formats fail

In [None]:
# Parse the Time column
# Time is in format HH.MM.SS (dots instead of colons - European style)

print("Time column sample:")
print(df_air_clean['Time'].head())

# Replace dots with colons for standard time format
df_air_clean['Time'] = df_air_clean['Time'].astype(str).str.replace('.', ':', regex=False)

print("\nAfter standardization:")
print(df_air_clean['Time'].head())

The time column used periods as separators (18.00.00) instead of colons (18:00:00). A simple string replacement standardizes this.

In [None]:
# Combine Date and Time into a single datetime column

df_air_clean['datetime'] = pd.to_datetime(
    df_air_clean['Date'].astype(str) + ' ' + df_air_clean['Time'],
    errors='coerce'
)

print("Combined datetime column:")
print(df_air_clean[['Date', 'Time', 'datetime']].head())

# Now we can do time-based analysis
print(f"\nDatetime range:")
print(f"  Start: {df_air_clean['datetime'].min()}")
print(f"  End:   {df_air_clean['datetime'].max()}")
print(f"  Duration: {df_air_clean['datetime'].max() - df_air_clean['datetime'].min()}")

Having a proper datetime column enables:
- Time-series analysis
- Grouping by hour, day, month
- Filtering by date ranges
- Calculating time differences

---

## 5. String Cleaning and Normalization

### Common String Problems

```
┌─────────────────────────────────────────────────────────────────┐
│                   STRING QUALITY ISSUES                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  WHITESPACE           CASE                    ENCODING          │
│  " Canada "           "CANADA"                "Caf\xe9"         │
│  "Canada  "           "canada"                "Cafu00e9"        │
│  "  Canada"           "Canada"                                  │
│                       "CaNaDa"                                  │
│                                                                 │
│  SPECIAL CHARS        VARIATIONS             ABBREVIATIONS     │
│  "Canada!"            "United States"        "US" vs "USA"     │
│  "Canada\n"           "United States of..."  "UK" vs "U.K."    │
│  "Canada\t"           "U.S.A."                                  │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Normalization Strategy

Normalization means transforming strings to a consistent format:
1. Strip leading/trailing whitespace
2. Normalize internal whitespace (multiple spaces to one)
3. Standardize case (usually lowercase or title case)
4. Remove or replace special characters
5. Apply domain-specific mappings (e.g., country name variations)

In [None]:
def normalize_string(series, case='lower', strip=True, collapse_whitespace=True):
    """
    Normalize string values in a series.
    
    Parameters:
    -----------
    series : pd.Series
        Series with string values
    case : str
        'lower', 'upper', 'title', or None for no change
    strip : bool
        Remove leading/trailing whitespace
    collapse_whitespace : bool
        Replace multiple spaces with single space
    
    Returns:
    --------
    pd.Series : Normalized string series
    """
    # Convert to string type
    result = series.astype(str)
    
    # Strip whitespace
    if strip:
        result = result.str.strip()
    
    # Collapse multiple whitespace
    if collapse_whitespace:
        result = result.str.replace(r'\s+', ' ', regex=True)
    
    # Apply case transformation
    if case == 'lower':
        result = result.str.lower()
    elif case == 'upper':
        result = result.str.upper()
    elif case == 'title':
        result = result.str.title()
    
    return result


# Demonstrate with messy data
messy_data = pd.Series([
    '  Canada  ',
    'UNITED STATES',
    'united   kingdom',
    'Brazil',
    '   Germany\t',
])

print("Original (showing repr to see whitespace):")
for val in messy_data:
    print(f"  '{val}'")

print("\nNormalized (lowercase):")
normalized = normalize_string(messy_data, case='lower')
for val in normalized:
    print(f"  '{val}'")

print("\nNormalized (title case):")
normalized_title = normalize_string(messy_data, case='title')
for val in normalized_title:
    print(f"  '{val}'")

The `normalize_string` function handles the most common string cleaning tasks:

1. **strip()**: Removes leading and trailing whitespace
2. **Regex replace**: `\s+` matches one or more whitespace characters, replaced with single space
3. **Case transformation**: Standardizes capitalization

After normalization, "  Canada  " and "CANADA" both become "canada" (or "Canada" in title case).

In [None]:
# Load the weather data to demonstrate country name normalization
weather_path = '../../Datasets/GlobalWeatherRepository.csv'
df_weather = pd.read_csv(weather_path)

print(f"Weather data shape: {df_weather.shape}")
print(f"\nUnique countries: {df_weather['country'].nunique()}")
print("\nSample country names:")
print(df_weather['country'].value_counts().head(10))

The country names appear consistent in this dataset, but let's check for any anomalies.

In [None]:
# Check for potential country name issues

def analyze_string_column(series, column_name):
    """
    Analyze a string column for potential quality issues.
    """
    print(f"Analysis of '{column_name}':")
    print("-" * 50)
    
    # Convert to string
    str_series = series.astype(str)
    
    # Check for leading/trailing whitespace
    has_leading_space = (str_series != str_series.str.lstrip()).sum()
    has_trailing_space = (str_series != str_series.str.rstrip()).sum()
    print(f"  Values with leading whitespace: {has_leading_space}")
    print(f"  Values with trailing whitespace: {has_trailing_space}")
    
    # Check for multiple internal spaces
    has_multiple_spaces = str_series.str.contains(r'\s{2,}', regex=True).sum()
    print(f"  Values with multiple internal spaces: {has_multiple_spaces}")
    
    # Check case distribution
    all_upper = (str_series == str_series.str.upper()).sum()
    all_lower = (str_series == str_series.str.lower()).sum()
    title_case = (str_series == str_series.str.title()).sum()
    print(f"  All uppercase: {all_upper}")
    print(f"  All lowercase: {all_lower}")
    print(f"  Title case: {title_case}")
    
    # Check for special characters
    has_special = str_series.str.contains(r'[^a-zA-Z0-9\s\-\']', regex=True).sum()
    print(f"  Values with special characters: {has_special}")
    
    return


analyze_string_column(df_weather['country'], 'country')

The analysis function checks for common string quality issues. This helps identify what cleaning is needed.

---

## 6. Key Generation for Unique Identifiers

### Why Keys Matter

Every record should have a unique identifier for:
- **Deduplication**: Identifying duplicate records
- **Joins**: Linking related tables
- **Updates**: Knowing which record to modify
- **Tracking**: Following data lineage

### Types of Keys

```
┌─────────────────────────────────────────────────────────────────┐
│                      KEY TYPES                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  NATURAL KEYS              SURROGATE KEYS                       │
│  ├── Exist in the data     ├── Generated/synthetic             │
│  ├── Have business meaning ├── No business meaning             │
│  ├── May change            ├── Never change                    │
│  └── e.g., email, SSN      └── e.g., auto-increment, UUID      │
│                                                                 │
│  COMPOSITE KEYS            HASH KEYS                            │
│  ├── Multiple columns      ├── Hash of column values           │
│  ├── Combined uniqueness   ├── Deterministic                   │
│  └── e.g., (date, city)    └── e.g., MD5(date+city+temp)       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [None]:
def generate_row_hash(df, columns, hash_name='row_hash'):
    """
    Generate a hash-based unique key from multiple columns.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to add hash to
    columns : list
        Column names to include in hash
    hash_name : str
        Name for the new hash column
    
    Returns:
    --------
    pd.DataFrame : DataFrame with new hash column
    """
    # Create a copy to avoid modifying original
    result = df.copy()
    
    # Concatenate column values into a string
    combined = result[columns].astype(str).agg('|'.join, axis=1)
    
    # Generate MD5 hash for each row
    result[hash_name] = combined.apply(
        lambda x: hashlib.md5(x.encode()).hexdigest()
    )
    
    return result


# Generate a unique key for air quality records
# A record is unique by date + time (one measurement per hour)
key_columns = ['Date', 'Time']

df_air_clean = generate_row_hash(df_air_clean, key_columns, 'record_id')

print("Generated record IDs:")
print(df_air_clean[['Date', 'Time', 'record_id']].head(10))

The `generate_row_hash` function creates a deterministic unique identifier by:

1. **Concatenating column values**: Joins selected columns with a delimiter
2. **Hashing the result**: MD5 produces a 32-character hex string

Benefits of hash-based keys:
- **Deterministic**: Same input always produces same hash
- **Compact**: Fixed length regardless of input size
- **Collision-resistant**: Different inputs produce different hashes (with high probability)

In [None]:
# Alternative: Create a composite key for weather data
# Using country + location + timestamp

def create_composite_key(df, columns, key_name='composite_key', separator='_'):
    """
    Create a human-readable composite key from multiple columns.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to add key to
    columns : list
        Column names to combine
    key_name : str
        Name for new key column
    separator : str
        Character to separate values
    
    Returns:
    --------
    pd.DataFrame : DataFrame with new key column
    """
    result = df.copy()
    
    # Clean and combine values
    parts = []
    for col in columns:
        # Convert to string and clean
        clean_col = result[col].astype(str).str.lower()
        clean_col = clean_col.str.replace(r'[^a-z0-9]', '', regex=True)
        parts.append(clean_col)
    
    # Join with separator
    result[key_name] = parts[0]
    for part in parts[1:]:
        result[key_name] = result[key_name] + separator + part
    
    return result


# Create composite key for weather data
df_weather_keyed = create_composite_key(
    df_weather.head(1000),  # Sample for demo
    ['country', 'location_name', 'last_updated'],
    'weather_key'
)

print("Composite keys:")
print(df_weather_keyed[['country', 'location_name', 'last_updated', 'weather_key']].head())

Composite keys are human-readable but can be long. They're useful when you need to inspect the key to understand which record it represents.

---

## 7. Deduplication Strategies

### Types of Duplicates

```
┌─────────────────────────────────────────────────────────────────┐
│                    DUPLICATE TYPES                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  EXACT DUPLICATES           FUZZY DUPLICATES                    │
│  ├── All columns match      ├── Similar but not identical      │
│  ├── Easy to detect         ├── Require similarity metrics     │
│  └── Drop one copy          └── Need rules for merging         │
│                                                                 │
│  Example:                   Example:                            │
│  "John Smith, 123 Main"     "John Smith, 123 Main St"          │
│  "John Smith, 123 Main"     "J. Smith, 123 Main Street"        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### Deduplication Approaches

1. **Exact match**: Use `drop_duplicates()` on all or selected columns
2. **Key-based**: Deduplicate on generated unique key
3. **Fuzzy matching**: Use similarity algorithms (Levenshtein, Jaro-Winkler)

In [None]:
# Check for duplicates in air quality data

print("Duplicate Analysis for Air Quality Data:")
print("=" * 50)

# Check exact duplicates (all columns)
exact_dups = df_air_clean.duplicated().sum()
print(f"\n1. Exact duplicates (all columns): {exact_dups}")

# Check duplicates on key columns
key_dups = df_air_clean.duplicated(subset=['Date', 'Time']).sum()
print(f"2. Duplicate timestamps (Date + Time): {key_dups}")

# If duplicates exist, show them
if key_dups > 0:
    print("\nSample duplicate timestamps:")
    dup_mask = df_air_clean.duplicated(subset=['Date', 'Time'], keep=False)
    print(df_air_clean[dup_mask].head(10))

The check shows whether our data has duplicate records. The `duplicated()` function marks rows that are duplicates of previous rows.

In [None]:
def deduplicate_dataframe(df, subset=None, keep='first', report=True):
    """
    Remove duplicate rows with detailed reporting.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to deduplicate
    subset : list
        Column names to check for duplicates (None = all columns)
    keep : str
        'first' keeps first occurrence, 'last' keeps last, False drops all
    report : bool
        Print deduplication report
    
    Returns:
    --------
    pd.DataFrame : Deduplicated DataFrame
    """
    original_count = len(df)
    
    # Find duplicates
    dup_mask = df.duplicated(subset=subset, keep=False)
    dup_count = dup_mask.sum()
    
    # Drop duplicates
    df_clean = df.drop_duplicates(subset=subset, keep=keep)
    final_count = len(df_clean)
    
    if report:
        print("Deduplication Report:")
        print("-" * 40)
        print(f"  Original rows:     {original_count:,}")
        print(f"  Duplicate rows:    {dup_count:,}")
        print(f"  Rows removed:      {original_count - final_count:,}")
        print(f"  Final rows:        {final_count:,}")
        print(f"  Duplicate rate:    {(original_count - final_count) / original_count * 100:.2f}%")
    
    return df_clean


# Deduplicate air quality data
df_air_deduped = deduplicate_dataframe(
    df_air_clean,
    subset=['Date', 'Time'],  # Unique by timestamp
    keep='first'
)

The `deduplicate_dataframe` function provides a comprehensive report showing exactly how many duplicates were found and removed. The `keep='first'` parameter retains the first occurrence when duplicates are found.

In [None]:
# Create a dataset with intentional duplicates to demonstrate

sample_data = pd.DataFrame({
    'id': [1, 2, 2, 3, 4, 4, 4, 5],
    'name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David', 'David', 'David', 'Eve'],
    'value': [100, 200, 200, 300, 400, 401, 400, 500]  # Note: David has different values
})

print("Original data with duplicates:")
print(sample_data)

print("\n" + "="*50)
print("Strategy 1: Keep first occurrence")
print("="*50)
result1 = sample_data.drop_duplicates(subset=['id'], keep='first')
print(result1)

print("\n" + "="*50)
print("Strategy 2: Keep last occurrence")
print("="*50)
result2 = sample_data.drop_duplicates(subset=['id'], keep='last')
print(result2)

print("\n" + "="*50)
print("Strategy 3: Drop all duplicates")
print("="*50)
result3 = sample_data.drop_duplicates(subset=['id'], keep=False)
print(result3)

The three strategies show different outcomes:

1. **keep='first'**: Keeps row 2 (Bob, 200), row 5 (David, 400)
2. **keep='last'**: Keeps row 3 (Bob, 200), row 7 (David, 400)
3. **keep=False**: Removes ALL duplicates, keeping only unique rows (Alice, Charlie, Eve)

The choice depends on your data semantics. If duplicates have different values (like David), you need a rule for which value to keep.

---

## 8. Building Reusable Cleaning Functions

### The Data Cleaning Pipeline Pattern

Rather than writing one-off cleaning code, build a reusable pipeline:

```
┌─────────────────────────────────────────────────────────────────┐
│                  DATA CLEANING PIPELINE                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Raw Data ───► Clean Types ───► Standardize ───► Dedupe        │
│       │              │              │              │            │
│       ▼              ▼              ▼              ▼            │
│  [Audit Log]   [Audit Log]   [Audit Log]    [Audit Log]        │
│                                                                 │
│                        │                                        │
│                        ▼                                        │
│                  Clean Data + Quality Report                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [None]:
class DataCleaner:
    """
    A reusable data cleaning pipeline.
    
    This class encapsulates common cleaning operations with
    automatic logging and quality reporting.
    """
    
    def __init__(self, df, name='dataset'):
        """
        Initialize the cleaner with a DataFrame.
        
        Parameters:
        -----------
        df : pd.DataFrame
            Raw data to clean
        name : str
            Name for logging purposes
        """
        self.df = df.copy()
        self.name = name
        self.log = []  # Track all operations
        self.original_shape = df.shape
        
        self._log(f"Initialized with {df.shape[0]:,} rows, {df.shape[1]} columns")
    
    def _log(self, message):
        """Add entry to operation log."""
        entry = {
            'timestamp': datetime.now().isoformat(),
            'operation': message,
            'rows': len(self.df),
            'columns': len(self.df.columns)
        }
        self.log.append(entry)
        print(f"  [{entry['rows']:,} rows] {message}")
    
    def drop_columns(self, columns):
        """Drop specified columns."""
        existing = [c for c in columns if c in self.df.columns]
        self.df = self.df.drop(columns=existing)
        self._log(f"Dropped columns: {existing}")
        return self
    
    def drop_empty_columns(self):
        """Drop columns that are entirely null."""
        empty_cols = [c for c in self.df.columns if self.df[c].isna().all()]
        if empty_cols:
            self.df = self.df.drop(columns=empty_cols)
            self._log(f"Dropped {len(empty_cols)} empty columns")
        return self
    
    def convert_european_decimals(self, columns):
        """Convert European decimal format to numeric."""
        for col in columns:
            if col in self.df.columns:
                self.df[col] = self.df[col].astype(str).str.replace(',', '.', regex=False)
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
        self._log(f"Converted European decimals in {len(columns)} columns")
        return self
    
    def parse_dates(self, column, format=None):
        """Parse date column."""
        self.df[column] = pd.to_datetime(self.df[column], format=format, errors='coerce')
        self._log(f"Parsed dates in '{column}'")
        return self
    
    def normalize_strings(self, columns, case='lower'):
        """Normalize string columns."""
        for col in columns:
            if col in self.df.columns:
                self.df[col] = self.df[col].astype(str).str.strip()
                self.df[col] = self.df[col].str.replace(r'\s+', ' ', regex=True)
                if case == 'lower':
                    self.df[col] = self.df[col].str.lower()
                elif case == 'upper':
                    self.df[col] = self.df[col].str.upper()
                elif case == 'title':
                    self.df[col] = self.df[col].str.title()
        self._log(f"Normalized {len(columns)} string columns")
        return self
    
    def replace_values(self, column, mapping):
        """Replace values according to mapping."""
        self.df[column] = self.df[column].replace(mapping)
        self._log(f"Replaced values in '{column}'")
        return self
    
    def replace_missing_codes(self, columns, codes, replacement=np.nan):
        """Replace coded missing values with NaN."""
        for col in columns:
            if col in self.df.columns:
                self.df[col] = self.df[col].replace(codes, replacement)
        self._log(f"Replaced missing codes {codes} in {len(columns)} columns")
        return self
    
    def add_hash_key(self, columns, key_name='row_hash'):
        """Generate hash key from columns."""
        combined = self.df[columns].astype(str).agg('|'.join, axis=1)
        self.df[key_name] = combined.apply(lambda x: hashlib.md5(x.encode()).hexdigest())
        self._log(f"Added hash key '{key_name}' from {columns}")
        return self
    
    def deduplicate(self, subset=None, keep='first'):
        """Remove duplicate rows."""
        before = len(self.df)
        self.df = self.df.drop_duplicates(subset=subset, keep=keep)
        removed = before - len(self.df)
        self._log(f"Removed {removed:,} duplicate rows")
        return self
    
    def get_result(self):
        """Return the cleaned DataFrame."""
        return self.df
    
    def get_report(self):
        """Generate cleaning report."""
        report = {
            'dataset': self.name,
            'original_shape': self.original_shape,
            'final_shape': self.df.shape,
            'rows_removed': self.original_shape[0] - len(self.df),
            'columns_removed': self.original_shape[1] - len(self.df.columns),
            'operations': len(self.log),
            'log': self.log
        }
        return report

The `DataCleaner` class provides a fluent interface for data cleaning. Key design features:

1. **Method chaining**: Each method returns `self`, allowing `cleaner.drop_columns().convert().dedupe()`
2. **Automatic logging**: Every operation is tracked with timestamp and row count
3. **Non-destructive**: Original data is copied, never modified
4. **Reporting**: Get a summary of all operations performed

In [None]:
# Use the DataCleaner on our air quality data

# Reload raw data for a fresh start
df_air_raw = pd.read_csv(air_quality_path, sep=';', encoding='utf-8')

# Define columns with European decimals
numeric_cols = ['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 
                'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
                'PT08.S5(O3)', 'T', 'RH', 'AH']

print("Cleaning Air Quality Dataset")
print("=" * 50)

# Create cleaner and run pipeline
cleaner = DataCleaner(df_air_raw, 'AirQualityUCI')

df_cleaned = (
    cleaner
    .drop_empty_columns()
    .convert_european_decimals(numeric_cols)
    .parse_dates('Date', format='%d/%m/%Y')
    .replace_missing_codes(numeric_cols, [-200, -200.0])
    .add_hash_key(['Date', 'Time'], 'record_id')
    .deduplicate(subset=['Date', 'Time'])
    .get_result()
)

print("\n" + "=" * 50)
print("Cleaning Complete!")

The pipeline ran all cleaning steps in sequence. The log shows what happened at each step with row counts.

In [None]:
# Get and display the cleaning report
report = cleaner.get_report()

print("\nCleaning Report")
print("=" * 50)
print(f"Dataset: {report['dataset']}")
print(f"Original shape: {report['original_shape']}")
print(f"Final shape: {report['final_shape']}")
print(f"Rows removed: {report['rows_removed']}")
print(f"Columns removed: {report['columns_removed']}")
print(f"Total operations: {report['operations']}")

The report provides a summary that can be saved for audit purposes. This documentation is essential for reproducibility.

In [None]:
# Verify the cleaned data
print("Cleaned Data Quality Check")
print("=" * 50)

print(f"\n1. Shape: {df_cleaned.shape}")

print(f"\n2. Data Types:")
for col in df_cleaned.columns[:10]:  # First 10 columns
    print(f"   {col}: {df_cleaned[col].dtype}")

print(f"\n3. Missing Values:")
missing = df_cleaned.isnull().sum()
missing_cols = missing[missing > 0]
if len(missing_cols) > 0:
    for col, count in missing_cols.items():
        pct = count / len(df_cleaned) * 100
        print(f"   {col}: {count:,} ({pct:.1f}%)")
else:
    print("   No missing values!")

print(f"\n4. Sample Data:")
print(df_cleaned.head())

The quality check confirms our cleaning was successful. Note that replacing -200 with NaN creates "missing values" - this is correct because -200 was actually a code for missing data.

---

## Summary: Key Takeaways

### 1. Understanding Dirty Data
- Data quality issues are pervasive and costly
- Always inspect raw data before processing
- Common issues: wrong types, inconsistent formats, duplicates, missing values

### 2. Type Standardization
- Explicit type conversion prevents silent errors
- Use `pd.to_numeric()` with `errors='coerce'` for safe conversion
- Define schemas explicitly for consistency

### 3. European Number Formats
- Many European countries use comma as decimal separator
- Replace comma with period before numeric conversion
- Look for semicolon delimiters as a sign of European CSV

### 4. Date Parsing
- Date formats vary by region and system
- Specify format explicitly when possible
- Combine Date + Time for full timestamp

### 5. String Normalization
- Strip whitespace, normalize case
- Use regex for pattern-based cleaning
- Build mapping tables for known variations

### 6. Key Generation
- Every record needs a unique identifier
- Hash keys are deterministic and compact
- Composite keys are human-readable

### 7. Deduplication
- Check for duplicates before analysis
- Choose strategy based on data semantics (first/last/drop all)
- Document deduplication decisions

### 8. Reusable Pipelines
- Encapsulate cleaning logic in classes/functions
- Log all operations for reproducibility
- Generate reports for audit trails

---

## References

### Books
- McKinney, W. (2022). *Python for Data Analysis* (3rd ed.). O'Reilly Media.
  - Chapter 7: Data Cleaning and Preparation
- McCallum, Q. E. (2012). *Bad Data Handbook*. O'Reilly Media.
- Wickham, H. (2014). "Tidy Data." *Journal of Statistical Software*, 59(10), 1-23.

### Documentation
- [pandas String Methods](https://pandas.pydata.org/docs/user_guide/text.html)
- [pandas Working with Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html)
- [pandas Duplicated](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html)

### Best Practices
- [Google Data Preparation Guidelines](https://developers.google.com/machine-learning/data-prep)

---

## Practice Exercises

### Exercise 1: Clean the Food Data
Load `FOOD-DATA-GROUP2.csv` and clean it using the `DataCleaner` class. This dataset has unnamed columns and potential duplicates.

### Exercise 2: Date Format Detective
Write a function that examines a date column and determines whether it's in US (MM/DD/YYYY) or European (DD/MM/YYYY) format based on value ranges.

### Exercise 3: Fuzzy Deduplication
Research the `fuzzywuzzy` library and implement a function that finds "near-duplicate" rows based on string similarity.

### Exercise 4: Data Quality Dashboard
Create a function that generates a comprehensive data quality report including:
- Missing value rates per column
- Duplicate rates
- Type analysis
- Value range checks

---

## Next Class: Data Cleaning II

In Lecture 10, we will cover:
- Outlier detection methods (IQR, Z-score, domain-based)
- Advanced missing value handling
- Validation rules and constraints
- Referential integrity checks
- Data quality scoring frameworks

We will continue building on the cleaning pipeline we developed today.

---

*End of Lecture 9*