# FRE 521D: Data Analytics in Climate, Food and Environment
## Lecture 10: Data Cleaning II - Outliers, Validation, and Quality Scoring

**Date:** Wednesday, February 4, 2026  
**Instructor:** Asif Ahmed Neloy  
**Program:** UBC Master of Food and Resource Economics

---

### Today's Agenda

1. Understanding Outliers
2. Statistical Outlier Detection (IQR Method)
3. Z-Score Based Detection
4. Domain-Based Outlier Rules
5. Missing Value Patterns and Handling
6. Validation Rules and Constraints
7. Referential Integrity Checks
8. Data Quality Scoring Framework
9. Automated Quality Reports

---

## Learning Objectives

By the end of this lecture, you will be able to:

1. Distinguish between legitimate extreme values and data errors
2. Apply IQR and Z-score methods to detect statistical outliers
3. Create domain-specific validation rules based on business logic
4. Identify missing value patterns (MCAR, MAR, MNAR)
5. Implement referential integrity checks between tables
6. Build a data quality scoring system
7. Generate automated quality reports for production pipelines

---

## Setting Up

In [1]:
# Standard imports
import pandas as pd
import numpy as np
from scipy import stats
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 60)

print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Current time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("Ready for Data Cleaning II!")

Pandas version: 2.3.1
NumPy version: 2.1.3
Current time: 2026-02-02 09:57:59
Ready for Data Cleaning II!


We import `scipy.stats` for statistical calculations like Z-scores. SciPy is a fundamental library for scientific computing in Python.

---

## Load Our Datasets

In [2]:
# Load the GlobalWeatherRepository dataset
weather_path = '../../Datasets/GlobalWeatherRepository.csv'
df_weather = pd.read_csv(weather_path)

print(f"Weather Data: {df_weather.shape[0]:,} rows, {df_weather.shape[1]} columns")
print(f"\nNumeric columns:")
numeric_cols = df_weather.select_dtypes(include=[np.number]).columns.tolist()
for col in numeric_cols[:10]:
    print(f"  - {col}")
print(f"  ... and {len(numeric_cols) - 10} more")

Weather Data: 43,884 rows, 41 columns

Numeric columns:
  - latitude
  - longitude
  - last_updated_epoch
  - temperature_celsius
  - temperature_fahrenheit
  - wind_mph
  - wind_kph
  - wind_degree
  - pressure_mb
  - pressure_in
  ... and 20 more


In [3]:
# Load the Food Nutrition dataset
food_path = '../../Datasets/FOOD-DATA-GROUP1.csv'
df_food = pd.read_csv(food_path)

print(f"Food Data: {df_food.shape[0]:,} rows, {df_food.shape[1]} columns")
print(f"\nSample columns:")
print(df_food.columns[:10].tolist())

Food Data: 551 rows, 35 columns

Sample columns:
['food', 'Caloric Value', 'Fat', 'Saturated Fats', 'Monounsaturated Fats', 'Polyunsaturated Fats', 'Carbohydrates', 'Sugars', 'Protein', 'Dietary Fiber']


---

## 1. Understanding Outliers

### What is an Outlier?

An **outlier** is a data point that differs significantly from other observations. However, not all outliers are errors.

```
┌─────────────────────────────────────────────────────────────────┐
│                      TYPES OF OUTLIERS                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LEGITIMATE OUTLIERS              ERROR OUTLIERS                │
│  (Keep or investigate)            (Fix or remove)               │
│  ├── Rare but real events         ├── Data entry errors         │
│  ├── Exceptional performance      ├── Measurement errors        │
│  ├── Black swan events            ├── Processing bugs           │
│  └── Natural variation            └── Unit/scale mistakes       │
│                                                                 │
│  Example: Record-high temp        Example: Temp = 999°C         │
│  (Climate: 56.7°C in Death        (Clearly a sensor error       │
│   Valley is real)                  or placeholder)              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### The Outlier Decision Framework

Before removing any outlier, ask:

1. **Is it possible?** Does this value make physical/logical sense?
2. **Is it plausible?** Could this happen in the real world?
3. **Is it documented?** Is there an explanation for this value?
4. **What's the impact?** How does it affect analysis?

---

## 2. Statistical Outlier Detection: IQR Method

### The Interquartile Range (IQR) Method

The IQR method is robust because it's based on quartiles, not mean/standard deviation (which are sensitive to outliers themselves).

```
┌─────────────────────────────────────────────────────────────────┐
│                    IQR OUTLIER DETECTION                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│      Q1           Median         Q3                             │
│      (25%)         (50%)        (75%)                           │
│       │             │            │                              │
│  ─────┼─────────────┼────────────┼─────                         │
│       │             │            │                              │
│       │◄────────────┴────────────►│                             │
│       │           IQR            │                              │
│                                                                 │
│  Lower Fence = Q1 - 1.5 × IQR                                   │
│  Upper Fence = Q3 + 1.5 × IQR                                   │
│                                                                 │
│  Values outside fences = Outliers                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

The multiplier 1.5 is conventional. Using 3.0 detects only extreme outliers.

In [4]:
def detect_outliers_iqr(series, multiplier=1.5):
    """
    Detect outliers using the IQR (Interquartile Range) method.
    
    Parameters:
    -----------
    series : pd.Series
        Numeric series to check for outliers
    multiplier : float
        IQR multiplier for fence calculation (default: 1.5)
        Use 3.0 for extreme outliers only
    
    Returns:
    --------
    dict : Contains outlier mask, bounds, and statistics
    """
    # Calculate quartiles
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    
    # Calculate fences
    lower_fence = Q1 - multiplier * IQR
    upper_fence = Q3 + multiplier * IQR
    
    # Identify outliers
    outlier_mask = (series < lower_fence) | (series > upper_fence)
    
    return {
        'outlier_mask': outlier_mask,
        'outlier_count': outlier_mask.sum(),
        'outlier_pct': outlier_mask.sum() / len(series) * 100,
        'Q1': Q1,
        'Q3': Q3,
        'IQR': IQR,
        'lower_fence': lower_fence,
        'upper_fence': upper_fence,
        'multiplier': multiplier
    }


# Apply to temperature data
temp_outliers = detect_outliers_iqr(df_weather['temperature_celsius'])

print("Temperature Outlier Analysis (IQR Method)")
print("=" * 50)
print(f"Q1 (25th percentile): {temp_outliers['Q1']:.1f}°C")
print(f"Q3 (75th percentile): {temp_outliers['Q3']:.1f}°C")
print(f"IQR: {temp_outliers['IQR']:.1f}°C")
print(f"Lower fence: {temp_outliers['lower_fence']:.1f}°C")
print(f"Upper fence: {temp_outliers['upper_fence']:.1f}°C")
print(f"\nOutliers found: {temp_outliers['outlier_count']:,} ({temp_outliers['outlier_pct']:.2f}%)")

Temperature Outlier Analysis (IQR Method)
Q1 (25th percentile): 19.2°C
Q3 (75th percentile): 29.1°C
IQR: 9.9°C
Lower fence: 4.3°C
Upper fence: 44.0°C

Outliers found: 1,669 (3.80%)


The IQR method identified temperature values outside the fences. Let's examine these outliers to determine if they're errors or legitimate values.

In [5]:
# Examine the outliers
outlier_temps = df_weather[temp_outliers['outlier_mask']]['temperature_celsius']

print("Distribution of Temperature Outliers:")
print(f"  Minimum: {outlier_temps.min():.1f}°C")
print(f"  Maximum: {outlier_temps.max():.1f}°C")
print(f"  Mean: {outlier_temps.mean():.1f}°C")

print("\nSample outlier records:")
outlier_records = df_weather[temp_outliers['outlier_mask']][['country', 'location_name', 'temperature_celsius', 'condition_text']].head(10)
print(outlier_records.to_string(index=False))

Distribution of Temperature Outliers:
  Minimum: -24.2°C
  Maximum: 49.2°C
  Mean: 4.8°C

Sample outlier records:
  country location_name  temperature_celsius condition_text
    Chile      Santiago                  1.0          Clear
    Chile      Santiago                  2.0          Sunny
Australia      Canberra                 -1.0          Clear
    Chile      Santiago                  3.0           Mist
  Iceland     Grindavik                  4.0     Light rain
    Sudan      Khartoum                 44.1          Sunny
Australia      Canberra                  3.0          Clear
     Chad     N'djamena                 44.0          Sunny
     Chad     N'djamena                 45.0          Sunny
Australia      Canberra                  1.0          Clear


Looking at the outliers, we can assess whether they are reasonable for their locations. Temperatures below -10°C are normal for northern countries in winter, and above 35°C is common in desert regions.

---

## 3. Z-Score Based Detection

### What is a Z-Score?

A Z-score measures how many standard deviations a value is from the mean:

$$Z = \frac{x - \mu}{\sigma}$$

Where:
- $x$ = the value
- $\mu$ = mean of the data
- $\sigma$ = standard deviation

**Common thresholds:**
- |Z| > 2: Unusual (5% of normally distributed data)
- |Z| > 3: Very unusual (0.3% of data)
- |Z| > 4: Extremely unusual (0.006% of data)

In [6]:
def detect_outliers_zscore(series, threshold=3.0):
    """
    Detect outliers using Z-score method.
    
    Parameters:
    -----------
    series : pd.Series
        Numeric series to check
    threshold : float
        Z-score threshold (default: 3.0)
    
    Returns:
    --------
    dict : Contains outlier information and z-scores
    """
    # Calculate mean and standard deviation
    mean = series.mean()
    std = series.std()
    
    # Calculate Z-scores
    z_scores = (series - mean) / std
    
    # Identify outliers
    outlier_mask = np.abs(z_scores) > threshold
    
    return {
        'outlier_mask': outlier_mask,
        'outlier_count': outlier_mask.sum(),
        'outlier_pct': outlier_mask.sum() / len(series) * 100,
        'z_scores': z_scores,
        'mean': mean,
        'std': std,
        'threshold': threshold
    }


# Apply Z-score detection to temperature
zscore_outliers = detect_outliers_zscore(df_weather['temperature_celsius'], threshold=3.0)

print("Temperature Outlier Analysis (Z-Score Method)")
print("=" * 50)
print(f"Mean: {zscore_outliers['mean']:.1f}°C")
print(f"Standard Deviation: {zscore_outliers['std']:.1f}°C")
print(f"Threshold: |Z| > {zscore_outliers['threshold']}")
print(f"\nOutliers found: {zscore_outliers['outlier_count']:,} ({zscore_outliers['outlier_pct']:.2f}%)")

Temperature Outlier Analysis (Z-Score Method)
Mean: 23.7°C
Standard Deviation: 8.7°C
Threshold: |Z| > 3.0

Outliers found: 230 (0.52%)


The Z-score method typically finds fewer outliers than IQR because it uses a threshold of 3 standard deviations, which corresponds to very extreme values.

In [7]:
# Compare IQR and Z-score methods

print("Comparison of Outlier Detection Methods")
print("=" * 50)

# Test on multiple columns
test_columns = ['temperature_celsius', 'humidity', 'wind_kph', 'pressure_mb']

comparison_data = []
for col in test_columns:
    if col in df_weather.columns:
        iqr_result = detect_outliers_iqr(df_weather[col].dropna())
        zscore_result = detect_outliers_zscore(df_weather[col].dropna())
        
        comparison_data.append({
            'Column': col,
            'IQR Outliers': iqr_result['outlier_count'],
            'IQR %': f"{iqr_result['outlier_pct']:.2f}%",
            'Z-Score Outliers': zscore_result['outlier_count'],
            'Z-Score %': f"{zscore_result['outlier_pct']:.2f}%"
        })

comparison_df = pd.DataFrame(comparison_data)
print(comparison_df.to_string(index=False))

Comparison of Outlier Detection Methods
             Column  IQR Outliers IQR %  Z-Score Outliers Z-Score %
temperature_celsius          1669 3.80%               230     0.52%
           humidity             0 0.00%                 0     0.00%
           wind_kph           554 1.26%                18     0.04%
        pressure_mb          2718 6.19%               369     0.84%


The comparison shows that different methods can yield different results. IQR is generally more conservative (finds more outliers) because it doesn't assume normality.

---

## 4. Domain-Based Outlier Rules

### The Importance of Domain Knowledge

Statistical methods are blind to context. Domain knowledge tells us what's actually possible:

```
┌─────────────────────────────────────────────────────────────────┐
│               DOMAIN-BASED VALIDATION RULES                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PHYSICAL CONSTRAINTS           BUSINESS RULES                  │
│  ├── Temperature: -90 to 60°C   ├── Price: must be positive    │
│  ├── Humidity: 0 to 100%        ├── Age: 0 to 150 years        │
│  ├── Latitude: -90 to 90        ├── Quantity: must be integer  │
│  └── Pressure: 870 to 1085 mb   └── Date: not in future        │
│                                                                 │
│  RELATIONAL CONSTRAINTS         TEMPORAL CONSTRAINTS            │
│  ├── Start date < End date      ├── Year: 1900 to current      │
│  ├── Min value < Max value      ├── Month: 1 to 12             │
│  └── Part of whole ≤ whole      └── Hour: 0 to 23              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [8]:
class DomainValidator:
    """
    Domain-based validation rules for weather data.
    
    This class encapsulates physical constraints that values must satisfy
    regardless of statistical distribution.
    """
    
    # Define valid ranges based on physical constraints
    WEATHER_RULES = {
        'temperature_celsius': {'min': -90, 'max': 60, 'description': 'Earth temperature range'},
        'temperature_fahrenheit': {'min': -130, 'max': 140, 'description': 'Earth temperature range'},
        'humidity': {'min': 0, 'max': 100, 'description': 'Percentage'},
        'pressure_mb': {'min': 870, 'max': 1085, 'description': 'Atmospheric pressure'},
        'wind_kph': {'min': 0, 'max': 410, 'description': 'Wind speed (max tornado)'},
        'visibility_km': {'min': 0, 'max': 100, 'description': 'Visibility range'},
        'uv_index': {'min': 0, 'max': 15, 'description': 'UV Index scale'},
        'cloud': {'min': 0, 'max': 100, 'description': 'Cloud cover percentage'},
        'latitude': {'min': -90, 'max': 90, 'description': 'Geographic latitude'},
        'longitude': {'min': -180, 'max': 180, 'description': 'Geographic longitude'},
    }
    
    def __init__(self, custom_rules=None):
        """
        Initialize with optional custom rules.
        
        Parameters:
        -----------
        custom_rules : dict
            Additional rules to merge with defaults
        """
        self.rules = self.WEATHER_RULES.copy()
        if custom_rules:
            self.rules.update(custom_rules)
    
    def validate_column(self, series, column_name):
        """
        Validate a column against domain rules.
        
        Returns:
        --------
        dict : Validation results
        """
        if column_name not in self.rules:
            return {'status': 'no_rule', 'message': f'No rule defined for {column_name}'}
        
        rule = self.rules[column_name]
        min_val = rule['min']
        max_val = rule['max']
        
        # Find violations
        below_min = series < min_val
        above_max = series > max_val
        violations = below_min | above_max
        
        return {
            'status': 'validated',
            'column': column_name,
            'rule': rule['description'],
            'valid_range': f"[{min_val}, {max_val}]",
            'violations_mask': violations,
            'violation_count': violations.sum(),
            'violation_pct': violations.sum() / len(series) * 100,
            'below_min_count': below_min.sum(),
            'above_max_count': above_max.sum(),
            'actual_min': series.min(),
            'actual_max': series.max()
        }
    
    def validate_dataframe(self, df):
        """
        Validate all applicable columns in a DataFrame.
        
        Returns:
        --------
        list : Validation results for each column
        """
        results = []
        for col in df.columns:
            if col in self.rules:
                result = self.validate_column(df[col], col)
                results.append(result)
        return results


# Create validator and run checks
validator = DomainValidator()
validation_results = validator.validate_dataframe(df_weather)

print("Domain Validation Results")
print("=" * 70)

for result in validation_results:
    if result['status'] == 'validated':
        print(f"\n{result['column']}:")
        print(f"  Rule: {result['rule']}")
        print(f"  Valid range: {result['valid_range']}")
        print(f"  Actual range: [{result['actual_min']:.1f}, {result['actual_max']:.1f}]")
        print(f"  Violations: {result['violation_count']:,} ({result['violation_pct']:.2f}%)")

Domain Validation Results

latitude:
  Rule: Geographic latitude
  Valid range: [-90, 90]
  Actual range: [-41.3, 64.2]
  Violations: 0 (0.00%)

longitude:
  Rule: Geographic longitude
  Valid range: [-180, 180]
  Actual range: [-175.2, 179.2]
  Violations: 0 (0.00%)

temperature_celsius:
  Rule: Earth temperature range
  Valid range: [-90, 60]
  Actual range: [-24.2, 49.2]
  Violations: 0 (0.00%)

temperature_fahrenheit:
  Rule: Earth temperature range
  Valid range: [-130, 140]
  Actual range: [-11.6, 120.6]
  Violations: 0 (0.00%)

wind_kph:
  Rule: Wind speed (max tornado)
  Valid range: [0, 410]
  Actual range: [3.6, 2963.2]
  Violations: 1 (0.00%)

pressure_mb:
  Rule: Atmospheric pressure
  Valid range: [870, 1085]
  Actual range: [971.0, 1080.0]
  Violations: 0 (0.00%)

humidity:
  Rule: Percentage
  Valid range: [0, 100]
  Actual range: [2.0, 100.0]
  Violations: 0 (0.00%)

cloud:
  Rule: Cloud cover percentage
  Valid range: [0, 100]
  Actual range: [0.0, 100.0]
  Violations:

The domain validator checks each column against physical constraints. This catches errors that statistical methods might miss - for example, a temperature of 200°C would have a high Z-score, but domain validation immediately flags it as physically impossible.

---

## 5. Missing Value Patterns and Handling

### Types of Missingness

Understanding WHY data is missing is crucial for choosing the right handling strategy:

```
┌─────────────────────────────────────────────────────────────────┐
│               MISSING DATA MECHANISMS                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  MCAR - Missing Completely At Random                            │
│  ├── Missingness is random, unrelated to any data               │
│  ├── Example: Sensor randomly fails                             │
│  └── Safe to drop or impute with mean/median                    │
│                                                                 │
│  MAR - Missing At Random                                        │
│  ├── Missingness depends on observed data                       │
│  ├── Example: Older equipment more likely to fail               │
│  └── Use regression or group-based imputation                   │
│                                                                 │
│  MNAR - Missing Not At Random                                   │
│  ├── Missingness depends on the missing value itself            │
│  ├── Example: High-income people refuse to report income        │
│  └── Most problematic - may need domain expertise               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [9]:
def analyze_missing_values(df):
    """
    Comprehensive missing value analysis.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame to analyze
    
    Returns:
    --------
    pd.DataFrame : Missing value statistics per column
    """
    # Calculate missing stats
    missing_count = df.isnull().sum()
    missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
    
    # Create summary DataFrame
    missing_df = pd.DataFrame({
        'column': df.columns,
        'missing_count': missing_count.values,
        'missing_pct': missing_pct.values,
        'dtype': df.dtypes.values
    })
    
    # Sort by missing percentage
    missing_df = missing_df.sort_values('missing_pct', ascending=False)
    
    # Add category
    def categorize_missing(pct):
        if pct == 0:
            return 'Complete'
        elif pct < 5:
            return 'Low (<5%)'
        elif pct < 20:
            return 'Moderate (5-20%)'
        elif pct < 50:
            return 'High (20-50%)'
        else:
            return 'Very High (>50%)'
    
    missing_df['category'] = missing_df['missing_pct'].apply(categorize_missing)
    
    return missing_df


# Analyze missing values in weather data
missing_analysis = analyze_missing_values(df_weather)

print("Missing Value Analysis - Weather Data")
print("=" * 60)

# Show columns with missing values
has_missing = missing_analysis[missing_analysis['missing_count'] > 0]

if len(has_missing) > 0:
    print(f"\nColumns with missing values: {len(has_missing)}")
    print(has_missing[['column', 'missing_count', 'missing_pct', 'category']].to_string(index=False))
else:
    print("\nNo missing values found!")

# Summary by category
print("\nSummary by Category:")
print(missing_analysis['category'].value_counts().to_string())

Missing Value Analysis - Weather Data

No missing values found!

Summary by Category:
category
Complete    41


The analysis categorizes columns by their missing value rates. This helps prioritize which columns need attention and what strategies to use.

In [10]:
def handle_missing_values(df, strategy='auto', fill_value=None):
    """
    Handle missing values with various strategies.
    
    Parameters:
    -----------
    df : pd.DataFrame
        DataFrame with missing values
    strategy : str
        'drop_rows' - Remove rows with any missing
        'drop_cols' - Remove columns with any missing
        'fill_mean' - Fill numeric with mean
        'fill_median' - Fill numeric with median
        'fill_mode' - Fill all with mode
        'fill_value' - Fill with specified value
        'auto' - Use appropriate strategy per column type
    fill_value : any
        Value to use with 'fill_value' strategy
    
    Returns:
    --------
    pd.DataFrame : DataFrame with missing values handled
    """
    result = df.copy()
    
    if strategy == 'drop_rows':
        result = result.dropna()
        
    elif strategy == 'drop_cols':
        result = result.dropna(axis=1)
        
    elif strategy == 'fill_mean':
        numeric_cols = result.select_dtypes(include=[np.number]).columns
        result[numeric_cols] = result[numeric_cols].fillna(result[numeric_cols].mean())
        
    elif strategy == 'fill_median':
        numeric_cols = result.select_dtypes(include=[np.number]).columns
        result[numeric_cols] = result[numeric_cols].fillna(result[numeric_cols].median())
        
    elif strategy == 'fill_mode':
        for col in result.columns:
            result[col] = result[col].fillna(result[col].mode().iloc[0] if len(result[col].mode()) > 0 else np.nan)
            
    elif strategy == 'fill_value':
        result = result.fillna(fill_value)
        
    elif strategy == 'auto':
        # Numeric: fill with median (robust to outliers)
        numeric_cols = result.select_dtypes(include=[np.number]).columns
        result[numeric_cols] = result[numeric_cols].fillna(result[numeric_cols].median())
        
        # Categorical/string: fill with mode
        non_numeric_cols = result.select_dtypes(exclude=[np.number]).columns
        for col in non_numeric_cols:
            mode_val = result[col].mode()
            if len(mode_val) > 0:
                result[col] = result[col].fillna(mode_val.iloc[0])
    
    return result


# Demonstrate different strategies on a sample with missing values
sample_df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [10, np.nan, 30, np.nan, 50],
    'C': ['x', 'y', np.nan, 'x', 'y']
})

print("Original data with missing values:")
print(sample_df)

print("\nStrategy: fill_median (numeric only)")
print(handle_missing_values(sample_df, strategy='fill_median'))

print("\nStrategy: auto (median for numeric, mode for categorical)")
print(handle_missing_values(sample_df, strategy='auto'))

Original data with missing values:
     A     B    C
0  1.0  10.0    x
1  2.0   NaN    y
2  NaN  30.0  NaN
3  4.0   NaN    x
4  5.0  50.0    y

Strategy: fill_median (numeric only)
     A     B    C
0  1.0  10.0    x
1  2.0  30.0    y
2  3.0  30.0  NaN
3  4.0  30.0    x
4  5.0  50.0    y

Strategy: auto (median for numeric, mode for categorical)
     A     B  C
0  1.0  10.0  x
1  2.0  30.0  y
2  3.0  30.0  x
3  4.0  30.0  x
4  5.0  50.0  y


The `handle_missing_values` function provides multiple strategies. The 'auto' strategy is recommended for mixed-type DataFrames because it:
- Uses **median** for numeric columns (robust to outliers)
- Uses **mode** for categorical columns (most frequent value)

---

## 6. Validation Rules and Constraints

### Building a Validation Framework

Production data pipelines need systematic validation. We'll build a framework that checks:

1. **Schema validation**: Are all required columns present?
2. **Type validation**: Are data types correct?
3. **Range validation**: Are values within acceptable bounds?
4. **Format validation**: Do strings match expected patterns?
5. **Uniqueness validation**: Are key columns unique?

In [11]:
class DataValidator:
    """
    Comprehensive data validation framework.
    
    This class runs multiple validation checks and generates
    a detailed report of any issues found.
    """
    
    def __init__(self, df):
        """
        Initialize validator with DataFrame to check.
        """
        self.df = df
        self.issues = []
        self.checks_passed = 0
        self.checks_failed = 0
    
    def _log_issue(self, check_name, severity, message, details=None):
        """Record a validation issue."""
        self.issues.append({
            'check': check_name,
            'severity': severity,  # 'error', 'warning', 'info'
            'message': message,
            'details': details
        })
        if severity == 'error':
            self.checks_failed += 1
    
    def _log_pass(self, check_name):
        """Record a passed check."""
        self.checks_passed += 1
    
    def check_required_columns(self, required_cols):
        """
        Verify all required columns exist.
        """
        missing = [c for c in required_cols if c not in self.df.columns]
        
        if missing:
            self._log_issue(
                'required_columns',
                'error',
                f"Missing {len(missing)} required columns",
                missing
            )
        else:
            self._log_pass('required_columns')
        
        return self
    
    def check_no_nulls(self, columns):
        """
        Verify specified columns have no null values.
        """
        for col in columns:
            if col in self.df.columns:
                null_count = self.df[col].isnull().sum()
                if null_count > 0:
                    self._log_issue(
                        'no_nulls',
                        'error',
                        f"Column '{col}' has {null_count:,} null values",
                        {'column': col, 'null_count': null_count}
                    )
                else:
                    self._log_pass('no_nulls')
        return self
    
    def check_unique(self, columns):
        """
        Verify specified columns have unique values.
        """
        for col in columns:
            if col in self.df.columns:
                dup_count = self.df[col].duplicated().sum()
                if dup_count > 0:
                    self._log_issue(
                        'unique_values',
                        'error',
                        f"Column '{col}' has {dup_count:,} duplicate values",
                        {'column': col, 'duplicate_count': dup_count}
                    )
                else:
                    self._log_pass('unique_values')
        return self
    
    def check_range(self, column, min_val=None, max_val=None):
        """
        Verify column values are within specified range.
        """
        if column not in self.df.columns:
            return self
        
        violations = 0
        if min_val is not None:
            violations += (self.df[column] < min_val).sum()
        if max_val is not None:
            violations += (self.df[column] > max_val).sum()
        
        if violations > 0:
            self._log_issue(
                'range_check',
                'warning',
                f"Column '{column}' has {violations:,} values outside [{min_val}, {max_val}]",
                {'column': column, 'violations': violations}
            )
        else:
            self._log_pass('range_check')
        
        return self
    
    def check_data_type(self, column, expected_type):
        """
        Verify column has expected data type.
        """
        if column not in self.df.columns:
            return self
        
        actual_type = str(self.df[column].dtype)
        
        if expected_type not in actual_type:
            self._log_issue(
                'data_type',
                'warning',
                f"Column '{column}' type is {actual_type}, expected {expected_type}",
                {'column': column, 'actual': actual_type, 'expected': expected_type}
            )
        else:
            self._log_pass('data_type')
        
        return self
    
    def check_row_count(self, min_rows=1, max_rows=None):
        """
        Verify row count is within expected range.
        """
        row_count = len(self.df)
        
        if row_count < min_rows:
            self._log_issue(
                'row_count',
                'error',
                f"Too few rows: {row_count:,} (minimum: {min_rows:,})"
            )
        elif max_rows and row_count > max_rows:
            self._log_issue(
                'row_count',
                'warning',
                f"Too many rows: {row_count:,} (maximum: {max_rows:,})"
            )
        else:
            self._log_pass('row_count')
        
        return self
    
    def get_report(self):
        """
        Generate validation report.
        """
        total_checks = self.checks_passed + self.checks_failed
        
        report = {
            'summary': {
                'total_checks': total_checks,
                'passed': self.checks_passed,
                'failed': self.checks_failed,
                'pass_rate': self.checks_passed / total_checks * 100 if total_checks > 0 else 0
            },
            'issues': self.issues,
            'status': 'PASS' if self.checks_failed == 0 else 'FAIL'
        }
        
        return report


# Run validation on weather data
validator = DataValidator(df_weather)

report = (
    validator
    .check_required_columns(['country', 'location_name', 'temperature_celsius'])
    .check_no_nulls(['country', 'location_name'])
    .check_range('temperature_celsius', min_val=-90, max_val=60)
    .check_range('humidity', min_val=0, max_val=100)
    .check_data_type('temperature_celsius', 'float')
    .check_row_count(min_rows=1000)
    .get_report()
)

print("Validation Report")
print("=" * 50)
print(f"Status: {report['status']}")
print(f"\nSummary:")
print(f"  Total checks: {report['summary']['total_checks']}")
print(f"  Passed: {report['summary']['passed']}")
print(f"  Failed: {report['summary']['failed']}")
print(f"  Pass rate: {report['summary']['pass_rate']:.1f}%")

if report['issues']:
    print(f"\nIssues Found ({len(report['issues'])}):") 
    for issue in report['issues']:
        print(f"  [{issue['severity'].upper()}] {issue['message']}")

Validation Report
Status: PASS

Summary:
  Total checks: 7
  Passed: 7
  Failed: 0
  Pass rate: 100.0%


The `DataValidator` class provides a fluent interface for chaining validation checks. The report summarizes all findings with severity levels.

---

## 7. Referential Integrity Checks

### What is Referential Integrity?

Referential integrity ensures that relationships between tables remain consistent:

```
┌─────────────────────────────────────────────────────────────────┐
│               REFERENTIAL INTEGRITY                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   COUNTRIES TABLE              WEATHER TABLE                    │
│   ┌───────────────┐            ┌───────────────┐               │
│   │ country_code  │◄───────────│ country       │               │
│   │ country_name  │            │ temperature   │               │
│   │ region        │            │ humidity      │               │
│   └───────────────┘            └───────────────┘               │
│                                                                 │
│   Every country in WEATHER must exist in COUNTRIES             │
│   (Foreign key constraint)                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [12]:
def check_referential_integrity(child_df, child_col, parent_df, parent_col):
    """
    Check that all values in child column exist in parent column.
    
    Parameters:
    -----------
    child_df : pd.DataFrame
        DataFrame with foreign key column
    child_col : str
        Column name of the foreign key
    parent_df : pd.DataFrame
        DataFrame with primary key column
    parent_col : str
        Column name of the primary key
    
    Returns:
    --------
    dict : Integrity check results
    """
    # Get unique values from each
    child_values = set(child_df[child_col].dropna().unique())
    parent_values = set(parent_df[parent_col].dropna().unique())
    
    # Find orphans (in child but not in parent)
    orphans = child_values - parent_values
    
    # Find unused (in parent but not in child)
    unused = parent_values - child_values
    
    # Count orphan rows
    orphan_rows = child_df[child_df[child_col].isin(orphans)]
    
    return {
        'child_column': child_col,
        'parent_column': parent_col,
        'child_unique_values': len(child_values),
        'parent_unique_values': len(parent_values),
        'orphan_values': len(orphans),
        'orphan_rows': len(orphan_rows),
        'orphan_list': list(orphans)[:10],  # First 10
        'unused_parent_values': len(unused),
        'integrity_valid': len(orphans) == 0
    }


# Create a reference table for valid countries
# In practice, this would come from a master data source
valid_countries = pd.DataFrame({
    'country_name': df_weather['country'].unique()
})

# Simulate an integrity check
# (In this case, all countries match because we derived the reference from the data)
integrity_result = check_referential_integrity(
    df_weather, 'country',
    valid_countries, 'country_name'
)

print("Referential Integrity Check")
print("=" * 50)
print(f"Child column: {integrity_result['child_column']}")
print(f"Parent column: {integrity_result['parent_column']}")
print(f"\nChild unique values: {integrity_result['child_unique_values']}")
print(f"Parent unique values: {integrity_result['parent_unique_values']}")
print(f"\nOrphan values: {integrity_result['orphan_values']}")
print(f"Orphan rows: {integrity_result['orphan_rows']}")
print(f"\nIntegrity valid: {integrity_result['integrity_valid']}")

Referential Integrity Check
Child column: country
Parent column: country_name

Child unique values: 210
Parent unique values: 210

Orphan values: 0
Orphan rows: 0

Integrity valid: True


Referential integrity checks ensure consistency across related datasets. Orphan records (values in child that don't exist in parent) often indicate data quality issues or incomplete data loads.

---

## 8. Data Quality Scoring Framework

### Why Score Data Quality?

A single quality score helps:
- **Track trends**: Is quality improving or declining?
- **Set thresholds**: Define minimum acceptable quality
- **Compare datasets**: Which source is more reliable?
- **Prioritize fixes**: Focus on biggest quality gaps

### Quality Dimensions

```
┌─────────────────────────────────────────────────────────────────┐
│                  DATA QUALITY DIMENSIONS                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  COMPLETENESS (25%)           VALIDITY (25%)                    │
│  ├── Missing value rate       ├── Values within valid range    │
│  └── Required fields present  └── Correct data types           │
│                                                                 │
│  UNIQUENESS (25%)             CONSISTENCY (25%)                 │
│  ├── No duplicate records     ├── Referential integrity        │
│  └── Key uniqueness           └── Cross-column logic           │
│                                                                 │
│           OVERALL SCORE = Weighted Average                      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

In [13]:
class DataQualityScorer:
    """
    Calculate data quality scores across multiple dimensions.
    
    Dimensions:
    - Completeness: % of non-null values
    - Validity: % of values within valid ranges
    - Uniqueness: % of unique rows (vs duplicates)
    - Consistency: % passing consistency rules
    """
    
    def __init__(self, df):
        self.df = df
        self.scores = {}
        self.details = {}
    
    def score_completeness(self, required_columns=None):
        """
        Score based on missing value rate.
        """
        if required_columns:
            cols_to_check = [c for c in required_columns if c in self.df.columns]
        else:
            cols_to_check = self.df.columns.tolist()
        
        if not cols_to_check:
            self.scores['completeness'] = 100.0
            return self
        
        total_cells = len(self.df) * len(cols_to_check)
        missing_cells = self.df[cols_to_check].isnull().sum().sum()
        
        score = (1 - missing_cells / total_cells) * 100
        
        self.scores['completeness'] = round(score, 2)
        self.details['completeness'] = {
            'total_cells': total_cells,
            'missing_cells': missing_cells,
            'columns_checked': len(cols_to_check)
        }
        
        return self
    
    def score_validity(self, rules):
        """
        Score based on values passing validation rules.
        
        Parameters:
        -----------
        rules : dict
            {column: {'min': x, 'max': y}} format
        """
        total_values = 0
        valid_values = 0
        
        for col, rule in rules.items():
            if col not in self.df.columns:
                continue
            
            col_data = self.df[col].dropna()
            total_values += len(col_data)
            
            valid_mask = pd.Series([True] * len(col_data), index=col_data.index)
            
            if 'min' in rule:
                valid_mask &= (col_data >= rule['min'])
            if 'max' in rule:
                valid_mask &= (col_data <= rule['max'])
            
            valid_values += valid_mask.sum()
        
        score = (valid_values / total_values * 100) if total_values > 0 else 100
        
        self.scores['validity'] = round(score, 2)
        self.details['validity'] = {
            'total_values': total_values,
            'valid_values': valid_values,
            'rules_applied': len(rules)
        }
        
        return self
    
    def score_uniqueness(self, key_columns=None):
        """
        Score based on duplicate rate.
        """
        if key_columns:
            dup_count = self.df.duplicated(subset=key_columns).sum()
        else:
            dup_count = self.df.duplicated().sum()
        
        score = (1 - dup_count / len(self.df)) * 100
        
        self.scores['uniqueness'] = round(score, 2)
        self.details['uniqueness'] = {
            'total_rows': len(self.df),
            'duplicate_rows': dup_count,
            'key_columns': key_columns
        }
        
        return self
    
    def score_consistency(self, rules):
        """
        Score based on cross-column consistency rules.
        
        Parameters:
        -----------
        rules : list of dict
            [{'check': lambda df: condition, 'name': 'rule_name'}, ...]
        """
        if not rules:
            self.scores['consistency'] = 100.0
            return self
        
        total_checks = len(rules) * len(self.df)
        passed_checks = 0
        
        rule_results = []
        for rule in rules:
            try:
                passed = rule['check'](self.df).sum()
                passed_checks += passed
                rule_results.append({
                    'name': rule['name'],
                    'passed': passed,
                    'total': len(self.df),
                    'pass_rate': passed / len(self.df) * 100
                })
            except Exception as e:
                rule_results.append({'name': rule['name'], 'error': str(e)})
        
        score = (passed_checks / total_checks * 100) if total_checks > 0 else 100
        
        self.scores['consistency'] = round(score, 2)
        self.details['consistency'] = {'rule_results': rule_results}
        
        return self
    
    def calculate_overall(self, weights=None):
        """
        Calculate weighted overall score.
        """
        if weights is None:
            weights = {
                'completeness': 0.25,
                'validity': 0.25,
                'uniqueness': 0.25,
                'consistency': 0.25
            }
        
        overall = 0
        total_weight = 0
        
        for dimension, weight in weights.items():
            if dimension in self.scores:
                overall += self.scores[dimension] * weight
                total_weight += weight
        
        if total_weight > 0:
            self.scores['overall'] = round(overall / total_weight * (total_weight), 2)
        else:
            self.scores['overall'] = 0
        
        return self
    
    def get_report(self):
        """
        Get complete quality report.
        """
        return {
            'scores': self.scores,
            'details': self.details,
            'grade': self._score_to_grade(self.scores.get('overall', 0))
        }
    
    def _score_to_grade(self, score):
        """Convert numeric score to letter grade."""
        if score >= 95:
            return 'A+'
        elif score >= 90:
            return 'A'
        elif score >= 85:
            return 'B+'
        elif score >= 80:
            return 'B'
        elif score >= 75:
            return 'C+'
        elif score >= 70:
            return 'C'
        elif score >= 60:
            return 'D'
        else:
            return 'F'

The `DataQualityScorer` class calculates scores across four dimensions. Each dimension contributes equally by default, but weights can be customized based on business priorities.

In [14]:
# Score the weather data quality

# Define validation rules
validity_rules = {
    'temperature_celsius': {'min': -90, 'max': 60},
    'humidity': {'min': 0, 'max': 100},
    'pressure_mb': {'min': 870, 'max': 1085},
    'wind_kph': {'min': 0, 'max': 410}
}

# Define consistency rules
consistency_rules = [
    {
        'name': 'temp_celsius_fahrenheit_match',
        'check': lambda df: abs(df['temperature_fahrenheit'] - (df['temperature_celsius'] * 9/5 + 32)) < 1
    },
    {
        'name': 'feels_like_reasonable',
        'check': lambda df: abs(df['feels_like_celsius'] - df['temperature_celsius']) < 20
    }
]

# Run scoring
scorer = DataQualityScorer(df_weather)

quality_report = (
    scorer
    .score_completeness(required_columns=['country', 'temperature_celsius', 'humidity'])
    .score_validity(validity_rules)
    .score_uniqueness(key_columns=['country', 'location_name', 'last_updated'])
    .score_consistency(consistency_rules)
    .calculate_overall()
    .get_report()
)

# Display report
print("Data Quality Report - Weather Data")
print("=" * 50)
print(f"\nOverall Grade: {quality_report['grade']}")
print(f"Overall Score: {quality_report['scores']['overall']:.1f}/100")

print("\nDimension Scores:")
for dimension in ['completeness', 'validity', 'uniqueness', 'consistency']:
    if dimension in quality_report['scores']:
        score = quality_report['scores'][dimension]
        bar = '█' * int(score / 5) + '░' * (20 - int(score / 5))
        print(f"  {dimension.capitalize():15} {bar} {score:.1f}%")

Data Quality Report - Weather Data

Overall Grade: A+
Overall Score: 100.0/100

Dimension Scores:
  Completeness    ████████████████████ 100.0%
  Validity        ████████████████████ 100.0%
  Uniqueness      ████████████████████ 100.0%
  Consistency     ████████████████████ 100.0%


The quality report provides a clear visual summary. The bar chart makes it easy to identify which dimensions need the most attention.

---

## 9. Automated Quality Reports

### Putting It All Together

Let's create a comprehensive quality report that can be generated automatically for any dataset.

In [15]:
def generate_quality_report(df, dataset_name, output_file=None):
    """
    Generate a comprehensive data quality report.
    
    Parameters:
    -----------
    df : pd.DataFrame
        Dataset to analyze
    dataset_name : str
        Name for the report
    output_file : str
        Optional path to save report
    
    Returns:
    --------
    str : Formatted report text
    """
    report_lines = []
    
    def add_line(text=""):
        report_lines.append(text)
    
    # Header
    add_line("=" * 70)
    add_line(f"DATA QUALITY REPORT: {dataset_name}")
    add_line("=" * 70)
    add_line(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    add_line()
    
    # Basic Statistics
    add_line("DATASET OVERVIEW")
    add_line("-" * 40)
    add_line(f"Rows: {len(df):,}")
    add_line(f"Columns: {len(df.columns)}")
    add_line(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024 / 1024:.2f} MB")
    add_line()
    
    # Column Types
    add_line("COLUMN TYPES")
    add_line("-" * 40)
    type_counts = df.dtypes.value_counts()
    for dtype, count in type_counts.items():
        add_line(f"  {dtype}: {count} columns")
    add_line()
    
    # Missing Values
    add_line("MISSING VALUES")
    add_line("-" * 40)
    missing = df.isnull().sum()
    missing_cols = missing[missing > 0].sort_values(ascending=False)
    if len(missing_cols) > 0:
        for col in missing_cols.head(10).index:
            pct = missing_cols[col] / len(df) * 100
            add_line(f"  {col}: {missing_cols[col]:,} ({pct:.1f}%)")
        if len(missing_cols) > 10:
            add_line(f"  ... and {len(missing_cols) - 10} more columns with missing values")
    else:
        add_line("  No missing values found!")
    add_line()
    
    # Numeric Column Statistics
    add_line("NUMERIC COLUMN STATISTICS")
    add_line("-" * 40)
    numeric_cols = df.select_dtypes(include=[np.number]).columns[:5]  # First 5
    for col in numeric_cols:
        add_line(f"  {col}:")
        add_line(f"    Min: {df[col].min():.2f}, Max: {df[col].max():.2f}")
        add_line(f"    Mean: {df[col].mean():.2f}, Median: {df[col].median():.2f}")
        add_line(f"    Std: {df[col].std():.2f}")
    add_line()
    
    # Duplicate Analysis
    add_line("DUPLICATE ANALYSIS")
    add_line("-" * 40)
    exact_dups = df.duplicated().sum()
    add_line(f"  Exact duplicate rows: {exact_dups:,} ({exact_dups/len(df)*100:.2f}%)")
    add_line()
    
    # Quality Score Summary
    add_line("QUALITY SCORES")
    add_line("-" * 40)
    
    # Calculate scores
    completeness = (1 - df.isnull().sum().sum() / (len(df) * len(df.columns))) * 100
    uniqueness = (1 - exact_dups / len(df)) * 100
    
    add_line(f"  Completeness: {completeness:.1f}%")
    add_line(f"  Uniqueness: {uniqueness:.1f}%")
    add_line(f"  Overall: {(completeness + uniqueness) / 2:.1f}%")
    add_line()
    
    # Footer
    add_line("=" * 70)
    add_line("END OF REPORT")
    add_line("=" * 70)
    
    report_text = "\n".join(report_lines)
    
    # Save if output file specified
    if output_file:
        with open(output_file, 'w') as f:
            f.write(report_text)
    
    return report_text


# Generate report for weather data
report = generate_quality_report(df_weather, 'GlobalWeatherRepository')
print(report)

DATA QUALITY REPORT: GlobalWeatherRepository
Generated: 2026-02-02 09:58:00

DATASET OVERVIEW
----------------------------------------
Rows: 43,884
Columns: 41
Memory usage: 40.69 MB

COLUMN TYPES
----------------------------------------
  float64: 23 columns
  object: 11 columns
  int64: 7 columns

MISSING VALUES
----------------------------------------
  No missing values found!

NUMERIC COLUMN STATISTICS
----------------------------------------
  latitude:
    Min: -41.30, Max: 64.15
    Mean: 19.14, Median: 17.25
    Std: 24.48
  longitude:
    Min: -175.20, Max: 179.22
    Mean: 22.14, Median: 23.32
    Std: 65.81
  last_updated_epoch:
    Min: 1715849100.00, Max: 1735385400.00
    Mean: 1725624383.47, Median: 1725710400.00
    Std: 5694807.07
  temperature_celsius:
    Min: -24.20, Max: 49.20
    Mean: 23.69, Median: 25.80
    Std: 8.68
  temperature_fahrenheit:
    Min: -11.60, Max: 120.60
    Mean: 74.65, Median: 78.50
    Std: 15.63

DUPLICATE ANALYSIS
------------------------

The automated report provides a comprehensive overview that can be generated for any dataset. This is valuable for:
- **Initial data exploration**: Quick understanding of data quality
- **Pipeline monitoring**: Run after each ETL to track quality
- **Documentation**: Archive reports for audit trails

---

## Summary: Key Takeaways

### 1. Outlier Detection
- IQR method is robust (doesn't assume normality)
- Z-score method works well for normal distributions
- Domain rules are essential for physical constraints

### 2. Missing Values
- Understand the mechanism (MCAR, MAR, MNAR)
- Choose imputation strategy based on data type
- Document all handling decisions

### 3. Validation Rules
- Schema validation: required columns, types
- Range validation: min/max bounds
- Uniqueness: key constraints

### 4. Referential Integrity
- Check foreign key relationships
- Identify orphan records
- Verify cross-table consistency

### 5. Quality Scoring
- Score across multiple dimensions
- Weight by business importance
- Track trends over time

### 6. Automation
- Build reusable validation classes
- Generate reports automatically
- Integrate into ETL pipelines

---

## References

### Books
- McKinney, W. (2022). *Python for Data Analysis* (3rd ed.). O'Reilly Media.
  - Chapter 7: Data Cleaning and Preparation
- Maydanchik, A. (2007). *Data Quality Assessment*. Technics Publications.
- Dasu, T., & Johnson, T. (2003). *Exploratory Data Mining and Data Cleaning*. Wiley.

### Documentation
- [pandas Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html)
- [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html)

### Academic Papers
- Rubin, D. B. (1976). "Inference and Missing Data." *Biometrika*, 63(3), 581-592.
- Tukey, J. W. (1977). *Exploratory Data Analysis*. Addison-Wesley.

---

## Practice Exercises

### Exercise 1: Food Data Outlier Analysis
Apply IQR and Z-score outlier detection to the Food Nutrition dataset. Compare the results for caloric value and protein content.

### Exercise 2: Custom Validator
Extend the `DataValidator` class with a `check_pattern` method that validates string columns against a regex pattern (e.g., email format, phone numbers).

### Exercise 3: Missing Value Imputation
Implement a group-based imputation strategy that fills missing values with the group mean (e.g., fill missing temperature with average for that country).

### Exercise 4: Quality Dashboard
Create a function that outputs the quality report as HTML with color-coded scores (green for good, red for poor).

---

## Next Class: Visualization I

In Lecture 11, we will cover:
- Matplotlib fundamentals
- Plotly interactive charts
- Time series visualization
- Storytelling with figures
- Accessibility and color choices

We will visualize our cleaned weather and climate data to communicate insights effectively.

---

*End of Lecture 10*