# Module 08: Data Quality and Validation

**Estimated Time:** 45-60 minutes

## Learning Objectives

By the end of this module, you will:
- Understand data quality dimensions
- Implement data validation strategies
- Use validation frameworks (Pandera, Great Expectations)
- Design data quality checks
- Test data pipelines
- Understand data contracts

---

## 1. Data Quality Dimensions

### The Six Dimensions of Data Quality

1. **Accuracy**: Is the data correct?
   - Values match reality
   - No errors or mistakes

2. **Completeness**: Is all required data present?
   - No missing values where required
   - All records are captured

3. **Consistency**: Does data agree across systems?
   - Same data, same value everywhere
   - No contradictions

4. **Timeliness**: Is data up-to-date?
   - Available when needed
   - Fresh and current

5. **Validity**: Does data conform to rules?
   - Follows format requirements
   - Within acceptable ranges

6. **Uniqueness**: No unwanted duplicates?
   - Each record appears once
   - Primary keys are unique

### Impact of Poor Data Quality

- [FAIL] Bad business decisions
- [FAIL] Wasted resources
- [FAIL] Loss of customer trust
- [FAIL] Regulatory compliance issues
- [FAIL] Failed ML models

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

print("[OK] Libraries loaded")

---

## 2. Basic Data Validation

In [None]:
# Create sample data with quality issues
data = {
    "customer_id": [1, 2, 3, 4, 5, 5],  # Duplicate
    "name": ["Alice", "Bob", None, "David", "Eve", "Frank"],  # Missing value
    "email": [
        "alice@ex.com",
        "invalid-email",
        "carol@ex.com",
        "david@ex.com",
        "eve@ex.com",
        "frank@ex.com",
    ],  # Invalid format
    "age": [25, 30, -5, 200, 35, 28],  # Invalid values
    "revenue": [1000.0, 1500.0, 2000.0, None, 3000.0, 1200.0],  # Missing value
    "signup_date": [
        "2024-01-01",
        "2024-02-01",
        "2024-03-01",
        "2024-04-01",
        "2024-05-01",
        "2024-06-01",
    ],
}

df = pd.DataFrame(data)
print("Sample Data:")
df

In [None]:
# Basic validation checks
def basic_data_validation(df):
    """
    Perform basic data quality checks
    """
    validation_results = []

    # Check 1: Missing values
    missing_counts = df.isnull().sum()
    for col, count in missing_counts.items():
        validation_results.append(
            {
                "check": "Missing Values",
                "column": col,
                "passed": count == 0,
                "details": f"{count} missing values",
            }
        )

    # Check 2: Duplicate primary keys
    if "customer_id" in df.columns:
        duplicates = df["customer_id"].duplicated().sum()
        validation_results.append(
            {
                "check": "Unique IDs",
                "column": "customer_id",
                "passed": duplicates == 0,
                "details": f"{duplicates} duplicates found",
            }
        )

    # Check 3: Value ranges
    if "age" in df.columns:
        invalid_ages = ((df["age"] < 0) | (df["age"] > 120)).sum()
        validation_results.append(
            {
                "check": "Age Range",
                "column": "age",
                "passed": invalid_ages == 0,
                "details": f"{invalid_ages} values outside 0-120 range",
            }
        )

    # Check 4: Email format
    if "email" in df.columns:
        email_pattern = r"^[\w\.-]+@[\w\.-]+\.\w+$"
        valid_emails = df["email"].str.match(email_pattern, na=False)
        invalid_count = (~valid_emails).sum()
        validation_results.append(
            {
                "check": "Email Format",
                "column": "email",
                "passed": invalid_count == 0,
                "details": f"{invalid_count} invalid email formats",
            }
        )

    return pd.DataFrame(validation_results)


# Run validation
results = basic_data_validation(df)
print("\nValidation Results:")
print("=" * 80)
for _, row in results.iterrows():
    status = "[OK] PASS" if row["passed"] else "[FAIL] FAIL"
    print(f"{status} | {row['check']:15} | {row['column']:15} | {row['details']}")

print("\n" + "=" * 80)
passed = results["passed"].sum()
total = len(results)
print(f"Overall: {passed}/{total} checks passed ({(passed/total*100):.1f}%)")

---

## 3. Using Pandera for Schema Validation

In [None]:
# Install pandera if needed
# pip install pandera

try:
    import pandera as pa
    from pandera import Column, Check, DataFrameSchema

    print("[OK] Pandera loaded")
except ImportError:
    print("[WARNING] Pandera not installed. Install with: pip install pandera")
    print("   Continuing with conceptual examples...")

In [None]:
# Define a data schema with Pandera (conceptual example)
"""
schema = DataFrameSchema({
    "customer_id": Column(int, checks=[
        Check.greater_than(0),
        Check(lambda s: ~s.duplicated().any(), error="Duplicate IDs found")
    ]),
    "name": Column(str, nullable=False),
    "email": Column(str, checks=[
        Check(lambda s: s.str.contains("@").all(), error="Invalid email format")
    ]),
    "age": Column(int, checks=[
        Check.in_range(min_value=0, max_value=120)
    ]),
    "revenue": Column(float, checks=[
        Check.greater_than_or_equal_to(0)
    ], nullable=False),
    "signup_date": Column(str)  # In practice, would be datetime
})

# Validate DataFrame against schema
try:
    validated_df = schema.validate(df)
    print("[OK] Data validation passed!")
except pa.errors.SchemaError as e:
    print("[FAIL] Data validation failed:")
    print(e)
"""

print("Schema definition example (conceptual)")

---

## 4. Custom Validation Framework

In [None]:
class DataValidator:
    """
    Custom data validation framework
    """

    def __init__(self, df):
        self.df = df
        self.results = []

    def check_not_null(self, column, error_message=None):
        """Check column has no null values"""
        null_count = self.df[column].isnull().sum()
        passed = null_count == 0

        self.results.append(
            {
                "check": "Not Null",
                "column": column,
                "passed": passed,
                "message": error_message or f"{null_count} null values found",
            }
        )
        return self

    def check_unique(self, column, error_message=None):
        """Check column has unique values"""
        dup_count = self.df[column].duplicated().sum()
        passed = dup_count == 0

        self.results.append(
            {
                "check": "Unique",
                "column": column,
                "passed": passed,
                "message": error_message or f"{dup_count} duplicates found",
            }
        )
        return self

    def check_range(self, column, min_val=None, max_val=None, error_message=None):
        """Check column values are within range"""
        mask = pd.Series([True] * len(self.df))

        if min_val is not None:
            mask &= self.df[column] >= min_val
        if max_val is not None:
            mask &= self.df[column] <= max_val

        invalid_count = (~mask).sum()
        passed = invalid_count == 0

        self.results.append(
            {
                "check": "Range",
                "column": column,
                "passed": passed,
                "message": error_message
                or f"{invalid_count} values outside range [{min_val}, {max_val}]",
            }
        )
        return self

    def check_custom(self, column, condition_func, check_name="Custom", error_message=None):
        """Check custom condition"""
        try:
            passed = condition_func(self.df[column])
            message = error_message or ("Passed" if passed else "Failed")
        except Exception as e:
            passed = False
            message = str(e)

        self.results.append(
            {"check": check_name, "column": column, "passed": passed, "message": message}
        )
        return self

    def get_report(self):
        """Get validation report"""
        df_results = pd.DataFrame(self.results)

        print("\nData Validation Report")
        print("=" * 80)

        for _, row in df_results.iterrows():
            status = "[OK]" if row["passed"] else "[FAIL]"
            print(f"{status} {row['check']:15} | {row['column']:15} | {row['message']}")

        print("=" * 80)
        passed = df_results["passed"].sum()
        total = len(df_results)
        print(f"Summary: {passed}/{total} checks passed ({(passed/total*100):.1f}%)")

        return df_results

    def validate(self, raise_on_error=False):
        """Validate and optionally raise exception"""
        report = self.get_report()
        all_passed = report["passed"].all()

        if raise_on_error and not all_passed:
            failed = report[~report["passed"]]
            raise ValueError(f"Validation failed:\n{failed}")

        return all_passed


print("[OK] Custom validation framework created")

In [None]:
# Use the custom validator
validator = DataValidator(df)

# Chain validation checks
(
    validator.check_not_null("name")
    .check_unique("customer_id")
    .check_range("age", min_val=0, max_val=120)
    .check_not_null("revenue")
    .check_custom(
        "email",
        lambda s: s.str.contains("@").all(),
        check_name="Email Format",
        error_message="All emails must contain @",
    )
    .validate(raise_on_error=False)
)

---

## 5. Data Profiling

In [None]:
def profile_data(df):
    """
    Generate data quality profile
    """
    profile = {
        "total_rows": len(df),
        "total_columns": len(df.columns),
        "memory_usage_mb": df.memory_usage(deep=True).sum() / 1024 / 1024,
        "columns": {},
    }

    for col in df.columns:
        col_profile = {
            "dtype": str(df[col].dtype),
            "null_count": int(df[col].isnull().sum()),
            "null_percentage": float(df[col].isnull().sum() / len(df) * 100),
            "unique_count": int(df[col].nunique()),
            "duplicate_count": int(df[col].duplicated().sum()),
        }

        # Numeric statistics
        if df[col].dtype in ["int64", "float64"]:
            col_profile.update(
                {
                    "min": float(df[col].min()) if not df[col].isnull().all() else None,
                    "max": float(df[col].max()) if not df[col].isnull().all() else None,
                    "mean": float(df[col].mean()) if not df[col].isnull().all() else None,
                    "median": float(df[col].median()) if not df[col].isnull().all() else None,
                }
            )

        profile["columns"][col] = col_profile

    return profile


# Generate profile
import json

profile = profile_data(df)

print("Data Quality Profile:")
print("=" * 80)
print(f"Total Rows: {profile['total_rows']:,}")
print(f"Total Columns: {profile['total_columns']}")
print(f"Memory Usage: {profile['memory_usage_mb']:.2f} MB")
print("\nColumn Details:")
print(json.dumps(profile["columns"], indent=2))

---

## 6. Data Contracts

A **Data Contract** is an agreement between data producers and consumers about:

1. **Schema**: Column names, types, constraints
2. **Freshness**: How often data is updated
3. **Quality**: Acceptable data quality thresholds
4. **SLAs**: Service level agreements

### Example Data Contract

```yaml
# customer_data_contract.yaml
dataset: customers
owner: data_team
consumers:
  - analytics_team
  - ml_team

schema:
  customer_id:
    type: integer
    nullable: false
    unique: true
  name:
    type: string
    nullable: false
  email:
    type: string
    nullable: false
    format: email
  age:
    type: integer
    min: 0
    max: 120

quality:
  completeness: 0.95  # 95% of required fields must be filled
  uniqueness: 1.0     # 100% unique IDs
  validity: 0.99      # 99% valid emails

freshness:
  update_frequency: daily
  max_delay_hours: 2

sla:
  availability: 0.999  # 99.9% uptime
  response_time_ms: 100
```

---

## 7. Testing Data Pipelines

In [None]:
# Unit tests for data transformations
def test_data_transformation():
    """
    Test data transformation functions
    """
    # Sample input
    input_data = pd.DataFrame({"value": [10, 20, 30]})

    # Apply transformation
    def transform(df):
        df["doubled"] = df["value"] * 2
        return df

    result = transform(input_data)

    # Assertions
    assert "doubled" in result.columns, "Missing 'doubled' column"
    assert (result["doubled"] == result["value"] * 2).all(), "Incorrect transformation"
    assert len(result) == len(input_data), "Row count changed"

    print("[OK] All transformation tests passed")


test_data_transformation()

In [None]:
# Integration test for full pipeline
def test_pipeline_integration():
    """
    Test complete ETL pipeline
    """

    # Simulate pipeline
    def run_pipeline():
        # Extract
        df = pd.DataFrame({"id": [1, 2, 3], "value": [10, 20, 30]})

        # Transform
        df["value"] = df["value"] * 2

        # Validate
        assert df["value"].min() >= 0, "Negative values found"

        return df

    result = run_pipeline()

    # Validate output
    assert len(result) > 0, "Empty result"
    assert "id" in result.columns, "Missing ID column"
    assert "value" in result.columns, "Missing value column"

    print("[OK] Pipeline integration test passed")


test_pipeline_integration()

---

## 8. Key Takeaways

[OK] **Quality Dimensions**: Accuracy, completeness, consistency, timeliness, validity, uniqueness

[OK] **Validation**: Check data at every stage (extract, transform, load)

[OK] **Frameworks**: Pandera, Great Expectations, custom validators

[OK] **Profiling**: Understand data characteristics before building pipelines

[OK] **Data Contracts**: Agreements between producers and consumers

[OK] **Testing**: Unit tests for transformations, integration tests for pipelines

### Best Practices

1. **Validate early**: Check data as soon as it's extracted
2. **Fail fast**: Stop processing bad data immediately
3. **Log issues**: Record all quality problems
4. **Monitor metrics**: Track quality over time
5. **Automate checks**: Build quality checks into pipelines
6. **Document rules**: Make validation rules explicit

---

## Next Steps

In **Module 09: End-to-End Pipeline Project**, we'll:
- Build a complete production-ready data pipeline
- Apply all concepts from previous modules
- Include extraction, transformation, loading
- Add validation, logging, and error handling
- Deploy and test the pipeline

---

**Ready for the capstone project?** Open `09_end_to_end_pipeline_project.ipynb`!