# Data Quality Issue Generator

Use this notebook to create synthetic data quality issues for the ingestion DAG demo. It produces a base dataset, then introduces at least seven issue types the DAG can detect.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path

BASE_DATA = Path('../Data/raw_dataset.csv')
OUTPUT_DIR = Path('../Data/raw')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

assert BASE_DATA.exists(), f'Base dataset missing: {BASE_DATA}'
base_df = pd.read_csv(BASE_DATA)
base_df.head()

## Issue recipes
We introduce at least seven errors:
1. **missing_column**: drop a required feature.
2. **missing_value**: insert NaNs into required fields.
3. **unknown_category**: set Geography/Gender to unexpected values.
4. **out_of_bounds**: push numeric values outside allowed ranges.
5. **non_numeric**: place strings in numeric columns.
6. **invalid_boolean**: use non-binary values for `HasCrCard` / `IsActiveMember`.
7. **duplicate_row**: duplicate existing records.

In [None]:
REQUIRED_COLUMNS = [
    'RowNumber','CustomerId','Surname','CreditScore','Geography','Gender','Age',
    'Tenure','Balance','NumOfProducts','HasCrCard','IsActiveMember','EstimatedSalary'
]

def make_copy(df, name, mutator):
    mutated = mutator(df.copy(deep=True))
    path = OUTPUT_DIR / f'{name}.csv'
    mutated.to_csv(path, index=False)
    print(f'Wrote {path} ({len(mutated)} rows)')

make_copy(base_df, 'missing_column', lambda d: d.drop(columns=['Balance']))

make_copy(base_df, 'missing_value', lambda d: d.assign(CreditScore=np.where(d.index % 5 == 0, np.nan, d.CreditScore)))

make_copy(base_df, 'unknown_category', lambda d: d.assign(Geography=np.where(d.index % 7 == 0, 'Atlantis', d.Geography)))

make_copy(base_df, 'out_of_bounds', lambda d: d.assign(Age=np.where(d.index % 4 == 0, 999, d.Age)))

make_copy(base_df, 'non_numeric', lambda d: d.assign(Tenure=np.where(d.index % 6 == 0, 'ten', d.Tenure)))

make_copy(base_df, 'invalid_boolean', lambda d: d.assign(HasCrCard=np.where(d.index % 3 == 0, 2, d.HasCrCard)))

def add_duplicates(df):
    dupes = df.head(10)
    return pd.concat([df, dupes], ignore_index=True)
make_copy(base_df, 'duplicate_row', add_duplicates)

Run the cell above to refresh the synthetic error files, then trigger the ingestion DAG to validate them.