# 1. Understanding Data Quality Issues

## What Are Data Quality Issues?

Data quality issues are problems that make data unreliable, inaccurate, or unsuitable for analysis. Understanding these issues is the first step in effective data cleaning.

## Types of Data Quality Problems

### 1. **Missing Values**
- **What**: Absent or null data entries
- **Why it happens**: Data entry errors, system failures, optional fields
- **Impact**: Reduced sample size, biased analysis, model errors
- **Examples**: NaN, None, NULL, empty strings, placeholder values (999, -1)

### 2. **Duplicates**
- **What**: Repeated records in dataset
- **Why it happens**: Data integration, human error, system glitches
- **Impact**: Inflated statistics, skewed analysis, wasted storage
- **Examples**: Exact duplicates, fuzzy duplicates (John vs. Jon)

### 3. **Inconsistent Data**
- **What**: Same information represented differently
- **Why it happens**: Multiple data sources, lack of standards, manual entry
- **Impact**: Fragmented analysis, incorrect grouping, aggregation errors
- **Examples**: USA vs US vs United States, 01/02/2020 vs 2020-01-02

### 4. **Incorrect Data Types**
- **What**: Data stored in wrong format
- **Why it happens**: Poor schema design, automatic type inference errors
- **Impact**: Calculation errors, sorting issues, analysis failures
- **Examples**: Numbers as strings ('123'), dates as strings

### 5. **Outliers & Anomalies**
- **What**: Values significantly different from others
- **Why it happens**: Measurement errors, data entry mistakes, genuine extreme values
- **Impact**: Skewed statistics, distorted models, misleading visualizations
- **Examples**: Age = 999, negative prices, impossible dates

### 6. **Invalid Values**
- **What**: Values that don't make logical sense
- **Why it happens**: Human error, system bugs, data corruption
- **Impact**: Analysis failures, incorrect conclusions
- **Examples**: Age = -5, date in future for birth date, negative quantities

### 7. **Formatting Issues**
- **What**: Inconsistent formatting within data
- **Why it happens**: Multiple data sources, lack of validation
- **Impact**: Matching problems, aggregation errors
- **Examples**: Extra whitespace, mixed case, special characters

### 8. **Encoding Problems**
- **What**: Character encoding mismatches
- **Why it happens**: Different systems use different encodings
- **Impact**: Garbled text, data loss, processing errors
- **Examples**: Ã© instead of é, â€™ instead of '

## When to Address Each Issue

| Issue | When to Address | Priority |
|-------|----------------|----------|
| Missing Values | Before analysis/modeling | HIGH |
| Duplicates | Immediately after loading data | HIGH |
| Incorrect Types | Before any operations | HIGH |
| Inconsistent Data | Before grouping/aggregation | MEDIUM |
| Outliers | After understanding domain | MEDIUM |
| Invalid Values | During validation phase | HIGH |
| Formatting | Before string operations | LOW |
| Encoding | At data loading stage | HIGH |

## Why Understanding Is Important

- **Choose right cleaning method**: Different problems need different solutions
- **Prioritize efforts**: Fix high-impact issues first
- **Prevent future issues**: Understand root causes
- **Communicate with stakeholders**: Explain data quality concerns
- **Validate solutions**: Know when cleaning is successful

In [None]:
import pandas as pd
import numpy as np

# Example: Dataset with multiple quality issues
dirty_data = {
    'customer_id': [1, 2, 2, 3, 4, 5, None, 7],
    'name': ['John Doe', 'jane smith', 'Jane Smith', 'Bob Johnson', None, '  Alice  ', 'Charlie', 'David'],
    'age': ['25', '30', '30', 'invalid', '150', '28', '35', '-5'],
    'email': ['john@email.com', 'jane@email.com', 'jane@email.com', 'bob@', None, 'alice@email.com', 'charlie@email.com', 'david@email.com'],
    'purchase_amount': ['100.50', '200', '200.00', '50.75', '9999999', None, '75.25', '30'],
    'country': ['USA', 'US', 'United States', 'UK', 'usa', 'USA', 'Canada', 'US']
}

df = pd.DataFrame(dirty_data)

print("=" * 60)
print("EXAMPLE: Dataset with Multiple Data Quality Issues")
print("=" * 60)
print("\nRaw Data:")
print(df)

print("\n" + "=" * 60)
print("DATA QUALITY ASSESSMENT")
print("=" * 60)

# 1. Missing Values Analysis
print("\n1. MISSING VALUES:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
print(missing_report[missing_report['Missing Count'] > 0])

# 2. Duplicate Detection
print("\n2. DUPLICATES:")
duplicates = df.duplicated()
print(f"   Total duplicate rows: {duplicates.sum()}")
print(f"   Duplicate customer_ids: {df['customer_id'].duplicated().sum()}")

# 3. Data Type Issues
print("\n3. DATA TYPE ISSUES:")
print("   Current types:")
print(df.dtypes)
print("\n   Issues:")
print("   - 'age' should be numeric, currently: object")
print("   - 'purchase_amount' should be numeric, currently: object")

# 4. Inconsistent Values
print("\n4. INCONSISTENT VALUES:")
print("   Country variations:")
print(df['country'].value_counts())

# 5. Invalid Values
print("\n5. INVALID VALUES:")
print("   Detected issues:")
print("   - age contains 'invalid', '150', '-5'")
print("   - email contains 'bob@' (invalid format)")
print("   - purchase_amount contains '9999999' (potential outlier)")

# 6. Formatting Issues
print("\n6. FORMATTING ISSUES:")
print("   - Mixed case in 'name': 'John Doe' vs 'jane smith'")
print("   - Extra whitespace: '  Alice  '")
print("   - Mixed case in 'country': 'USA' vs 'usa'")

print("\n" + "=" * 60)
print("✓ Data quality issues identified!")
print("✓ Next step: Apply appropriate cleaning techniques")
print("=" * 60)

# 2. Data Profiling & Initial Assessment

## What is Data Profiling?

Data profiling is the systematic examination of data to understand its structure, content, quality, and relationships. It's the essential first step before any cleaning activity.

## Why Data Profiling?

### 1. **Understand Your Data**
- Know what you're working with before making changes
- Identify patterns and anomalies
- Understand data distribution and characteristics

### 2. **Plan Cleaning Strategy**
- Prioritize cleaning tasks based on severity
- Choose appropriate cleaning techniques
- Estimate time and resources needed

### 3. **Set Quality Benchmarks**
- Establish baseline metrics
- Define acceptable quality levels
- Track improvement over time

### 4. **Communicate with Stakeholders**
- Document data quality issues
- Justify cleaning decisions
- Set expectations for data improvements

## When to Profile Data

- **Immediately after loading**: First look at raw data
- **Before analysis**: Ensure data is suitable for intended use
- **After cleaning**: Verify cleaning effectiveness
- **Regular intervals**: Monitor ongoing data quality
- **After data updates**: Check for new issues

## Key Profiling Activities

### 1. **Structural Profiling**
- Number of rows and columns
- Column names and data types
- Memory usage
- Index structure

### 2. **Content Profiling**
- Unique values count
- Value distributions
- Min, max, mean, median
- Null/missing counts

### 3. **Quality Profiling**
- Missing value percentages
- Duplicate counts
- Outlier detection
- Invalid value identification

### 4. **Relationship Profiling**
- Correlations between columns
- Dependencies and constraints
- Cross-field validation

## Profiling Tools

- **Manual**: pandas describe(), info(), value_counts()
- **Automated**: pandas-profiling, sweetviz, ydata-profiling
- **Visual**: matplotlib, seaborn for distribution plots
- **Custom**: Build your own profiling functions

In [None]:
# Comprehensive Data Profiling Example

# Create sample dataset
np.random.seed(42)
n = 1000

sample_data = pd.DataFrame({
    'transaction_id': range(1, n+1),
    'customer_id': np.random.randint(1, 100, n),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', None], n),
    'quantity': np.random.randint(1, 10, n),
    'price': np.random.uniform(10, 1000, n),
    'discount': np.random.choice([0, 5, 10, 15, 20, None], n),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'PayPal', 'Cash'], n),
    'status': np.random.choice(['Completed', 'Pending', 'Cancelled'], n, p=[0.7, 0.2, 0.1])
})

# Add some duplicates
sample_data = pd.concat([sample_data, sample_data.sample(50)], ignore_index=True)

# Add some missing values
sample_data.loc[np.random.choice(sample_data.index, 100, replace=False), 'price'] = np.nan

print("=" * 80)
print("COMPREHENSIVE DATA PROFILING REPORT")
print("=" * 80)

# 1. STRUCTURAL PROFILING
print("\n1. STRUCTURAL PROFILING")
print("-" * 80)
print(f"Dataset Shape: {sample_data.shape[0]:,} rows × {sample_data.shape[1]} columns")
print(f"Memory Usage: {sample_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\nColumn Information:")
print(sample_data.dtypes)

# 2. CONTENT PROFILING
print("\n2. CONTENT PROFILING")
print("-" * 80)

print("\nUnique Values per Column:")
for col in sample_data.columns:
    unique_count = sample_data[col].nunique()
    unique_pct = (unique_count / len(sample_data) * 100)
    print(f"   {col:20s}: {unique_count:5d} unique ({unique_pct:5.1f}%)")

print("\nNumerical Statistics:")
print(sample_data.describe())

print("\nCategorical Value Distributions:")
for col in ['product', 'payment_method', 'status']:
    print(f"\n{col}:")
    print(sample_data[col].value_counts())

# 3. QUALITY PROFILING
print("\n3. QUALITY PROFILING")
print("-" * 80)

print("\nMissing Values Analysis:")
missing_summary = pd.DataFrame({
    'Column': sample_data.columns,
    'Missing_Count': sample_data.isnull().sum().values,
    'Missing_Pct': (sample_data.isnull().sum() / len(sample_data) * 100).values
})
print(missing_summary[missing_summary['Missing_Count'] > 0].to_string(index=False))

print("\nDuplicate Analysis:")
dup_count = sample_data.duplicated().sum()
dup_pct = (dup_count / len(sample_data) * 100)
print(f"   Total duplicates: {dup_count} ({dup_pct:.2f}%)")

print("\nOutlier Detection (Numerical Columns):")
for col in ['quantity', 'price', 'discount']:
    if sample_data[col].dtype in ['int64', 'float64']:
        Q1 = sample_data[col].quantile(0.25)
        Q3 = sample_data[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
        outliers = sample_data[(sample_data[col] < lower) | (sample_data[col] > upper)]
        print(f"   {col:15s}: {len(outliers):4d} outliers ({len(outliers)/len(sample_data)*100:.2f}%)")

# 4. CUSTOM QUALITY CHECKS
print("\n4. CUSTOM QUALITY CHECKS")
print("-" * 80)

# Business rule validations
print("\nBusiness Rule Validations:")
invalid_qty = (sample_data['quantity'] <= 0).sum()
invalid_price = (sample_data['price'] <= 0).sum()
invalid_discount = (sample_data['discount'] > 100).sum()

print(f"   Invalid quantity (≤0): {invalid_qty}")
print(f"   Invalid price (≤0): {invalid_price}")
print(f"   Invalid discount (>100): {invalid_discount}")

# 5. DATA QUALITY SCORE
print("\n5. DATA QUALITY SCORE")
print("-" * 80)

total_cells = sample_data.shape[0] * sample_data.shape[1]
missing_cells = sample_data.isnull().sum().sum()
completeness = ((total_cells - missing_cells) / total_cells * 100)

unique_pct = (sample_data.duplicated().sum() / len(sample_data) * 100)
uniqueness = 100 - unique_pct

quality_score = (completeness + uniqueness) / 2

print(f"   Completeness: {completeness:.2f}%")
print(f"   Uniqueness: {uniqueness:.2f}%")
print(f"   Overall Quality Score: {quality_score:.2f}%")

print("\n" + "=" * 80)
print("✓ Profiling complete!")
print("✓ Use insights to plan cleaning strategy")
print("=" * 80)

# 3. Handling Missing Values

## What are Missing Values?

Missing values are absent or undefined data entries in a dataset. They appear as NaN (Not a Number), None, NULL, or empty cells.

## Why Missing Values Occur

1. **Data Entry Errors**: Human mistakes during manual entry
2. **System Failures**: Technical issues during data collection
3. **Optional Fields**: Survey questions not answered
4. **Data Integration**: Merging datasets with different schemas
5. **Privacy**: Sensitive information intentionally omitted
6. **Not Applicable**: Value doesn't apply to certain records

## When to Handle Missing Values

- **Before Analysis**: Missing data affects statistics and models
- **Before Modeling**: Most ML algorithms can't handle NaN
- **During Validation**: Check if missingness is acceptable
- **After Understanding**: Know WHY data is missing before fixing

## Types of Missingness

### 1. **MCAR (Missing Completely At Random)**
- No pattern to missing data
- Safe to delete or impute
- Example: Random survey non-response

### 2. **MAR (Missing At Random)**
- Missing related to observed data
- Can be predicted from other variables
- Example: Income missing for younger respondents

### 3. **MNAR (Missing Not At Random)**
- Missing related to the missing value itself
- Most problematic type
- Example: High earners don't report income

## Handling Strategies

### 1. **Deletion**
**When to use**:
- < 5% data missing
- MCAR missingness
- Large dataset with plenty of data

**Methods**:
- Listwise deletion (drop rows)
- Column deletion (drop columns)

**Pros**: Simple, no assumptions
**Cons**: Loss of data, potential bias

### 2. **Imputation (Filling)**
**When to use**:
- >5% missing data
- Can't afford to lose data
- Pattern exists in missingness

**Methods**:
- Mean/Median/Mode
- Forward/Backward fill (time series)
- Interpolation
- Predictive modeling
- Multiple imputation

**Pros**: Retains data size
**Cons**: May introduce bias, assumes patterns

### 3. **Flagging**
**When to use**:
- Missingness is informative
- Want to preserve that information

**Method**: Create indicator variable
**Pros**: Preserves information about missingness
**Cons**: Increases dimensions

## Why Each Method Works

| Method | Why It Works | When to Use |
|--------|-------------|-------------|
| Mean | Preserves central tendency | Normally distributed data |
| Median | Robust to outliers | Skewed distributions |
| Mode | Most common value | Categorical data |
| Forward Fill | Assumes continuity | Time series, ordered data |
| Interpolation | Smooth transitions | Temporal or spatial data |
| KNN | Uses similar records | Pattern-based missingness |
| MICE | Multiple iterations | Complex patterns |

## Impact of Missing Data

- **Statistical Power**: Reduced sample size
- **Bias**: Non-random missingness introduces bias
- **Model Performance**: Lower accuracy, wrong conclusions
- **Computation**: Some algorithms fail with NaN

In [None]:
# Comprehensive Missing Value Handling Examples

# Create dataset with missing values
np.random.seed(42)
n = 100

data = pd.DataFrame({
    'age': np.random.randint(18, 80, n),
    'income': np.random.normal(50000, 20000, n),
    'credit_score': np.random.randint(300, 850, n),
    'years_employed': np.random.randint(0, 40, n),
    'city': np.random.choice(['NYC', 'LA', 'Chicago', 'Houston'], n)
})

# Introduce missing values
data.loc[np.random.choice(data.index, 15, replace=False), 'income'] = np.nan
data.loc[np.random.choice(data.index, 10, replace=False), 'credit_score'] = np.nan
data.loc[np.random.choice(data.index, 5, replace=False), 'years_employed'] = np.nan
data.loc[np.random.choice(data.index, 8, replace=False), 'city'] = np.nan

print("=" * 80)
print("COMPREHENSIVE MISSING VALUE HANDLING")
print("=" * 80)

# 1. DETECT MISSING VALUES
print("\n1. MISSING VALUE DETECTION")
print("-" * 80)

print("\nMissing Count by Column:")
print(data.isnull().sum())

print("\nMissing Percentage by Column:")
missing_pct = (data.isnull().sum() / len(data) * 100).round(2)
print(missing_pct[missing_pct > 0])

print(f"\nTotal rows with any missing: {data.isnull().any(axis=1).sum()} ({data.isnull().any(axis=1).sum()/len(data)*100:.1f}%)")
print(f"Complete rows (no missing): {(~data.isnull().any(axis=1)).sum()} ({(~data.isnull().any(axis=1)).sum()/len(data)*100:.1f}%)")

# 2. DELETION STRATEGIES
print("\n2. DELETION STRATEGIES")
print("-" * 80)

# Drop rows with any missing
data_dropna = data.dropna()
print(f"\nListwise deletion (drop any NaN): {len(data)} → {len(data_dropna)} rows ({len(data_dropna)/len(data)*100:.1f}% retained)")

# Drop rows with missing in specific columns
data_drop_subset = data.dropna(subset=['income', 'credit_score'])
print(f"Drop if income OR credit_score missing: {len(data)} → {len(data_drop_subset)} rows")

# Drop columns with too many missing
threshold = 0.3  # Drop if >30% missing
data_drop_cols = data.dropna(thresh=len(data)*(1-threshold), axis=1)
print(f"Drop columns with >{threshold*100}% missing: {data.shape[1]} → {data_drop_cols.shape[1]} columns")

# 3. IMPUTATION STRATEGIES
print("\n3. IMPUTATION STRATEGIES")
print("-" * 80)

data_imputed = data.copy()

# Mean imputation (for income)
mean_income = data_imputed['income'].mean()
data_imputed['income'].fillna(mean_income, inplace=True)
print(f"\nIncome: Mean imputation ({mean_income:.2f})")

# Median imputation (for credit_score - more robust to outliers)
median_credit = data_imputed['credit_score'].median()
data_imputed['credit_score'].fillna(median_credit, inplace=True)
print(f"Credit Score: Median imputation ({median_credit:.0f})")

# Mode imputation (for categorical - city)
mode_city = data_imputed['city'].mode()[0]
data_imputed['city'].fillna(mode_city, inplace=True)
print(f"City: Mode imputation ('{mode_city}')")

# Constant value (for years_employed)
data_imputed['years_employed'].fillna(0, inplace=True)
print(f"Years Employed: Constant value (0)")

print(f"\nRemaining missing values: {data_imputed.isnull().sum().sum()}")

# 4. ADVANCED IMPUTATION
print("\n4. ADVANCED IMPUTATION TECHNIQUES")
print("-" * 80)

# Forward fill (for time series - example)
data_ffill = data.copy()
data_ffill = data_ffill.fillna(method='ffill')
print(f"\nForward Fill: {data.isnull().sum().sum()} → {data_ffill.isnull().sum().sum()} missing")

# Backward fill
data_bfill = data.copy()
data_bfill = data_bfill.fillna(method='bfill')
print(f"Backward Fill: {data.isnull().sum().sum()} → {data_bfill.isnull().sum().sum()} missing")

# Interpolation (for numerical columns)
data_interp = data.copy()
data_interp[['income', 'credit_score']] = data_interp[['income', 'credit_score']].interpolate(method='linear')
print(f"Interpolation: {data['income'].isnull().sum()} → {data_interp['income'].isnull().sum()} missing in income")

# 5. MISSING VALUE INDICATOR
print("\n5. MISSING VALUE INDICATOR (FLAGGING)")
print("-" * 80)

data_flagged = data.copy()

# Create indicator columns
data_flagged['income_missing'] = data_flagged['income'].isnull().astype(int)
data_flagged['credit_missing'] = data_flagged['credit_score'].isnull().astype(int)

# Then fill the actual values
data_flagged['income'].fillna(data_flagged['income'].mean(), inplace=True)
data_flagged['credit_score'].fillna(data_flagged['credit_score'].median(), inplace=True)

print("\nCreated indicator columns:")
print(f"   income_missing: {data_flagged['income_missing'].sum()} cases flagged")
print(f"   credit_missing: {data_flagged['credit_missing'].sum()} cases flagged")

# 6. COMPARISON OF METHODS
print("\n6. METHOD COMPARISON")
print("-" * 80)

comparison = pd.DataFrame({
    'Method': ['Original', 'Deletion', 'Mean/Median', 'Forward Fill', 'Interpolation', 'Flagged'],
    'Rows': [len(data), len(data_dropna), len(data_imputed), len(data_ffill), len(data_interp), len(data_flagged)],
    'Columns': [data.shape[1], data_dropna.shape[1], data_imputed.shape[1], data_ffill.shape[1], data_interp.shape[1], data_flagged.shape[1]],
    'Missing': [data.isnull().sum().sum(), data_dropna.isnull().sum().sum(), data_imputed.isnull().sum().sum(), 
                data_ffill.isnull().sum().sum(), data_interp.isnull().sum().sum(), data_flagged[data.columns].isnull().sum().sum()]
})

print(comparison.to_string(index=False))

print("\n" + "=" * 80)
print("RECOMMENDATIONS:")
print("-" * 80)
print("• Use DELETION when <5% missing and MCAR")
print("• Use MEAN for normal distributions")
print("• Use MEDIAN for skewed data or presence of outliers")
print("• Use MODE for categorical variables")
print("• Use FORWARD/BACKWARD FILL for time series")
print("• Use INTERPOLATION for continuous numerical data")
print("• Use FLAGGING when missingness is informative")
print("• Always document which method you used and why")
print("=" * 80)