# Module 04: Data Collection and Management

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 60 minutes

**Prerequisites**: [Module 03: Research Design and Hypothesis Testing](03_research_design_hypothesis_testing.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Design data collection strategies** by choosing between primary and secondary data sources
2. **Assess data quality** using completeness, validity, and reliability metrics
3. **Create data documentation** including data dictionaries and lineage tracking
4. **Implement data lineage tracking** to document transformations and versions
5. **Follow metadata standards** for reproducible and transparent research

## Setup

Let's import the libraries we'll use in this notebook.

In [None]:
# Standard data science libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
from datetime import datetime

# Configuration for better visualizations
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)

print("‚úì Libraries imported successfully!")

## 1. Data Acquisition Strategies

### Primary vs. Secondary Data

When designing your research, the first decision is **where your data comes from**. Understanding the trade-offs between primary and secondary data is crucial for efficient research design.

| Aspect | Primary Data | Secondary Data |
|--------|-------------|----------------|
| **Definition** | Data you collect yourself | Data collected by others, already available |
| **Sources** | Surveys, interviews, experiments, sensors | Existing databases, published reports, public datasets |
| **Advantages** | Tailored to research question, fresh and current | Cost-effective, faster, large-scale available |
| **Disadvantages** | Time-consuming, expensive, biased by design choices | May not fit research question, quality unknown |
| **Examples** | Customer surveys, lab measurements | Census data, social media, government statistics |
| **Quality Control** | You design validation protocols | Limited control, must assess existing quality |

### Sampling Methods for Primary Data

When collecting primary data, **sampling method** critically affects validity:

**1. Random Sampling**
- Every population member has equal selection probability
- Use when: Population is homogeneous and accessible
- Advantage: Unbiased, theoretically sound
- Disadvantage: May miss rare subgroups

**2. Stratified Sampling**
- Divide population into strata (subgroups), sample each
- Use when: Important subgroups exist (age, income, region)
- Advantage: Ensures all subgroups represented
- Disadvantage: More complex, requires population structure knowledge

**3. Cluster Sampling**
- Divide population into clusters, randomly select clusters
- Use when: Natural groupings exist and population is dispersed
- Advantage: Efficient for geographically dispersed populations
- Disadvantage: Less precise, cluster homogeneity assumed

**4. Convenience Sampling**
- Sample from readily available population members
- Use when: Quick, preliminary insights needed
- Advantage: Fast and inexpensive
- Disadvantage: Highly biased, results not generalizable

### Data Validation at Collection

**Validation at the point of collection** prevents errors from accumulating:

- **Range checks**: Verify values fall within expected bounds
- **Type validation**: Ensure correct data types (numeric, date, categorical)
- **Referential integrity**: Foreign keys reference existing records
- **Uniqueness constraints**: Prevent duplicate entries where required
- **Mandatory fields**: Ensure required data is present
- **Format validation**: Phone numbers, emails, dates match expected patterns

In [None]:
# Example: Building a data validation function at collection time

class DataValidator:
    """Validates data according to specified rules."""
    
    def __init__(self, validation_rules):
        """
        Initialize validator with rules.
        
        Parameters:
        -----------
        validation_rules : dict
            Dictionary mapping column names to validation functions
        """
        self.validation_rules = validation_rules
        self.validation_report = {}
    
    def validate_record(self, record):
        """Validate a single record against all rules."""
        errors = []
        
        for field, rule in self.validation_rules.items():
            if field not in record:
                errors.append(f"Missing required field: {field}")
                continue
            
            value = record[field]
            try:
                is_valid = rule(value)
                if not is_valid:
                    errors.append(f"Invalid value for {field}: {value}")
            except Exception as e:
                errors.append(f"Validation error in {field}: {str(e)}")
        
        return len(errors) == 0, errors
    
    def validate_dataset(self, records):
        """Validate multiple records and generate report."""
        valid_count = 0
        invalid_records = []
        
        for idx, record in enumerate(records):
            is_valid, errors = self.validate_record(record)
            
            if is_valid:
                valid_count += 1
            else:
                invalid_records.append({
                    'record_index': idx,
                    'record': record,
                    'errors': errors
                })
        
        # Generate validation report
        total_records = len(records)
        validation_rate = (valid_count / total_records * 100) if total_records > 0 else 0
        
        self.validation_report = {
            'total_records': total_records,
            'valid_records': valid_count,
            'invalid_records': len(invalid_records),
            'validation_rate': validation_rate,
            'invalid_details': invalid_records
        }
        
        return self.validation_report
    
    def print_report(self):
        """Print validation report in readable format."""
        print("\nDATA VALIDATION REPORT")
        print("=" * 60)
        print(f"Total Records: {self.validation_report['total_records']}")
        print(f"Valid Records: {self.validation_report['valid_records']}")
        print(f"Invalid Records: {self.validation_report['invalid_records']}")
        print(f"Validation Rate: {self.validation_report['validation_rate']:.1f}%")
        
        if self.validation_report['invalid_details']:
            print(f"\nInvalid Record Details:")
            for detail in self.validation_report['invalid_details']:
                print(f"  Record {detail['record_index']}:")
                for error in detail['errors']:
                    print(f"    - {error}")

# Define validation rules for customer survey data
validation_rules = {
    'customer_id': lambda x: isinstance(x, int) and x > 0,
    'age': lambda x: isinstance(x, int) and 18 <= x <= 120,
    'email': lambda x: isinstance(x, str) and '@' in x,
    'satisfaction': lambda x: isinstance(x, int) and 1 <= x <= 5,
    'product_category': lambda x: x in ['Electronics', 'Clothing', 'Home', 'Books']
}

# Sample survey data (with some invalid records)
survey_records = [
    {'customer_id': 1, 'age': 32, 'email': 'john@example.com', 'satisfaction': 4, 'product_category': 'Electronics'},
    {'customer_id': 2, 'age': 150, 'email': 'jane@example.com', 'satisfaction': 5, 'product_category': 'Clothing'},  # Invalid age
    {'customer_id': 3, 'age': 45, 'email': 'no-at-sign', 'satisfaction': 3, 'product_category': 'Home'},  # Invalid email
    {'customer_id': 4, 'age': 28, 'email': 'bob@example.com', 'satisfaction': 6, 'product_category': 'Books'},  # Invalid satisfaction
    {'customer_id': 5, 'age': 35, 'email': 'alice@example.com', 'satisfaction': 4, 'product_category': 'Electronics'},
]

# Validate data
validator = DataValidator(validation_rules)
report = validator.validate_dataset(survey_records)
validator.print_report()

## 2. Data Quality Assessment

### Understanding Data Quality Dimensions

High-quality data is the foundation of rigorous research. The major dimensions of data quality are:

**1. Completeness**
- What percentage of required data is present?
- Missing values can introduce bias and reduce statistical power
- Calculate: `missing_percentage = (missing_values / total_records) * 100`

**2. Validity**
- Are values correct type and within expected ranges?
- Invalid data cannot be reliably analyzed
- Examples: Age as negative number, date in impossible format

**3. Reliability**
- Are measurements consistent across time and conditions?
- Unreliable data shows high variability from measurement error
- Measured through test-retest correlation, inter-rater agreement

**4. Bias Detection**
- Are there systematic errors or non-random patterns?
- Bias can invalidate conclusions even if data is otherwise complete
- Examples: Selection bias (only certain populations represented), measurement bias (systematic over/underestimation)

### Critical Rule: Never Trust Aggregate Statistics Alone

Anscombe's quartet shows why you must visualize data: four different datasets have identical summary statistics but completely different structures!

In [None]:
# Anscombe's Quartet: Identical summary stats, different distributions

# Create Anscombe's four datasets
anscombe_data = pd.DataFrame({
    'dataset_1_x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
    'dataset_1_y': [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68],
    'dataset_2_x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
    'dataset_2_y': [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74],
    'dataset_3_x': [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5],
    'dataset_3_y': [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73],
    'dataset_4_x': [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8],
    'dataset_4_y': [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
})

# Calculate summary statistics for each dataset
print("\nSUMMARY STATISTICS FOR ANSCOMBE'S QUARTET")
print("=" * 70)

for i in range(1, 5):
    x = anscombe_data[f'dataset_{i}_x']
    y = anscombe_data[f'dataset_{i}_y']
    correlation = x.corr(y)
    
    print(f"\nDataset {i}:")
    print(f"  X mean: {x.mean():.2f},  Y mean: {y.mean():.2f}")
    print(f"  X std:  {x.std():.2f},  Y std:  {y.std():.2f}")
    print(f"  Correlation (r): {correlation:.3f}")

print("\n" + "=" * 70)
print("Notice: All four datasets have IDENTICAL summary statistics!")
print("This is why visualization is critical for data quality assessment.")

# Visualize the datasets
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

for i in range(1, 5):
    x = anscombe_data[f'dataset_{i}_x']
    y = anscombe_data[f'dataset_{i}_y']
    
    axes[i-1].scatter(x, y, s=100, alpha=0.6, edgecolors='black')
    axes[i-1].set_xlim(3, 20)
    axes[i-1].set_ylim(3, 14)
    axes[i-1].set_xlabel('X', fontsize=12)
    axes[i-1].set_ylabel('Y', fontsize=12)
    axes[i-1].set_title(f'Dataset {i}', fontsize=12, fontweight='bold')
    axes[i-1].grid(True, alpha=0.3)
    
    # Add trend line
    z = np.polyfit(x, y, 1)
    p = np.poly1d(z)
    axes[i-1].plot(x, p(x), "r--", alpha=0.8, linewidth=2)

plt.tight_layout()
plt.suptitle("Anscombe's Quartet: Different Data, Same Statistics!", 
             fontsize=14, fontweight='bold', y=1.00)
plt.show()

In [None]:
# Comprehensive Data Quality Assessment Function

class DataQualityAssessment:
    """Assess data quality across multiple dimensions."""
    
    def __init__(self, dataframe):
        """Initialize with a pandas DataFrame."""
        self.df = dataframe
        self.quality_report = {}
    
    def assess_completeness(self):
        """Calculate missing data percentage by column."""
        completeness = {}
        
        for column in self.df.columns:
            total = len(self.df)
            missing = self.df[column].isna().sum()
            missing_pct = (missing / total) * 100
            completeness[column] = {
                'missing_count': missing,
                'missing_percentage': missing_pct,
                'complete_percentage': 100 - missing_pct
            }
        
        return completeness
    
    def assess_validity(self):
        """Check for invalid data types and ranges."""
        validity = {}
        
        for column in self.df.columns:
            validity[column] = {
                'dtype': str(self.df[column].dtype),
                'unique_values': self.df[column].nunique(),
                'min_value': self.df[column].min() if pd.api.types.is_numeric_dtype(self.df[column]) else 'N/A',
                'max_value': self.df[column].max() if pd.api.types.is_numeric_dtype(self.df[column]) else 'N/A',
                'sample_values': self.df[column].head(3).tolist()
            }
        
        return validity
    
    def assess_duplicates(self):
        """Identify duplicate records."""
        total_records = len(self.df)
        duplicate_rows = self.df.duplicated().sum()
        duplicate_percentage = (duplicate_rows / total_records) * 100
        
        # Also check for duplicate IDs if 'id' column exists
        id_duplicates = {}
        for col in self.df.columns:
            if 'id' in col.lower():
                duplicate_ids = self.df[col].duplicated().sum()
                id_duplicates[col] = duplicate_ids
        
        return {
            'duplicate_rows': duplicate_rows,
            'duplicate_percentage': duplicate_percentage,
            'duplicate_ids': id_duplicates
        }
    
    def generate_quality_report(self):
        """Generate comprehensive quality report."""
        self.quality_report = {
            'dataset_shape': self.df.shape,
            'completeness': self.assess_completeness(),
            'validity': self.assess_validity(),
            'duplicates': self.assess_duplicates()
        }
        
        return self.quality_report
    
    def print_report(self):
        """Print quality report in readable format."""
        if not self.quality_report:
            self.generate_quality_report()
        
        print("\nDATA QUALITY ASSESSMENT REPORT")
        print("=" * 70)
        
        shape = self.quality_report['dataset_shape']
        print(f"Dataset Size: {shape[0]} rows √ó {shape[1]} columns")
        
        print("\n" + "COMPLETENESS ANALYSIS" + "\n" + "-" * 70)
        completeness_df = pd.DataFrame(self.quality_report['completeness']).T
        print(completeness_df.to_string())
        
        print("\n" + "DUPLICATE ANALYSIS" + "\n" + "-" * 70)
        dup = self.quality_report['duplicates']
        print(f"Duplicate Rows: {dup['duplicate_rows']} ({dup['duplicate_percentage']:.1f}%)")
        if dup['duplicate_ids']:
            print("Duplicate IDs by column:")
            for col, count in dup['duplicate_ids'].items():
                print(f"  {col}: {count} duplicates")

# Create sample dataset with quality issues
np.random.seed(42)
sample_data = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5, 5, 6, 7, np.nan, 9],  # Has duplicate and missing
    'age': [25, 32, 28, 45, 35, 35, 52, 41, 38, 29],  # All valid
    'income': [50000, 75000, np.nan, 120000, 65000, 65000, 95000, 85000, 72000, 58000],  # Missing
    'email': ['john@example.com', 'jane@example.com', 'invalid-email', 'bob@example.com',
              'alice@example.com', 'alice@example.com', 'charlie@example.com', 
              'david@example.com', 'eve@example.com', 'frank@example.com'],
    'satisfaction': [4, 5, 3, 4, 4, 4, 5, 3, 2, 4]  # Valid ratings
})

# Run quality assessment
qa = DataQualityAssessment(sample_data)
qa.print_report()

In [None]:
# Visualize missing data patterns

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Missing data heatmap
missing_matrix = sample_data.isna().astype(int)
sns.heatmap(missing_matrix.T, cbar=True, cmap='RdYlGn_r', ax=axes[0], 
            yticklabels=True, xticklabels=False)
axes[0].set_title('Missing Data Pattern\n(Red = Missing, Green = Present)', 
                   fontsize=12, fontweight='bold')
axes[0].set_xlabel('Record Index', fontsize=11)
axes[0].set_ylabel('Column', fontsize=11)

# Right plot: Missing percentage by column
missing_pct = sample_data.isna().sum() / len(sample_data) * 100
colors = ['red' if x > 0 else 'green' for x in missing_pct]
missing_pct.plot(kind='barh', ax=axes[1], color=colors, edgecolor='black')
axes[1].set_title('Missing Data Percentage by Column', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Missing Percentage (%)', fontsize=11)
axes[1].set_xlim(0, 25)

# Add percentage labels on bars
for i, v in enumerate(missing_pct):
    if v > 0:
        axes[1].text(v + 0.5, i, f'{v:.1f}%', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

print("Quality Insight: Missing data patterns can reveal systematic collection issues.")
print("For example, if an entire column is missing, it may indicate a collection failure.")

## 3. Data Documentation and Metadata Standards

### Why Data Documentation Matters

**Data documentation** serves as the "instruction manual" for your dataset. Without it:
- Future you won't remember what column `X15` means
- Collaborators can't understand your data structure
- Errors go undetected because assumptions are implicit
- Research becomes irreproducible

### Essential Documentation Components

**1. Data Dictionary**
- Lists every column, its type, valid values, and meaning
- Includes units (kg, days, USD) and measurement scales
- Documents any transformations applied

**2. Data Lineage**
- Tracks where data came from (source)
- Documents all transformations applied
- Records versions and timestamps
- Enables reproducibility and auditing

**3. README File**
- High-level overview of the dataset
- How to use the data
- Known limitations and caveats
- Contact information for questions

**4. Metadata Standards**
- Follow established standards (DDI, ISO 19115, MIAOU)
- Enable machine-readable metadata
- Facilitate data discovery and integration

In [None]:
# Create a comprehensive Data Dictionary template and generator

class DataDictionary:
    """Create and manage data dictionaries for datasets."""
    
    def __init__(self, dataframe, dataset_name):
        """Initialize with a DataFrame and dataset name."""
        self.df = dataframe
        self.dataset_name = dataset_name
        self.dictionary = {}
    
    def add_column_definition(self, column_name, description, data_type, 
                             units='', valid_values=None, constraints=''):
        """Add definition for a column."""
        self.dictionary[column_name] = {
            'description': description,
            'data_type': data_type,
            'units': units,
            'valid_values': valid_values,
            'constraints': constraints,
            'column_type': str(self.df[column_name].dtype) if column_name in self.df.columns else 'N/A'
        }
    
    def generate_from_dataframe(self):
        """Auto-generate basic data dictionary from DataFrame."""
        for column in self.df.columns:
            dtype = str(self.df[column].dtype)
            unique_count = self.df[column].nunique()
            missing_count = self.df[column].isna().sum()
            
            self.dictionary[column] = {
                'description': f'[Description needed for {column}]',
                'data_type': dtype,
                'units': '',
                'valid_values': f'Unique values: {unique_count}',
                'constraints': f'Missing: {missing_count}',
                'sample_values': self.df[column].dropna().head(2).tolist()
            }
    
    def to_json(self):
        """Export dictionary as JSON."""
        return json.dumps(self.dictionary, indent=2, default=str)
    
    def to_dataframe(self):
        """Convert dictionary to DataFrame for display."""
        dict_list = []
        for col_name, col_info in self.dictionary.items():
            row = {'Column Name': col_name}
            row.update(col_info)
            dict_list.append(row)
        return pd.DataFrame(dict_list)
    
    def print_dictionary(self):
        """Print formatted data dictionary."""
        print(f"\nDATA DICTIONARY: {self.dataset_name.upper()}")
        print("=" * 80)
        
        for column, info in self.dictionary.items():
            print(f"\nüìã {column}")
            print(f"   Description: {info.get('description', 'N/A')}")
            print(f"   Data Type: {info.get('data_type', 'N/A')}")
            print(f"   Units: {info.get('units', 'None')}")
            print(f"   Valid Values: {info.get('valid_values', 'N/A')}")
            print(f"   Constraints: {info.get('constraints', 'None')}")
            if 'sample_values' in info:
                print(f"   Sample Values: {info.get('sample_values')}")

# Create and populate data dictionary
dd = DataDictionary(sample_data, 'Customer Survey Dataset')

# Manually add detailed descriptions
dd.add_column_definition(
    'customer_id',
    'Unique identifier for each customer',
    'integer',
    '',
    'Positive integers only',
    'Primary key, must be unique'
)

dd.add_column_definition(
    'age',
    'Age of customer in years',
    'integer',
    'years',
    '18-120',
    'Must be >= 18 and <= 120'
)

dd.add_column_definition(
    'income',
    'Annual household income',
    'float',
    'USD',
    '> 0',
    'Must be positive, in dollars'
)

dd.add_column_definition(
    'email',
    'Customer email address',
    'string',
    '',
    'Valid email format',
    'Must contain @ symbol, unique per customer'
)

dd.add_column_definition(
    'satisfaction',
    'Customer satisfaction rating',
    'integer',
    'Likert scale (1-5)',
    '1 (very dissatisfied) to 5 (very satisfied)',
    'Must be integer 1-5'
)

# Print the dictionary
dd.print_dictionary()

print("\n" + "=" * 80)
print("\nüìä Data Dictionary as Table:")
print(dd.to_dataframe().to_string(index=False))

In [None]:
# Data Lineage Tracking

class DataLineage:
    """Track data transformations and versions for reproducibility."""
    
    def __init__(self, source_name):
        """Initialize lineage tracking."""
        self.source_name = source_name
        self.lineage_log = []
        
        # Log the source
        self.log_step(
            step_type='SOURCE',
            description=f'Data source: {source_name}',
            input_records=None,
            output_records=None,
            transformation='Initial data load'
        )
    
    def log_step(self, step_type, description, input_records=None, 
                 output_records=None, transformation=''):
        """Log a transformation step."""
        log_entry = {
            'timestamp': datetime.now().isoformat(),
            'step_type': step_type,
            'description': description,
            'input_records': input_records,
            'output_records': output_records,
            'transformation': transformation
        }
        self.lineage_log.append(log_entry)
    
    def print_lineage(self):
        """Print data lineage in readable format."""
        print(f"\nDATA LINEAGE: {self.source_name}")
        print("=" * 80)
        
        for i, entry in enumerate(self.lineage_log, 1):
            print(f"\nStep {i}: {entry['step_type']}")
            print(f"  Timestamp: {entry['timestamp']}")
            print(f"  Description: {entry['description']}")
            if entry['input_records']:
                print(f"  Input Records: {entry['input_records']}")
            if entry['output_records']:
                print(f"  Output Records: {entry['output_records']}")
            if entry['transformation']:
                print(f"  Transformation: {entry['transformation']}")
    
    def to_json(self):
        """Export lineage as JSON for version control."""
        return json.dumps(self.lineage_log, indent=2, default=str)

# Create lineage for sample dataset
lineage = DataLineage('Customer Survey Data v1.0')

# Log each transformation step
lineage.log_step(
    step_type='VALIDATION',
    description='Validated email formats and age ranges',
    input_records=10,
    output_records=10,
    transformation='Applied regex validation to email column, range check on age column'
)

lineage.log_step(
    step_type='CLEANING',
    description='Removed duplicate records',
    input_records=10,
    output_records=9,
    transformation='Dropped rows where customer_id appears multiple times'
)

lineage.log_step(
    step_type='IMPUTATION',
    description='Handled missing income values',
    input_records=9,
    output_records=9,
    transformation='Applied median imputation to income column (missing: 1 value)'
)

lineage.log_step(
    step_type='ENCODING',
    description='Encoded categorical variables',
    input_records=9,
    output_records=9,
    transformation='No categorical variables in this dataset'
)

lineage.log_step(
    step_type='EXPORT',
    description='Final processed dataset',
    input_records=9,
    output_records=9,
    transformation='Exported to CSV format for analysis'
)

# Print lineage
lineage.print_lineage()

## 4. Data Preprocessing Workflow

### The 7-Step Preprocessing Pipeline

A robust preprocessing workflow ensures data quality and reproducibility:

1. **Data Loading and Initial Inspection**
   - Load data with proper encoding
   - Check shape, dtypes, and first few rows
   - Document source and version

2. **Exploratory Data Analysis (EDA)**
   - Understand distributions and relationships
   - Identify outliers and anomalies
   - Check for missing patterns

3. **Data Validation**
   - Check data types are correct
   - Verify values fall within expected ranges
   - Identify invalid or suspicious records

4. **Handling Missing Data**
   - Document why data is missing (MCAR, MAR, MNAR)
   - Decide on strategy: drop, impute, or flag
   - Apply only to training data, use same approach for test data

5. **Outlier Detection and Handling**
   - Use statistical methods (z-score, IQR) or domain knowledge
   - Decide: remove, transform, or flag
   - Document decisions for transparency

6. **Feature Engineering and Transformation**
   - Create new features from existing ones
   - Normalize/standardize numeric features
   - Encode categorical variables
   - **CRITICAL**: Fit transformers ONLY on training data

7. **Quality Check and Export**
   - Verify preprocessing hasn't introduced errors
   - Check data quality metrics again
   - Export with versioning and metadata

### THE CRITICAL RULE: Fit Only On Training Data

This is the #1 source of data leakage in ML projects:

**WRONG** (causes leakage):
```python
# Scale entire dataset first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_all)  # Uses ALL data
X_train = X_scaled[:split_idx]
X_test = X_scaled[split_idx:]  # Test data influenced by entire dataset
```

**CORRECT** (no leakage):
```python
# Split FIRST
X_train, X_test = train_test_split(X, test_size=0.2)
# Then fit on training data only
scaler = StandardScaler()
scaler.fit(X_train)  # Uses ONLY training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Consistent transformation
```

In [None]:
# Demonstration of correct preprocessing with proper data handling

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

print("\n" + "="*70)
print("CORRECT PREPROCESSING WORKFLOW: No Data Leakage")
print("="*70)

# Create synthetic dataset
print("\n1Ô∏è‚É£  STEP 1: Load and Inspect Data")
np.random.seed(42)
X = np.random.randn(100, 3) * [10, 50, 100]  # Different scales
y = (X[:, 0] + X[:, 1] * 0.5 + X[:, 2] * 0.1 + np.random.randn(100) * 5 > 50).astype(int)

print(f"Dataset shape: {X.shape}")
print(f"Feature means: {X.mean(axis=0).round(2)}")
print(f"Feature stds: {X.std(axis=0).round(2)}")
print("   ‚ö†Ô∏è Notice: Features have very different scales!")

print("\n2Ô∏è‚É£  STEP 2: Train-Test Split (BEFORE ANY FITTING)")
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print(f"Training set size: {X_train.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
print("   ‚úì Data split BEFORE any transformations - no leakage possible")

print("\n3Ô∏è‚É£  STEP 3: Fit Scaler ONLY on Training Data")
scaler = StandardScaler()
scaler.fit(X_train)  # Fit ONLY on training data
print(f"Scaler parameters computed from training data:")
print(f"   Training mean: {scaler.mean_.round(2)}")
print(f"   Training std: {scaler.scale_.round(2)}")
print("   ‚úì Scaler parameters based ONLY on training data")

print("\n4Ô∏è‚É£  STEP 4: Transform Both Sets Using Training Parameters")
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training data after scaling:")
print(f"   Mean: {X_train_scaled.mean(axis=0).round(3)}")
print(f"   Std: {X_train_scaled.std(axis=0).round(3)}")
print(f"\nTest data after scaling:")
print(f"   Mean: {X_test_scaled.mean(axis=0).round(3)}")
print(f"   Std: {X_test_scaled.std(axis=0).round(3)}")
print("   ‚úì Test data mean/std ‚â† 0/1 because it wasn't used in fitting")
print("   ‚úì This is CORRECT - test data should be scaled using training parameters")

print("\n" + "="*70)
print("KEY PRINCIPLE: Fit all transformers on training data only!")
print("This includes: scaling, imputation, encoding, feature selection")
print("="*70)

## Exercise 1: Assessing Data Quality

You've been given a new dataset for analysis. Your task is to:

1. **Load the data** and perform initial inspection
2. **Assess completeness**: Calculate missing percentage for each column
3. **Assess validity**: Check for out-of-range values
4. **Identify duplicates**: Find any duplicate records
5. **Document findings** in a short report

Below is a dataset with intentional quality issues. Complete the assessment:


In [None]:
# Exercise 1: Data Quality Assessment

# Create a problematic dataset (intentionally with quality issues)
exercise_data = pd.DataFrame({
    'student_id': [101, 102, 103, 104, 105, 105, 107, 108, np.nan, 110],
    'test_score': [85, 92, np.nan, 78, 88, 88, 95, 1000, 76, 82],  # 1000 is invalid
    'study_hours': [5, 6.5, 4, np.nan, 7, 7, 8, 3, 4.5, 6],
    'grade': ['A', 'A', 'B', 'C', 'B', 'B', 'A', 'F', 'C', 'B'],
    'attendance_pct': [95, 100, 85, 70, 92, 92, 98, 150, 88, 91]  # 150% is invalid
})

print("EXERCISE 1: Data Quality Assessment")
print("="*70)
print("\nDataset Overview:")
print(exercise_data)

print("\nTODO: Complete the following assessment:")
print("-" * 70)

# TODO 1: Calculate completeness for each column
print("\n1. COMPLETENESS ANALYSIS")
print("   Calculate missing percentage for each column")
print("   Hint: (missing_count / total_count) * 100")

# YOUR CODE HERE
# completeness = ...
# print(completeness)

print("\n2. VALIDITY ANALYSIS")
print("   Find invalid values:")
print("   - test_score should be 0-100")
print("   - attendance_pct should be 0-100")

# YOUR CODE HERE
# invalid_scores = ...
# invalid_attendance = ...

print("\n3. DUPLICATE ANALYSIS")
print("   Find duplicate student_id values")

# YOUR CODE HERE
# duplicates = ...

print("\n4. WRITE A QUALITY REPORT")
print("   Summarize findings and recommend actions")

# YOUR CODE HERE: Write your findings below

## Exercise 2: Creating a Data Dictionary

Create a comprehensive data dictionary for the student dataset from Exercise 1.

Include:
1. **Description** of each column
2. **Data type** (integer, float, string, date, etc.)
3. **Valid range** or allowed values
4. **Units** if applicable
5. **Special constraints** or business rules

In [None]:
# Exercise 2: Create Data Dictionary

print("EXERCISE 2: Create Data Dictionary for Student Dataset")
print("="*70)

# Create data dictionary
student_dictionary = DataDictionary(exercise_data, 'Student Performance Dataset')

# TODO: Add definitions for each column
# Use the add_column_definition method with appropriate parameters

# Example (complete this):
student_dictionary.add_column_definition(
    'student_id',
    'Unique identifier for each student',  # Description
    'integer',  # Data type
    '',  # Units
    '1-999',  # Valid values
    'Primary key, no duplicates allowed'  # Constraints
)

# TODO: Add definitions for remaining columns:
# - test_score
# - study_hours  
# - grade
# - attendance_pct

print("\nYour Data Dictionary:")
print("-" * 70)

# Print your completed dictionary
student_dictionary.print_dictionary()

## Exercise 3: Avoiding Data Leakage

**Scenario**: You're building a model to predict student test scores from study hours and attendance.

Below are two preprocessing approaches. Identify which one causes data leakage and explain why.

**Approach A**: Scale entire dataset, then split
**Approach B**: Split first, then scale each set separately

Why does this matter for research validity?

In [None]:
# Exercise 3: Data Leakage Detection

print("EXERCISE 3: Detecting Data Leakage in Preprocessing")
print("=" * 70)

# Prepare data for modeling
features = ['study_hours', 'attendance_pct']
target = 'test_score'

# Remove rows with missing target values for this example
model_data = exercise_data.dropna(subset=[target]).copy()
X = model_data[features].fillna(model_data[features].mean())
y = model_data[target]

print(f"\nDataset size: {len(X)} samples")
print(f"Features: {features}")
print(f"Target: {target}")
print(f"\nFeature statistics (original data):")
print(f"  Study hours - mean: {X['study_hours'].mean():.2f}, std: {X['study_hours'].std():.2f}")
print(f"  Attendance - mean: {X['attendance_pct'].mean():.2f}, std: {X['attendance_pct'].std():.2f}")

print("\n" + "="*70)
print("APPROACH A: Scale first, then split (‚ö†Ô∏è CAUSES LEAKAGE)")
print("="*70)

scaler_a = StandardScaler()
X_scaled_all = scaler_a.fit_transform(X)  # Fit on ALL data
X_train_a, X_test_a = train_test_split(X_scaled_all, test_size=0.3, random_state=42)

print(f"\nScaling parameters computed from: ALL {len(X)} samples")
print(f"  Scaler mean: {scaler_a.mean_.round(2)}")
print(f"  Scaler std: {scaler_a.scale_.round(2)}")
print(f"\nApproach A result:")
print(f"  Training mean: {X_train_a.mean(axis=0).round(3)}")
print(f"  Test mean: {X_test_a.mean(axis=0).round(3)}")
print(f"\n‚ùå PROBLEM: Test data statistics influenced by training data!")
print(f"   Scaler learned from test data distribution - LEAKAGE!")

print("\n" + "="*70)
print("APPROACH B: Split first, then scale (‚úì CORRECT)")
print("="*70)

X_train_b, X_test_b = train_test_split(X.values, test_size=0.3, random_state=42)
scaler_b = StandardScaler()
scaler_b.fit(X_train_b)  # Fit ONLY on training data
X_train_b_scaled = scaler_b.transform(X_train_b)
X_test_b_scaled = scaler_b.transform(X_test_b)

print(f"\nScaling parameters computed from: {len(X_train_b)} training samples")
print(f"  Scaler mean: {scaler_b.mean_.round(2)}")
print(f"  Scaler std: {scaler_b.scale_.round(2)}")
print(f"\nApproach B result:")
print(f"  Training mean: {X_train_b_scaled.mean(axis=0).round(3)}")
print(f"  Test mean: {X_test_b_scaled.mean(axis=0).round(3)}")
print(f"\n‚úì CORRECT: Test data statistics NOT affected by training parameters")
print(f"   Scaler learned ONLY from training data - NO LEAKAGE!")

print("\n" + "="*70)
print("LEARNING OUTCOME:")
print("Always split data BEFORE fitting any transformers.")
print("This is critical for unbiased model evaluation.")
print("="*70)

# TODO: Answer this question
print("\nQUESTION: Why is Approach B more valid for research?")
print("-" * 70)
print("Your answer:")
print("""
1. In Approach A, when we evaluate the model's performance on test data,
   we're not truly measuring how it would perform on NEW, unseen data.
   
2. The test set was already used to compute scaling parameters,
   so the model has effectively "seen" test data statistics.
   
3. This violates the fundamental research principle of test-train separation.

4. The reported test accuracy will be optimistically biased,
   making the model appear better than it actually is.

5. Reproducibility is compromised because results depend on
   the full dataset, not just training data.
""")

## Summary

### Key Takeaways

‚úÖ **Data Acquisition**: Choose between primary and secondary data based on research needs, use appropriate sampling methods, and validate at collection point

‚úÖ **Data Quality Assessment**: Always assess completeness, validity, reliability, and potential biases - visualize patterns, don't rely on summary statistics alone

‚úÖ **Data Documentation**: Create data dictionaries, track lineage, write README files, and follow metadata standards for reproducibility

‚úÖ **7-Step Preprocessing**: Load ‚Üí EDA ‚Üí Validate ‚Üí Handle Missing ‚Üí Handle Outliers ‚Üí Transform ‚Üí Quality Check

‚úÖ **Prevent Data Leakage**: ALWAYS split train/test BEFORE fitting any transformers - this is the #1 rule for valid research

### What's Next?

In **Module 05: Exploratory Data Analysis and Visualization**, you'll learn:
- Statistical summaries and distributions
- Correlation analysis
- Effective data visualization for research
- Communicating data stories

### Additional Resources

- **Paper**: "Leakage in Data Mining" (Kaufman et al., 2012)
- **Standard**: DDI (Data Documentation Initiative)
- **Guide**: "Data Quality: Concepts, Methodologies and Techniques" (Olson, 2003)
- **Tool**: Great Expectations (automated data validation)


## Self-Assessment

Before moving to Module 05, ensure you can:

- [ ] Explain the difference between primary and secondary data
- [ ] Choose appropriate sampling methods for different research scenarios
- [ ] Calculate and interpret data quality metrics (completeness, validity)
- [ ] Create a comprehensive data dictionary
- [ ] Document data lineage and transformations
- [ ] Implement the 7-step preprocessing workflow
- [ ] Explain why fitting transformers only on training data prevents data leakage
- [ ] Identify common data quality issues in real datasets
- [ ] Visualize missing data patterns
- [ ] Apply validation rules at data collection time

If you can confidently check all boxes, you're ready for Module 05! üéâ