# 📘 Chapter 2: Collecting, Labeling, and Validating Data

## Why Data Matters

> “Data is the hardest part of ML and the most important piece to get right.”  
> — ML Practitioner at Uber

> “No other activity in the ML lifecycle has a higher ROI than improving the data.”  
> — ML Practitioner at Gojek

In production, data is **critical** and often **messier** than in academic settings. This chapter explores the importance of collecting the right data, labeling it effectively, and validating it before using it in ML systems.


## Important Considerations in Data Collection

- Data must **represent the feature space** of the application you're building.
- Data issues can arise from:
  - Mismatched measurement systems
  - Inconsistent encodings (e.g. float vs. int)
  - Missing or misinterpreted values (e.g. `0` elevation)

You also need to ensure:
- Early detection of issues (monitoring)
- High **predictive signal** from features (via feature engineering and selection)


## Responsible Data Collection

Ethical and secure data collection involves:
- **Data security**: protecting user data from unauthorized access
- **Data privacy**: controlling how personal data is stored, shared, and deleted
- **Fairness**: avoiding harm like:
  - Representational harm (e.g. reinforcing stereotypes)
  - Opportunity denial
  - Disproportionate product failure
  - Harm by disadvantage


## Labeling Data

### Types of Labeling

1. **Direct Labeling**  
   - Automated from system logs (e.g. click-through rate)  
   - Tools: Logstash, Fluentd

2. **Human Labeling**  
   - Generalists or subject matter experts  
   - Often needed for complex or medical data  
   - Challenges: bias, inconsistency, cost, speed

### Labeling Best Practices
- Ensure diversity in rater pool
- Monitor label freshness and quality
- Consider incentive structures


## Data Changes and Drift

### Gradual Changes
- Happen over months/years
- Due to trends, seasonality, changing business processes

### Sudden Changes
- Sensor failures, logging errors, system updates
- May lead to **data drift** or **concept drift**:
  - **Data drift**: input distribution changes
  - **Concept drift**: relationship between input and output changes

### Retraining Strategy Based on Drift Speed:

| Drift Speed     | Strategy                                  |
|------------------|--------------------------------------------|
| Months/Years   | Curated datasets, standard retraining       |
| Weeks          | Feedback-based retraining                   |
| Hours/Minutes  | Continuous monitoring, fast feedback loops  |


## Validating Data

### Common Data Issues
- **Data drift**: feature distributions change
- **Concept drift**: label meanings change
- **Schema skew**: training vs serving schema mismatch
- **Feature skew**: mismatched feature values
- **Distribution skew**: value distribution mismatch


## Using TensorFlow Data Validation (TFDV)

### Capabilities of TFDV
- Generate statistics (counts, missing values, etc.)
- Visualize and compare datasets
- Infer schema
- Detect:
  - Anomalies
  - Drift
  - Training-serving skew


In [None]:
# Install TFDV
!pip install tensorflow-data-validation


In [None]:
# Generate statistics from a CSV file
import tensorflow_data_validation as tfdv

stats = tfdv.generate_statistics_from_csv(
    data_location='your_data.csv',
    delimiter=','
)


In [None]:
# Generate statistics from TFRecord format
stats = tfdv.generate_statistics_from_tfrecord(
    data_location='your_data.tfrecord'
)

# Visualize statistics
tfdv.visualize_statistics(stats)



In [None]:
# Print histogram of string features (e.g. for label distribution)
print(stats.datasets[0].features[0].string_stats.rank_histogram)

# Example output:
# buckets {
#   label: "ham"
#   sample_count: 4827.0
# }
# buckets {
#   label: "spam"
#   sample_count: 747.0
# }


## Types of Skew Detected by TFDV

1. Schema skew       → Mismatch in data types or formats
2. Feature skew      → Same feature has different values
3. Distribution skew → Feature distributions differ


## Alternatives to TensorFlow Data Validation

While TFDV is powerful, other tools exist that may better suit your tech stack:

- **Great Expectations**  
  Originally open-source, now also a commercial cloud solution.  
  Connects easily with many data sources, including in-memory databases.

- **Evidently**  
  Focuses on dataset monitoring and drift detection.  
  Supports unstructured text data.

These tools offer different trade-offs for teams not using TensorFlow-based workflows.


## Conclusion

In this chapter, we discussed the many things to consider when collecting and labeling the data used to train ML models.

Given:
- The importance of data to system health
- The complexity of labeling
- The potential for data drift and sudden change

...it's crucial to build systems that **manage**, **monitor**, and **validate** your data throughout the lifecycle of your machine learning pipeline.
