# 📘 Chapter 5: Advanced Labeling, Augmentation, and Data Preprocessing


---
# 1. Introduction
This chapter focuses on:
 - Techniques to reduce labeling cost
- Methods to expand labeled datasets via augmentation
- Domain-specific preprocessing, especially for time series data

---
## 2. Advanced Labeling
Labeling is often the most expensive step in building a supervised learning system.
Advanced labeling techniques allow us to label more data with fewer resources.

---
### 2.1 Semi-Supervised Labeling
Combines a small set of labeled data with a large set of unlabeled data.
Assumes similar data points in feature space belong to the same class.


In [None]:

from sklearn import datasets
from sklearn.semi_supervised import LabelPropagation
import matplotlib.pyplot as plt
import numpy as np

# Generate synthetic dataset
X, y = datasets.make_classification(n_samples=200, n_features=2, n_classes=2, n_redundant=0)

# Randomly unlabel some data
rng = np.random.RandomState(42)
y_missing = y.copy()
unlabeled_indices = rng.rand(len(y)) < 0.8
y_missing[unlabeled_indices] = -1

# Apply label propagation
lp_model = LabelPropagation()
lp_model.fit(X, y_missing)

---
 ### 2.2 Active Learning
 
Uses sampling strategies to label the most informative examples first.
Ideal when labeling budget is limited.

---
Margin Sampling (Illustration)


In [None]:
from sklearn.svm import SVC

# Fit a model on a few initial labels
initial_idx = np.where(y_missing != -1)[0][:10]
model = SVC(kernel="linear", probability=True)
model.fit(X[initial_idx], y[initial_idx])

# Predict margins for all points
probs = model.predict_proba(X)
margins = np.abs(probs[:, 0] - probs[:, 1])
uncertain_idx = np.argsort(margins)[:10]

---
 ### 2.3 Weak Supervision with Snorkel

Use labeling functions (heuristics) to assign noisy labels.

A generative model is used to denoise them.

In [None]:

from snorkel.labeling import labeling_function

@labeling_function()
def lf_contains_my(x):
    return 1 if "my" in x.text.lower() else -1

@labeling_function()
def lf_short_comment(x):
    return 0 if len(x.text.split()) < 5 else -1

These labeling functions would be combined in Snorkel to generate weak labels.

---
 ## 3. Data Augmentation
Increase dataset size and diversity by generating valid variants of data points.

---
### Example: CIFAR-10 Image Augmentation

In [None]:
import tensorflow as tf

def augment(x, height, width, num_channels):
    x = tf.image.resize_with_crop_or_pad(x, height + 8, width + 8)
    x = tf.image.random_crop(x, [height, width, num_channels])
    x = tf.image.random_flip_left_right(x)
    return x

Example call (assuming `image` is a 32x32x3 tensor):
image_aug = augment(image, 32, 32, 3)

---
 Other augmentation strategies:
 - Semi-supervised augmentation
 - Unsupervised Data Augmentation (UDA)
 - Policy-based augmentation (e.g., AutoAugment)

---
 ## 4. Time Series Preprocessing
 Common for forecasting problems, e.g., temperature prediction.
 Needs special handling for time order, windowing, sampling.

---
 ### Seasonality Visualization

In [None]:

x = np.linspace(0, 10, 200)
y = 3 * np.sin(2 * np.pi * x) + 15
plt.plot(x, y)
plt.title("Seasonality in Time Series")
plt.xlabel("Time")
plt.ylabel("Temperature")
plt.show()


---
### Windowing Strategy
Generate input-output pairs using past data (e.g., 6-hour history -> 1-hour forecast)

---
#### Windowing illustration
t = [t0, t1, ..., t5] -> predict t6
or t = [t0 ... t23] -> predict t24

---
### Sampling Strategy
Optimize training data size by sampling/aggregating at a suitable interval

---
For example, sampling once per hour instead of every 10 minutes reduces feature vector size from 792 to 126.

---
## 5. Conclusion
Chapter 5 covered:
- Labeling strategies: Semi-supervised, Active Learning, Weak Supervision
- Data augmentation: Valid, label-preserving transformations
- Time series preprocessing: Windowing, sampling, and seasonal patterns

Techniques here improve data efficiency and model generalization.


---
## 🔑 Keywords
- Label Propagation
- Semi-Supervised Labeling
- Active Learning
- Margin Sampling
- Weak Supervision
- Snorkel
- Data Augmentation
- CIFAR-10
- Seasonality
- Windowing
- Sampling
- Time Series Forecasting