## 2. INSUFFICIENT / LABELED DATA PROBLEM

### Definition
**Not having enough labeled training data** is one of the biggest challenges in supervised learning.

### The Challenge

#### How Much Data Do You Need?
```
Simple models (Linear Regression):      100 - 1,000 samples
Medium models (Decision Trees):         1,000 - 10,000 samples
Complex models (Neural Networks):       10,000 - 1,000,000 samples
Deep Learning (Images, Text, Audio):    1,000,000+ samples
```

#### The Chicken-and-Egg Problem:
- Need data to train models
- Need models to label data
- Labeling is expensive and time-consuming
- Manual labeling: $0.10 - $10+ per sample (domain-dependent)

### Real Numbers:
```
Labeling Cost Examples:
- Image classification: $1-5 per image
- Medical imaging: $10-50 per image
- Text annotation: $0.10-1 per sample
- Video labeling: $50+ per hour

Dataset Size Costs:
- 1,000 samples: $100 - $50,000
- 10,000 samples: $1,000 - $500,000
- 100,000 samples: $10,000 - $5,000,000
- 1,000,000 samples: Millions of dollars
```

### 2.1 Manual Labeling Challenges

#### Inconsistencies Between Annotators


In [None]:
# Example: Inter-rater disagreement
import numpy as np
from sklearn.metrics import confusion_matrix, cohen_kappa_score

# Annotator 1's labels
annotator1 = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 1])

# Annotator 2's labels (same samples)
annotator2 = np.array([0, 1, 0, 0, 1, 1, 1, 1, 0, 0])

# Measure disagreement
disagreement = np.sum(annotator1 != annotator2) / len(annotator1)
print(f"Disagreement rate: {disagreement*100:.1f}%")  # 30%!

# Calculate Cohen's Kappa (agreement beyond chance)
kappa = cohen_kappa_score(annotator1, annotator2)
print(f"Cohen's Kappa: {kappa:.2f}")  # 0.56 (moderate agreement)

# Confusion between annotators
cm = confusion_matrix(annotator1, annotator2)
print("Confusion Matrix:")
print(cm)


#### Ambiguous Cases


In [None]:
# Example: Ambiguous sentiment classification
texts = [
    "This movie is not bad",           # Sarcasm? Negative or Positive?
    "I love this... not",              # Sarcasm? Confusing!
    "It's okay, I guess",              # Slightly positive or negative?
    "This product broke after 1 day",  # Clear negative
    "Great quality! Expensive though",  # Mixed positive/negative
]

# Different annotators might label differently
# Text 1: A says negative, B says positive → Disagreement!


### 2.2 Solutions to Insufficient Data

#### 1. Data Augmentation


In [None]:
# For Images
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Artificial variations of existing images
augment = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.2,
    shear_range=0.2,
    fill_mode='nearest'
)

# For Text
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas

# Synonym replacement
text = "The movie is great"
augment = naw.SynonymAug(aug_src='wordnet')
augmented = augment.augment(text)
# Might produce: "The film is wonderful"

# Paraphrasing
augment = nas.ContextualWordEmbsAug(
    model_path='bert-base-uncased',
    action="substitute"
)

# For Structured Data
def augment_numerical_data(X, noise_level=0.1):
    import numpy as np
    noise = np.random.normal(0, noise_level, X.shape)
    return X + noise

# Example: 100 samples → 1000 samples through augmentation
original_data = np.random.rand(100, 20)
augmented_data = []

for _ in range(10):  # 10x augmentation
    augmented_data.append(augment_numerical_data(original_data))

augmented_data = np.vstack(augmented_data)
print(f"Original: {original_data.shape}, Augmented: {augmented_data.shape}")


#### 2. Transfer Learning


In [None]:
# Use pre-trained models instead of training from scratch
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Sequential

# Pre-trained on millions of images
base_model = MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet'  # Pre-trained weights
)

# Freeze base model weights (don't retrain)
base_model.trainable = False

# Add small custom layer on top
model = Sequential([
    base_model,
    Flatten(),
    Dense(256, activation='relu'),
    Dense(10, activation='softmax')  # Your classes
])

# Train only your custom layers (requires less data!)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(your_limited_data, epochs=10)

# With transfer learning:
# - Only need 100s of samples instead of 100,000s
# - Trains 10x faster
# - Better accuracy with limited data


#### 3. Semi-Supervised Learning


In [None]:
# Use both labeled and unlabeled data
from sklearn.semi_supervised import LabelSpreading
import numpy as np

# Labeled data (expensive)
X_labeled = np.array([[1, 0], [0, 1], [1, 1]])
y_labeled = np.array([0, 1, 1])

# Unlabeled data (cheap, abundant)
X_unlabeled = np.random.rand(97, 2)
X_combined = np.vstack([X_labeled, X_unlabeled])

# Initialize unlabeled as -1 (unknown)
y_combined = np.hstack([
    y_labeled,
    np.full(97, -1)  # -1 means unlabeled
])

# Label propagation: uses unlabeled data to improve
model = LabelSpreading()
model.fit(X_combined, y_combined)

# Model learned from 3 labeled + 97 unlabeled samples!
predictions = model.predict(X_unlabeled)


#### 4. Weak Supervision


In [None]:
# Use noisy/weak labels instead of manual annotation
# Heuristics, rules, or cheap proxies for labels

def weak_label_sentiment(text):
    """Weak labeling using simple rules"""
    positive_words = {'great', 'amazing', 'wonderful', 'excellent', 'good'}
    negative_words = {'bad', 'terrible', 'awful', 'horrible', 'poor'}
    
    text_lower = text.lower()
    
    pos_count = sum(1 for word in positive_words if word in text_lower)
    neg_count = sum(1 for word in negative_words if word in text_lower)
    
    if pos_count > neg_count:
        return 1  # Positive
    elif neg_count > pos_count:
        return 0  # Negative
    else:
        return -1  # Uncertain

# Automatically label huge dataset
texts = ["This is great!", "I hate it", "It's okay"]
weak_labels = [weak_label_sentiment(t) for t in texts]
# Fast and cheap! But noisy...

# Train model with noisy labels (handles noise)
from snorkel.labeling import LabelingFunction
from snorkel.labeling.model import LabelModel

# Snorkel framework handles label aggregation and noise
lfs = [  # Multiple weak labeling functions
    weak_label_sentiment,
    # Add more heuristics...
]


#### 5. Active Learning


In [None]:
# Intelligently select which samples to label
# Focus labeling effort on most uncertain/informative samples

from sklearn.ensemble import RandomForestClassifier
import numpy as np

class ActiveLearner:
    def __init__(self, base_model):
        self.model = base_model
        self.X_labeled = []
        self.y_labeled = []
        self.X_unlabeled = []
    
    def select_most_uncertain(self, n=10):
        """Select n most uncertain samples"""
        probabilities = self.model.predict_proba(self.X_unlabeled)
        
        # Uncertainty = confidence closest to 0.5
        uncertainty = 1 - np.max(probabilities, axis=1)
        
        # Select most uncertain
        uncertain_indices = np.argsort(uncertainty)[-n:]
        
        return self.X_unlabeled[uncertain_indices]
    
    def add_labeled(self, X, y):
        """Add newly labeled samples"""
        self.X_labeled.extend(X)
        self.y_labeled.extend(y)
    
    def retrain(self):
        """Retrain with new labeled data"""
        self.model.fit(self.X_labeled, self.y_labeled)

# Active Learning Loop
learner = ActiveLearner(RandomForestClassifier())

# Start with 10 random labeled samples
initial_indices = np.random.choice(1000, 10)
learner.add_labeled(X[initial_indices], y[initial_indices])

for iteration in range(10):  # 10 iterations
    learner.retrain()
    
    # Select 10 most informative unlabeled samples
    uncertain_samples = learner.select_most_uncertain(n=10)
    
    # Human labels these (expensive!)
    human_labels = [human_label_this(s) for s in uncertain_samples]
    
    # Add to training data
    learner.add_labeled(uncertain_samples, human_labels)
    
    print(f"Iteration {iteration}: {learner.model.score(X_test, y_test):.3f}")

# With active learning:
# - Label only 100 most informative samples instead of 1000
# - Get better performance with less labeling
# - 10x more efficient!


#### 6. Crowdsourcing


In [None]:
# Use many cheap annotators instead of few expensive ones
import numpy as np
from scipy.stats import mode

# 100 crowdworkers label same samples (cheap but noisy)
# Aggregate their votes
crowdworker_labels = [
    [0, 1, 0, 1, 0],  # Worker 1's labels
    [0, 1, 1, 1, 0],  # Worker 2's labels
    [1, 1, 0, 1, 0],  # Worker 3's labels
    # ... 97 more workers
]

# Simple aggregation: majority vote
aggregated = mode(crowdworker_labels, axis=0)[0].flatten()
# Result: [0, 1, 0, 1, 0]

# More sophisticated: weight by worker reliability
# Workers who agree more often get more weight
from sklearn.preprocessing import normalize

worker_reliability = np.array([
    0.9,  # Worker 1 is 90% reliable
    0.8,  # Worker 2 is 80% reliable
    0.85, # Worker 3 is 85% reliable
])

weighted_votes = []
for i in range(5):  # For each sample
    votes = [
        crowdworker_labels[j][i]
        for j in range(3)
    ]
    weights = worker_reliability
    
    # Weighted average (0 or 1)
    final_label = 1 if np.average(votes, weights=weights) > 0.5 else 0
    weighted_votes.append(final_label)

print(f"Majority vote: {aggregated}")
print(f"Weighted votes: {weighted_votes}")


### Cost-Benefit Analysis:


In [None]:
# Compare different data collection strategies
strategies = {
    'Manual Annotation': {
        'cost_per_sample': 1.0,
        'quality': 0.95,
        'time_weeks': 50  # 10000 samples
    },
    'Crowdsourcing': {
        'cost_per_sample': 0.1,
        'quality': 0.85,
        'time_weeks': 2  # Faster
    },
    'Data Augmentation': {
        'cost_per_sample': 0.01,
        'quality': 0.70,
        'time_weeks': 1  # Very fast
    },
    'Transfer Learning': {
        'cost_per_sample': 0.05,
        'quality': 0.88,
        'time_weeks': 1
    }
}

n_samples_needed = 10000

for strategy, specs in strategies.items():
    cost = specs['cost_per_sample'] * n_samples_needed
    quality = specs['quality']
    time = specs['time_weeks']
    
    print(f"\n{strategy}:")
    print(f"  Total Cost: ${cost:,.0f}")
    print(f"  Quality: {quality*100:.0f}%")
    print(f"  Time: {time} weeks")
    print(f"  Cost/Quality ratio: {cost/quality:.0f}")


---
