## 10. COST CONSIDERATIONS

### Definition
**Building ML systems is expensive**. Often prohibitively so for small organizations.

### Cost Breakdown:

```
ML Project Budget = Data + Computing + Personnel + Infrastructure

1. DATA COLLECTION & LABELING
   ├─ Data acquisition: $1K - $100K
   ├─ Manual labeling: $10K - $1M
   │  └─ Depends on: Volume, Complexity, Expertise needed
   ├─ Crowdsourcing: $1K - $100K
   └─ Annotation tools: $100 - $10K/month

2. COMPUTING RESOURCES
   ├─ GPU/TPU for training: $1K - $100K/month
   ├─ Storage: $100 - $10K/month
   ├─ Cloud platform: $1K - $50K/month
   │  └─ AWS, GCP, Azure compute costs
   └─ Infrastructure: $5K - $100K (setup)

3. PERSONNEL
   ├─ Data scientists: $80K - $200K/year (1-3 people)
   ├─ ML engineers: $100K - $250K/year (1-2 people)
   ├─ Data engineers: $90K - $220K/year (1-2 people)
   └─ ML Ops: $80K - $200K/year (1 person)

4. TOTAL FIRST YEAR: $200K - $2M+

5. ONGOING (PER YEAR): $150K - $1M
   ├─ Salaries
   ├─ Computing
   ├─ Data updates
   └─ Maintenance
```

### Real-World Cost Examples:


In [None]:
# Example 1: Image Classification for E-commerce
# Objective: Categorize 1M product images

# Cost Breakdown:
costs = {
    'Manual annotation': 1_000_000 * 0.50,      # $500K (0.50 per image)
    'AWS S3 storage': (1_000_000 * 5e-6 * 365),  # $1.8K/year
    'GPU training (1 month)': 30 * 24 * 2.4,     # $1.7K
    'Data scientist (6 months)': 150_000 * 0.5,  # $75K
    'ML engineer (6 months)': 200_000 * 0.5,     # $100K
}

total = sum(costs.values())
print(f"Total cost: ${total:,.0f}")
# → ~$677K for 1M images

# Example 2: NLP Model for Customer Support
# Objective: Classify 100K support tickets

costs_nlp = {
    'Crowdsourced labeling': 100_000 * 0.10,    # $10K
    'Compute (training)': 5 * 3600 * 0.30,      # $5.4K
    'Data scientist (3 months)': 150_000 * 0.25, # $37.5K
}

total_nlp = sum(costs_nlp.values())
print(f"Total cost: ${total_nlp:,.0f}")
# → ~$52.9K for 100K tickets

# Example 3: Autonomous Vehicle ML System
# Objective: Detect pedestrians, cars, signs in video

costs_av = {
    'Video data collection': 100_000,            # $100K
    'Manual annotation': 1_000_000 * 5,          # $5M (expensive)
    'GPU cluster (1 year)': 100 * 24 * 365 * 0.30,  # $262K
    'Team (5 people, 1 year)': (150 + 200 + 90 + 100 + 80) * 1000,  # $620K
}

total_av = sum(costs_av.values())
print(f"Total cost: ${total_av:,.0f}")
# → ~$5.98M per year!


### Cost Optimization Strategies:


In [None]:
# Strategy 1: Use existing models (Transfer Learning)
# Instead of: Train from scratch ($500K)
# Do: Fine-tune pre-trained model ($50K)
# Savings: $450K

from transformers import AutoModel

# Pre-trained BERT (millions of dollars already spent by others)
model = AutoModel.from_pretrained('bert-base-uncased')

# Fine-tune on your specific task
model.fit(your_data, your_labels, epochs=5)

# Cost: Only fine-tuning, not pre-training!

# Strategy 2: Data augmentation instead of collecting more
# Instead of: Collect 100K images ($50K)
# Do: Augment 10K images ($5K)
# Savings: $45K

from imgaug import augmenters as iaa

aug = iaa.Sequential([
    iaa.Fliplr(0.5),
    iaa.Affine(rotate=(-25, 25)),
    iaa.Multiply((0.8, 1.2)),
    iaa.GaussianBlur(sigma=(0, 0.5))
])

# Create 10 augmented versions of each image
augmented_images = []
for img in original_images:
    for _ in range(10):
        augmented_images.append(aug(image=img))

# 10K → 100K images cheaply!

# Strategy 3: Automated labeling + human validation
# Instead of: Manual label all 100K samples ($10K)
# Do: Auto-label + human verify 10% ($1K labeling + $1K review)
# Savings: $8K

def auto_label_with_heuristics(data):
    """Quick automated labeling using rules"""
    labels = []
    for item in data:
        if 'negative_word' in item.text:
            labels.append('negative')
        elif 'positive_word' in item.text:
            labels.append('positive')
        else:
            labels.append('uncertain')
    return labels

# Auto-label all
auto_labels = auto_label_with_heuristics(data)

# Humans verify uncertain ones (small set)
uncertain_indices = [i for i, l in enumerate(auto_labels) if l == 'uncertain']
human_verified = human_review(uncertain_indices)

# Low cost, decent quality!

# Strategy 4: Prioritize high-impact use cases
# Focus on:
# - High revenue impact ($)
# - Easy to implement (low cost)
# - High success probability (ROI)

# Not all ML projects are worthwhile!
impact_assessment = pd.DataFrame({
    'Use Case': ['Recommendation', 'Price Optimization', 'Churn Prediction'],
    'Annual Impact': [1_000_000, 500_000, 200_000],
    'Development Cost': [300_000, 100_000, 50_000],
    'ROI': [1_000_000 / 300_000, 500_000 / 100_000, 200_000 / 50_000]
})

# Sort by ROI, focus on highest
impact_assessment = impact_assessment.sort_values('ROI', ascending=False)

# Strategy 5: Use AutoML to reduce data scientist costs
from h2o import automl

# Instead of hiring expensive data scientist
# Use automated ML to:
# - Feature engineering
# - Model selection
# - Hyperparameter tuning

# Cost: $50/month for AutoML platform
# vs. $150K/year for data scientist

h2o.init()
aml = automl.H2OAutoML(max_models=20, seed=1)
aml.train(X=X, y=y, training_frame=train)

# Saves ~$140K/year!


### When NOT to Build ML Models:


In [None]:
# ❌ Don't build if:

# 1. Simple rules work better
# Instead of ML: if age > 50 and income > 100K: approve
# Rule: $0, Accuracy: 90%

# 2. Data doesn't exist
# Can't build recommendation without user history
# Collect data first

# 3. ROI negative
# Cost: $500K, Expected benefit: $100K
# ROI = -80%, DON'T BUILD

# 4. Regulation/ethics issues
# Discriminatory models
# Privacy violations
# Regulatory non-compliance

# 5. Real-time predictions not needed
# Use business rules or simpler models
# Save $300K on infrastructure

# 6. Data quality too poor
# 80% missing values, 50% errors
# Fix data first (cheaper)

# ✅ Do build if:
# 1. High ROI (>300%)
# 2. Data readily available
# 3. Complex patterns to learn
# 4. Real-time predictions needed
# 5. Ethical/regulatory OK
# 6. Team has expertise


### Cost Estimation Framework:


In [None]:
def estimate_ml_project_cost(
    dataset_size_samples,
    annotation_complexity,  # 'simple', 'moderate', 'complex'
    model_complexity,       # 'simple', 'moderate', 'complex'
    deployment_scale        # 'small', 'medium', 'large'
):
    """Estimate ML project costs"""
    
    # Data costs
    annotation_costs = {
        'simple': 0.10,
        'moderate': 0.50,
        'complex': 5.00
    }
    data_cost = dataset_size_samples * annotation_costs[annotation_complexity]
    
    # Computing costs (6 months)
    compute_costs = {
        'simple': 5_000,
        'moderate': 30_000,
        'complex': 100_000
    }
    compute_cost = compute_costs[model_complexity]
    
    # Personnel (6 months)
    team_sizes = {
        'simple': 1,        # 1 data scientist
        'moderate': 2,      # 1 DS + 1 engineer
        'complex': 4        # Full team
    }
    personnel_cost = team_sizes[model_complexity] * 75_000  # 6 months average
    
    # Infrastructure
    infra_costs = {
        'small': 10_000,
        'medium': 50_000,
        'large': 200_000
    }
    infra_cost = infra_costs[deployment_scale]
    
    # Total
    total = data_cost + compute_cost + personnel_cost + infra_cost
    
    return {
        'data': data_cost,
        'compute': compute_cost,
        'personnel': personnel_cost,
        'infrastructure': infra_cost,
        'total': total
    }

# Example: Medium complexity project
costs = estimate_ml_project_cost(
    dataset_size_samples=50_000,
    annotation_complexity='moderate',
    model_complexity='moderate',
    deployment_scale='medium'
)

print("Cost Breakdown:")
for category, amount in costs.items():
    print(f"  {category.capitalize()}: ${amount:,.0f}")


---

## SUMMARY TABLE: ML CHALLENGES

| Challenge | Problem | Impact | Solution |
|-----------|---------|--------|----------|
| **Data Collection** | Scraping legal/technical issues, API limits | Limited training data | Use APIs, buy datasets, crowdsource |
| **Insufficient Data** | Can't afford labeling, annotation errors | Poor model performance | Transfer learning, data augmentation, semi-supervised learning |
| **Non-Representative Data** | Sampling bias, not representative | Poor generalization, unfair predictions | Stratified sampling, reweighting, collect representative data |
| **Poor Quality** | Missing values, outliers, duplicates | Model fails or performs poorly | Data cleaning, validation, quality checks |
| **Irrelevant Features** | Noise introduced, overfitting | Worse performance | Feature selection, domain expertise |
| **Overfitting** | Model memorizes training data | Great train, terrible test | Regularization, early stopping, more data |
| **Underfitting** | Model too simple | Poor performance everywhere | More complex model, feature engineering |
| **Integration** | Env mismatch, deployment challenges | Model fails in production | Containerization, CI/CD, monitoring |
| **Offline Learning** | Concept drift, model becomes stale | Predictions degrade over time | Online learning, scheduled retraining |
| **Cost** | High data/compute/personnel costs | Project infeasible | Use transfer learning, AutoML, cost-benefit analysis |

---

## PRACTICAL CHECKLIST: Avoiding ML Challenges

### Before Starting:
- [ ] Understand business requirements and ROI
- [ ] Assess data availability and quality
- [ ] Check if ML is necessary (rules might work)
- [ ] Estimate costs (data + compute + team)
- [ ] Understand ethical/regulatory implications

### Data Collection:
- [ ] Use APIs when available (legal, reliable)
- [ ] Respect ToS and robots.txt
- [ ] Ensure data is representative
- [ ] Validate data quality early
- [ ] Plan for continuous data collection

### Model Development:
- [ ] Start with simple baselines
- [ ] Monitor for overfitting/underfitting
- [ ] Use cross-validation
- [ ] Perform feature selection
- [ ] Document everything

### Deployment:
- [ ] Containerize model (Docker)
- [ ] Set up logging and monitoring
- [ ] Plan for model updates
- [ ] Implement A/B testing
- [ ] Have rollback plan

### Production:
- [ ] Monitor data drift
- [ ] Track model performance
- [ ] Retrain on schedule
- [ ] Handle edge cases
- [ ] Maintain documentation

---

**Use these notes as reference for identifying and solving real-world ML challenges!**
