# Lesson 3: The ML System Lifecycle

**Module 1: Foundations & Background**  
**Estimated Time**: 3-4 hours  
**Difficulty**: Intermediate

---

## üéØ Learning Objectives

By the end of this lesson, you will:

‚úÖ Understand the complete ML system lifecycle  
‚úÖ Know each stage: Data ‚Üí Train ‚Üí Deploy ‚Üí Monitor  
‚úÖ Identify challenges at each stage  
‚úÖ Build a simple end-to-end ML pipeline  
‚úÖ Answer lifecycle questions in interviews  

---

## üìö What You'll Learn

1. [The Complete ML Lifecycle](#1-complete-lifecycle)
2. [Stage 1: Data Collection & Processing](#2-data-stage)
3. [Stage 2: Model Training](#3-training-stage)
4. [Stage 3: Model Deployment](#4-deployment-stage)
5. [Stage 4: Monitoring & Maintenance](#5-monitoring-stage)
6. [Hands-On: End-to-End Pipeline](#6-hands-on)
7. [Interview Preparation](#7-interview-prep)

---

## 1. The Complete ML Lifecycle

### Overview

The ML system lifecycle extends far beyond just training a model. It encompasses the entire journey from raw data to a production system that delivers value.

### The Four Main Stages

```
DATA ‚Üí TRAIN ‚Üí DEPLOY ‚Üí MONITOR
 ‚Üë                         ‚Üì
 ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ Feedback ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

1. **Data**: Collection, validation, cleaning, feature engineering.
2. **Train**: Model selection, training, hyperparameter tuning, evaluation.
3. **Deploy**: Packaging, testing, serving infrastructure, rollout.
4. **Monitor**: Tracking performance, drift detection, retraining.

### Time Distribution

In production, you will spend most of your time on **Data** and **Monitoring/Infrastructure**, rather than just training.

## 2. Stage 1: Data Collection & Processing

**"Garbage in, garbage out."**

### Key Activities
- **Collection**: APIs, databases, logs, scraping.
- **Validation**: Checking schema, types, ranges, missing values.
- **Cleaning**: Handling nulls, outliers, duplicates.
- **Feature Engineering**: Creating predictive features.
- **Versioning**: Using DVC to track dataset versions.

### Common Challenges
- **Data Leakage**: Using future information in training.
- **Training-Serving Skew**: Differences between offline training data and live production data.

## 3. Stage 2: Model Training

### Key Activities
- **Baseline**: Start with a simple model (dummy, logistic regression).
- **Experimentation**: Try different algorithms (Trees, NNs).
- **Hyperparameter Tuning**: Grid search, random search, Bayesian opt.
- **Evaluation**: Accuracy, Precision, Recall, F1, ROC-AUC.
- **Tracking**: Use MLflow or Weights & Biases to log experiments.

### Validation Strategy
Always use a **Holdout Set** or **Cross-Validation** to estimate generalization performance.

## 4. Stage 3: Model Deployment

### Key Activities
- **Packaging**: Serialize model (pickle, ONNX) and dependencies (Docker).
- **Serving**: Expose as API (FastAPI) or Batch Job.
- **Infrastructure**: Kubernetes, AWS SageMaker, etc.
- **Rollout**: Canary deployment (gradual), Blue-Green deployment.

### Challenges
- **Latency**: Serving predictions in milliseconds.
- **Throughput**: Handling thousands of requests per second.

## 5. Stage 4: Monitoring & Maintenance

### Key Activities
- **System Monitoring**: CPU, memory, latency, errors.
- **Model Monitoring**: Prediction distribution, null outputs.
- **Drift Detection**: Checking if input data statistics have changed.
- **Retraining**: Automating updates when performance drops.

### Feedback Loop
Use production data to label new examples and retrain the model, creating a virtuous cycle.

## 6. Hands-On: End-to-End Pipeline

Let's implement a simplified version of this lifecycle in Python.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import joblib
import os

# ==========================================
# 1. DATA STAGE
# ==========================================
print("üìä STAGE 1: DATA COLLECTION & PREP")
# Simulate collecting data
X, y = make_classification(n_samples=2000, n_features=20, random_state=42)
df = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(20)])
df['target'] = y

# Data Validation (Simplified)
assert df.isnull().sum().sum() == 0, "Data contains nulls!"
print(f"   Data shape: {df.shape}")
print(f"   Target distribution:\n{df['target'].value_counts()}")

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# ==========================================
# 2. TRAINING STAGE
# ==========================================
print("\nü§ñ STAGE 2: MODEL TRAINING")
# Experiment: Random Forest
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train, y_train)

# Evaluation
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"   Model Accuracy: {acc:.4f}")
print("   Classification Report:")
print(classification_report(y_test, y_pred))

# ==========================================
# 3. DEPLOYMENT STAGE (Simulation)
# ==========================================
print("\nüöÄ STAGE 3: DEPLOYMENT")
# Serialize model
os.makedirs('models', exist_ok=True)
model_path = 'models/rf_model_v1.pkl'
joblib.dump(model, model_path)
print(f"   Model saved to {model_path}")

# Simulate API
class ModelService:
    def __init__(self, path):
        self.model = joblib.load(path)
    
    def predict(self, features):
        # Expects features as list or array
        return self.model.predict([features])[0]

service = ModelService(model_path)
print("   ModelService initialized and ready.")

# ==========================================
# 4. MONITORING STAGE (Simulation)
# ==========================================
print("\nüìà STAGE 4: MONITORING")
# Simulate incoming production traffic
sample_request = X_test[0]
prediction = service.predict(sample_request)
print(f"   Incoming request: {sample_request[:5]}...")
print(f"   Prediction: {prediction}")

# Simple Drift Check (Simulation)
prod_batch = X_test[:100]  # Simulate 100 requests
train_mean = X_train.mean()
prod_mean = prod_batch.mean()
print(f"   Training Mean: {train_mean:.3f}")
print(f"   Production Mean: {prod_mean:.3f}")
if abs(train_mean - prod_mean) > 0.1:
    print("   ‚ö†Ô∏è ALERT: Possible Data Drift Detected!")
else:
    print("   ‚úÖ Data statistics look stable.")

üìä STAGE 1: DATA COLLECTION & PREP
   Data shape: (2000, 21)
   Target distribution:
target
0    1002
1     998
Name: count, dtype: int64

ü§ñ STAGE 2: MODEL TRAINING
   Model Accuracy: 0.9350
   Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.93      0.94       204
           1       0.93      0.94      0.93       196

    accuracy                           0.94       400
   macro avg       0.93      0.94      0.93       400
weighted avg       0.94      0.94      0.94       400


üöÄ STAGE 3: DEPLOYMENT
   Model saved to models/rf_model_v1.pkl
   ModelService initialized and ready.

üìà STAGE 4: MONITORING
   Incoming request: [-2.56758512 -0.26884104 -0.53058036  0.2715822   0.29151146]...
   Prediction: 1
   Training Mean: -0.004
   Production Mean: -0.002
   ‚úÖ Data statistics look stable.


## 7. Interview Preparation

### Top Questions

#### 1. "Walk me through the lifecycle of an ML system you built."
**Answer Framework (STAR)**:
- **Situation**: We needed to reduce fraud.
- **Task**: Build a real-time detection system.
- **Action**:
  - Collected transaction logs (Data).
  - Trained an XGBoost model, tuned hyperparameters (Train).
  - Deployed as a FastAPI service on Kubernetes (Deploy).
  - Set up Prometheus to track latency and drift (Monitor).
- **Result**: Caught 20% more fraud, latency < 50ms.

#### 2. "What happens after you deploy a model?"
**Key Answer**: Monitoring and Feedback Loops.
- Monitor technical metrics (latency, errors).
- Monitor functional metrics (accuracy, drift).
- Retrain strategies (scheduled vs triggered).

#### 3. "How do you know when to retrain?"
**Key Answer**: 
- **Performance Degradation**: Accuracy drops below threshold.
- **Data Drift**: Input distribution changes significantly.
- **New Data**: Significant volume of new labeled data available.