A structured learning path toward MLOps mastery, covering tools, platforms, and core concepts across the full ML lifecycle.
- ML lifecycle stages (data, training, evaluation, deployment, monitoring)
- Technical debt in ML systems
- Reproducibility and experiment tracking
- ML system design patterns
- Offline vs online evaluation
- Data versioning and lineage
- Feature engineering pipelines
- Feature stores (online vs offline)
- Data validation and schema enforcement
- Handling data drift and skew
- ETL/ELT for ML workloads
- Experiment tracking and metadata management
- Hyperparameter optimization strategies
- Distributed training patterns
- GPU/TPU resource management
- Model registry and artifact management
- Transfer learning and fine-tuning workflows
- Batch vs real-time inference
- Model serialization formats (ONNX, TorchScript, SavedModel)
- Blue/green and canary deployments for models
- Shadow deployments and A/B testing
- Model compression and quantization
- Edge deployment considerations
- Continuous training pipelines
- ML-specific testing (data tests, model tests, integration tests)
- Automated retraining triggers
- Pipeline orchestration patterns
- Infrastructure as Code for ML
- Data drift detection
- Model performance degradation tracking
- Prediction logging and auditing
- Alerting strategies for ML systems
- Feedback loops and ground truth collection
- Model explainability and interpretability
- Fairness and bias auditing
- Regulatory compliance (GDPR, model cards)
- Cost optimization for ML infrastructure
- Disaster recovery for ML systems
- MLflow
- Weights & Biases (W&B)
- Neptune.ai
- CometML
- DVC (Data Version Control)
- Feast (Feature Store)
- Great Expectations (Data Validation)
- Delta Lake / Lakehouse formats
- Apache Spark for ML
- Apache Airflow
- Kubeflow Pipelines
- Prefect
- Dagster
- ZenML
- Metaflow
- TensorFlow Serving
- TorchServe
- Triton Inference Server
- BentoML
- Seldon Core
- Ray Serve
- vLLM (LLM serving)
- Docker
- Kubernetes
- Helm charts for ML workloads
- Terraform / Pulumi
- KServe
- GitHub Actions for ML
- GitLab CI/CD for ML
- CML (Continuous Machine Learning)
- DVC Pipelines
- Evidently AI
- Prometheus + Grafana
- Arize AI
- WhyLabs
- AWS SageMaker
- Google Vertex AI
- Azure Machine Learning
- Databricks MLflow
- LangChain / LangSmith
- LlamaIndex
- Prompt engineering and management
- RAG (Retrieval-Augmented Generation) pipelines
- Fine-tuning LLMs (LoRA, QLoRA)
- LLM evaluation frameworks (RAGAS, DeepEval)
- Guardrails and content filtering
| Area | Status |
|---|---|
| Foundations | Not Started |
| Data Engineering for ML | Not Started |
| Model Training & Experimentation | Not Started |
| Model Serving & Deployment | Not Started |
| CI/CD for ML | Not Started |
| Monitoring & Observability | Not Started |
| Governance & Reliability | Not Started |
| Tools & Platforms | Not Started |
| LLMOps | Not Started |