

## MLOps Cycle and Vertex AI Features

The MLOps cycle generally consists of stages like Data Preparation, Model Development (Training & Evaluation), Model Deployment, and Model Monitoring & Governance. Vertex AI provides specialized tools for each.

Vertex AI is designed to support the entire MLOps lifecycle, providing a unified platform for building, deploying, and managing machine learning models. 

**Notice:** if you find this emoji 🧑‍🎓 it means that you will see more on this in this training

Here's how its key features align with the MLOps cycle as of mid-2025:


### 1. Data Preparation and Management

**Goal:** To collect, clean, transform, and store data for machine learning. This includes managing features for reusability and consistency.

* **📁Vertex AI Feature Store:**
    * **Purpose:** A centralized repository for organizing, storing, and serving machine learning features. It helps in reusing features across multiple models and teams, ensuring consistency between training and serving, and reducing feature engineering efforts.
    * **Key Capabilities (as of 2025):**
        * **Online Serving:** Provides ultra-low latency access to the latest feature values for real-time predictions. This is crucial for applications like fraud detection or personalized recommendations. Optimized online serving is recommended for most scenarios, offering lower latencies than Bigtable online serving, and supports embeddings management.
        * **Offline Serving:** Enables batch retrieval of historical feature data for model training and batch predictions, ensuring point-in-time correctness to prevent data leakage.
        * **Integration with BigQuery:** Features are managed from BigQuery tables or views, acting as the offline store. Vertex AI Feature Store acts as a metadata layer interfacing with these BigQuery data sources.
        * **Feature Monitoring:** Capabilities to monitor feature values for drift and anomalies, which can indicate issues with data pipelines or changes in data distribution that might impact model performance.
        * **Embedding Management:** Supports storing and serving embeddings (vector representations of data) directly, crucial for vector similarity search and retrieval-augmented generation (RAG) applications.
        * **Feature Registry:** For setting up and managing metadata about your features, including feature groups and feature views.

* **📁Cloud Storage:**
    * **Purpose:** Highly scalable and durable object storage.
    * **Role in MLOps:** Used for storing various ML artifacts, including raw datasets, preprocessed data, model artifacts (trained models), and pipeline outputs.

  
* **📁 - 🧑‍🎓 Vertex AI Datastore (specifically for Generative AI Applications/Vertex AI Search):**
    * **Purpose:** While not a general-purpose ML data store like BigQuery, Vertex AI Datastore is a crucial component within **Vertex AI Search** and **Vertex AI Conversation** (now **Vertex AI Agent Garden**). It's designed to ingest and manage various types of data (website, structured, unstructured, healthcare FHIR) to be used for grounding generative AI models, enabling accurate and context-aware responses.
    * **Key Capabilities:**
        * **Website Data:** Indexes data from public or private websites for search.
        * **Structured Data:** For semantic search or recommendations over structured data, imported from BigQuery or Cloud Storage.
        * **Unstructured Data:** Supports ingesting and searching over various document types (PDFs, text files).
        * **Healthcare FHIR Data:** Specialized data store for healthcare data.
        * **Blended Search:** Allows combining multiple data sources for comprehensive search results.
        * **Grounding:** Provides the factual basis for generative AI models, reducing hallucinations by linking model responses to specific data.
    * **Example:** https://console.cloud.google.com/vertex-ai/locations/europe-west1/datasets/2393654405854396416/analyze?inv=1&invt=Ab38cA&project=poc-eurobet-ds

* **📁⚙️ - 🧑‍🎓  BigQuery:**
    * **Purpose:** Google Cloud's fully managed, petabyte-scale data warehouse. It serves as the foundational data storage for most ML workflows on Vertex AI.
    * **Role in MLOps:** Used for storing raw data, transformed data, and features that feed into the Feature Store. Its scalability and analytical capabilities make it ideal for large-scale data preparation.
  
* **⚙️ - 🧑‍🎓  Dataproc:**
    * **Purpose:** A managed service for running Apache Spark, Hadoop, Flink, and Presto clusters.
    * **Role in MLOps:** For large-scale data processing, transformations, and feature engineering tasks, especially when complex batch processing is required.


### 2. Model Development (Training & Evaluation)

**Goal:** To develop, train, and evaluate machine learning models, iterating on different architectures and hyperparameters.

* **🧑‍🎓 - Vertex AI Workbench:**
    * **Purpose:** A fully managed, unified development environment for data scientists and ML engineers. It provides Jupyter-managed notebooks.
    * **Role in MLOps:** Serves as the primary interface for interactive development, experimentation, data exploration, and model prototyping. It integrates seamlessly with other Vertex AI services.
* **🧑‍🎓 - Vertex AI Training:**
    * **Purpose:** Offers managed training jobs for custom models and AutoML capabilities.
    * **Key Capabilities:**
        * **Custom Training:** Allows users to run training jobs using custom code (TensorFlow, PyTorch, scikit-learn, XGBoost, etc.) on scalable infrastructure, including GPUs and TPUs (Ironwood TPUs for optimized performance). Supports distributed training.
        * **AutoML:** For users with limited ML expertise, AutoML automates model selection, hyperparameter tuning, and architecture search for various data types (tabular, image, text, video).
        * **Hyperparameter Tuning (Vertex AI Vizier):** A black-box optimization service that automates the process of finding the best hyperparameters for a model, significantly accelerating experimentation.
        * **Generative AI Fine-tuning:** Extensive support for fine-tuning large language models (LLMs) and other generative models (Gemini, Gemma, Qwen, Llama, etc.) using various methods like PEFT and Axolotl, allowing models to adapt to specific domain data.
* **Vertex AI Experiments:**
    * **Purpose:** To track, compare, and analyze different model architectures, hyper-parameters, and training environments.
    * **Role in MLOps:** Essential for experiment management, ensuring reproducibility and helping identify the best performing models.
* **Vertex ML Metadata:**
    * **Purpose:** To track the lineage of ML artifacts (datasets, models, metrics, parameters) and orchestrate ML workflows.
    * **Role in MLOps:** Provides critical visibility into the entire ML lifecycle, enabling auditing, debugging, and understanding the dependencies between different components of an ML system.
* **Generative AI Evaluation Service:**
    * **Purpose:** Allows users to evaluate the quality and performance of generative models and applications against custom criteria.
    * **Role in MLOps:** Crucial for validating the output of generative models (e.g., assessing factual accuracy, relevance, safety) before deployment and continuously monitoring their performance in production.



### 3. Model Deployment and Serving

**Goal:** To deploy trained models into production environments for inference, ensuring high availability, low latency, and scalability.

* **🧑‍🎓 - Vertex AI Endpoints:**
    * **Purpose:** A fully managed service for deploying models for online (real-time) predictions.
    * **Key Capabilities:**
        * **Managed Prediction Service:** Handles infrastructure, scaling (auto-scaling), and updates, so users don't have to manage servers.
        * **Low-latency Serving:** Optimized for real-time inference requests.
        * **Model Versioning:** Allows deploying multiple versions of a model to the same endpoint for A/B testing or gradual rollouts.
        * **Traffic Splitting:** Enables routing traffic to different model versions for A/B testing, canary deployments, or blue/green deployments.
        * **Private Endpoints:** For secure, private access to models within your VPC network.
* **🧑‍🎓 - Vertex AI Batch Prediction:**
    * **Purpose:** For making predictions on large datasets asynchronously, where latency is not a critical factor.
    * **Role in MLOps:** Ideal for generating insights on a schedule or for scenarios like reporting and large-scale data enrichment.
* **🧑‍🎓 - Model Registry:**
    * **Purpose:** A centralized repository for managing and versioning trained ML models.
    * **Role in MLOps:** Stores metadata about models, enables model versioning, approval workflows, and facilitates the transition from training to deployment. From the Model Registry, you can evaluate models, deploy models to an endpoint, and create batch inferences.
* **AI Hypercomputer:**
    * **Purpose:** Google Cloud's integrated supercomputing system, leveraging Ironwood TPUs and NVIDIA GPUs.
    * **Role in MLOps:** While primarily for training, its advanced compute capabilities are also leveraged for highly demanding inference workloads, especially for large generative models, ensuring efficient and fast serving.



### 4. Model Monitoring & Governance

**Goal:** To continuously monitor model performance in production, detect issues like data drift and concept drift, ensure fairness, and manage the model lifecycle.

* **Vertex AI Model Monitoring:**
    * **Purpose:** Monitors deployed models for drift (training-serving skew, inference drift), performance degradation, and anomalous feature distributions.
    * **Key Capabilities:**
        * **Drift Detection:** Automatically detects statistical differences between training data and serving data, or between historical and current inference data, which can indicate model decay.
        * **Alerting:** Configurable alerts notify users of detected drift or performance issues, enabling timely intervention.
        * **Feature Attribution Monitoring:** Helps understand which features are driving model predictions and if their importance changes over time.
        * **Bias Detection:** Monitors for potential biases in model predictions.
* **Vertex AI Pipelines:**
    * **Purpose:** Automates, monitors, and governs your ML workflows in a serverless manner. It allows you to define ML workflows as a series of interconnected steps (pipeline tasks).
    * **Key Capabilities (as of 2025):**
        * **Workflow Orchestration:** Defines the entire MLOps lifecycle, from data ingestion and preprocessing to training, evaluation, deployment, and even retraining.
        * **Reproducibility:** Ensures that ML workflows are repeatable and consistent by defining them as code.
        * **Versioning:** Supports versioning of pipelines and components, enabling tracking of changes and rollbacks.
        * **Scheduled Runs:** Automates pipeline execution on a schedule or in response to triggers (e.g., new data arrival).
        * **Lineage Tracking:** Automatically tracks the lineage of artifacts within a pipeline run using Vertex ML Metadata, providing full traceability.
        * **Multi-Modal Pipelines:** Enhanced support for orchestrating pipelines that process and generate multi-modal data (vision, text, audio, video) leveraging the latest Gemini models and other generative AI capabilities.
        * **Enterprise Integration:** Seamless integration with BigQuery, Workspace, and other GCP services within pipeline components.
* **Vertex AI Agent Engine and Agent Garden:**
    * **Purpose:** While primarily for agent development, they play a crucial role in operationalizing and governing complex AI systems that might incorporate multiple models and decision-making logic. Agent Engine provides a managed runtime for deploying and securely managing AI agents, complete with memory management and evaluation tools.
    * **Role in MLOps:** Enables the deployment and monitoring of intelligent agents that can automate complex tasks, potentially leveraging multiple ML models and decision points. This moves beyond just model serving to serving entire intelligent systems.
* **Cloud Logging & Cloud Monitoring:**
    * **Purpose:** General Google Cloud services for collecting logs and metrics from your applications and infrastructure.
    * **Role in MLOps:** Provide foundational capabilities for monitoring the health and performance of your Vertex AI services, tracking resource utilization, and debugging issues.
* **Identity and Access Management (IAM):**
    * **Purpose:** Controls who can do what within your Google Cloud project.
    * **Role in MLOps:** Essential for securing your ML pipelines, models, and data, ensuring proper access control and compliance.
 