In [None]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# MetaMath with Vertex AI Open Source Model Tuning

<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb">
      <img width="32px" src="https://www.gstatic.com/pantheon/images/bigquery/welcome_page/colab-logo.svg" alt="Google Colaboratory logo"><br> Open in Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https:%2F%2Fraw.githubusercontent.com%2FGoogleCloudPlatform%2Fgenerative-ai%2Fmain%2Fopen-models%2Ffine-tuning%2Fget_started_with_oss_tuning_on_vertexai.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo"><br> Open in Colab Enterprise
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb">
      <img src="https://www.gstatic.com/images/branding/gcpiconscolors/vertexai/v1/32px.svg" alt="Vertex AI logo"><br> Open in Vertex AI Workbench
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb">
      <img width="32px" src="https://www.svgrepo.com/download/217753/github.svg" alt="GitHub logo"><br> View on GitHub
    </a>
  </td>
</table>

<div style="clear: both;"></div>

<b>Share to:</b>

<a href="https://www.linkedin.com/sharing/share-offsite/?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/8/81/LinkedIn_icon.svg" alt="LinkedIn logo">
</a>

<a href="https://bsky.app/intent/compose?text=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/7/7a/Bluesky_Logo.svg" alt="Bluesky logo">
</a>

<a href="https://twitter.com/intent/tweet?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/5a/X_icon_2.svg" alt="X logo">
</a>

<a href="https://reddit.com/submit?url=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb" target="_blank">
  <img width="20px" src="https://redditinc.com/hubfs/Reddit%20Inc/Brand/Reddit_Logo.png" alt="Reddit logo">
</a>

<a href="https://www.facebook.com/sharer/sharer.php?u=https%3A//github.com/GoogleCloudPlatform/generative-ai/blob/main/open-models/fine-tuning/get_started_with_oss_tuning_on_vertexai.ipynb" target="_blank">
  <img width="20px" src="https://upload.wikimedia.org/wikipedia/commons/5/51/Facebook_f_logo_%282019%29.svg" alt="Facebook logo">
</a>

| Author(s) |
| --- |
| [Ivan Nardini](https://github.com/inardini) |

## Overview

This notebook demonstrates how to reproduce the core ideas of the **MetaMath** paper by fine-tuning a Llama model on the `MetaMathQA` dataset using Vertex AI's managed service for open-source models.

### Objective

The goal is to leverage the `MetaMathQA` dataset to enhance the mathematical reasoning capabilities of a base Llama model. Specifically, this notebook focuses on reproducing the results for the **7B model** variant discussed in the paper, using the comparable **Llama 3.1 8B model** available on Vertex AI.

We will cover the following steps:

1.  **Prepare the Dataset**: Download the `MetaMathQA` dataset and convert it to the required JSON Lines (JSONL) format for Vertex AI.
2.  **Fine-Tune the Model**: Configure and launch a managed fine-tuning job on Vertex AI using a Llama 3.1 8B model.
3.  **Deploy & Evaluate**: Deploy the newly tuned model to a Vertex AI Endpoint and test its mathematical reasoning.
4.  **Compare (Optional)**: Compare our model's output with the official pre-trained MetaMath model from Hugging Face.
5.  **Run Official Evaluation (Advanced)**: Download our tuned model and run the official evaluation scripts from the MetaMath repository.


### Citation

```
@article{yu2023metamath,
  title={MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models},
  author={Yu, Longhui and Jiang, Weisen and Shi, Han and Yu, Jincheng and Liu, Zhengying and Zhang, Yu and Kwok, James T and Li, Zhenguo and Weller, Adrian and Liu, Weiyang},
  journal={arXiv preprint arXiv:2309.12284},
  year={2023}
}
```

## Get started

This initial section handles all the necessary setup, including authenticating your account, installing libraries, and configuring your Google Cloud environment.

### (Optional) Choose your runtime.

This tutorial has been designed to compare your tuned version with the MetaMath model by running some tests and the evaluation benchmarks.

To run **ONLY** those optional sections of this tutorial, you need to have access more resources compared to the one are available in free Colab. Consider to create a Workbench instance on Vertex AI using `a2-highgpu-2g (Accelerator Optimized: 2 NVIDIA Tesla A100 GPUs, 24 vCPUs, 170GB RAM)`.

Optionally, you can install `jupyterlab-nvdashboard` to visualize GPU usage metrics within your notebook environment.

### Install Google Gen AI SDK and other required packages

Install the Python libraries needed for this tutorial.

  * `google-cloud-aiplatform`: The official SDK for interacting with Vertex AI services like model tuning and deployment.
  * `datasets`: A library from Hugging Face that makes it easy to download and manipulate datasets.
  * `transformers`: A Hugging Face library used for downloading and running the official pre-trained MetaMath model for comparison.


In [None]:
%pip install --upgrade --quiet --force-reinstall google-cloud-aiplatform>=1.105.0
%pip install --upgrade --quiet --force-reinstall datasets transformers torch sentencepiece accelerate bitsandbytes vllm hf_transfer fraction tqdm numpy fire openai scipy jsonlines pandas pydantic crcmod

### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

This gives the notebook permission to access your Google Cloud resources on your behalf.


In [None]:
# import sys

# if "google.colab" in sys.modules:
#     from google.colab import auth

#     auth.authenticate_user()

### Set Google Cloud project information

To get started using Vertex AI, you must have an existing Google Cloud project and [enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

Then, you define essential configuration variables. All resources you create (like models and storage buckets) will be associated with your specified Google Cloud project and region.

  * `PROJECT_ID`: Your unique Google Cloud project identifier.
  * `REGION`: The geographic location where your resources will be created (e.g., `us-central1`).
  * `BUCKET_URI`: A unique Google Cloud Storage (GCS) bucket. This will be our central location for storing the training dataset and the resulting model artifacts.


In [None]:
# Use the environment variable if the user doesn't provide Project ID.
import os

import vertexai

PROJECT_ID = "[your-project-id]"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}

if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

LOCATION = "us-central1"

BUCKET_NAME = "[your-bucket-name]"  # @param {type: "string", placeholder: "[your-bucket-name]", isTemplate: true}
BUCKET_URI = f"gs://{BUCKET_NAME}"

# Create the GCS bucket if it doesn't exist
! gsutil mb -p {PROJECT_ID} -l {LOCATION} {BUCKET_URI}

# Initialize the Vertex AI SDK. This authenticates our session and sets the default project and location.
vertexai.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

### Import libraries

In [None]:
import gc
import json
import os
import time
import uuid

import torch
import vertexai
from datasets import load_dataset
from google.cloud import aiplatform
from google.cloud.aiplatform_v1beta1.types import JobState
from pydantic import BaseModel, Field
from transformers import pipeline
from vertexai.preview import model_garden
from vertexai.preview.tuning import sft
from vertexai.preview.tuning._tuning import SourceModel

### Helpers

In [None]:
def save_to_jsonl(dataset, output_path):
    """Save dataset in JSONL format required by Vertex AI."""
    with open(output_path, "w") as f:
        for example in dataset:
            json.dump(example, f)
            f.write("\n")
    print(f"Saved {len(dataset)} examples to {output_path}")

## Prepare the MetaMathQA Dataset

The core of the MetaMath paper is its unique dataset, `MetaMathQA`. Here, we'll download it, process it into the required format, and upload it to our cloud storage bucket so the Vertex AI tuning service can access it.


### Download and Format the Dataset

The Vertex AI tuning service expects the training data to be in a **JSON Lines (JSONL)** format, where each line is a separate JSON object. For chat models, each object should contain a `"messages"` field with a list of conversation turns. We'll download the dataset from Hugging Face and map its `question` and `response` columns to this required format.

In [None]:
# Load the MetaMathQA dataset from the Hugging Face Hub.
# We'll use the 'GSM8K' configuration, which is a key part of the paper's contribution.
dataset = load_dataset("meta-math/MetaMathQA")["train"]

In [None]:
# Use the .train_test_split() method to create an 80/20 split.
# 80% of the data will be for training, 20% for validation.
# This easily satisfies the <25% requirement for the validation set.
split_dataset = dataset.train_test_split(test_size=0.2, seed=42)

# The result is a dictionary containing the two new splits.
train_split = split_dataset["train"]
validation_split = split_dataset["test"]

# Limit validation dataset to less than 5000 rows for Vertex AI requirement
if len(validation_split) > 5000:
    validation_split = validation_split.shuffle(
        seed=42
    )  # Use a seed for reproducibility
    validation_split = validation_split.select(range(4999))

In [None]:
# MetaMath's instruction template
METAMATH_TEMPLATE = """Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:"""

# Define a function to transform each example into the {"prompt": ..., "completion": ...} structure.
def format_for_tuning(example):

    query = example["query"]
    response = example["response"]

    instruction = METAMATH_TEMPLATE.format(instruction=query)

    # Important: Add space before response for proper tokenization
    return {
        "messages": [
            {"role": "user", "content": instruction},
            {"role": "assistant", "content": f" {response}"},
        ]
    }


# Apply the formatting function to the entire dataset.
train_formatted_dataset = train_split.map(
    format_for_tuning, remove_columns=train_split.column_names, num_proc=os.cpu_count()
)
validation_formatted_dataset = validation_split.map(
    format_for_tuning,
    remove_columns=validation_split.column_names,
    num_proc=os.cpu_count(),
)

In [None]:
train_file_path = "metamath_gsm8k_train.jsonl"
validation_file_path = "metamath_gsm8k_validation.jsonl"

# Write the formatted training data to a local JSONL file.
save_to_jsonl(train_formatted_dataset, train_file_path)
save_to_jsonl(validation_formatted_dataset, validation_file_path)

### Upload Dataset to GCS

The Vertex AI tuning service runs on Google's infrastructure and cannot directly access files in this notebook's local environment. Therefore, we must upload our formatted JSONL file to our GCS bucket.

In [None]:
# Define the destination path in your GCS bucket.
train_file_uri = f"{BUCKET_URI}/datasets/metamath_gsm8k_train.jsonl"
validation_file_uri = f"{BUCKET_URI}/datasets/metamath_gsm8k_validation.jsonl"

# Use the gsutil command-line tool to copy the local file to GCS.
! gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp {train_file_path} {train_file_uri}
! gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp {validation_file_path} {validation_file_uri}

## Configure and Launch the Fine-Tuning Job

We'll define all the parameters according to the official MetaMath README for our fine-tuning job and then launch it.


### Define configuration

Set tuning parameters.

In [None]:
# This class groups all hyperparameters and provides documentation and default values.
class MetaMathTuningConfig(BaseModel):
    """Configuration settings for the MetaMath fine-tuning job."""

    base_model: str = Field(
        default="meta/llama3_1@llama-3.1-8b",
        description="The base model to fine-tune, corresponding to the 7B model in the paper.",
    )
    tuning_mode: str = Field(
        default="FULL",
        description="The tuning mode. We use 'FULL' to replicate the paper's method for the 7B model.",
    )
    epochs: int = Field(
        default=3,
        description="Number of training epochs, as specified in the MetaMath paper.",
    )
    learning_rate: float = Field(
        default=2e-5,
        description="The learning rate for the optimizer, matching the paper's value for full fine-tuning.",
    )


# Create an instance of the configuration class.
config = MetaMathTuningConfig()

# Dynamically create paths that depend on runtime variables.
output_uri = f"{BUCKET_URI}/tuning-output/{uuid.uuid4()}"
model_artifacts_gcs_uri = os.path.join(
    output_uri, "postprocess/node-0/checkpoints/final"
)

### Launch the full fine-tuning job

This function sends our configuration to the Vertex AI service, which will provision machines and run the training.


In [None]:
source_model = SourceModel(base_model=config.base_model)

sft_tuning_job = sft.preview_train(
    source_model=source_model,
    tuning_mode=config.tuning_mode,
    epochs=config.epochs,
    learning_rate=config.learning_rate,
    train_dataset=train_file_uri,
    validation_dataset=validation_file_uri,
    output_uri=output_uri,
)

### Monitor the Job

The tuning job runs remotely on Google Cloud. The following code provides a convenient way to check the job's status from within the notebook without having to manually refresh the console. It will print an update every 10 minutes.

In [None]:
print(
    "Monitoring job... This will take several hours. You can safely close this notebook and come back later."
)

while not sft_tuning_job.state in [
    JobState.JOB_STATE_CANCELLED,
    JobState.JOB_STATE_FAILED,
    JobState.JOB_STATE_SUCCEEDED,
]:
    time.sleep(600)  # Check status every 10 minutes
    sft_tuning_job.refresh()
    print(f"Current job state: {str(sft_tuning_job.state.name)}")

print(f"Job finished with state: {sft_tuning_job.state.name}")

## Deploy the Tuned Model

Once the tuning job is complete, the new model "lives" as a set of files in your GCS bucket. To use it for inference, we must **deploy** it. This process loads the model onto a server with a GPU and creates a unique API endpoint that we can send prediction requests to.

In [None]:
# Define the hardware for our deployment. An L4 GPU is a cost-effective choice for a model of this size.
machine_type = "g2-standard-12"
accelerator_type = "NVIDIA_L4"
accelerator_count = 1

# Create a CustomModel object that points to our tuned model artifacts in GCS.
tuned_model = model_garden.CustomModel(gcs_uri=model_artifacts_gcs_uri)

# Deploy the model. This step provisions the hardware and can take 15-30 minutes.
endpoint = tuned_model.deploy(
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
)

## Evaluate and Compare

Now, the exciting part! Let's test our newly tuned model and compare its performance to the official version.

### Evaluate Tuned Model

We'll send a sample math problem to our deployed endpoint. The prompt format is critical; we use the exact template specified in the official MetaMath repository to ensure the model responds as expected.

In [None]:
# The inference prompt for MetaMath models.
prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response: Let's think step by step."
instruction = "James buys 5 packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay?"

# Define the prediction request payload.
# We set a low temperature for more factual, less "creative" output.
instances = [
    {
        "prompt": prompt_template.format(instruction=instruction),
        "max_tokens": 250,
        "temperature": 0.2,
        "top_p": 1.0,
        "top_k": 1,
        "raw_response": True,
    }
]

# Send the request to our endpoint.
response = endpoint.predict(instances=instances, use_dedicated_endpoint=True)

print("Response from tuned model")
for prediction in response.predictions:
    print(prediction)

### (Optional) Compare with the official MetaMath Model


To see how well our reproduction worked, we can compare its output to the official `MetaMath-7B-V1.0` model released by the authors on Hugging Face. This provides a valuable benchmark.

**Note**: This step runs a large model locally on the notebook's machine and may require significant RAM, GPU and time to download and generate predictions.


In [None]:
# Enable hf_transfer for parallel downloads
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

# Load the official MetaMath 7B model from Hugging Face.
official_pipe = pipeline(
    "text-generation", model="meta-math/MetaMath-7B-V1.0", device_map="auto"
)

# Use the same prompt and instruction for a fair comparison.
official_response = official_pipe(
    prompt_template.format(instruction=instruction), max_new_tokens=250, do_sample=False
)

print("Response from the MetaMath Model")
print(official_response[0]["generated_text"])

### Clean runtime

To clear GPU memory we explicitly delete the pipeline, free up unused memory from PyTorch cache and any memory occupied by objects that are no longer referenced.

In [None]:
del official_pipe
torch.cuda.empty_cache()
gc.collect()

### (Optional) Run the Official Evaluation Scripts

The single-prompt tests above are good for a qualitative check. To get the official `pass@1` benchmark scores reported in the paper, you must run the evaluation scripts from the MetaMath GitHub repository against the full test dataset. This is how academic results are formally measured.

#### Clone the MetaMath Repository
This command downloads the evaluation scripts (`eval_gsm8k.py`) and the test data files.

In [None]:
!git clone https://github.com/meta-math/MetaMath.git

#### Download Your Tuned Model from GCS
The evaluation script runs locally, so it needs the model files on the notebook's machine. We'll copy them from our GCS output directory.


In [None]:
# Create a local directory to store the model.
LOCAL_MODEL_PATH = "./my_tuned_metamath_model"
!mkdir -p {LOCAL_MODEL_PATH}

# Copy the model files from GCS to the local path. This can take several minutes.
!gsutil -m cp -r {model_artifacts_gcs_uri}/* {LOCAL_MODEL_PATH}/

#### Run the GSM8K Evaluation
Finally, execute the official evaluation script, pointing it to your locally downloaded model. This will run the model on every question in the GSM8K test set.

In [None]:
!python MetaMath/eval_gsm8k.py \
    --model {LOCAL_MODEL_PATH} \
    --data_file ./MetaMath/data/test/GSM8K_test.jsonl \
    --tensor_parallel_size 2 \
    --batch_size 32

After running, the script will output the final `pass@1` accuracy score. You can compare this number directly to the results table in the MetaMath paper to see how well your model performed! Also if you are interested, you can run the same process to measure `math` metrics.

## Cleaning up

To avoid incurring ongoing charges for the deployed model and stored data, you must undeploy the endpoint and delete your GCS artifacts.

In [None]:
delete_experiments = True
delete_endpoint = True
delete_bucket = True

# Deleting experiment
if delete_experiments:
    experiment = aiplatform.Experiment.list()[0]
    experiment.delete()

# Deleting the endpoint itself removes the resource configuration.
if delete_endpoint:
    endpoint = aiplatform.Endpoint.list()[0]
    endpoint.delete(force=True)

# To fully clean up, you should also delete the model artifacts and dataset from your GCS bucket.
# You can do this via the command line or the Google Cloud Console.
if delete_bucket:
    !gsutil -m rm -r {BUCKET_URI}