<a href="https://colab.research.google.com/github/Amirosimani/deepseek_vertexai/blob/main/deepseek_on_vertex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


|||
|----------|-------------|
| Author(s)   | amirimani@ |
| Last updated | 10/02/2025 |
<br><br>


This notebook showcases how to deploy DeepSeek R1 Distill Qwen 7B from the Hugging Face Hub on Vertex AI using Vertex AI Model Garden. It also shows how to prototype and deploy a ReAct agent with google search tool using Langchain.

### Install Vertex AI SDK and other required packages


In [None]:
# !pip install --quiet google-cloud-aiplatform
# !pip install --quiet langchain langchain_community
# !pip install --quiet langchain_google_genai langchain_google_community
# !pip install --quiet tiktoken

In [None]:
# Restart runtime
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

### Authenticate your notebook environment (Colab only)

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [None]:
import os
import vertexai
from huggingface_hub import get_token

from google.cloud import aiplatform
from google.colab import userdata

In [None]:
PROJECT_ID = # YOUR PROJECT NAME
LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

BUCKET_NAME = "deepseek-amir"
BUCKET_URI = f"gs://{BUCKET_NAME}"

vertexai.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)
! gsutil mb -p $PROJECT_ID -l $LOCATION $BUCKET_URI


Set the model ID from Hugging Face Hub. In this case, you use DeepSeek-R1-Distill-Qwen-7B, a dense model distilled from DeepSeek-R1 good at math.

In [None]:
MODEL_ID = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

#### Register and Deploy DeepSeek model on Vertex AI

Please note that you only have to register and deploy the model to vertex only once

In [None]:
deepseek_model = aiplatform.Model.upload(
    display_name=MODEL_ID.replace("/", "--").lower(),
    serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/vertex-model-garden/vllm-inference.cu121.0-6.ubuntu2204.py310",
    serving_container_args=[
        "python",
        "-m",
        "vllm.entrypoints.api_server",
        "--host=0.0.0.0",
        "--port=8080",
        f"--model={MODEL_ID}",
        "--tensor-parallel-size=1",
        "--max-model-len=16384",
        "--enforce-eager",
    ],
    serving_container_ports=[8080],
    serving_container_predict_route="/generate",
    serving_container_health_route="/ping",
    serving_container_environment_variables={
        "HF_TOKEN": get_token(),
        "DEPLOY_SOURCE": "notebook",
    },
)
deepseek_model.wait()


After the model is registered on Vertex AI, you can deploy the model to an endpoint. This can take around 20 minutes.



In [None]:
deepseek_endpoint = aiplatform.Endpoint.create(
    display_name=MODEL_ID.replace("/", "--").lower() + "-endpoint"
)

deployed_deepseek_model = deepseek_model.deploy(
    endpoint=deepseek_endpoint,
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    sync=False,
)

### Generate predictions using Deepseek

either get the endpoint name from the console or use aiplatform.Endpoint.list() to see them here.

In [None]:
endpoint = aiplatform.Endpoint(endpoint_name="YOUR ENDPOINT NAME. something like 5128549095566770688")

In [None]:
prediction_request = {
    "instances": [
        {
            "@requestFormat": "textGeneration",
            "prompt":"Is Hawaiian cuisine vegan friendly?",
            "max_tokens": 2048,
            "temperature": 0.7,
        }
    ]
}

In [None]:
output = endpoint.predict(instances=prediction_request["instances"])
for prediction in output.predictions[0]:
    print("------- DeepSeek prediction -------")
    print(prediction["message"]["content"])
    print("---------------------------------\n")

# ReAct Agent with tool calling using DeepSeek

In [None]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [None]:
import os
import time
from pydantic import Field
from typing import List, Dict, Any, Optional, Union

from google.cloud import aiplatform

from langchain import hub
from langchain.llms.base import LLM
from langchain.chat_models.base import BaseChatModel
from langchain.schema import LLMResult, AIMessage
from langchain.agents import AgentExecutor, Tool, create_react_agent
from langchain_community.utilities import GoogleSearchAPIWrapper
from langchain.schema import AgentAction, AgentFinish, LLMResult
from langchain.callbacks.base import BaseCallbackHandler

In [None]:
# --- Configuration ---
MAX_ITERATIONS = 5
MAX_EXECUTION_TIME = 45

GENERATION_CONFIG = {
    "temperature": 0.8,
    "max_tokens": 2048,
}


GOOGLE_API_KEY = userdata.get('GOOGLE-API')
CSE_ID = userdata.get("CSE-ID")

In [None]:
# Initialize the Vertex AI client
PROJECT_ID = "amir-genai-bb"
LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")
aiplatform.init(project=PROJECT_ID, location=LOCATION)
endpoint = aiplatform.Endpoint(endpoint_name="YOUR ENDPOINT NAME. something like 5128549095566770688")

This code implements a ReAct agent executor designed to interact with a custom large language model (LLM) and external tools, specifically Google Search. The `CustomLLM`class acts as a bridge, adapting a Vertex AI LLM to the LangChain framework. The `ReActAgentExecutor` class orchestrates the agent's behavior, setting up the LLM, search tools, and the ReAct agent itself. It allows users to provide an input query, which the agent processes by strategically using the search tool and the LLM to generate a response. Additionally, a `TokenCountingCallbackHandler` is incorporated to track the token usage during the agent's execution, facilitating cost analysis and performance monitoring.

In [None]:
class CustomLLM(LLM):
    """A wrapper for the custom Vertex AI LLM that conforms to LangChain's LLM API."""

    model_name: str = Field(default="vertex-ai")  # Renamed `model` to `model_name` to avoid conflicts
    generation_config: Dict[str, Any] = Field(default_factory=dict)

    @property
    def _llm_type(self) -> str:
        return "custom-vertex-ai"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        """Generate text using the Vertex AI LLM."""
        prediction_request = {
            "instances": [
                {
                    "@requestFormat": "textGeneration",
                    "prompt": prompt,
                    "max_tokens": self.generation_config.get("max_tokens", 2048),
                    "temperature": self.generation_config.get("temperature", 0.8),
                }
            ]
        }

        response = endpoint.predict(instances=prediction_request["instances"])
        return response.predictions[0] if response.predictions else ""

class ReActAgentExecutor:
    """
    A class to run the ReAct agent with specified configurations and tools.
    """
    def __init__(
        self,
        model: str,
        generation_config: Dict,
        max_iterations: int,
        max_execution_time: int,
        google_api_key: str=GOOGLE_API_KEY,
        cse_id: str=CSE_ID,
    ):
        self.model = model
        self.generation_config = generation_config
        self.max_iterations = max_iterations
        self.max_execution_time = max_execution_time
        self.google_api_key = google_api_key
        self.cse_id = cse_id
        self.llm = None
        self.tools = None
        self.agent = None
        self.agent_executor = None
        self.token_callback = None

        self._setup_llm()
        self._setup_tools()
        self._setup_agent()

    def _setup_llm(self):
        """Initializes the custom LLM."""
        self.llm = CustomLLM(model=self.model, generation_config=self.generation_config)

    def _setup_tools(self):
        """Sets up the tools for the agent."""
        search = GoogleSearchAPIWrapper(
            google_api_key=self.google_api_key, google_cse_id=self.cse_id
        )
        self.tools = [
            Tool(
                name="Google Search",
                func=search.run,
                description="Useful for finding information on current events, comparisons, or diverse perspectives.",
            ),
        ]

    def _setup_agent(self):
        """Sets up the ReAct agent and executor."""
        prompt = hub.pull("hwchase17/react")
        system_instruction = "Once you are done finding the answer, only return Yes or No"
        prompt.template = system_instruction + "\n" + prompt.template

        self.agent = create_react_agent(self.llm, self.tools, prompt)
        self.token_callback = TokenCountingCallbackHandler(self.model)
        self.agent_executor = AgentExecutor(
            agent=self.agent,
            tools=self.tools,
            verbose=False,
            handle_parsing_errors=True,
            max_iterations=self.max_iterations,
            max_execution_time=self.max_execution_time,
            callbacks=[self.token_callback],
        )

    def run(self, input_data: Union[Dict, str]) -> Dict:
        """
        Runs the agent with the given input data.
        """
        if isinstance(input_data, str):
            input_data = {"input": input_data}

        start_time = time.time()
        try:
            result = self.agent_executor.invoke(input_data)
            result["total_token"] = self.token_callback.total_token
            self.token_callback.reset()
        except Exception as e:
            print(f"An error occurred: {e}")
            result = {"error": str(e)}
        end_time = time.time()
        result["wall_time"] = end_time - start_time

        return result

class TokenCountingCallbackHandler(BaseCallbackHandler):
    """Callback handler for counting tokens used by the language model."""
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.total_token = 0

    def reset(self):
        """Reset the counters for the next chain run."""
        self.total_token = 0


In [None]:
llm = CustomLLM(
    model="vertex-ai",
    generation_config=GENERATION_CONFIG
)

agent_executor = ReActAgentExecutor(
    model=llm,
    generation_config=GENERATION_CONFIG,
    max_iterations=8,
    max_execution_time=60,
    google_api_key=GOOGLE_API_KEY,
    cse_id=CSE_ID
)



In [None]:
result = agent_executor.run(train_ds[10]["question"])

In [None]:
result

{'input': 'Does a Starbucks passion tea have ginger in it?',
 'output': 'Agent stopped due to iteration limit or time limit.',
 'total_token': 0,
 'wall_time': 147.72699403762817}