# Deployment DeepSeek: Step-by-Step Guide

This guide demonstrates how to deploy [deepseek/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) with Vast's templates and integrate them with Langchain for advanced processing capabilities.

## Setup

In [24]:
%%capture
!pip install --upgrade vastai
!pip install --upgrade langchain langchain-openai openai

In [4]:
%%bash
export VAST_API_KEY="405acbff24f03c3dcade754ed546406ac41b8fa01df72a5726e6383a896bb"
vastai set api-key $VAST_API_KEY

Your api key has been saved in /root/.config/vastai/vast_api_key


### Choosing Hardware

To deploy the DeepSeek-R1-Distill-Qwen-32B model on Vast.ai, we need to find a GPU with the following specifications:

1. GPU Memory:
  - DeepSeek model weights (32B Parameters)
  - KV Cache for handling of extra long output token lengths


2. At least one direct port that we can forward for:
   - vLLM's OpenAI-compatible API server
   - External access to the model endpoint
   - Secure request routing

3. At least 120GB of disk space to hold the model and other things we might like to download

In [25]:
%%bash
vastai search offers "compute_cap >= 750 \
gpu_ram >= 80 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 120 \
rentable = true"

ID        CUDA   N  Model      PCIE  cpu_ghz  vCPUs    RAM  Disk  $/hr    DLP    DLP/$   score  NV Driver   Net_up  Net_down  R     Max_Days  mach_id  status    host_id  ports  country       
15952834  12.2  1x  H100_SXM   54.9  3.8      28.0   193.5  433   2.0681  345.6  167.12  321.9  535.216.03  6806.7  8125.4    99.7  30.0      31688    verified  68137    1249   France,_FR    
18284030  12.4  1x  H200       48.9  4.0      24.0   258.0  2386  3.2009  454.4  141.97  319.3  550.127.05  4346.1  7118.5    99.7  122.2     32676    verified  97732    4999   ,_US          
17618585  12.2  1x  H100_SXM   54.9  3.8      32.0   221.1  701   1.9344  345.4  178.55  311.0  535.230.02  7211.0  8410.4    99.2  8.8       33035    verified  68137    1428   France,_FR    
17201721  12.6  1x  H100_SXM   52.5  4.1      32.0   128.9  1649  1.8682  345.9  185.16  305.5  560.35.05   2008.3  11354.3   99.4  151.8     32847    verified  125728   249    Czechia,_CZ   
16461479  12.6  1x  H100_SXM   54.9  3.7

### Deploying the Server via Vast Template

Choose a machine and copy and paste the id below to set `INSTANCE_ID`.

We will deploy a template that:
1. Uses `vllm/vllm-openai:latest` docker image. This gives us an OpenAI-compatible server.
2. Forwards port `8000` to the outside of the container, which is the default OpenAI server port
3. Forwards `--model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --max-model-len 8192  --enforce-eager` on to the default entrypoint (the server itself)
4. Uses `--tensor-parallel-size 1` by default.
5. Uses `--gpu-memory-utilization 0.90` by default
6. Ensures that we have 120 GB of Disk space

These settings balance performance and stability for serving the DeepSeek model.

In [26]:
%%bash
export INSTANCE_ID='13690958'
vastai create instance $INSTANCE_ID --disk 120 --template_hash eda062b3e0c9c36f09d9d9a294405ded

Started. {'success': True, 'new_contract': 18385217}


### Verify Setup

In [28]:
%%bash
export VAST_IP_ADDRESS="80.188.223.202"
export VAST_PORT="11459"
curl -X POST http://$VAST_IP_ADDRESS:$VAST_PORT/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
           "prompt": "Hello, how are you?",
           "max_tokens": 50
         }'


{"id":"cmpl-f6d723bc5b32444a8be49d8c20cb5b9c","object":"text_completion","created":1740837657,"model":"deepseek-ai/DeepSeek-R1-Distill-Qwen-32B","choices":[{"index":0,"text":" Can you help me with my assignment? I need to analyze this passage:\n\n\"Treason doth never prosper: what's the reason?\nWhy, if it prosper, none dare call it treason.\"\n\nCan you explain the meaning of this passage? What is","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":7,"total_tokens":57,"completion_tokens":50,"prompt_tokens_details":null}}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   148    0     0  100   148      0    123  0:00:01  0:00:01 --:--:--   123100   148    0     0  100   148      0     67  0:00:02  0:00:02 --:--:--    67100   734  100   586  100   148    233     58  0:00:02  0:00:02 --:--:--   292


## Usage

### Creating a DeepSeek Output Parser

DeepSeek's response contains two parts:
1. Thinking - wrapped in `<think>` tags
2. Answer - follows after the tags


In [29]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

from typing import Optional, Tuple
from langchain.schema import BaseOutputParser

class R1OutputParser(BaseOutputParser[Tuple[Optional[str], str]]):
    """Parser for DeepSeek R1 model output that includes thinking and response sections."""

    def parse(self, text: str) -> Tuple[Optional[str], str]:
        """Parse the model output into thinking and response sections.

        Args:
            text: Raw text output from the model

        Returns:
            Tuple containing (thinking_text, response_text)
            - thinking_text will be None if no thinking section is found
        """
        if "</think>" in text:
            # Split on </think> tag
            parts = text.split("</think>")
            # Extract thinking text (remove <think> tag)
            thinking_text = parts[0].replace("<think>", "").strip()
            # Get response text
            response_text = parts[1].strip()
            return thinking_text, response_text

        # If no thinking tags found, return None for thinking and full text as response
        return None, text.strip()

    @property
    def _type(self) -> str:
        """Return type key for serialization."""
        return "r1_output_parser"

## Setup Model

In [30]:
VAST_IP_ADDRESS="80.188.223.202"
VAST_PORT="11459"

openai_api_key = "EMPTY"
openai_api_base = f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"

model = ChatOpenAI(
    base_url=openai_api_base,
    api_key=openai_api_key,
    model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    max_tokens=8000,
    temperature=0.7
)

# Create prompt template
prompt = ChatPromptTemplate.from_messages([
    ("user", "{input}")
])

# Create parser
parser = R1OutputParser()

# Create chain
chain = (
    {"input": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

## Request

In [31]:
prompt_text = "Explain CPU to a 10-year-old who loves AI."

thinking, response = chain.invoke(prompt_text)
print("\nTHINKING:\n")
print(thinking)
print("\nRESPONSE:\n")
print(response)


THINKING:

Okay, so I need to explain what a CPU is to a 10-year-old who loves AI. Hmm, let's think about how to approach this. First, I should make sure I understand what a CPU is myself. From what I know, CPU stands for Central Processing Unit, and it's often called the "brain" of the computer. It's responsible for executing instructions and performing calculations. But how do I translate that into something a kid who's into AI would understand?

Maybe I should relate it to something they already know. Since they love AI, perhaps I can compare the CPU to something in AI. Wait, but AI is more about algorithms and data processing. Maybe I should think of the CPU as the part that makes AI work? So, if the CPU is the brain, then in the context of AI, it's the part that helps the AI think and make decisions.

Let me break it down. The CPU is like the worker inside the computer who does all the tasks. When you tell the AI to do something, like answer a question or play a game, the CPU is 

In [32]:
from openai import OpenAI

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Hello, how are you today"},
        ],
    }],
)
print("Chat completion output:", chat_response.choices[0].message.content)


Chat completion output: Alright, someone just said, "Hello, how are you today." I need to respond to that.

Since I'm an AI, I don't have feelings, but I should acknowledge that.

I'll start by saying I don't have feelings, then ask how they're doing.

Keeping it friendly and open for them to continue the conversation.
</think>

Hello! I'm just a computer program, so I don't have feelings, but thanks for asking! How are you today?


## Delete Machine

In [33]:
# Delete vast.ai machine
%%bash
export INSTANCE_ID='13690958'
vastai destroy instance $INSTANCE_ID

failed with error 404: Instance 13690958 not found.
