This repository provides documentation and scripts for running a local Large Language Model (LLM) using vLLM, with an OpenAI-compatible API for seamless integration into local tools and workflows.
The primary goal is to enable local reasoning and Model Context Protocol (MCP) capabilities for workflows and machines that do not have access to the internet. By hosting the model locally, users can perform inference, reasoning, and task automation securely within closed environments.
A successful test deployment was performed on the Pinback server. Pinback is a uniquely compatible server for our needs at the APS. Pinback has access to the internet, but secure offline models can access to ports on Pinback, granting them access to our models and their capabilities.
-
Create a read token on Hugging Face in Access Tokens → Create Token
-
Log in to your Hugging Face account in the terminal:
huggingface-cli login
-
Download an OpenAI-compatible model locally. Specify the model, directory, and set use-symlinks to false. For example:
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \ --local-dir /home/beams/AKIRSCH/rareevent/vllm/Mistral-7B-Instruct-v0.3 \ --local-dir-use-symlinks False
-
Create a Bash script and specify the endpoint, model, port, host, and datatype. For example:
#!/bin/bash # Optional: set GPU export CUDA_VISIBLE_DEVICES=1 python -m vllm.entrypoints.openai.api_server \ --model /home/beams/AKIRSCH/rareevent/vllm/Mistral-7B-Instruct-v0.3 \ --port 8000 \ --host 0.0.0.0 \ --dtype auto
This exposes an OpenAI-compatible API at:
http://localhost:8000/v1/chat/completions
-
In Cline, go to Settings → API Configuration.
-
Set API Provider to
OpenAI Compatible. -
Set Base URL to:
http://localhost:8000/v1 -
Leave Model ID and API Key blank — VLLM will use the single loaded model by default.
- Run the VLLM server script and begin chatting in Cline using your local model.
If you receive errors like: ValueError: This model's maximum context length is 8192 tokens. However, you requested 13553 tokens in the messages...
Try the following:
- Use a model with a larger context window
- Restart Cline to clear chat history.
- Adjust
context windowormax_tokensin advanced settings.