Locally hosted LLM with vLLM

Overview

This repository provides documentation and scripts for running a local Large Language Model (LLM) using vLLM, with an OpenAI-compatible API for seamless integration into local tools and workflows.

The primary goal is to enable local reasoning and Model Context Protocol (MCP) capabilities for workflows and machines that do not have access to the internet. By hosting the model locally, users can perform inference, reasoning, and task automation securely within closed environments.

A successful test deployment was performed on the Pinback server. Pinback is a uniquely compatible server for our needs at the APS. Pinback has access to the internet, but secure offline models can access to ports on Pinback, granting them access to our models and their capabilities.

Installation

Download Model

Create a read token on Hugging Face in Access Tokens → Create Token
Log in to your Hugging Face account in the terminal:
```
huggingface-cli login
```

Download an OpenAI-compatible model locally. Specify the model, directory, and set use-symlinks to false. For example:

huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
  --local-dir /home/beams/AKIRSCH/rareevent/vllm/Mistral-7B-Instruct-v0.3 \
  --local-dir-use-symlinks False

Usage

Launch VLLM Server

Create a Bash script and specify the endpoint, model, port, host, and datatype. For example:

#!/bin/bash

# Optional: set GPU
export CUDA_VISIBLE_DEVICES=1

python -m vllm.entrypoints.openai.api_server \
  --model /home/beams/AKIRSCH/rareevent/vllm/Mistral-7B-Instruct-v0.3 \
  --port 8000 \
  --host 0.0.0.0 \
  --dtype auto

This exposes an OpenAI-compatible API at:
http://localhost:8000/v1/chat/completions

Connect Cline to Local Endpoint

In Cline, go to Settings → API Configuration.
Set API Provider to OpenAI Compatible.
Set Base URL to:
```
http://localhost:8000/v1
```
Leave Model ID and API Key blank — VLLM will use the single loaded model by default.

Run the Model

Run the VLLM server script and begin chatting in Cline using your local model.

Problems

Too Many Tokens

If you receive errors like: ValueError: This model's maximum context length is 8192 tokens. However, you requested 13553 tokens in the messages...

Try the following:

Use a model with a larger context window
Restart Cline to clear chat history.
Adjust context window or max_tokens in advanced settings.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
prometheus-2.52.0.linux-amd64		prometheus-2.52.0.linux-amd64
.gitignore		.gitignore
README.md		README.md
action_agent.py		action_agent.py
chat-template.jinja		chat-template.jinja
local_inference.py		local_inference.py
local_inference.sh		local_inference.sh
vllm_doc.md		vllm_doc.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Locally hosted LLM with vLLM

Table of Contents

Overview

Installation

Download Model

Usage

Launch VLLM Server

Connect Cline to Local Endpoint

Run the Model

Problems

Too Many Tokens

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Locally hosted LLM with vLLM

Table of Contents

Overview

Installation

Download Model

Usage

Launch VLLM Server

Connect Cline to Local Endpoint

Run the Model

Problems

Too Many Tokens

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages