Skip to content

AISDC/vllm

Repository files navigation

Locally hosted LLM with vLLM

Table of Contents


Overview

This repository provides documentation and scripts for running a local Large Language Model (LLM) using vLLM, with an OpenAI-compatible API for seamless integration into local tools and workflows.

The primary goal is to enable local reasoning and Model Context Protocol (MCP) capabilities for workflows and machines that do not have access to the internet. By hosting the model locally, users can perform inference, reasoning, and task automation securely within closed environments.

A successful test deployment was performed on the Pinback server. Pinback is a uniquely compatible server for our needs at the APS. Pinback has access to the internet, but secure offline models can access to ports on Pinback, granting them access to our models and their capabilities.


Installation

Download Model

  1. Create a read token on Hugging Face in Access TokensCreate Token

  2. Log in to your Hugging Face account in the terminal:

    huggingface-cli login
  3. Download an OpenAI-compatible model locally. Specify the model, directory, and set use-symlinks to false. For example:

    huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
      --local-dir /home/beams/AKIRSCH/rareevent/vllm/Mistral-7B-Instruct-v0.3 \
      --local-dir-use-symlinks False

Usage

Launch VLLM Server

  1. Create a Bash script and specify the endpoint, model, port, host, and datatype. For example:

    #!/bin/bash
    
    # Optional: set GPU
    export CUDA_VISIBLE_DEVICES=1
    
    python -m vllm.entrypoints.openai.api_server \
      --model /home/beams/AKIRSCH/rareevent/vllm/Mistral-7B-Instruct-v0.3 \
      --port 8000 \
      --host 0.0.0.0 \
      --dtype auto

This exposes an OpenAI-compatible API at:
http://localhost:8000/v1/chat/completions


Connect Cline to Local Endpoint

  1. In Cline, go to SettingsAPI Configuration.

  2. Set API Provider to OpenAI Compatible.

  3. Set Base URL to:

    http://localhost:8000/v1
    
  4. Leave Model ID and API Key blank — VLLM will use the single loaded model by default.


Run the Model

  1. Run the VLLM server script and begin chatting in Cline using your local model.

Problems

Too Many Tokens

If you receive errors like: ValueError: This model's maximum context length is 8192 tokens. However, you requested 13553 tokens in the messages...

Try the following:

  • Use a model with a larger context window
  • Restart Cline to clear chat history.
  • Adjust context window or max_tokens in advanced settings.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages