# **🧪 Running LMCache with vLLM on Google Colab**

This Colab notebook demonstrates how to run LMCache with the vLLM inference engine. We use Meta's Llama-3.1-8B-Instruct model as an example.


<p align="center">
  <a href="https://lmcache.ai/" style="display:inline-block; margin:0 1em; text-decoration:none;">
    <img
      src="https://raw.githubusercontent.com/LMCache/LMCache/2e4c7b95a0784babd6d61313724a801614898e1e/docs/source/assets/lmcache-logo_crop.png"
      alt="LMCache logo"
      width="170"
      style="vertical-align:middle; border:none;"
    />
  </a>
</p>

<p align="center" style="margin-top:.5em;">
  <a href="https://join.slack.com/t/lmcacheworkspace/shared_invite/zt-2viziwhue-5Amprc9k5hcIdXT7XevTaQ" style="text-decoration:none;">
    <img
      src="https://upload.wikimedia.org/wikipedia/commons/b/b9/Slack_Technologies_Logo.svg"
      alt="Slack logo"
      width="125"
      style="vertical-align:middle; margin-right: 0.5em;"
    />
  </a>
</p>

<!-- GitHub line -->
<p align="center">
  <em><b>Join Slack if you need help + ⭐ Star us on <a href="https://github.com/LMCache/LMCache" style="text-decoration:none;">GitHub</a> ⭐</b></em>
</p>


## ⚙️ Configure Colab Runtime

To enable GPU acceleration on Google Colab:

1. Click the **Runtime** menu in the top toolbar.
2. Select **Change runtime type**.
3. In the **Hardware accelerator** dropdown, choose **GPU** (preferably **A100 GPU**, as LMCache currently does not support the T4 GPU).
4. Click **Save**.

> 📌 You can confirm GPU access by running the following cell:
>
> ```python
> !nvidia-smi
> ```

---

## 🔐 Set Up Hugging Face Credentials

Since this demo uses **Meta’s Llama-3.1-8B-Instruct model**, you’ll need to set up your Hugging Face account and request access:

1. **Sign up** for a free account at [https://huggingface.co/join](https://huggingface.co/join).
2. **Request access** to the LLaMA 3 model here:  
   👉 [Llama-3.1-8B-Instruct Model Card](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)  
3. Once approved, **create a Hugging Face access token**:  
   👉 [Token Settings Page](https://huggingface.co/settings/tokens)  
4. Click on the sidebar (left panel) → **"Secrets"** tab
5. Click **“+ Add new secret”**
*   Name: HF_TOKEN
*   Value: (paste your Hugging Face token)

> 💡 Your token will be used to authenticate and download the model via the Hugging Face.


> Access your secret keys in Python via:
>
> ```python
> from google.colab import userdata
> hf_token = userdata.get("HF_TOKEN")
> ```


## Install vLLM v1

📦 Install uv (a fast Python package manager)

In [None]:
!curl -LsSf https://astral.sh/uv/install.sh | sh

📦 Install the latest nightly version of vLLM

In [None]:
!uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

📦 Install LMCache from source

In [None]:
!git clone https://github.com/LMCache/LMCache.git
%cd LMCache
!uv pip install .

## Run Inference **without** LMCache

In [None]:
import os
import time
from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig
from google.colab import userdata

# Set token chunk size
os.environ["LMCACHE_CHUNK_SIZE"] = "256"

# Enable CPU offloading backend
os.environ["LMCACHE_LOCAL_CPU"] = "True"

# Set CPU memory limit (in GB)
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "5.0"

# Set Hugging Face access token
hf_token = userdata.get("HF_TOKEN")
os.environ["HF_TOKEN"] = hf_token

# Input
with open("/content/drive/MyDrive/Colab Notebooks/long_document.txt", "r", encoding="utf-8") as f:
    shared_prompt = f.read()
prompts = [shared_prompt + "\n\n" + "When did the Roman Empire begin and when did it fall?"]

# Set sampling parameters
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)

print("🔁 Running generation WITHOUT LMCache...")

# Initialize vLLM without LMCache integration
llm_no_lmcache = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_model_len=8000,
    dtype="float16",
    enable_prefix_caching=False,
    gpu_memory_utilization=0.8,
)

# First run (cold cache, no reuse)
time_start_1 = time.time()
outputs = llm_no_lmcache.generate(prompts, sampling_params)
for output in outputs:
    print("Generated text (no LMCache, 1st run):", repr(output.outputs[0].text))
time_end_1 = time.time()

# Second run (still no cache reuse)
time_start_2 = time.time()
outputs = llm_no_lmcache.generate(prompts, sampling_params)
for output in outputs:
    print("Generated text (no LMCache, 2nd run):", repr(output.outputs[0].text))
time_end_2 = time.time()

print(f"❌ No LMCache - 1st run duration: {time_end_1 - time_start_1:.2f} seconds")
print(f"❌ No LMCache - 2nd run duration: {time_end_2 - time_start_2:.2f} seconds")

## Run Inference **with** LMCache

In [1]:
import os
import time
from vllm import LLM, SamplingParams
from vllm.config import KVTransferConfig
from google.colab import userdata

# Set token chunk size
os.environ["LMCACHE_CHUNK_SIZE"] = "256"

# Enable CPU offloading backend
os.environ["LMCACHE_LOCAL_CPU"] = "True"

# Set CPU memory limit (in GB)
os.environ["LMCACHE_MAX_LOCAL_CPU_SIZE"] = "5.0"

# Set Hugging Face access token
hf_token = userdata.get("HF_TOKEN")
os.environ["HF_TOKEN"] = hf_token

# Input
with open("/content/drive/MyDrive/Colab Notebooks/long_document.txt", "r", encoding="utf-8") as f:
    shared_prompt = f.read()
prompts = [shared_prompt + "\n\n" + "When did the Roman Empire begin and when did it fall?"]

# Set sampling parameters
sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=10)

print("🔁 Running generation WITHOUT LMCache...")

# Set up LMCache connector for KV cache transfer
ktc = KVTransferConfig(
    kv_connector="LMCacheConnectorV1",  # MUST match registered name
    kv_role="kv_both",  # both read and write
)

# Initialize vLLM with LMCache integration
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    kv_transfer_config=ktc,
    max_model_len=8000,
    dtype="float16",
    gpu_memory_utilization=0.8,
)

# First run (LMCache will store KV cache)
time_start_1 = time.time()
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print("Generated text (LMCache, 1st run):", repr(output.outputs[0].text))
time_end_1 = time.time()

# Second run (LMCache reuses KV cache)
time_start_2 = time.time()
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print("Generated text (LMCache, 2nd run):", repr(output.outputs[0].text))
time_end_2 = time.time()

print(f"✅ LMCache - 1st run duration: {time_end_1 - time_start_1:.2f} seconds")
print(f"✅ LMCache - 2nd run duration: {time_end_2 - time_start_2:.2f} seconds")

Clean Up LMCache Engine:

In [None]:
from lmcache.v1.cache_engine import LMCacheEngineBuilder
from lmcache.integration.vllm.utils import ENGINE_NAME

# Properly clean up the LMCache backend
LMCacheEngineBuilder.destroy(ENGINE_NAME)

> 📌 During inference, LMCache will automatically handle storing and managing KV cache in CPU memory. You can monitor this through the logs, which will show messages like:
>
> ```python
> LMCache INFO: Storing KV cache for 6006 out of 6006 tokens for request 0
> ```

This means KV cache was successfully offloaded to CPU memory.

Note


*   Adjust gpu_memory_utilization based on your GPU's available memory
*   The CPU offloading buffer size can be adjusted through `LMCACHE_MAX_LOCAL_CPU_SIZE`

## What happens in real life without LMCache

### Without LMCache: old cache gets evicted as new queries come in.



### With LMCache: old cache is offloaded and reused.