Local Inference (LM Studio), caching and Assistants API

As an amateur, I tried to figure out how it works and added something from my experience working with LM Studio.
I thought that the Assistants API was the way to go for efficiency.

Cloud APIs (OpenAI, Anthropic, Google) operate on the principle of Stateless (stateless). They don't remember the settings between separate independent requests. Here's how it works in practice and exactly where the instructions should be.:

How it works in cloud models

* Each request is autonomous: The cloud server _forgets you immediately_ after sending a single line translation. In order for the model to know the rules, it must receive a system prompt _with each API call_ .
* Automation through a plugin: You don't need to manually copy the prompt into Subtitle Edit (SE) for each line. Instead, you insert the entire system prompt into a special field in the SE translation settings.
* Saving traffic: SE automatically concatenates the system prompt with the current subtitle line and sends it to the cloud. Every time...

Features of working with the cloud (as opposed to LM Studio)

* Token fees: Since SE will send your system prompt with each line of subtitles, you will be charged for these tokens _repeatedly with each request_.
* Saving solution: Use models that support Prompt Caching (prompt caching), such as Claude 3.5 Sonnet or DeepSeek. If the system prompt is long, the cloud may recognize it as repetitive and offer a significant discount (up to 90%).
* Alternative (Custom GPT / Assistants API): OpenAI offers the **Assistants API**, where the prompt is stored on the server, but Subtitle Edit does not typically support this mode directly. Instead, it works through the standard Chat Completions.

What to do in Subtitle Edit when working with a local API (LM Studio, Ollama)

LM Studio:
   1. Open LM Studio and go to the list of downloaded models.
   2. Select the model for translation.
   3. Go to the Inferens section and paste your prompt into the System Promot window. 
   
Subtitile Edit via Aytotranslate (API).
 1. Open the SE auto-translation window 
 2. Paste only the command to start the translation (by default, it's "Translate from {0} to {1} ...")*
 * Do not place anything there that might conflict with the system prompt.

This way, SE will only send the text, and the entire logic will remain "under the hood" of the LM Stusio settings, like a system cache.

_But why does the cloud model forget me after each response (Stateless)? After all, I have a key, and if I come with an API request, the model should know the owner of the key? As soon as I arrive, it should immediately access the system prompt, I assume. When I chat with the model and provide the system prompt in advance, I don't have to send ALL it to the chat every time._

The architecture of working through a chat on a website and through an API is fundamentally different.
This is why the cloud "forgets" you in the API, despite having an API key: 
1. An API key is a passport, not a memory, and it is used solely for identification and payment. The model recognizes you as the owner of the key  ("Access is granted, money will be deducted from account X"), but it does not have the task of remembering your work. Storing the context of millions of users on cloud servers would be extremely resource-intensive.
2. How the chat on a website (Stateful) works.When you communicate in a web interface (such as ChatGPT), the website performs hidden work for you:
* The website stores your conversation history and system prompt in a database.
* Every time you write a new message, the website takes the entire history, the system prompt, and your new word, combines them into a single giant package, and sends it to the API server.
* You do not see this, but the model re-reads the chat every time.
3. Как устроен чистый API (Stateless).
In the API, you communicate with the model directly, without the mediation of a smart website.

* The server processed the request, provided a translation of one line of subtitles, and immediately erased the request from RAM to make room for thousands of other people.
* If you send the next line in a second, it is absolutely new to the server, unrelated to the previous request.

How can I solve your problem without sending prompta every time?
If you absolutely do not want to send a system prompt with each line of subtitles via Subtitle Edit, you have two ways.:

1. Use the Assistants API (if it supports the SE plugin): OpenAI and Anthropic have a special "Assistants" mode. You create a robot on their server once, sew a system prompt into it, and it gets a permanent ID. In requests from SE, you only send this ID and the text.
2. Rely on Prompt Caching: You continue to send the system prompt from SE with each request, but the cloud server (for example, DeepSeek or OpenAI) "recognizes" the familiar instruction text. He will not process it again, but will take it from his fast cache, due to which the transfer will be instantaneous, and the fee for sending the prompt will decrease by 80-90%.

_ How do I check if Prompt Caching works exactly when I translate subtitles? For example, in the case of Gemini, Mistral, DeepSeek or OpenAI?_

Check how Prompt Caching works It won't work directly "by eye" through the Subtitle Edit (SE) interface, since the program just shows the finished translation. However, there are three reliable ways to find out if the cache is working and if you are saving money.

Method 1. According to indirect criteria (Transfer speed), the simplest marker is speed.

* The first line of subtitles is translated with a delay of 1-3 seconds (the model "reads" and warms up your long system prompt).
* All subsequent lines start to "fly out" instantly (in a fraction of a second), since the model does not waste time re-analyzing instructions. 

Method 2. Through the provider's personal account (The most accurate one)
Each neural network records cache hits in your usage statistics. Start translating a block of 50-100 lines into a Subtitle Edit, wait 5 minutes and log in to the developer console.:

* Google Gemini (Google AI Studio): In the detailed cost statistics (Billing / Usage), you will see the division of tokens. The tokens read from the cache are displayed as Cached tokens (or in the API response audit as cached_content_token_count).
* OpenAI (ChatGPT API): Go to Dashboard -> Usage. There will be a separate purple graph or a row of Cached text input tokens (Cached input tokens), which are 50% cheaper than regular ones.
* DeepSeek: There will be a clear counter in your personal account on the statistics tab: Context Cache Hit (how many tokens are read from the cache for free/with a 90% discount) and Context Cache Miss (how much is read for the full price).

Method 3. Verification via the Subtitle Edit logs (For advanced users)
If API request logging is enabled in the Subtitle Edit (or you use proxy tools like Fiddler/Charles), you can look inside the technical response (Response JSON) that the server sends to each line of subtitles.
There is a usage block at the very end of each response from the server. That's what should be there.:
OpenAI:
```
"usage": {
    "prompt_tokens": 2048,
    "completion_tokens": 25,
    "prompt_tokens_details": {
        "cached_tokens": 1920  // <- IF IT'S NOT ZERO, THE CACHE IS WORKING!
    }
}
```

Gemini:
```
"usageMetadata": {
    "promptTokenCount": 2048,
    "candidatesTokenCount": 25,
    "cachedContentTokenCount": 1920 // <- Here is your working cache
}
```

Critically important: why might the cache NOT work in SE?
In order for caching to start automatically (Implicit Caching), three strict conditions must be met:

   1. The size of the prompt: For OpenAI and Gemini, the automatic cache is enabled only if your system prompt + line text together exceed 1024 tokens (approximately 700-800 words). If your prompt is short (for example: "Translate to Russian"), caching won't be enabled, but you're already paying pennies.
   2. Absolute identity of the beginning: Only the part that does not change at all from request to request is cached.
   * How to do it correctly: System prompt (Glossary, rules) -> ALWAYS THE SAME. Only the {text} at the very end is changed.
      * How wrong it is: If the SE plugin inserts a dynamic row ID, time, or history of previous rows before the system prompt for some reason. Then each request is unique from the very first character, and the cache is reset.
   3. Time between requests (TTL): The cache lives on servers for an average of 5 to 60 minutes. If you translate subtitles in automatic mode (SE plays line by line in a row), the cache will work at 100%. If you translate one line manually every half hour, the cache will constantly "rot". 
   
Sorry, I wrote this using Google

https://www.youtube.com/watch?v=GW02ugqoIdU
https://www.reddit.com/r/GoogleGeminiAI/comments/1qbx9qe/easy_api_setup_without_programming_knowledge/
https://ai.google.dev/gemini-api/docs/coding-agents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Inference (LM Studio), caching and Assistants API #274

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Local Inference (LM Studio), caching and Assistants API #274

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions