# Open Assistant as a Local ChatGPT API

Video tutorial:
[![Open Assistant as a Local ChatGPT API](https://img.youtube.com/vi/kkTNg_UOCNE/0.jpg)](https://www.youtube.com/watch?v=kkTNg_UOCNE)


Welcome everyone to a bit of a showcasing and how-to with Open Assistant's Pythia 12 billion parameter model. This model is meant to be a chat assistant, like ChatGPT, but runnable locally. The model uses 48GB of memory, or 24GB at half precision.

This model is in live development and training, so you will want to keep an eye out for new releases. I started playing with this model's first variant (https://huggingface.co/OpenAssistant/oasst-sft-1-pythia-12b) and the next time I checked for an update, there was a 4th iteration available (https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5). 


Being a local model, I'd like to also show how to essentially set up your own local API, which makes doing your own R&D and testing much quicker and easier. To start though, let's check out a super basic example. 

At their most basic level, these large language GPT models just simply generate text sequentially. An example input might be:

"<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"

And the output might be.

"A meme is a cultural idea, behavior, or style that spreads from person to person within a"

We can then wrap this in some basic logic to handle for the special tokens of <|prompter|>, <|endoftext|>, and <|assistant|> to get a more human readable output to give the chat and response feel. 

Let's dive in!


In [None]:
# OPTIONAL TO RUN ON A SPECIFIC GPU:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

First, we'll import AutoTokenizer & AutoModelForCausalLM, which will allow us to load the model and tokenizer from the HuggingFace model hub. We'll also import torch, which we'll use to handle the model's output in a bit.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_NAME = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

This will load the model and tokenizer into memory. If this is the first time you're running this with that specific model, it will take a bit to download the model and tokenizer. After that, it should take ~ a minute or so to load into memory. Once you have the model downloaded and loaded, you can optionally move it to your GPU if possible. In this example, I am also using the half precision version of the model, which is a bit faster and uses half the memory:

In [None]:
# Move the model to GPU and set it to half precision (float16)
model = model.half().cuda()

Now we'll start with some input. This could be any text you want, but it probably makes the most sense to structure it how this model was trained, with the special tokens of <|prompter|>, <|endoftext|>, and <|assistant|>. 

Imagine that we want to ask this model "What color is the sky?"

The way to build this prompt would be to be more like:

"<|prompter|>What color is the sky?<|endoftext|><|assistant>"

It feels a bit weird to use this end of ext tag followed by assistant tag, seems maybe redundant, but that's in the example provided by OpenAssistant on their 
HF page, so I assume every string before another "speaker" is terminated with that tag. By ending with the <|assistant> tag, we're making very clear to the model that a continue generation would be starting with the assistance's response to that input. The output from the model will likely be a continued generation, something like:

"<|prompter|>What color is the sky?<|endoftext|><|assistant> The sky is often blue.<|endoftext|>"

You may find that after that end text tag, another prompter tag is generated and more text is continued to be generated by the model. You can either handle for this with some python logic to stop at the end of text tag, or you can utilize the early-stopping capability from the transformers package.

Let's see how to do this in Python:

In [None]:
inp = "<|prompter|>What color is the sky?<|endoftext|><|assistant|>"

input_ids = tokenizer.encode(inp, return_tensors="pt")

# Move the input to GPU  (ONLY do this if you're using the GPU for your model.)
input_ids = input_ids.cuda()

First, we specify some text input, then we tokenize that input with the model's tokenizer. From here, we move the tokenized input to the GPU, if we're using one. 

Next, we're going torch's automatic mixed precision (AMP) autocast context manager, which automatically sets operation datatypes. Within AMP's autocast context, we'll generate output with the model:

In [None]:
# Using automatic mixed precision
with torch.cuda.amp.autocast():
    # generate text until the output length (which includes the original input/context's length) reaches max_length. do_sample for random sampling vs greedy
    output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)

Now we've got some output, but its on the GPU. Let's move it to the CPU so we can more easily access it:

In [None]:
# Move the output back to CPU
output = output.cpu()

Finally, we can use the tokenizer to decode the output into human readable text:

In [None]:
# Decode the output
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(output_text)

Full code up to this point:

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# OPTIONAL TO RUN ON A SPECIFIC GPU:
import os

MODEL_NAME = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Move the model to GPU and set it to half precision (float16)
model = model.half().cuda()

inp = "<|prompter|>What color is the sky?<|endoftext|><|assistant|>"

input_ids = tokenizer.encode(inp, return_tensors="pt")

# Move the input to GPU  (ONLY do this if you're using the GPU for your model.)
input_ids = input_ids.cuda()

# Using automatic mixed precision
with torch.cuda.amp.autocast():
    # generate text until the output length (which includes the original input/context's length) reaches max_length
    output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)

# Move the output back to CPU
output = output.cpu()
# Decode the output
output_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(output_text)

Okay, so that's a very basic example of how to use this model. Let's take a look at how to set up a local API to make this a bit easier to use and workwith. 

With an API, even just locally, we can speed up R&D time without needing to re-load the model to memory every run (though you could also just use a notebook or something in this case too!). Beyond that, we can also access this API from anywhere else on our network, or even the internet if we wanted, empowering whatever devices and computers we might want.

For this, I am going to use Flask (pip install flask), but there are certainly many ways you could do this same thing. I'll start a new script, which I'll call `oasst_api.py`. We'll start with:

In [None]:
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os


app = Flask(__name__)
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

MODEL_NAME = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

model = model.half().cuda()

Not too much new here from before yet, other than the flask imports and beginning app defintion. Now, all we need with our flask app is a very basic route to handle our input and output. We'll use the same logic as before, but we'll wrap it in a function, and then we'll use flask's jsonify to return the output as a json object.

Starting with:

In [None]:
@app.route('/generate', methods=['POST'])
def generate():
    content = request.json

This view will take a post request, and that request will have a json object, which will contain our prompt. We can get the prompt with `content.get` and then we will tokenize and pass that to the GPU (if we're using one).

In [None]:
    inp = content.get("text", "")
    input_ids = tokenizer.encode(inp, return_tensors="pt")
    input_ids = input_ids.cuda()

Now we will query the model:

In [None]:
    with torch.cuda.amp.autocast():
        output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)

Similar to before, we're using AMP's autocast and model.generate to get our output. From here, we just need to decode and return the output as a json object:

In [None]:
    decoded = tokenizer.decode(output[0], skip_special_tokens=False)
    return jsonify({'generated_text': decoded})

Finally, we can run the app:

In [None]:
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Making our full code for `oasst_api.py` now:

In [None]:
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os


app = Flask(__name__)
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

MODEL_NAME = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

model = model.half().cuda()


@app.route('/generate', methods=['POST'])
def generate():
    content = request.json
    inp = content.get("text", "")
    input_ids = tokenizer.encode(inp, return_tensors="pt")
    input_ids = input_ids.cuda()

    with torch.cuda.amp.autocast():
        output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)

    decoded = tokenizer.decode(output[0], skip_special_tokens=False)

    return jsonify({'generated_text': decoded})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)  # Set the host to '0.0.0.0' to make it accessible from your local network

Now, we can run this API on whatever machine we want to host the model, and then we can query this machine from whatever machine we want, provided it's on network. 

For example, I can create a new file, called `chat-oasst-api.py` to work with my new API. To start, some imports and constants:

In [None]:
import requests
import json
import colorama

SERVER_IP = "10.0.0.18" # Change this to the IP of your server that's hosting the API. This can be the same machine you're working on too.
URL = f"http://{SERVER_IP}:5000/generate"

USERTOKEN = "<|prompter|>"
ENDTOKEN = "<|endoftext|>"
ASSISTANTTOKEN = "<|assistant|>"

With the imports and constants out of the way, let's write a quick prompt function:

In [None]:
def prompt(inp):
    data = {"text": inp}
    headers = {'Content-type': 'application/json'}

    response = requests.post(URL, data=json.dumps(data), headers=headers)

    if response.status_code == 200:
        return response.json()["generated_text"]
    else:
        return "Error:", response.status_code

This function takes input, builds a dictionary which we'll convert to a json object, sets headers, and then sends a post request to our API. We'll use the requests package to do this. From here, we'll grab either the json response, or error if there is one. Now we just need some simple logic to handle for the chat and context:


In [None]:
history = ""
while True:
    inp = input(">>> ")
    context = history + USERTOKEN + inp + ENDTOKEN + ASSISTANTTOKEN
    output = prompt(context)
    history = output
    just_latest_asst_output = output.split(ASSISTANTTOKEN)[-1].split(ENDTOKEN)[0]
    # color just_latest_asst_output green in print:
    print(colorama.Fore.GREEN + just_latest_asst_output + colorama.Style.RESET_ALL)

The full `chat_oasst_api.py` code:

In [None]:
import requests
import json
import colorama

SERVER_IP = "10.0.0.18"
URL = f"http://{SERVER_IP}:5000/generate"

USERTOKEN = "<|prompter|>"
ENDTOKEN = "<|endoftext|>"
ASSISTANTTOKEN = "<|assistant|>"

def prompt(inp):
    data = {"text": inp}
    headers = {'Content-type': 'application/json'}

    response = requests.post(URL, data=json.dumps(data), headers=headers)

    if response.status_code == 200:
        return response.json()["generated_text"]
    else:
        return "Error:", response.status_code
    
history = ""
while True:
    inp = input(">>> ")
    context = history + USERTOKEN + inp + ENDTOKEN + ASSISTANTTOKEN
    output = prompt(context)
    history = output
    just_latest_asst_output = output.split(ASSISTANTTOKEN)[-1].split(ENDTOKEN)[0]
    # color just_latest_asst_output green in print:
    print(colorama.Fore.GREEN + just_latest_asst_output + colorama.Style.RESET_ALL)

With this, we can fully interact with our model! 

Only one slight problem is the context is going to continue growing. The maximum context length for this model is 2048 tokens. That's quite a bit, but if you have a longer conversation, or you even just want to keep an ongoing one for days, this is going to be a problem. 

How you handle for context might vary. You could just trim context to keep it in some range. Remember: context includes the prompt as well as the generation. Your generation might want to be 200 tokens long, so this really means your prompt needs to be 1848 tokens or less.

Besides a simple trimming past a certain amount of tokens, you could also get more complex by attempting to also summarize the context to "compress" it. I will skip that for now and go straight to a trim. In most cases, this will be fine. If you need to retain history more, then you might try a more complicated approach. 

You can also choose whether you want to add this logic to the API, or the client. I think handling for summarization would be done client-side, but a brute trimming of the context to handle for longer conversations can happen API-side I think. This really is up to you though. I'll edit the `oasst_api.py`, and start by adding the following constants:

In [None]:
# Get max context length and the determine cushion for response
MAX_CONTEXT_LENGTH = model.config.max_position_embeddings
print(f"Max context length: {MAX_CONTEXT_LENGTH}")
ROOM_FOR_RESPONSE = 512

This dynamically pulls the maximum context length from the model's attributes, and then we can opt for how much of a "cushion" we want to leave for a plausible generation. I've chosen 512, which is quite large and probably will never happen, but 2048-512=1536, which is still a lot of context!

Now, within the `generate` function, we can add some logic to handle for context length:

In [None]:
    # Calc current size
    print("Context length is currently", input_ids.shape[1], "tokens. Allowed amount is", MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE, "tokens.")
    # determine if we need to trim
    if input_ids.shape[1] > (MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):
        print("Trimming a bit")
        # trim as needed AT the first dimension
        input_ids = input_ids[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):]

The full code for `oasst_api.py` is now:

In [None]:
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os


app = Flask(__name__)
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

MODEL_NAME = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Get max context length and the determine cushion for response
MAX_CONTEXT_LENGTH = model.config.max_position_embeddings
print(f"Max context length: {MAX_CONTEXT_LENGTH}")
ROOM_FOR_RESPONSE = 512

model = model.half().cuda()


@app.route('/generate', methods=['POST'])
def generate():
    content = request.json
    inp = content.get("text", "")
    input_ids = tokenizer.encode(inp, return_tensors="pt")

    # Calc current size
    print("Context length is currently", input_ids.shape[1], "tokens. Allowed amount is", MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE, "tokens.")
    # determine if we need to trim
    if input_ids.shape[1] > (MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):
        print("Trimming a bit")
        # trim as needed AT the first dimension
        input_ids = input_ids[:, -(MAX_CONTEXT_LENGTH-ROOM_FOR_RESPONSE):]
    
    input_ids = input_ids.cuda()

    with torch.cuda.amp.autocast():
        output = model.generate(input_ids, max_length=2048, do_sample=True, early_stopping=True, num_return_sequences=1, eos_token_id=model.config.eos_token_id)

    decoded = tokenizer.decode(output[0], skip_special_tokens=False)

    return jsonify({'generated_text': decoded})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)  # Set the host to '0.0.0.0' to make it accessible from your local network