# How to use to generate LiveAvatar script

1. First, you need a VLLM server running a compatible LLM. We recommend using gpt-oss 120b
2. run `pip install transformers==4.41.1 torch torchaudio`
3. run `bash demo/download_experimental_voices.sh`

## Run cells in order to get .wav file script for LiveAvatar

Output will be saved to this folder as `file_generated.wav`

Once complete, navigate to run LiveAvatar using DigitalOcean Gradient 8 x NVIDIA H100 or 8 x NVIDIA H200 GPU Droplet.

In [None]:
!pip install transformers==4.41.1 torch torchaudio
!bash demo/download_experimental_voices.sh

In [None]:
from openai import OpenAI
 
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

title = 'Turning Your 1-Click Model GPU Droplets Into A Personal Assistant'
article ="""
1-Click Models are the new collaborative project from DigitalOcean and Hugging Face to bring you an easy method to interface with some of the best open-source Large Language Models (LLMs) on the most powerful GPUs available on the cloud. Together, users can optimize their usage of the best open-source models with no hassle or coding to setup.

In this tutorial, we are going to show and walkthrough the development of a voice-enabled personal assistant tool designed to run on any 1-Click Model enabled GPU Droplet. This application uses Gradio, and is fully API enabled with FastAPI. Follow along to learn more about the advantages of using 1-Click Models, learn the basics of querying a deployed 1-Click Model GPU Droplet, and see how to use the personal assistant on your own machines!
1-Click Hugging Face Models with DigitalOcean GPU Droplets

The new 1-Click models come with a wide variety of LLM options, all with different use cases. These are namely:

    meta-llama/Meta-Llama-3.1-8B-Instruct
    meta-llama/Meta-Llama-3.1-70B-Instruct
    meta-llama/Meta-Llama-3.1-405B-Instruct-FP8
    Qwen/Qwen2.5-7B-Instruct
    google/gemma-2-9b-it
    google/gemma-2-27b-it
    mistralai/Mixtral-8x7B-Instruct-v0.1
    mistralai/Mistral-7B-Instruct-v0.3
    mistralai/Mixtral-8x22B-Instruct-v0.1
    NousResearch/Hermes-3-Llama-3.1-8B
    NousResearch/Hermes-3-Llama-3.1-70B
    NousResearch/Hermes-3-Llama-3.1-405B
    NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO

Creating a new GPU Droplet with any of these models only requires the process of setting up a GPU droplet, as shown here.

Watch the following video for a full step-by-step guide to creating a 1-Click Model GPU Droplet, and check out this article for more details on launching a new instance.

Once you have set up your new machine, navigate to the next section for more detail on interacting with your 1-Click Model.
Interacting with the 1-Click Model Deployment

Connecting to the 1-Click Model Deployment is simple if we want to interact with it on the same machine. “When connected to the HUGS Droplet, the initial SSH message will display a Bearer Token, which is required to send requests to the public IP of the deployed HUGS Droplet. Then you can send requests to the Messages API via either localhost if connected within the HUGS Droplet, or via its public IP.” (Source). To access the Droplet on other machines then, we will require getting the Bearer Token. Connect to your machine using SSH to get a copy of the token, and save it for later. If we are just wanting to interact with the inference endpoint from our GPU Droplet, things are pretty simple. The variable is already saved to the environment.

Once the Bearer Token variable is set on the machine we are choosing to use, we can begin inferencing with the model. There are two routes to do this with at the moment: cURL and the Python. The endpoint will be automatically run from the port 8080, so we can default requests to our machine. If we are using a different machine, change the localhost value below to the IPv4 address.
cURL

curl http://localhost:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $BEARER_TOKEN"

This code will ask the model “What is Deep Learning?” and issue a response in the following format:

{"object":"chat.completion","id":"","created":1731532721,"model":"hfhugs/Meta-Llama-3.1-8B-Instruct","system_fingerprint":"2.3.1-dev0-sha-169178b","choices":[{"index":0,"message":{"role":"assistant","content":"**Deep Learning: A Subfield of Machine Learning**\n=====================================================\n\nDeep learning is a subfield of machine learning that focuses on the use of artificial neural networks to analyze and interpret data. It is inspired by the structure and function of the human brain and is particularly well-suited for tasks such as image and speech recognition, natural language processing, and data classification.\n\n**Key Characteristics of Deep Learning:**\n\n1. **Artificial Neural Networks**: Deep learning models are composed of multiple layers of interconnected nodes or \"neurons\" that process and transform inputs into outputs.\n2. **Non-Linear Transformations**: Each layer applies a non-linear"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":40,"completion_tokens":128,"total_tokens":168}}

This can then be plugged into a variety of web development applications as needed.
Python

The model can also Pythonically be accessed using either the Hugging Face Hub or OpenAI packages. We are going to refer to the Hugging Face Hub reference code for this demonstration.

### Hugging Face Hub
import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

chat_completion = client.chat.completions.create(
    messages=[
        {"role":"user","content":"What is Deep Learning?"},
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
)

This will return a formatted response as a ChatCompletionOutput object.

## HuggingFace Hub
ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content='**Deep Learning: An Overview**\n\nDeep Learning is a subset of Machine Learning that involves the use of Artificial Neural Networks (ANNs) with multiple layers to analyze and interpret data. These networks are inspired by the structure and function of the human brain, with each layer processing the input data in a hierarchical manner.\n\n**Key Characteristics:**\n\n1.  **Multiple Layers:** Deep Learning models typically have 2 or more hidden layers, allowing them to learn complex patterns and relationships in the data.\n2.  **Neural Networks:** Deep Learning models are based on artificial neural networks, which are composed of interconnected nodes (neurons) that process', tool_calls=None), logprobs=None)], created=1731532948, id='', model='hfhugs/Meta-Llama-3.1-8B-Instruct', system_fingerprint='2.3.1-dev0-sha-169178b', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=40, total_tokens=168))

We can print just the output with:

chat_completion.choices[0]['message']['content']

Interacting with the 1-Click Model Deployment

Connecting to the 1-Click Model Deployment is simple if we want to interact with it on the same machine. “When connected to the HUGS Droplet, the initial SSH message will display a Bearer Token, which is required to send requests to the public IP of the deployed HUGS Droplet. Then you can send requests to the Messages API via either localhost if connected within the HUGS Droplet, or via its public IP.” (Source). To access the Droplet on other machines then, we will require getting the Bearer Token. Connect to your machine using SSH to get a copy of the token, and save it for later. If we are just wanting to interact with the inference endpoint from our GPU Droplet, things are pretty simple. The variable is already saved to the environment.

Once the Bearer Token variable is set on the machine we are choosing to use, we can begin inferencing with the model. There are two routes to do this with at the moment: cURL and the Python. The endpoint will be automatically run from the port 8080, so we can default requests to our machine. If we are using a different machine, change the localhost value below to the IPv4 address.
cURL

curl http://localhost:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $BEARER_TOKEN"

This code will ask the model “What is Deep Learning?” and issue a response in the following format:

{"object":"chat.completion","id":"","created":1731532721,"model":"hfhugs/Meta-Llama-3.1-8B-Instruct","system_fingerprint":"2.3.1-dev0-sha-169178b","choices":[{"index":0,"message":{"role":"assistant","content":"**Deep Learning: A Subfield of Machine Learning**\n=====================================================\n\nDeep learning is a subfield of machine learning that focuses on the use of artificial neural networks to analyze and interpret data. It is inspired by the structure and function of the human brain and is particularly well-suited for tasks such as image and speech recognition, natural language processing, and data classification.\n\n**Key Characteristics of Deep Learning:**\n\n1. **Artificial Neural Networks**: Deep learning models are composed of multiple layers of interconnected nodes or \"neurons\" that process and transform inputs into outputs.\n2. **Non-Linear Transformations**: Each layer applies a non-linear"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":40,"completion_tokens":128,"total_tokens":168}}

This can then be plugged into a variety of web development applications as needed.
Python

The model can also Pythonically be accessed using either the Hugging Face Hub or OpenAI packages. We are going to refer to the Hugging Face Hub reference code for this demonstration.

### Hugging Face Hub
import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

chat_completion = client.chat.completions.create(
    messages=[
        {"role":"user","content":"What is Deep Learning?"},
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
)

This will return a formatted response as a ChatCompletionOutput object.

## HuggingFace Hub
ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content='**Deep Learning: An Overview**\n\nDeep Learning is a subset of Machine Learning that involves the use of Artificial Neural Networks (ANNs) with multiple layers to analyze and interpret data. These networks are inspired by the structure and function of the human brain, with each layer processing the input data in a hierarchical manner.\n\n**Key Characteristics:**\n\n1.  **Multiple Layers:** Deep Learning models typically have 2 or more hidden layers, allowing them to learn complex patterns and relationships in the data.\n2.  **Neural Networks:** Deep Learning models are based on artificial neural networks, which are composed of interconnected nodes (neurons) that process', tool_calls=None), logprobs=None)], created=1731532948, id='', model='hfhugs/Meta-Llama-3.1-8B-Instruct', system_fingerprint='2.3.1-dev0-sha-169178b', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=40, total_tokens=168))

We can print just the output with:

chat_completion.choices[0]['message']['content']

Interacting with the 1-Click Model Deployment

Connecting to the 1-Click Model Deployment is simple if we want to interact with it on the same machine. “When connected to the HUGS Droplet, the initial SSH message will display a Bearer Token, which is required to send requests to the public IP of the deployed HUGS Droplet. Then you can send requests to the Messages API via either localhost if connected within the HUGS Droplet, or via its public IP.” (Source). To access the Droplet on other machines then, we will require getting the Bearer Token. Connect to your machine using SSH to get a copy of the token, and save it for later. If we are just wanting to interact with the inference endpoint from our GPU Droplet, things are pretty simple. The variable is already saved to the environment.

Once the Bearer Token variable is set on the machine we are choosing to use, we can begin inferencing with the model. There are two routes to do this with at the moment: cURL and the Python. The endpoint will be automatically run from the port 8080, so we can default requests to our machine. If we are using a different machine, change the localhost value below to the IPv4 address.
cURL

curl http://localhost:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $BEARER_TOKEN"

This code will ask the model “What is Deep Learning?” and issue a response in the following format:

{"object":"chat.completion","id":"","created":1731532721,"model":"hfhugs/Meta-Llama-3.1-8B-Instruct","system_fingerprint":"2.3.1-dev0-sha-169178b","choices":[{"index":0,"message":{"role":"assistant","content":"**Deep Learning: A Subfield of Machine Learning**\n=====================================================\n\nDeep learning is a subfield of machine learning that focuses on the use of artificial neural networks to analyze and interpret data. It is inspired by the structure and function of the human brain and is particularly well-suited for tasks such as image and speech recognition, natural language processing, and data classification.\n\n**Key Characteristics of Deep Learning:**\n\n1. **Artificial Neural Networks**: Deep learning models are composed of multiple layers of interconnected nodes or \"neurons\" that process and transform inputs into outputs.\n2. **Non-Linear Transformations**: Each layer applies a non-linear"},"logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":40,"completion_tokens":128,"total_tokens":168}}

This can then be plugged into a variety of web development applications as needed.
Python

The model can also Pythonically be accessed using either the Hugging Face Hub or OpenAI packages. We are going to refer to the Hugging Face Hub reference code for this demonstration.

### Hugging Face Hub
import os
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

chat_completion = client.chat.completions.create(
    messages=[
        {"role":"user","content":"What is Deep Learning?"},
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
)

This will return a formatted response as a ChatCompletionOutput object.

## HuggingFace Hub
ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content='**Deep Learning: An Overview**\n\nDeep Learning is a subset of Machine Learning that involves the use of Artificial Neural Networks (ANNs) with multiple layers to analyze and interpret data. These networks are inspired by the structure and function of the human brain, with each layer processing the input data in a hierarchical manner.\n\n**Key Characteristics:**\n\n1.  **Multiple Layers:** Deep Learning models typically have 2 or more hidden layers, allowing them to learn complex patterns and relationships in the data.\n2.  **Neural Networks:** Deep Learning models are based on artificial neural networks, which are composed of interconnected nodes (neurons) that process', tool_calls=None), logprobs=None)], created=1731532948, id='', model='hfhugs/Meta-Llama-3.1-8B-Instruct', system_fingerprint='2.3.1-dev0-sha-169178b', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=40, total_tokens=168))

We can print just the output with:

chat_completion.choices[0]['message']['content']

Creating a Voice Enabled Personal Assistant

To make the best use of this powerful new tool, we have developed a new personal assistant application to run with the models. The application is fully voice enabled, capable of listening to and reading back out loud inputs and outputs. To make this possible, the demo uses Whisper to transcribe an audio input, or takes plain text, and inputs that to an the LLM powered by 1-Click GPU Droplets to generate a text response. We then use Coqui-AI’s XTTS2 model to convert the text input into a understandable audio output. It’s worth noting that the software uses voice cloning to generate the output audio, so users will receive a voice output close to their own speaking voice.

Take a look at the code below:

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread
import os
from huggingface_hub import InferenceClient
import gradio as gr
import random
import time
from TTS.api import TTS
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import scipy.io.wavfile as wavfile
import numpy as np


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id_w = "openai/whisper-large-v3"

model_w = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id_w, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model_w.to(device)

processor = AutoProcessor.from_pretrained(model_id_w)

pipe_w = pipeline(
    "automatic-speech-recognition",
    model=model_w,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

# Example voice cloning with YourTTS in English, French and Portuguese
# tts = TTS("tts_models/multilingual/multi-dataset/bark", gpu=True)

# get v2.0.2
tts = TTS(model_name="xtts_v2.0.2", gpu=True)

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(type="messages")
    with gr.Row():
        msg = gr.Textbox(label = 'Prompt')
        audi = gr.Audio(label = 'Transcribe audio')
    with gr.Row():
        submit = gr.Button('Submit')
        submit_audio = gr.Button('Submit Audio')
        read_audio = gr.Button('Transcribe Text to Audio')
        clear = gr.ClearButton([msg, chatbot])
    with gr.Row():
        token_val = gr.Slider(label = 'Max new tokens', value = 512, minimum = 128, maximum = 1024, step = 8, interactive=True)
        temperature_ = gr.Slider(label = 'Temperature', value = .7, minimum = 0, maximum =1, step = .1, interactive=True)
        top_p_ = gr.Slider(label = 'Top P', value = .95, minimum = 0, maximum =1, step = .05, interactive=True)

    def respond(message, chat_history, token_val, temperature_, top_p_):
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output.wav")

        return "", chat_history, #"output.wav"
    
    def respond_audio(audi, chat_history, token_val, temperature_, top_p_):  
        wavfile.write("output.wav", 44100, audi[1]) 
        result = pipe_w('output.wav')
        message = result["text"]
        print(message)
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output2.wav")
        # tts.tts_to_file(bot_message,
                # file_path="output.wav",
                # speaker_wav="output.wav",
                # language="en")
        return "", chat_history, #"output.wav"
    def read_text(chat_history):
        print(chat_history)
        print(type(chat_history))
        tts.tts_to_file(chat_history[-1]['content'],
                file_path="output.wav",
                speaker_wav="output.wav",
                language="en")
        return 'output.wav'


    msg.submit(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit.click(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit_audio.click(respond_audio, [audi, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    read_audio.click(read_text, [chatbot], [audi])
demo.launch(share = True)

Put together, this integrated system makes it possible to take full advantage of the speed and availability of a cloud GPU to act as a personal assistant for all kinds of tasks. We have been using it in place of popular closed source tools like Gemini and ChatGPT, and have really been impressed with the results.
Setting up & Running the Demo

To install the required packages onto your GPU Droplet, paste the following into the terminal:

pip install gradio tts huggingface_hub transformers datasets scipy torch torchaudio

To run this demo, simply paste the code above into a blank python file (let’s arbitrarily call it app.py) on your 1-Click Model enabled Cloud GPU, and run it with python3 app.py.
Creating a Voice Enabled Personal Assistant

To make the best use of this powerful new tool, we have developed a new personal assistant application to run with the models. The application is fully voice enabled, capable of listening to and reading back out loud inputs and outputs. To make this possible, the demo uses Whisper to transcribe an audio input, or takes plain text, and inputs that to an the LLM powered by 1-Click GPU Droplets to generate a text response. We then use Coqui-AI’s XTTS2 model to convert the text input into a understandable audio output. It’s worth noting that the software uses voice cloning to generate the output audio, so users will receive a voice output close to their own speaking voice.

Take a look at the code below:

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread
import os
from huggingface_hub import InferenceClient
import gradio as gr
import random
import time
from TTS.api import TTS
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import scipy.io.wavfile as wavfile
import numpy as np


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id_w = "openai/whisper-large-v3"

model_w = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id_w, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model_w.to(device)

processor = AutoProcessor.from_pretrained(model_id_w)

pipe_w = pipeline(
    "automatic-speech-recognition",
    model=model_w,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

# Example voice cloning with YourTTS in English, French and Portuguese
# tts = TTS("tts_models/multilingual/multi-dataset/bark", gpu=True)

# get v2.0.2
tts = TTS(model_name="xtts_v2.0.2", gpu=True)

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(type="messages")
    with gr.Row():
        msg = gr.Textbox(label = 'Prompt')
        audi = gr.Audio(label = 'Transcribe audio')
    with gr.Row():
        submit = gr.Button('Submit')
        submit_audio = gr.Button('Submit Audio')
        read_audio = gr.Button('Transcribe Text to Audio')
        clear = gr.ClearButton([msg, chatbot])
    with gr.Row():
        token_val = gr.Slider(label = 'Max new tokens', value = 512, minimum = 128, maximum = 1024, step = 8, interactive=True)
        temperature_ = gr.Slider(label = 'Temperature', value = .7, minimum = 0, maximum =1, step = .1, interactive=True)
        top_p_ = gr.Slider(label = 'Top P', value = .95, minimum = 0, maximum =1, step = .05, interactive=True)

    def respond(message, chat_history, token_val, temperature_, top_p_):
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output.wav")

        return "", chat_history, #"output.wav"
    
    def respond_audio(audi, chat_history, token_val, temperature_, top_p_):  
        wavfile.write("output.wav", 44100, audi[1]) 
        result = pipe_w('output.wav')
        message = result["text"]
        print(message)
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output2.wav")
        # tts.tts_to_file(bot_message,
                # file_path="output.wav",
                # speaker_wav="output.wav",
                # language="en")
        return "", chat_history, #"output.wav"
    def read_text(chat_history):
        print(chat_history)
        print(type(chat_history))
        tts.tts_to_file(chat_history[-1]['content'],
                file_path="output.wav",
                speaker_wav="output.wav",
                language="en")
        return 'output.wav'


    msg.submit(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit.click(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit_audio.click(respond_audio, [audi, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    read_audio.click(read_text, [chatbot], [audi])
demo.launch(share = True)

Put together, this integrated system makes it possible to take full advantage of the speed and availability of a cloud GPU to act as a personal assistant for all kinds of tasks. We have been using it in place of popular closed source tools like Gemini and ChatGPT, and have really been impressed with the results.
Setting up & Running the Demo

To install the required packages onto your GPU Droplet, paste the following into the terminal:

pip install gradio tts huggingface_hub transformers datasets scipy torch torchaudio

To run this demo, simply paste the code above into a blank python file (let’s arbitrarily call it app.py) on your 1-Click Model enabled Cloud GPU, and run it with python3 app.py.
Closing Thoughts

The personal assistant application developed for this tutorial has already proven useful for us in our daily lives, and we hope others can find some utility using them. Furthermore, the new 1-Click Model GPU Droplets offer a really interesting alternative to enterprise LLM software. While costly for single users, there are a number of use cases we can think of (namely running the largest open-source LLMs) that can justify the expenditure. Our new offerings have the largest Mixtral and LLaMA models available, so it is an interesting opportunity to test the power of these models against the best competition.

Thank you for reading!

"""
 
result = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant for creating short video essay scripts."},
        {"role": "user", "content": f"Extract the content from the following article (which includes all text following the colon), and create a short summary script for a video explaining the article. Do not include any additional text for the script, only the words that are to be spoken. The article is titled {title}, and its content is: {article}"}
    ]
)
 
print(result.choices[0].message.content)
text = str(result.choices[0].message.content)


[upbeat intro music]  
Welcome to a quick dive into turning DigitalOcean’s 1‑Click Model GPU droplets into your very own voice‑enabled personal assistant.  

First, what are 1‑Click Models? [curious tone] They’re a collaboration between DigitalOcean and Hugging Face that lets you spin up a cloud GPU droplet pre‑loaded with powerful open‑source LLMs—no coding required. Choose from models like Meta‑Llama‑3.1, Gemma‑2, Mixtral, or Hermes, all ready to run on the droplet’s GPU.  

Once your droplet is up, you’ll see a Bearer Token in the SSH welcome message. [pause] Save that token—it’s the key to sending requests to the model’s inference API on port 8080, whether you’re on the same machine or a remote one.  

You can query the model with a simple cURL command:  
```bash
curl http://localhost:8080/v1/chat/completions -X POST \
  -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}], "temperature":0.7,"top_p":0.95,"max_tokens":128}' \
  -H 'Content-Type: application/json' \
 

In [6]:
with open('file.txt', 'w') as file:
    file.write(text)

In [1]:
import argparse
import os
import re
import traceback
from typing import List, Tuple, Union, Dict, Any
import time
import torch
import copy
import glob

from vibevoice.modular.modeling_vibevoice_streaming_inference import VibeVoiceStreamingForConditionalGenerationInference
from vibevoice.processor.vibevoice_streaming_processor import VibeVoiceStreamingProcessor
from transformers.utils import logging

logging.set_verbosity_info()
logger = logging.get_logger(__name__)


class VoiceMapper:
    """Maps speaker names to voice file paths"""
    
    def __init__(self):
        self.setup_voice_presets()
        for k, v in self.voice_presets.items():
            print(f"{k}: {v}")

    def setup_voice_presets(self):
        """Setup voice presets by scanning the voices directory."""
        voices_dir = os.path.join(os.path.dirname('./demo/'), "voices/streaming_model")
        
        # Check if voices directory exists
        if not os.path.exists(voices_dir):
            print(f"Warning: Voices directory not found at {voices_dir}")
            self.voice_presets = {}
            self.available_voices = {}
            return
        
        # Scan for all VOICE files in the voices directory
        self.voice_presets = {}
        
        # Get all .pt files in the voices directory
        pt_files = glob.glob(os.path.join(voices_dir, "**", "*.pt"), recursive=True)
        
        # Create dictionary with filename (without extension) as key
        for pt_file in pt_files:
            # key: filename without extension
            name = os.path.splitext(os.path.basename(pt_file))[0].lower()
            full_path = os.path.abspath(pt_file)
            self.voice_presets[name] = full_path
        
        # Sort the voice presets alphabetically by name for better UI
        self.voice_presets = dict(sorted(self.voice_presets.items()))
        
        # Filter out voices that don't exist (this is now redundant but kept for safety)
        self.available_voices = {
            name: path for name, path in self.voice_presets.items()
            if os.path.exists(path)
        }
        
        print(f"Found {len(self.available_voices)} voice files in {voices_dir}")
        print(f"Available voices: {', '.join(self.available_voices.keys())}")

    def get_voice_path(self, speaker_name: str) -> str:
        """Get voice file path for a given speaker name"""
        # First try exact match
        speaker_name = speaker_name.lower()
        if speaker_name in self.voice_presets:
            return self.voice_presets[speaker_name]
        
        # Try partial matching (case insensitive)
        matched_path = None
        for preset_name, path in self.voice_presets.items():
            if preset_name.lower() in speaker_name or speaker_name in preset_name.lower():
                if matched_path is not None:
                    raise ValueError(f"Multiple voice presets match the speaker name '{speaker_name}', please make the speaker_name more specific.")
                matched_path = path
        if matched_path is not None:
            return matched_path
        
        # Default to first voice if no match found
        default_voice = list(self.voice_presets.values())[0]
        print(f"Warning: No voice preset found for '{speaker_name}', using default voice: {default_voice}")
        return default_voice


APEX FusedRMSNorm not available, using native implementation


In [7]:


model_path = "microsoft/VibeVoice-Realtime-0.5B"
txt_path = "file.txt"
speaker_name = "en-carter_man"
ouput_dir = "./outputs"
device = 'cuda'
cfg_scale = 1.5
output_dir = './'
# Initialize voice mapper
voice_mapper = VoiceMapper()

# Check if txt file exists
if not os.path.exists(txt_path):
    print(f"Error: txt file not found: {txt_path}")

# Read and parse txt file
print(f"Reading script from: {txt_path}")
with open(txt_path, 'r', encoding='utf-8') as f:
    scripts = f.read().strip()

if not scripts:
    print("Error: No valid scripts found in the txt file")

full_script = scripts.replace("’", "'").replace('“', '"').replace('”', '"')

print(f"Loading processor & model from {model_path}")
processor = VibeVoiceStreamingProcessor.from_pretrained(model_path)

# Decide dtype & attention implementation
if device == "mps":
    load_dtype = torch.float32  # MPS requires float32
    attn_impl_primary = "sdpa"  # flash_attention_2 not supported on MPS
elif device == "cuda":
    load_dtype = torch.bfloat16
    attn_impl_primary = "flash_attention_2"
else:  # cpu
    load_dtype = torch.float32
    attn_impl_primary = "sdpa"
print(f"Using device: {device}, torch_dtype: {load_dtype}, attn_implementation: {attn_impl_primary}")
# Load model with device-specific logic
try:
    if device == "mps":
        model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
            model_path,
            torch_dtype=load_dtype,
            attn_implementation=attn_impl_primary,
            device_map=None,  # load then move
        )
        model.to("mps")
    elif device == "cuda":
        model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
            model_path,
            torch_dtype=load_dtype,
            device_map="cuda",
            attn_implementation=attn_impl_primary,
        )
    else:  # cpu
        model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
            model_path,
            torch_dtype=load_dtype,
            device_map="cpu",
            attn_implementation=attn_impl_primary,
        )
except Exception as e:
    if attn_impl_primary == 'flash_attention_2':
        print(f"[ERROR] : {type(e).__name__}: {e}")
        print(traceback.format_exc())
        print("Error loading the model. Trying to use SDPA. However, note that only flash_attention_2 has been fully tested, and using SDPA may result in lower audio quality.")
        model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
            model_path,
            torch_dtype=load_dtype,
            device_map=(device if device in ("cuda", "cpu") else None),
            attn_implementation='sdpa'
        )
        if device == "mps":
            model.to("mps")
    else:
        raise e


model.eval()
model.set_ddpm_inference_steps(num_steps=5)

if hasattr(model.model, 'language_model'):
   print(f"Language model attention: {model.model.language_model.config._attn_implementation}")

target_device = device if device != "cpu" else "cpu"
voice_sample = voice_mapper.get_voice_path(speaker_name)
print(f"Using voice preset for {speaker_name}: {voice_sample}")
all_prefilled_outputs = torch.load(voice_sample, map_location=target_device, weights_only=False)

# Prepare inputs for the model
inputs = processor.process_input_with_cached_prompt(
    text=full_script,
    cached_prompt=all_prefilled_outputs,
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)

# Move tensors to target device
for k, v in inputs.items():
    if torch.is_tensor(v):
        inputs[k] = v.to(target_device)

print(f"Starting generation with cfg_scale: {cfg_scale}")

# Generate audio
start_time = time.time()
outputs = model.generate(
    **inputs,
    max_new_tokens=None,
    cfg_scale=cfg_scale,
    tokenizer=processor.tokenizer,
    generation_config={'do_sample': False},
    verbose=True,
    all_prefilled_outputs=copy.deepcopy(all_prefilled_outputs) if all_prefilled_outputs is not None else None,
)
generation_time = time.time() - start_time
print(f"Generation time: {generation_time:.2f} seconds")

# Calculate audio duration and additional metrics
if outputs.speech_outputs and outputs.speech_outputs[0] is not None:
    # Assuming 24kHz sample rate (common for speech synthesis)
    sample_rate = 24000
    audio_samples = outputs.speech_outputs[0].shape[-1] if len(outputs.speech_outputs[0].shape) > 0 else len(outputs.speech_outputs[0])
    audio_duration = audio_samples / sample_rate
    rtf = generation_time / audio_duration if audio_duration > 0 else float('inf')
    
    print(f"Generated audio duration: {audio_duration:.2f} seconds")
    print(f"RTF (Real Time Factor): {rtf:.2f}x")
else:
    print("No audio output generated")

# Calculate token metrics
input_tokens = inputs['tts_text_ids'].shape[1]  # Number of input tokens
output_tokens = outputs.sequences.shape[1]  # Total tokens (input + generated)
generated_tokens = output_tokens - input_tokens - all_prefilled_outputs['tts_lm']['last_hidden_state'].size(1)

print(f"Prefilling text tokens: {input_tokens}")
print(f"Generated speech tokens: {generated_tokens}")
print(f"Total tokens: {output_tokens}")

# Save output (processor handles device internally)
txt_filename = os.path.splitext(os.path.basename(txt_path))[0]
output_path = os.path.join(output_dir, f"{txt_filename}_generated.wav")
os.makedirs(output_dir, exist_ok=True)

processor.save_audio(
    outputs.speech_outputs[0], # First (and only) batch item
    output_path=output_path,
)
print(f"Saved output to {output_path}")

# Print summary
print("\n" + "="*50)
print("GENERATION SUMMARY")
print("="*50)
print(f"Input file: {txt_path}")
print(f"Output file: {output_path}")
print(f"Speaker names: {speaker_name}")
print(f"Prefilling text tokens: {input_tokens}")
print(f"Generated speech tokens: {generated_tokens}")
print(f"Total tokens: {output_tokens}")
print(f"Generation time: {generation_time:.2f} seconds")
print(f"Audio duration: {audio_duration:.2f} seconds")
print(f"RTF (Real Time Factor): {rtf:.2f}x")

print("="*50)



loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B/snapshots/060db6499f32faf8b98477b0a26969ef7d8b9987/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B/snapshots/060db6499f32faf8b98477b0a26969ef7d8b9987/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B/snapshots/060db6499f32faf8b98477b0a26969ef7d8b9987/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Qwen--Qwen2.5-0.5B/snapshots/060db6499f32faf8b98477b0a26969ef7d8b9987/tokenizer_config.json
loading file chat_template.jinja from cache at None
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load fr

Found 61 voice files in ./demo/voices/streaming_model
Available voices: de-spk0_man, de-spk1_woman, de-spk2_woman, de-spk3_man, de-spk4_woman, de-spk5_man, de-spk6_man, en-breeze_woman, en-brutalon_man, en-carter_man, en-clarion_man, en-clarissa_woman, en-davis_man, en-emma_woman, en-frank_man, en-grace_woman, en-gravitar_man, en-gravus_man, en-mechcorsair_man, en-mike_man, en-oldenheart_man, en-silkvox_man, en-snarkling_woman, en-soother_woman, fr-spk0_man, fr-spk1_woman, fr-spk2_man, fr-spk3_woman, fr-spk4_woman, fr-spk5_man, in-samuel_man, it-spk0_woman, it-spk1_man, jp-spk0_man, jp-spk1_woman, jp-spk2_woman, jp-spk3_woman, jp-spk4_woman, jp-spk5_man, kr-spk0_woman, kr-spk1_man, kr-spk2_woman, kr-spk3_man, nl-spk0_man, nl-spk1_woman, pl-spk0_man, pl-spk1_woman, pl-spk2_man, pl-spk3_woman, pt-spk0_woman, pt-spk1_man, pt-spk2_woman, pt-spk3_man, pt-spk4_man, pt-spk5_woman, sp-spk0_woman, sp-spk1_man, sp-spk2_woman, sp-spk3_man, sp-spk4_woman, sp-spk5_man
de-spk0_man: /home/VibeVoice/d

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--microsoft--VibeVoice-Realtime-0.5B/snapshots/6bce5f06044837fe6d2c5d7a71a84f0416bd57e4/config.json
Model config VibeVoiceStreamingConfig {
  "acoustic_tokenizer_config": {
    "causal": true,
    "channels": 1,
    "conv_bias": true,
    "conv_norm": "none",
    "corpus_normalize": 0.0,
    "decoder_depths": null,
    "decoder_n_filters": 32,
    "decoder_ratios": [
      8,
      5,
      5,
      4,
      2,
      2
    ],
    "disable_last_norm": true,
    "encoder_depths": "3-3-3-3-3-3-8",
    "encoder_n_filters": 32,
    "encoder_ratios": [
      8,
      5,
      5,
      4,
      2,
      2
    ],
    "fix_std": 0.5,
    "layer_scale_init_value": 1e-06,
    "layernorm": "RMSNorm",
    "layernorm_elementwise_affine": true,
    "layernorm_eps": 1e-05,
    "mixer_layer": "depthw

Using device: cuda, torch_dtype: torch.bfloat16, attn_implementation: flash_attention_2
[ERROR] : ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
Traceback (most recent call last):
  File "/tmp/ipykernel_437806/4057761752.py", line 50, in <module>
    model = VibeVoiceStreamingForConditionalGenerationInference.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 279, in _wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4336, in from_pretrained
    config = cls._autoset_attn_implementation(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2109, in _autoset_attn_implementation
    cls

All model checkpoint weights were used when initializing VibeVoiceStreamingForConditionalGenerationInference.

All the weights of VibeVoiceStreamingForConditionalGenerationInference were initialized from the model checkpoint at microsoft/VibeVoice-Realtime-0.5B.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VibeVoiceStreamingForConditionalGenerationInference for predictions without further training.
Generation config file not found, using a generation config created from the model config.


Language model attention: sdpa
Using voice preset for en-carter_man: /home/VibeVoice/demo/voices/streaming_model/en-Carter_man.pt
Starting generation with cfg_scale: 1.5


                                                                                                                                                                                             

Generation time: 83.14 seconds
Generated audio duration: 266.40 seconds
RTF (Real Time Factor): 0.31x
Prefilling text tokens: 646
Generated speech tokens: 1998
Total tokens: 2960
Saved output to ./file_generated.wav

GENERATION SUMMARY
Input file: file.txt
Output file: ./file_generated.wav
Speaker names: en-carter_man
Prefilling text tokens: 646
Generated speech tokens: 1998
Total tokens: 2960
Generation time: 83.14 seconds
Audio duration: 266.40 seconds
RTF (Real Time Factor): 0.31x
