-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High quality models not running on iPhone 13. Medium Quality models can only speak short phrases. #1
Comments
Another example demonstrating this when using the en_GB-Jenny medium quality voice on the Speech Central app. You can see it’s performing quite well -- until it hits a longer sentence, causing the system to fall back to compact Siri. RPReplay_Final1717895785-HD.720p.mov |
Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word. |
Ah, I had no Idea of the ~60mb limitation, which was certainly frustrating to find out. However, I did a little poking around the model parameters (with the help of ChatGPT, as this is not my domain) and noted that essentially almost all Piper model weights are in FP32. import onnx
import numpy as np
# Load the ONNX model
model_path = "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/en_GB-jenny.onnx"
model = onnx.load(model_path)
# Function to convert ONNX TensorProto data types to human-readable format
def tensor_dtype(tensor):
if tensor.data_type == onnx.TensorProto.FLOAT:
return 'float32'
elif tensor.data_type == onnx.TensorProto.FLOAT16:
return 'float16'
elif tensor.data_type == onnx.TensorProto.INT8:
return 'int8'
elif tensor.data_type == onnx.TensorProto.INT32:
return 'int32'
elif tensor.data_type == onnx.TensorProto.INT64:
return 'int64'
# Add other data types as needed
else:
return 'unknown'
# Iterate through the initializers and print their data types
for initializer in model.graph.initializer:
weight_name = initializer.name
weight_dtype = tensor_dtype(initializer)
print(f"Weight Name: {weight_name}, Data Type: {weight_dtype}”) Found this page, which gave instructions on reducing onnx weights to fp16. After a few failed attempts, I managed to convert the Jenny Medium model to fp16 with this script: import onnx
from onnxconverter_common import float16
# Load the model
model_path = "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/jenny.onnx"
model = onnx.load(model_path)
# Convert the model to float16, keeping the inputs and outputs as float32
model_fp16 = float16.convert_float_to_float16(
model,
keep_io_types=True,
op_block_list=['RandomNormalLike', 'Range']
)
# Save the converted model
onnx.save(model_fp16, "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/jenny_fp16.onnx") Happy to report that the en-GB-Jenny-Medium model is now 32 mb in size, rather than 63mb -- and it is still running inference in Piper perfectly fine! I tried the same on the uk_UA-lada model, and after being converted to fp16, it’s reduced from 20mb to 10mb! Attaching a drive link to the fp16 files and audio samples from the fp32 vs fp16 models. https://drive.google.com/drive/folders/1WlB4GBs1mohKi_8y9AxMztFZKokXUMl1?usp=share_link I’ll report back once I test out the fp16 models within a build of the iOS app. Hopefully this makes the Piper iOS app more feasible to develop! |
Hello! A brief update: With FP16 weights, the ua-UK model, which is now about 10mb in size, seems to run pretty well on my iPhone 13 -- both within the app as well as when called on as a system voice in various applications. It still had an issue where longer sentences will cause the system to fall back on a compact default voice. The same thing goes for the Jenny FP16 model, which takes up about 20-30mb of space -- but the "tolerance" for iOS before it falls back on a default voice is shorter than the 10mb Lada model, in terms of sentence length. You can see this demonstrated in the video where it lags in a sentence at the middle of the excerpt, and then totally stops at the final sentence -- that is usually when the debugging terminal on Xcode will display an error. copy_2F019055-8040-4109-AB11-03144958F861.movIn terms of memory usage, when I've tried generating texts of various length and complexity on the app itself using Jenny -- on my iPhone, Mac, as well as various simulated iOS devices, I find that the memory usage never really goes above 50mb, and typically hovers around 45mb. It tends to be VERY responsive and quick on the swift app on my Mac and on various iOS simulations. What I'm wondering is if the fallback seen on real iOS devices like the iPhone 13 might have less to do with the 80mb memory limit on Audio Extension based apps, and rather more to do with the request timing out due to some internal threshold iOS has for TTS generation? If so, I wonder how a Vit-based app like your might fare if it utilized a more efficient, natural, and compact version of VITS -- such as a mini MB-iSTFT-VITS, which tends to have a very significant real-time factor (0.02x) in speech generation compared to the vanilla VITS used by Piper? (0.27x or 0.099x depending on model size) If I calculated correctly, a 7m parameter mini MB model should take up about 15mb or less when stored with fp16 weights? I'm also wondering if CoreML conversions of any sort of VITS models might speed up the performance and/or reduce their memory usage to make them usable within the AVSpeechSynthesisProvider extension framework. Would be curious to hear your thoughts! Thank you! |
@S-Ali-Zaidi it'd be cool if you'd publish your stuff on your fork, I've been trying to get this repo to work with a few piper voices (ie Amy medium) because none of the built in iOS voices have the quality I require for my PokéDex app, but they end up extremely weirdly pitched so it'd be cool to have a second reference point to using different voices |
I’ll be returning to this project soon, and happy to push updates to my fork when I do! Most of what I’ve bene doing has been on the model quantization side, rather than making changes to the actual scripts themselves. THAT SAID -- if you are having issues of weird pitch with your voice, I am 90% certain that is because of a sample rate mismatch between your model and what the audio unit is rendering the audio in. Note that the sample rate for Amy Medium is 22,050hz, according to the model card (And likely the config json):
You need to make sure the sample rate you have set within PiperAudioUnit.swift is set to match the sample rate of the Piper model you are using. Check line 28: self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 16000.0, channels: 1, interleaved: true)! Note that by default it is set to self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 22050.0, channels: 1, interleaved: true)! I’m pretty sure this is your issue -- as sample rate mismatches are the only reason I can think of getting high pitched audio outputs. Make sure to also replace any references of For reference, here is how looks on my end after I made such modifications for a 22,050hz Lessac Piper model: //
// piperttsAudioUnit.swift
// pipertts
//
// Created by Ihor Shevchuk on 27.12.2023.
//
// NOTE:- An Audio Unit Speech Extension (ausp) is rendered offline, so it is safe to use
// Swift in this case. It is not recommended to use Swift in other AU types.
import AVFoundation
import piper_objc
import PiperappUtils
public class PiperttsAudioUnit: AVSpeechSynthesisProviderAudioUnit {
private var outputBus: AUAudioUnitBus
private var _outputBusses: AUAudioUnitBusArray!
private var request: AVSpeechSynthesisProviderRequest?
private var format: AVAudioFormat
var piper: Piper?
@objc override init(componentDescription: AudioComponentDescription, options: AudioComponentInstantiationOptions) throws {
self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 22050.0, channels: 1, interleaved: true)!
outputBus = try AUAudioUnitBus(format: self.format)
try super.init(componentDescription: componentDescription, options: options)
_outputBusses = AUAudioUnitBusArray(audioUnit: self, busType: AUAudioUnitBusType.output, busses: [outputBus])
}
public override var outputBusses: AUAudioUnitBusArray {
return _outputBusses
}
public override func allocateRenderResources() throws {
try super.allocateRenderResources()
Log.debug("allocateRenderResources")
if piper == nil {
let model = Bundle.main.path(forResource: "lessac_med_fp16", ofType: "onnx")!
let config = Bundle.main.path(forResource: "lessac_med_fp16.onnx", ofType: "json")!
piper = Piper(modelPath: model, andConfigPath: config)
}
}
public override func deallocateRenderResources() {
super.deallocateRenderResources()
piper = nil
}
// MARK: - Rendering
/*
NOTE:- It is only safe to use Swift for audio rendering in this case, as Audio Unit Speech Extensions process offline.
(Swift is not usually recommended for processing on the realtime audio thread)
*/
public override var internalRenderBlock: AUInternalRenderBlock {
return { [weak self] actionFlags, _, frameCount, _, outputAudioBufferList, _, _ in
guard let self = self,
let piper = self.piper else {
actionFlags.pointee = .unitRenderAction_PostRenderError
Log.error("Utterance Client is nil while request for rendering came.")
return kAudioComponentErr_InstanceInvalidated
}
if piper.completed() && !piper.hasSamplesLeft() {
Log.debug("Completed rendering")
actionFlags.pointee = .offlineUnitRenderAction_Complete
self.cleanUp()
return noErr
}
if !piper.readyToRead() {
actionFlags.pointee = .offlineUnitRenderAction_Preflight
Log.debug("No bytes yet.")
return noErr
}
let levelsData = piper.popSamples(withMaxLength: UInt(frameCount))
guard let levelsData else {
actionFlags.pointee = .offlineUnitRenderAction_Preflight
Log.debug("Rendering in progress. No bytes.")
return noErr
}
outputAudioBufferList.pointee.mNumberBuffers = 1
var unsafeBuffer = UnsafeMutableAudioBufferListPointer(outputAudioBufferList)[0]
let frames = unsafeBuffer.mData!.assumingMemoryBound(to: Float.self)
unsafeBuffer.mDataByteSize = UInt32(levelsData.count)
unsafeBuffer.mNumberChannels = 1
for frame in 0..<levelsData.count {
frames[Int(frame)] = levelsData[Int(frame)].int16Value.toFloat()
}
actionFlags.pointee = .offlineUnitRenderAction_Render
Log.debug("Rendering \(levelsData.count) bytes")
return noErr
}
}
public override func synthesizeSpeechRequest(_ speechRequest: AVSpeechSynthesisProviderRequest) {
Log.debug("synthesizeSpeechRequest \(speechRequest.ssmlRepresentation)")
self.request = speechRequest
let text = AVSpeechUtterance(ssmlRepresentation: speechRequest.ssmlRepresentation)?.speechString
piper?.cancel()
piper?.synthesize(text ?? "")
}
public override func cancelSpeechRequest() {
Log.debug("\(#file) cancelSpeechRequest")
cleanUp()
piper?.cancel()
}
func cleanUp() {
request = nil
}
public override var speechVoices: [AVSpeechSynthesisProviderVoice] {
get {
return [
AVSpeechSynthesisProviderVoice(name: "Lessac", identifier: "pipertts", primaryLanguages: ["en_US"], supportedLanguages: ["en_US", "en_GB"])
]
}
set { }
}
public override var canProcessInPlace: Bool {
return true
}
} Note that your are still likely going to find you have issues on iOS -- especially for older iPhones like my iPhone 13. Due to the way apple integrates third party TTS systems into the iOS system, there are strict limits on how much RAM piper TTS can use -- about 60-80mb. Unless you are using a heavily quantized model, you are going to find that your iPhone may fail to render any speech from Piper or may cut off at longer sentences. I’m currently exploring int-8 quantization aware training of Piper models, as well as retraining some entirely using a tokenizer vocabulary instead of phonemes or graphemes. If that works out, it may result in Piper TTS models being able to run smoothly on iOS. Until then -- don’t expect it to work consistently, even with the above fixes! |
How is your 8-bit quantization coming along ? |
Is there anywhere documentation about this memory limit ? I couldn't find those values nowhere officially documented, and this is a pretty substantial limit for TTS voices ! |
Haven’t gotten around to it yet as there’s a couple things I’m thinking:
Thus, what I’m coming to realize is that a more efficient manner of encoding the text itself is needed, rather than phoneme-based encoding.. as phonemes roughly come out to the same count per input text as characters. Language models have made great strides here, by training byte-pair encoding or Unigram modeling tokenizers, which tokenize by subword or word length elements. I’ve trained a tokenizer with a vocabulary of about 4000. When I use it to encode texts, I’m seeing a 50-80% reduction in the number of vectors needed to represent a string of text that will pass be converted into intermediate representations by the embedding layer. A vocab of 4000-10,000 seems like a good balance in terms of efficiently compressing the text without ballooning the model size by too much due to the increased size of the embedding matrix But this will require re-training a VITS (ideally VITS2) model from scratch on tokenized graphemes rather than phonemes. So I’m currently focused on finishing my neural networks courses right now (doing a post-bachelors program in ML). Once I’ve finished that up, I’ll get back to training a new VITS model within the Piper framework that will ideally:
It’s a relatively complicated project compared to a simple quantization of an existing VITS models, so it will take a little time, I don’t expect to get around to even starting it until sometime this fall, after I finish by current quarter of coursework. |
The documentation is very spare -- I had to dig deeply throguh Apple Developer Forums to find mentions of it. And even then, I think with the latest iOS and Mac updates, it might be a bit more of a dynamic limit depending on the total system resources. I find that my MacBook Pro M1 Pro with 16GB RAM has an effectively unlimited limit to how long of inputs it can process in VITS / Piper via the AudioUnit Extension. The limit seems to still be there, but not terrible, for newer iPhones and iPads with 8+GB of RAM. As you work your way down to older and cheaper iOS and iPad devices with less and less RAM, you find that the text-input limit becomes shorter and shorter. But yes, there seems to be be no solid documentation from Apple on how it works, which sucks. |
I still don't understand why you want to use a tokenizer model/grapheme based model instead of the phoneme-based model. Isn't the phoneme-based model much more light-weight, as we only have to deal with a limited set of phoneme-id's instead of the dictionary/unigram-based tokenizer model which adds a few thousand tokens for the input layer ? |
As for the memory explosion. Have you seen this documentation ? https://onnxruntime.ai/docs/get-started/with-c.html. There are a few options available for memory consumption control. Maybe, it would be worth investigating to use these ? |
What would be interesting to know, if the memory consumption is determined by the encoder layer of the VITS model, or the decoder layer. I could imagine, that the upsampling code of HifiGAN is causing the memory explosion. Heck, it needs also the most time for all of it (> 90%). It might be worthwhile to exchange HifiGAN with iSTFT. Have you already done this experiment ? |
If one uses a pre-trained tokenizer (for example, with a unigram model trained on a custom English corpus in SentencePiece, set to a vocab size of 4096), the size of the embedding matrix will indeed be larger, at 4000 tokens X 192 dimensions, -- versus something like 200 tokens x 192 dimensions -- but the difference will be pretty trivial, compared to the size of the overall model: Vocabulary Size: 4096 vs. 200 | Embedding Dimensions: 192
So as you can see, even if one never does any model quantization, the memory cost of using a custom-trained tokenizer really adds nothing substantial to memory size, and such a embedding matrix would only need to be instantiated once during the VITS model run. It would be a one-time, static increase of around 1-3MB if migrating to a unigram tokenizer of around n=4096 vocab size. But let’s consider the impact on the memory in terms a simple sentence like Phoneme Representation (approx. 22 phonemes):
Hypothetical 4096 Vocabulary Tokenizer (8 tokens: "The", "apple", "never", "falls", "far", "from", "the", "tree.")
So as we can see, going from using a FP32 phoneme-based embedding matrix to an FP32 4096-sized pre-trained tokenizer results in a 63% reduction in vector outputs of the embedding matrix. If we were to quantize the model (including the embedding matrix) to INT-8, we can see up to a 90% size reduction in the initial vector representation of the text. so as you can imagine, if you feed longer inputs to the VITS / Piper model that is trained in a byte-pair or Unigram tokenizer paradigm, it could be much more efficient in memory usage on memory constrained frameworks like Apple’s Audio Unit extension -- especially if we’re operating under the naive assumption that we have to maintain within memory all the intermediate vector representations of the input text as it passes through the various transformer and convolutional layers. If I understand the architecture of VITS correctly, we could see this 60-90% decrease in intermediate vector representations across all vectors representations / activations generated by the model during the inferencing process, which seems to be the source of the intense memory ballooning that can be seen when running it within Apple’s Audio Unit Extension framework. My suspicion is that, if one has a diverse and large enough speech-text pair training corpus (using natural datasets like LJSpeech, augmented with synthetic data, such as by using ElevenLabs or Tortoise or something to generate a bunch more speech-text pairs to cover your 4096-sized token vocabulary extensively) that we might see a model (especially the VITS2 architecture) perform better (sound more natural and adaptable) than if using phoneme-based training alone. There’s a lot of research I’ve read that consistently points to this, which is another reason why I would like to test it out. That all said, in respect to the memory consumption control possible within Onnx Runtime, I will absolutely have to take a look at that and see if that might be a potential solution for the immediate term! I cannot recall how the computational graph of VITS looks off the top of my head.. I’ll need to look back at the pytroch scripts and try to make sure that intermediate vector representations are not needed beyond their immediate output and input layers. If so, it seems like this could be a viable solution. And in respect to ISTFT, yes, I’m IMMENSELY interested in that! In fact, my target model to try training on a 4096-size tokenizer would be the MB-iSTFT VITS2 architecture that some folks put together and got very good results with. It seems like it could make it both smaller in memory footprint as well as faster! But I won’t be getting to testing all of this out for a little while -- currently deep in the mud in my neural networks courses right now.. but once I get the time, I will absolutely try out all of this and report back with my results! |
Hello -- I’ve managed to build and run the application on my iPhone 13 Pro Max
I’m finding that any high quality models I attempt to build and run on the iPhone 13 fail to run -- and iOS falls back to using the compact Siri voice.
Medium quality models are able to run -- but only speak in short phrases. If given a long sentence or phrase, the system will also fall back to using the compact Siri Voice.
I am able to get the high quality models running fine on the iOS and iPad OS simulations, as well as on the Mac build.
I’m a bit confounded by this behavior on iPhone 13. The difference between the models in terms of size is about 60mb vs 110mb. My iPhone is able to run LLMs in the 1B to 7B parameter range (at varying speeds), so a bit surprised that it seems to be struggling with the high quality models that are around 20m parameters in size?
For context -- I’m able to run Piper Medium models just fine using WASM interfaces, such as this one.
Is this an issue with ONNX runtime, something to do with the handling of audio buffers, or some other issue? Out of my depth to be able to troubleshoot this myself! Help would be appreciated.
Videos demonstrating the issue here -- using the en_GB-Cori medium model. I see the same issue with en_GB-Jenny medium. High quality models fail to run at all:
clip.1.mov
Clip.2.mov
The text was updated successfully, but these errors were encountered: