Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High quality models not running on iPhone 13. Medium Quality models can only speak short phrases. #1

Open
S-Ali-Zaidi opened this issue Jun 9, 2024 · 14 comments
Assignees

Comments

@S-Ali-Zaidi
Copy link

S-Ali-Zaidi commented Jun 9, 2024

Hello -- I’ve managed to build and run the application on my iPhone 13 Pro Max

I’m finding that any high quality models I attempt to build and run on the iPhone 13 fail to run -- and iOS falls back to using the compact Siri voice.

Medium quality models are able to run -- but only speak in short phrases. If given a long sentence or phrase, the system will also fall back to using the compact Siri Voice.

I am able to get the high quality models running fine on the iOS and iPad OS simulations, as well as on the Mac build.

I’m a bit confounded by this behavior on iPhone 13. The difference between the models in terms of size is about 60mb vs 110mb. My iPhone is able to run LLMs in the 1B to 7B parameter range (at varying speeds), so a bit surprised that it seems to be struggling with the high quality models that are around 20m parameters in size?

For context -- I’m able to run Piper Medium models just fine using WASM interfaces, such as this one.

Is this an issue with ONNX runtime, something to do with the handling of audio buffers, or some other issue? Out of my depth to be able to troubleshoot this myself! Help would be appreciated.

Videos demonstrating the issue here -- using the en_GB-Cori medium model. I see the same issue with en_GB-Jenny medium. High quality models fail to run at all:

clip.1.mov
Clip.2.mov
@S-Ali-Zaidi
Copy link
Author

Another example demonstrating this when using the en_GB-Jenny medium quality voice on the Speech Central app. You can see it’s performing quite well -- until it hits a longer sentence, causing the system to fall back to compact Siri.

RPReplay_Final1717895785-HD.720p.mov

@IhorShevchuk
Copy link
Owner

IhorShevchuk commented Jun 9, 2024

Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of AVSpeechSynthesisProviderAudioUnit and this is application extension specific class, unfortunately app extension that is built of top Audio Unit on iOS can't use more than 60 mb of RAM. ONNX runtime loads whole file to RAM during initialisation, this is why model can't be more than ~50 mb because there should be some space left for audio buffers and extension itself.

This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word.

@IhorShevchuk IhorShevchuk self-assigned this Jun 9, 2024
@S-Ali-Zaidi
Copy link
Author

S-Ali-Zaidi commented Jun 9, 2024

Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of AVSpeechSynthesisProviderAudioUnit and this is application extension specific class, unfortunately app extension that is built of top Audio Unit on iOS can't use more than 60 mb of RAM. ONNX runtime loads whole file to RAM during initialisation, this is why model can't be more than ~50 mb because there should be some space left for audio buffers and extension itself.

This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word.

Ah, I had no Idea of the ~60mb limitation, which was certainly frustrating to find out.

However, I did a little poking around the model parameters (with the help of ChatGPT, as this is not my domain) and noted that essentially almost all Piper model weights are in FP32.

import onnx
import numpy as np

# Load the ONNX model
model_path = "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/en_GB-jenny.onnx"
model = onnx.load(model_path)

# Function to convert ONNX TensorProto data types to human-readable format
def tensor_dtype(tensor):
    if tensor.data_type == onnx.TensorProto.FLOAT:
        return 'float32'
    elif tensor.data_type == onnx.TensorProto.FLOAT16:
        return 'float16'
    elif tensor.data_type == onnx.TensorProto.INT8:
        return 'int8'
    elif tensor.data_type == onnx.TensorProto.INT32:
        return 'int32'
    elif tensor.data_type == onnx.TensorProto.INT64:
        return 'int64'
    # Add other data types as needed
    else:
        return 'unknown'

# Iterate through the initializers and print their data types
for initializer in model.graph.initializer:
    weight_name = initializer.name
    weight_dtype = tensor_dtype(initializer)
    print(f"Weight Name: {weight_name}, Data Type: {weight_dtype}”)

Found this page, which gave instructions on reducing onnx weights to fp16.

After a few failed attempts, I managed to convert the Jenny Medium model to fp16 with this script:

import onnx
from onnxconverter_common import float16

# Load the model
model_path = "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/jenny.onnx"
model = onnx.load(model_path)

# Convert the model to float16, keeping the inputs and outputs as float32
model_fp16 = float16.convert_float_to_float16(
    model,
    keep_io_types=True,
    op_block_list=['RandomNormalLike', 'Range']
)

# Save the converted model
onnx.save(model_fp16, "/Users/s.alizaidi/Programming/TTS_ALL/Piper_TTS/modules/piper/models/en_GB/jenny_fp16.onnx")

Happy to report that the en-GB-Jenny-Medium model is now 32 mb in size, rather than 63mb -- and it is still running inference in Piper perfectly fine!

I tried the same on the uk_UA-lada model, and after being converted to fp16, it’s reduced from 20mb to 10mb!

Attaching a drive link to the fp16 files and audio samples from the fp32 vs fp16 models.

https://drive.google.com/drive/folders/1WlB4GBs1mohKi_8y9AxMztFZKokXUMl1?usp=share_link

I’ll report back once I test out the fp16 models within a build of the iOS app. Hopefully this makes the Piper iOS app more feasible to develop!

@S-Ali-Zaidi
Copy link
Author

S-Ali-Zaidi commented Jun 13, 2024

Hello! A brief update:

With FP16 weights, the ua-UK model, which is now about 10mb in size, seems to run pretty well on my iPhone 13 -- both within the app as well as when called on as a system voice in various applications. It still had an issue where longer sentences will cause the system to fall back on a compact default voice.

The same thing goes for the Jenny FP16 model, which takes up about 20-30mb of space -- but the "tolerance" for iOS before it falls back on a default voice is shorter than the 10mb Lada model, in terms of sentence length.

You can see this demonstrated in the video where it lags in a sentence at the middle of the excerpt, and then totally stops at the final sentence -- that is usually when the debugging terminal on Xcode will display an error.

copy_2F019055-8040-4109-AB11-03144958F861.mov

In terms of memory usage, when I've tried generating texts of various length and complexity on the app itself using Jenny -- on my iPhone, Mac, as well as various simulated iOS devices, I find that the memory usage never really goes above 50mb, and typically hovers around 45mb. It tends to be VERY responsive and quick on the swift app on my Mac and on various iOS simulations.

What I'm wondering is if the fallback seen on real iOS devices like the iPhone 13 might have less to do with the 80mb memory limit on Audio Extension based apps, and rather more to do with the request timing out due to some internal threshold iOS has for TTS generation?

If so, I wonder how a Vit-based app like your might fare if it utilized a more efficient, natural, and compact version of VITS -- such as a mini MB-iSTFT-VITS, which tends to have a very significant real-time factor (0.02x) in speech generation compared to the vanilla VITS used by Piper? (0.27x or 0.099x depending on model size)

IMG_9756

If I calculated correctly, a 7m parameter mini MB model should take up about 15mb or less when stored with fp16 weights?

I'm also wondering if CoreML conversions of any sort of VITS models might speed up the performance and/or reduce their memory usage to make them usable within the AVSpeechSynthesisProvider extension framework.

Would be curious to hear your thoughts! Thank you!

@castdrian
Copy link

@S-Ali-Zaidi it'd be cool if you'd publish your stuff on your fork, I've been trying to get this repo to work with a few piper voices (ie Amy medium) because none of the built in iOS voices have the quality I require for my PokéDex app, but they end up extremely weirdly pitched so it'd be cool to have a second reference point to using different voices

@S-Ali-Zaidi
Copy link
Author

S-Ali-Zaidi commented Jun 25, 2024

@S-Ali-Zaidi it'd be cool if you'd publish your stuff on your fork, I've been trying to get this repo to work with a few piper voices (ie Amy medium) because none of the built in iOS voices have the quality I require for my PokéDex app, but they end up extremely weirdly pitched so it'd be cool to have a second reference point to using different voices

I’ll be returning to this project soon, and happy to push updates to my fork when I do! Most of what I’ve bene doing has been on the model quantization side, rather than making changes to the actual scripts themselves.

THAT SAID -- if you are having issues of weird pitch with your voice, I am 90% certain that is because of a sample rate mismatch between your model and what the audio unit is rendering the audio in.

Note that the sample rate for Amy Medium is 22,050hz, according to the model card (And likely the config json):

# Model card for amy (low)

* Language: en_US (English, United States)
* Speakers: 1
* Quality: medium
* Samplerate: 22,050Hz

## Dataset

* URL: https://github.com/MycroftAI/mimic3-voices
* License: See URL

## Training

Finetuned from U.S. English lessac voice (medium quality).

You need to make sure the sample rate you have set within PiperAudioUnit.swift is set to match the sample rate of the Piper model you are using.

Check line 28:

self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 16000.0, channels: 1, interleaved: true)!

Note that by default it is set to sampleRate: 16000.0 because that was the output sample rate of the Lada Piper model used by @IhorShevchuk. IF you have not already, change this so it reads:

self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 22050.0, channels: 1, interleaved: true)!

I’m pretty sure this is your issue -- as sample rate mismatches are the only reason I can think of getting high pitched audio outputs.

Make sure to also replace any references of uk_UA-lada in PiperAudioUnit.swift with the filename and config filename of your Amy model -- as well as changing the references of primaryLanguages: ["uk-UA"], supportedLanguages: ["uk-UA”] to your desired language (which for English would be en-US or en-GB).

For reference, here is how looks on my end after I made such modifications for a 22,050hz Lessac Piper model:

//
//  piperttsAudioUnit.swift
//  pipertts
//
//  Created by Ihor Shevchuk on 27.12.2023.
//

// NOTE:- An Audio Unit Speech Extension (ausp) is rendered offline, so it is safe to use
// Swift in this case. It is not recommended to use Swift in other AU types.

import AVFoundation

import piper_objc
import PiperappUtils

public class PiperttsAudioUnit: AVSpeechSynthesisProviderAudioUnit {
    private var outputBus: AUAudioUnitBus
    private var _outputBusses: AUAudioUnitBusArray!
    
    private var request: AVSpeechSynthesisProviderRequest?

    private var format: AVAudioFormat

    var piper: Piper?

    @objc override init(componentDescription: AudioComponentDescription, options: AudioComponentInstantiationOptions) throws {

        self.format = AVAudioFormat(commonFormat: .pcmFormatFloat32, sampleRate: 22050.0, channels: 1, interleaved: true)!

        outputBus = try AUAudioUnitBus(format: self.format)
        try super.init(componentDescription: componentDescription, options: options)
        _outputBusses = AUAudioUnitBusArray(audioUnit: self, busType: AUAudioUnitBusType.output, busses: [outputBus])
    }
    
    public override var outputBusses: AUAudioUnitBusArray {
        return _outputBusses
    }
    
    public override func allocateRenderResources() throws {
        try super.allocateRenderResources()
        Log.debug("allocateRenderResources")
        if piper == nil {
            let model = Bundle.main.path(forResource: "lessac_med_fp16", ofType: "onnx")!
            let config = Bundle.main.path(forResource: "lessac_med_fp16.onnx", ofType: "json")!
            piper = Piper(modelPath: model, andConfigPath: config)
        }
    }

    public override func deallocateRenderResources() {
        super.deallocateRenderResources()
        piper = nil
    }

    // MARK: - Rendering
    /*
     NOTE:- It is only safe to use Swift for audio rendering in this case, as Audio Unit Speech Extensions process offline.
     (Swift is not usually recommended for processing on the realtime audio thread)
     */
    public override var internalRenderBlock: AUInternalRenderBlock {
        return { [weak self] actionFlags, _, frameCount, _, outputAudioBufferList, _, _ in

            guard let self = self,
            let piper = self.piper else {
                actionFlags.pointee = .unitRenderAction_PostRenderError
                Log.error("Utterance Client is nil while request for rendering came.")
                return kAudioComponentErr_InstanceInvalidated
            }

            if piper.completed() && !piper.hasSamplesLeft() {
                Log.debug("Completed rendering")
                actionFlags.pointee = .offlineUnitRenderAction_Complete
                self.cleanUp()
                return noErr
            }

            if !piper.readyToRead() {
                actionFlags.pointee = .offlineUnitRenderAction_Preflight
                Log.debug("No bytes yet.")
                return noErr
            }

            let levelsData = piper.popSamples(withMaxLength: UInt(frameCount))

            guard let levelsData else {
                actionFlags.pointee = .offlineUnitRenderAction_Preflight
                Log.debug("Rendering in progress. No bytes.")
                return noErr
            }

            outputAudioBufferList.pointee.mNumberBuffers = 1
            var unsafeBuffer = UnsafeMutableAudioBufferListPointer(outputAudioBufferList)[0]
            let frames = unsafeBuffer.mData!.assumingMemoryBound(to: Float.self)
            unsafeBuffer.mDataByteSize = UInt32(levelsData.count)
            unsafeBuffer.mNumberChannels = 1

            for frame in 0..<levelsData.count {
                frames[Int(frame)] = levelsData[Int(frame)].int16Value.toFloat()
            }

            actionFlags.pointee = .offlineUnitRenderAction_Render

            Log.debug("Rendering \(levelsData.count) bytes")

            return noErr

        }
    }

    public override func synthesizeSpeechRequest(_ speechRequest: AVSpeechSynthesisProviderRequest) {
        Log.debug("synthesizeSpeechRequest \(speechRequest.ssmlRepresentation)")
        self.request = speechRequest
        let text = AVSpeechUtterance(ssmlRepresentation: speechRequest.ssmlRepresentation)?.speechString

        piper?.cancel()
        piper?.synthesize(text ?? "")
    }
    
    public override func cancelSpeechRequest() {
        Log.debug("\(#file) cancelSpeechRequest")
        cleanUp()
        piper?.cancel()
    }

    func cleanUp() {
        request = nil
    }

    public override var speechVoices: [AVSpeechSynthesisProviderVoice] {
        get {
            return [
                AVSpeechSynthesisProviderVoice(name: "Lessac", identifier: "pipertts", primaryLanguages: ["en_US"], supportedLanguages: ["en_US", "en_GB"])
            ]
        }
        set { }
    }

    public override var canProcessInPlace: Bool {
        return true
    }

}

Note that your are still likely going to find you have issues on iOS -- especially for older iPhones like my iPhone 13. Due to the way apple integrates third party TTS systems into the iOS system, there are strict limits on how much RAM piper TTS can use -- about 60-80mb.

Unless you are using a heavily quantized model, you are going to find that your iPhone may fail to render any speech from Piper or may cut off at longer sentences.

I’m currently exploring int-8 quantization aware training of Piper models, as well as retraining some entirely using a tokenizer vocabulary instead of phonemes or graphemes. If that works out, it may result in Piper TTS models being able to run smoothly on iOS. Until then -- don’t expect it to work consistently, even with the above fixes!

@lumpidu
Copy link

lumpidu commented Aug 7, 2024

How is your 8-bit quantization coming along ?

@lumpidu
Copy link

lumpidu commented Aug 8, 2024

Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of AVSpeechSynthesisProviderAudioUnit and this is application extension specific class, unfortunately app extension that is built of top Audio Unit on iOS can't use more than 60 mb of RAM. ONNX runtime loads whole file to RAM during initialisation, this is why model can't be more than ~50 mb because there should be some space left for audio buffers and extension itself.

This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word.

Is there anywhere documentation about this memory limit ? I couldn't find those values nowhere officially documented, and this is a pretty substantial limit for TTS voices !

@S-Ali-Zaidi
Copy link
Author

How is your 8-bit quantization coming along ?

Haven’t gotten around to it yet as there’s a couple things I’m thinking:

  1. It’s not the model size alone that creates the issue -- it’s the intermediate vector representations of the text that swell up memory usage as well. Even a smaller 10mb model can balloon up to over 100mb if you give it a decently long chunk of text to process.

  2. Intermediate vector representations are contingent on essentially the number of “tokens / phonemes” per input text. Each phoneme is assigned a vector just under 200 elements in length. Each element in the vector is represented by either an FP32, FP16, INT-8, or INT-4 value, depending on the quantization used in the training / fine-tuning / conversion of the model.

Thus, what I’m coming to realize is that a more efficient manner of encoding the text itself is needed, rather than phoneme-based encoding.. as phonemes roughly come out to the same count per input text as characters.

Language models have made great strides here, by training byte-pair encoding or Unigram modeling tokenizers, which tokenize by subword or word length elements. I’ve trained a tokenizer with a vocabulary of about 4000. When I use it to encode texts, I’m seeing a 50-80% reduction in the number of vectors needed to represent a string of text that will pass be converted into intermediate representations by the embedding layer.

A vocab of 4000-10,000 seems like a good balance in terms of efficiently compressing the text without ballooning the model size by too much due to the increased size of the embedding matrix

But this will require re-training a VITS (ideally VITS2) model from scratch on tokenized graphemes rather than phonemes. So I’m currently focused on finishing my neural networks courses right now (doing a post-bachelors program in ML).

Once I’ve finished that up, I’ll get back to training a new VITS model within the Piper framework that will ideally:

  • Use a tokenizer with a vocab of 4000-10,000 in size.
  • Pre-trained on a decently large body of natural and synthetic speech at 44.1khz or 48khz, as well as a smaller one at around 22khz.
  • Will under-go quantization-aware training, reducing it to INT-8 or INT-4 in terms of the weight compression.

It’s a relatively complicated project compared to a simple quantization of an existing VITS models, so it will take a little time, I don’t expect to get around to even starting it until sometime this fall, after I finish by current quarter of coursework.

@S-Ali-Zaidi
Copy link
Author

Hello @S-Ali-Zaidi, problem with running high quality models is their size. Apple Text to Speech engine is built on top of AVSpeechSynthesisProviderAudioUnit and this is application extension specific class, unfortunately app extension that is built of top Audio Unit on iOS can't use more than 60 mb of RAM. ONNX runtime loads whole file to RAM during initialisation, this is why model can't be more than ~50 mb because there should be some space left for audio buffers and extension itself.
This is one of the reasons why this application is not more than proof of concept in addition to missing SSML support and callback about finished sentence and word.

Is there anywhere documentation about this memory limit ? I couldn't find those values nowhere officially documented, and this is a pretty substantial limit for TTS voices !

The documentation is very spare -- I had to dig deeply throguh Apple Developer Forums to find mentions of it. And even then, I think with the latest iOS and Mac updates, it might be a bit more of a dynamic limit depending on the total system resources. I find that my MacBook Pro M1 Pro with 16GB RAM has an effectively unlimited limit to how long of inputs it can process in VITS / Piper via the AudioUnit Extension.

The limit seems to still be there, but not terrible, for newer iPhones and iPads with 8+GB of RAM. As you work your way down to older and cheaper iOS and iPad devices with less and less RAM, you find that the text-input limit becomes shorter and shorter.

But yes, there seems to be be no solid documentation from Apple on how it works, which sucks.

@lumpidu
Copy link

lumpidu commented Aug 12, 2024

I still don't understand why you want to use a tokenizer model/grapheme based model instead of the phoneme-based model. Isn't the phoneme-based model much more light-weight, as we only have to deal with a limited set of phoneme-id's instead of the dictionary/unigram-based tokenizer model which adds a few thousand tokens for the input layer ?

@lumpidu
Copy link

lumpidu commented Aug 13, 2024

As for the memory explosion. Have you seen this documentation ? https://onnxruntime.ai/docs/get-started/with-c.html. There are a few options available for memory consumption control. Maybe, it would be worth investigating to use these ?

@lumpidu
Copy link

lumpidu commented Aug 13, 2024

What would be interesting to know, if the memory consumption is determined by the encoder layer of the VITS model, or the decoder layer. I could imagine, that the upsampling code of HifiGAN is causing the memory explosion. Heck, it needs also the most time for all of it (> 90%). It might be worthwhile to exchange HifiGAN with iSTFT. Have you already done this experiment ?

@S-Ali-Zaidi
Copy link
Author

S-Ali-Zaidi commented Aug 17, 2024

I still don't understand why you want to use a tokenizer model/grapheme based model instead of the phoneme-based model. Isn't the phoneme-based model much more light-weight, as we only have to deal with a limited set of phoneme-id's instead of the dictionary/unigram-based tokenizer model which adds a few thousand tokens for the input layer ?

If one uses a pre-trained tokenizer (for example, with a unigram model trained on a custom English corpus in SentencePiece, set to a vocab size of 4096), the size of the embedding matrix will indeed be larger, at 4000 tokens X 192 dimensions, -- versus something like 200 tokens x 192 dimensions -- but the difference will be pretty trivial, compared to the size of the overall model:

Vocabulary Size: 4096 vs. 200 | Embedding Dimensions: 192

  • 4096 Vocabulary Size:

    • FP32 (4 bytes per element):
      • ( 4096 \times 192 \times 4 ) bytes = 3,145,728 bytes = 3.00 MB
    • FP16 (2 bytes per element):
      • ( 4096 \times 192 \times 2 ) bytes = 1,572,864 bytes = 1.50 MB
    • INT8 (1 byte per element):
      • ( 4096 \times 192 \times 1 ) bytes = 786,432 bytes = 0.75 MB
    • INT4 (0.5 bytes per element):
      • ( 4096 \times 192 \times 0.5 ) bytes = 393,216 bytes = 0.375 MB
  • 200 Vocabulary Size:

    • FP32 (4 bytes per element):
      • ( 200 \times 192 \times 4 ) bytes = 153,600 bytes = 0.146 MB
    • FP16 (2 bytes per element):
      • ( 200 \times 192 \times 2 ) bytes = 76,800 bytes = 0.073 MB
    • INT8 (1 byte per element):
      • ( 200 \times 192 \times 1 ) bytes = 38,400 bytes = 0.037 MB
    • INT4 (0.5 bytes per element):
      • ( 200 \times 192 \times 0.5 ) bytes = 19,200 bytes = 0.018 MB

So as you can see, even if one never does any model quantization, the memory cost of using a custom-trained tokenizer really adds nothing substantial to memory size, and such a embedding matrix would only need to be instantiated once during the VITS model run. It would be a one-time, static increase of around 1-3MB if migrating to a unigram tokenizer of around n=4096 vocab size.

But let’s consider the impact on the memory in terms a simple sentence like The apple never falls far from the tree. if represented by embedding vectors derived from a phoneme tokenizer vs a pre-trained unigram tokenizer (which in the case of the tokenizers I’ve trained would end up representing all words with a single token, and only breaking down more complex and rare words into sub-word tokens):

Phoneme Representation (approx. 22 phonemes):

  1. FP32 (4 bytes per element):

    • 22 phonemes × 192 dimensions × 4 bytes = 16,896 bytes = 16.5 KB
  2. FP16 (2 bytes per element):

    • 22 phonemes × 192 dimensions × 2 bytes = 8,448 bytes = 8.25 KB
  3. INT8 (1 byte per element):

    • 22 phonemes × 192 dimensions × 1 byte = 4,224 bytes = 4.125 KB
  4. INT4 (0.5 bytes per element):

    • 22 phonemes × 192 dimensions × 0.5 bytes = 2,112 bytes = 2.0625 KB

Hypothetical 4096 Vocabulary Tokenizer (8 tokens: "The", "apple", "never", "falls", "far", "from", "the", "tree.")

  1. FP32 (4 bytes per element):

    • 8 tokens × 192 dimensions × 4 bytes = 6,144 bytes = 6.00 KB
  2. FP16 (2 bytes per element):

    • 8 tokens × 192 dimensions × 2 bytes = 3,072 bytes = 3.00 KB
  3. INT8 (1 byte per element):

    • 8 tokens × 192 dimensions × 1 byte = 1,536 bytes = 1.50 KB
  4. INT4 (0.5 bytes per element):

    • 8 tokens × 192 dimensions × 0.5 bytes = 768 bytes = 0.75 KB

So as we can see, going from using a FP32 phoneme-based embedding matrix to an FP32 4096-sized pre-trained tokenizer results in a 63% reduction in vector outputs of the embedding matrix. If we were to quantize the model (including the embedding matrix) to INT-8, we can see up to a 90% size reduction in the initial vector representation of the text.

so as you can imagine, if you feed longer inputs to the VITS / Piper model that is trained in a byte-pair or Unigram tokenizer paradigm, it could be much more efficient in memory usage on memory constrained frameworks like Apple’s Audio Unit extension -- especially if we’re operating under the naive assumption that we have to maintain within memory all the intermediate vector representations of the input text as it passes through the various transformer and convolutional layers.

If I understand the architecture of VITS correctly, we could see this 60-90% decrease in intermediate vector representations across all vectors representations / activations generated by the model during the inferencing process, which seems to be the source of the intense memory ballooning that can be seen when running it within Apple’s Audio Unit Extension framework.

My suspicion is that, if one has a diverse and large enough speech-text pair training corpus (using natural datasets like LJSpeech, augmented with synthetic data, such as by using ElevenLabs or Tortoise or something to generate a bunch more speech-text pairs to cover your 4096-sized token vocabulary extensively) that we might see a model (especially the VITS2 architecture) perform better (sound more natural and adaptable) than if using phoneme-based training alone.

There’s a lot of research I’ve read that consistently points to this, which is another reason why I would like to test it out.

That all said, in respect to the memory consumption control possible within Onnx Runtime, I will absolutely have to take a look at that and see if that might be a potential solution for the immediate term! I cannot recall how the computational graph of VITS looks off the top of my head.. I’ll need to look back at the pytroch scripts and try to make sure that intermediate vector representations are not needed beyond their immediate output and input layers. If so, it seems like this could be a viable solution.

And in respect to ISTFT, yes, I’m IMMENSELY interested in that! In fact, my target model to try training on a 4096-size tokenizer would be the MB-iSTFT VITS2 architecture that some folks put together and got very good results with. It seems like it could make it both smaller in memory footprint as well as faster!

But I won’t be getting to testing all of this out for a little while -- currently deep in the mud in my neural networks courses right now.. but once I get the time, I will absolutely try out all of this and report back with my results!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants