Skip to content

Unofficial PyTorch implementation of PnG BERT with some changes

Notifications You must be signed in to change notification settings

CookiePPP/pngnw_bert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pngnw_bert

PRETRAINED WEIGHTS LINK


REPO STATUS: WORKS BUT NOT GOING TO BE MAINTAINED


Unofficial PyTorch implementation of PnG BERT with some changes.

Dubbed "Phoneme and Grapheme and Word BERT", this model includes additional word-level embeddings on both grapheme and phoneme side of the model.

Also does include additional text-to-emoji objective using DeepMoji teacher model.


I no longer recommend using PnG BERT or this modified version because of the high compute costs.

Since each input is chars+phonemes instead of just wordpieces, the input length is around 6x longer than BERT.

With dot-prod attention scaling with the square of the input length, the attention is theoretically 36x more expensive in PnG BERT than normal BERT.


pre_training_architecture.png

Here's the modified architecture.

New stuff is

  • Word Values Embeddings

  • Rel Word and Rel Token Position Embeddings

  • Subword Position Embeddings

  • Emoji Teacher Loss

The position embeddings are configurable in the config and I will likely disable some of them once I find the best configuration for training.


Update 19th Feb

I tested 5% Trained PnGnW BERT checkpoint with Tacotron2 Decoder.

pngnw_bert_tacotron2_alignment.png

Alignment Achieved in 300k samples, about 80% faster than the original tacotron2 text encoder [1].

I'll look into adding Flash Attention next since training is taking longer than I'd like.

[1] - LOCATION-RELATIVE ATTENTION MECHANISMS FOR ROBUST LONG-FORM SPEECH SYNTHESIS


Update 3rd March

I've;

  • added Flash Attention
  • Trained Tacotron2, Prosody Prediction and Prosody-to-Mel models with PnGnW BERT
  • Experimented with different Position Embedding (Learned Embedding vs Sinusoidal Embedding)

I found that - in downstream TTS tasks - fine-tuned PnGnW BERT is about on par with fine-tuning normal BERT + using DeepMoji + using G2p, while requiring much more VRAM and compute.

I can't recommend using this repo. The idea sounded really cool but after experimenting, it seems like the only benefit to this method is simplifying the pipeline by using a single model instead of multiple smaller models. There is no noticeable improvement in quality (which makes me really sad) and it requires 10x~ more compute.

It's still possible that this method will help a lot with accented speakers or other more challenging cases, but for normal English speakers it's just not worth it.


About

Unofficial PyTorch implementation of PnG BERT with some changes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages