<a href="https://colab.research.google.com/github/Serg123-ent/browserup-proxy-py/blob/master/notebooks/Russian_TTS_with_IPA_G2P_FastPitch_and_HifiGAN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Install NeMo.

In [1]:
# Install NeMo library. If you are running locally (rather than on Google Colab), comment out the below lines
# and instead follow the instructions at https://github.com/NVIDIA/NeMo#Installation
GITHUB_ACCOUNT = "NVIDIA"
BRANCH = "main"
!python -m pip install git+https://github.com/{GITHUB_ACCOUNT}/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]

# Download local version of NeMo scripts. If you are running locally and want to use your own local NeMo code,
# comment out the below lines and set NEMO_DIR to your local path.
NEMO_DIR = 'nemo'
!git clone -b {BRANCH} https://github.com/{GITHUB_ACCOUNT}/NeMo.git $NEMO_DIR

[33mDEPRECATION: git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting nemo_toolkit (from nemo_toolkit[all])
  Cloning https://github.com/NVIDIA/NeMo.git (to revision main) to /tmp/pip-install-rgpmguxc/nemo-toolkit_3f2837e40ca547198ec38b3c86d132c3
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/NeMo.git /tmp/pip-install-rgpmguxc/nemo-toolkit_3f2837e40ca547198ec38b3c86d132c3
  Resolved https://github.com/NVIDIA/NeMo.git to commit 93f000207879d718c145f6592424c8f3045d3b9f
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting onnx>=1.7.0 (from nemo_toolk

Make imports

In [2]:
import torch
import IPython.display as ipd
import re
import soundfile as sf
from matplotlib.pyplot import imshow
from matplotlib import pyplot as plt

from nemo.collections.tts.models import FastPitchModel
from nemo.collections.tts.models import HifiGanModel

Define file names

In [8]:
INPUT_TEXT = "input_text.txt"
INPUT_FOR_G2P = "input_for_g2p.txt"
OUTPUT_OF_G2P = "output_of_g2p.txt"
INPUT_TEXT_PHONEMES = "input_text_phonemes.txt"

Create file with some input text.
Note that text normalization (conversion of digits to words etc.) is **not** included in this pipeline.

In [4]:
!echo "(Я представляю себе вашу ироническую улыбку. Тем не менее – буквально два слова.) Как известно, мир несовершенен." > {INPUT_TEXT}
!echo "Устоями общества являются корыстолюбие, страх и продажность." >> {INPUT_TEXT}
!echo "Конфликт мечты с действительностью не утихает тысячелетиями." >> {INPUT_TEXT}
!echo "Вместо желаемой гармонии на земле царят хаос и беспорядок." >> {INPUT_TEXT}
!echo "Более того, нечто подобное мы обнаружили в собственной душе." >> {INPUT_TEXT}
!echo "Мы жаждем совершенства, а вокруг торжествует пошлость. Как в этой ситуации поступает деятель, революционер?" >> {INPUT_TEXT}
!echo "Революционер делает попытки установить мировую гармонию." >> {INPUT_TEXT}
!echo "Он начинает преобразовывать жизнь, достигая иногда курьезных мичуринских результатов." >> {INPUT_TEXT}
!echo "Допустим, выводит морковь, совершенно неотличимую от картофеля. В общем, создает новую человеческую породу." >> {INPUT_TEXT}
!echo "Известно, чем это кончается… Что в этой ситуации предпринимает моралист? Он тоже пытается достичь гармонии." >> {INPUT_TEXT}


Some helper preprocessing functions

In [5]:
def clean_russian_g2p_trascription(text: str) -> str:
    result = text
    result = result.replace("<DELETE>", " ").replace("+", "").replace("~", "")
    result = result.replace("ʑ", "ɕ:").replace("ɣ", "x")
    result = result.replace(":", "ː").replace("'", "`")
    result = "".join(result.split())
    result = result.replace("_", " ")
    return result


def clean_russian_text_for_tts(text: str) -> str:
    result = text
    result = result.replace("+", "")  # remove stress
    result = result.casefold()  # lowercase
    result = result.replace("ё", "е")
    result = result.replace("\u2011", "-")  # non-breaking hyphen
    result = result.replace("\u2014", "-")  # em dash
    result = result.replace("\u2026", ".")  # horizontal ellipsis
    result = result.replace("\u00ab", "\"")  # LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
    result = result.replace("\u00bb", "\"")  # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
    result = result.replace("\u2019", "'")  # ’ Right Single Quotation Mark
    result = result.replace("\u201c", "\"")  # “ Left Double Quotation Mark
    result = result.replace("\u201d", "\"")  # ” Right Double Quotation Mark
    result = result.replace("\u201e", "\"")  # „ Double Low-9 Quotation Mark
    result = result.replace("\u201f", "\"")  # ‟ Double High-reversed-9 Quotation Mark
    return result


Take all unique words from the input text and prepare them to feed to G2P model.
Note that G2P model works with separate words and does not take context into account.

In [None]:
all_words = set()
with open(INPUT_TEXT, "r", encoding="utf-8") as inp:
    for line in inp:
        text = line.strip()
        words = re.compile('\w+').findall(text)
        for w in words:
            all_words.add(clean_russian_text_for_tts(w))

with open(INPUT_FOR_G2P, "w", encoding="utf-8") as out:
    for w in all_words:
        out.write(" ".join(list(w)) + "\n")


In [16]:
!head {INPUT_FOR_G2P}

Clone [G2P model](https://huggingface.co/bene-ges/ru_g2p_ipa_bert_large) from HuggingFace.
If cloning doesn't work try `git lfs install`


In [7]:
!git clone https://huggingface.co/bene-ges/ru_g2p_ipa_bert_large

Cloning into 'ru_g2p_ipa_bert_large'...
remote: Enumerating objects: 46, done.[K
remote: Total 46 (delta 0), reused 0 (delta 0), pack-reused 46 (from 1)[K
Unpacking objects: 100% (46/46), 83.49 KiB | 2.04 MiB/s, done.


Run G2P inference on the words that we prepared

In [10]:
!python {NEMO_DIR}/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \
  pretrained_model=ru_g2p_ipa_bert_large/ru_g2p.nemo \
  inference.from_file={INPUT_FOR_G2P} \
  inference.out_file={OUTPUT_OF_G2P} \
  model.max_sequence_len=512 \
  inference.batch_size=128 \
  lang=ru


2025-02-08 11:41:24.927633: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739014884.949613    4300 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739014884.956042    4300 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Traceback (most recent call last):
  File "/content/nemo/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py", line 42, in <module>
    from helpers import ITN_MODEL, instantiate_model_and_trainer
  File "/content/nemo/examples/nlp/text_normalization_as_tagging/helpers.py", line 22, in <module>
    from nemo.collections.nlp.models import ThutmoseTaggerModel
  File "/usr/local/lib/python3.11/dist-packages/nemo

In [17]:
!head {OUTPUT_OF_G2P}

Preprocess input text for TTS using G2P results and vocabularies of known transcriptions.

In [18]:
# heteronyms are words with ambiguous transcription, we will leave them as plain text
heteronyms = set()
with open("ru_g2p_ipa_bert_large/heteronyms.txt", "r", encoding="utf-8") as f:
    for line in f:
        inp = line.strip()
        heteronyms.add(inp)

g2p_vocab = {}
# first read transcriptions from our g2p prediction
with open(OUTPUT_OF_G2P, "r", encoding="utf-8") as f:
    for line in f:
        try:
            _, inp, transcription, _, _ = line.strip().split("\t")
        except:
            print("cannot read line: " + line)
            continue
        inp = inp.replace(" ", "")
        g2p_vocab[inp] = clean_russian_g2p_trascription(transcription)

# then override known transcriptions using vocabulary
with open("ru_g2p_ipa_bert_large/g2p_correct_vocab.txt", "r", encoding="utf-8") as f:
    for line in f:
        # Example input: ледок \t lʲɪd`ok
        inp, transcription = line.strip().split("\t")
        g2p_vocab[inp] = transcription

out = open(INPUT_TEXT_PHONEMES, "w", encoding="utf-8")

with open(INPUT_TEXT, "r", encoding="utf-8") as inp:
    for line in inp:
        text = line.strip()
        text = clean_russian_text_for_tts(text)
        phonemized_text = ""
        m = re.search(r"[\w\-]+", text)
        while m is not None:
            begin = m.start()
            end = m.end()
            phonemized_text += text[0:begin]
            w = text[begin:end]
            if w in heteronyms:
                phonemized_text += w
            elif w in g2p_vocab:
                phonemized_text += clean_russian_g2p_trascription(g2p_vocab[w])
            else:  # shouldn't go here as all words are expected to pass through g2p
                phonemized_text += w

            if end >= len(text):
                break
            text = text[end:]
            end = 0
            m = re.search(r"[\w\-]+", text)
        if end < len(text):
            phonemized_text += text[end:]

        out.write(phonemized_text + "\n")

out.close()

Look at the final TTS input

In [19]:
!head {INPUT_TEXT_PHONEMES}

(ja prʲɪtstɐvlʲ`æjʊ sʲɪbʲ`e v`aʂʊ ɪrɐnʲ`itɕɪskʊjʊ ʊɫ`ɨpkʊ. тем не mʲ`enʲɪje – bʊkv`alʲnə dva слова.) kak ɪzvʲ`esnə, mʲir nʲɪsəvʲɪrʂ`ɛnʲɪn.
ʊst`ojəmʲɪ `opɕːɪstvə jɪvlʲ`æjʊtsə корыстолюбие, strax i prɐd`aʐnəsʲtʲ.
kɐnflʲ`ikt mʲɪtɕt`ɨ s dʲɪjstvʲ`itʲɪlʲnəsʲtʲjʊ не ʊtʲɪx`ajɪt tɨsʲɪtɕɪlʲ`etʲɪjəmʲɪ.
vmʲ`estə ʐɨɫ`ajɪməj ɡɐrm`onʲɪɪ на zʲɪmlʲ`e tsɐrʲ`at хаос i bʲɪspɐrʲ`adək.
b`olʲɪje того, nʲ`eʂtə pɐd`obnəjə mɨ ɐbnɐr`uʐɨlʲɪ v s`opstvʲɪnːəj душе.
mɨ ʐ`aʐdʲɪm səvʲɪrʂ`ɛnstvə, a vɐkr`uk tərʐɨstv`ujɪt p`oʂɫəsʲtʲ. kak v `ɛtəj sʲɪtʊ`atsɨɪ pəstʊp`ajɪt dʲ`ejɪtʲɪlʲ, rʲɪvəlʲʊtsɨɐnʲ`er?
rʲɪvəlʲʊtsɨɐnʲ`er dʲ`eɫəjɪt pɐp`ɨtkʲɪ ʊstənɐvʲ`itʲ mʲɪrɐv`ujʊ ɡɐrm`onʲɪjʊ.
on nətɕɪn`ajɪt prʲɪəbrɐz`ovɨvətʲ ʐɨzʲnʲ, dəsʲtʲɪɡ`ajə ɪnɐɡd`a kʊrʲ`jɵznɨx мичуринских rʲɪzʊlʲt`atəf.
допустим, vɨv`odʲɪt mɐrk`ofʲ, səvʲɪrʂ`ɛnːə nʲɪətlʲɪtɕ`imʊjʊ от kɐrt`ofʲɪlʲə. v `opɕːɪm, səzdɐ`jɵt n`ovʊjʊ tɕɪɫɐvʲ`etɕɪskʊjʊ pɐr`odʊ.
ɪzvʲ`esnə, чем `ɛtə kɐnʲtɕ`æjɪtsə. ʂto v `ɛtəj sʲɪtʊ`atsɨɪ prʲɪtprʲɪnʲɪm`ajɪt mərɐlʲ`ist? on t`oʐɨ pɨt`ajɪtsə dɐsʲtʲ`itɕ

Run TTS. The resulting wav files will be saved to working directory and also displayed in the output cell.

In [20]:
if torch.cuda.is_available():
  device = "cuda"
else:
  device = "cpu"

# Load FastPitch
spectrogram_generator = FastPitchModel.from_pretrained("bene-ges/tts_ru_ipa_fastpitch_ruslan").eval().to(device)
# Load vocoder
vocoder = HifiGanModel.from_pretrained(model_name="bene-ges/tts_ru_hifigan_ruslan").eval().to(device)

i = 0
with open(INPUT_TEXT_PHONEMES, "r", encoding="utf-8") as inp:
    for line in inp:
        text = line.strip()
        parsed = spectrogram_generator.parse(text)
        spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
        audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)

        # Note that vocoder return a batch of audio. In this example, we just take the first and only sample.
        filename = str(i) + ".wav"
        sf.write(filename, audio.to('cpu').detach().numpy()[0], 22050)
        i += 1

        # display
        print(f'"{text}"\n')
        ipd.display(ipd.Audio(audio.to('cpu').detach(), rate=22050))


tts_ru_ipa_fastpitch_ruslan.nemo:   0%|          | 0.00/183M [00:00<?, ?B/s]

[NeMo W 2025-02-08 11:45:31 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.dataset.TTSDataset
      manifest_filepath: train_manifest.json
      sample_rate: 22050
      sup_data_path: sup_data
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
      hop_length: 256
      window: hann
      n_mels: 80
      lowfreq: 0
      highfreq: null
      max_duration: 15
      min_duration: 0.1
      ignore_file: null
      trim: true
      trim_top_db: 50
      trim_frame_length: 1024
      trim_hop_length: 256
      pitch_fmin: 65.40639132514966
      pitch_fmax: 2093.004522404789
      pitch_norm: true
      pitch_mean: 120.88
      pitch_std: 44.0
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size

[NeMo I 2025-02-08 11:45:31 nemo_logging:393] PADDING: 1
[NeMo I 2025-02-08 11:45:32 nemo_logging:393] Model FastPitchModel was successfully restored from /root/.cache/huggingface/hub/models--bene-ges--tts_ru_ipa_fastpitch_ruslan/snapshots/396055a801d366b8a58460129f563311db34fc42/tts_ru_ipa_fastpitch_ruslan.nemo.


tts_ru_hifigan_ruslan.nemo:   0%|          | 0.00/339M [00:00<?, ?B/s]

[NeMo W 2025-02-08 11:45:36 nemo_logging:405] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.dataset.VocoderDataset
      manifest_filepath: train_manifest_mel.json
      sample_rate: 22050
      n_segments: 8192
      max_duration: null
      min_duration: 0.75
      load_precomputed_mel: true
      hop_length: 256
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 16
      num_workers: 4
      pin_memory: true
    
[NeMo W 2025-02-08 11:45:36 nemo_logging:405] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.dataset.Voco

[NeMo I 2025-02-08 11:45:36 nemo_logging:393] PADDING: 0
[NeMo I 2025-02-08 11:45:36 nemo_logging:393] STFT using exact pad
[NeMo I 2025-02-08 11:45:36 nemo_logging:393] PADDING: 0
[NeMo I 2025-02-08 11:45:36 nemo_logging:393] STFT using exact pad
[NeMo I 2025-02-08 11:45:38 nemo_logging:393] Model HifiGanModel was successfully restored from /root/.cache/huggingface/hub/models--bene-ges--tts_ru_hifigan_ruslan/snapshots/c34c98456b4ec37a4ba787e2f20a86a52e6750e2/tts_ru_hifigan_ruslan.nemo.


[NeMo W 2025-02-08 11:45:38 nemo_logging:405] Text: [(ja prʲɪtstɐvlʲ`æjʊ sʲɪbʲ`e v`aʂʊ ɪrɐnʲ`itɕɪskʊjʊ ʊɫ`ɨpkʊ. тем не mʲ`enʲɪje – bʊkv`alʲnə dva слова.) kak ɪzvʲ`esnə, mʲir nʲɪsəvʲɪrʂ`ɛnʲɪn.] contains unknown char: [–]. Symbol will be skipped.


"(ja prʲɪtstɐvlʲ`æjʊ sʲɪbʲ`e v`aʂʊ ɪrɐnʲ`itɕɪskʊjʊ ʊɫ`ɨpkʊ. тем не mʲ`enʲɪje – bʊkv`alʲnə dva слова.) kak ɪzvʲ`esnə, mʲir nʲɪsəvʲɪrʂ`ɛnʲɪn."



"ʊst`ojəmʲɪ `opɕːɪstvə jɪvlʲ`æjʊtsə корыстолюбие, strax i prɐd`aʐnəsʲtʲ."



"kɐnflʲ`ikt mʲɪtɕt`ɨ s dʲɪjstvʲ`itʲɪlʲnəsʲtʲjʊ не ʊtʲɪx`ajɪt tɨsʲɪtɕɪlʲ`etʲɪjəmʲɪ."



"vmʲ`estə ʐɨɫ`ajɪməj ɡɐrm`onʲɪɪ на zʲɪmlʲ`e tsɐrʲ`at хаос i bʲɪspɐrʲ`adək."



"b`olʲɪje того, nʲ`eʂtə pɐd`obnəjə mɨ ɐbnɐr`uʐɨlʲɪ v s`opstvʲɪnːəj душе."



"mɨ ʐ`aʐdʲɪm səvʲɪrʂ`ɛnstvə, a vɐkr`uk tərʐɨstv`ujɪt p`oʂɫəsʲtʲ. kak v `ɛtəj sʲɪtʊ`atsɨɪ pəstʊp`ajɪt dʲ`ejɪtʲɪlʲ, rʲɪvəlʲʊtsɨɐnʲ`er?"



"rʲɪvəlʲʊtsɨɐnʲ`er dʲ`eɫəjɪt pɐp`ɨtkʲɪ ʊstənɐvʲ`itʲ mʲɪrɐv`ujʊ ɡɐrm`onʲɪjʊ."



"on nətɕɪn`ajɪt prʲɪəbrɐz`ovɨvətʲ ʐɨzʲnʲ, dəsʲtʲɪɡ`ajə ɪnɐɡd`a kʊrʲ`jɵznɨx мичуринских rʲɪzʊlʲt`atəf."



"допустим, vɨv`odʲɪt mɐrk`ofʲ, səvʲɪrʂ`ɛnːə nʲɪətlʲɪtɕ`imʊjʊ от kɐrt`ofʲɪlʲə. v `opɕːɪm, səzdɐ`jɵt n`ovʊjʊ tɕɪɫɐvʲ`etɕɪskʊjʊ pɐr`odʊ."



"ɪzvʲ`esnə, чем `ɛtə kɐnʲtɕ`æjɪtsə. ʂto v `ɛtəj sʲɪtʊ`atsɨɪ prʲɪtprʲɪnʲɪm`ajɪt mərɐlʲ`ist? on t`oʐɨ pɨt`ajɪtsə dɐsʲtʲ`itɕ ɡɐrm`onʲɪɪ."

