<a href="https://colab.research.google.com/github/R0b0t-Maker/LLM-T/blob/main/TTS_v001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!pip install nemo_toolkit[all] soundfile




In [6]:
from nemo.collections.tts.models import FastPitchModel, HifiGanModel

print("Available FastPitch models:")
for model in FastPitchModel.list_available_models():
    print(model)

print("\nAvailable HiFi-GAN models:")
for model in HifiGanModel.list_available_models():
    print(model)


Available FastPitch models:
PretrainedModelInfo(
	pretrained_model_name=tts_en_fastpitch,
	description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is ARPABET-based.,
	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo,
	class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
)
PretrainedModelInfo(
	pretrained_model_name=tts_en_fastpitch_ipa,
	description=This model is trained on LJSpeech sampled at 22050Hz with and can be used to generate female English voices with an American accent. It is IPA-based.,
	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/IPA_1.13.0/files/tts_en_fastpitch_align_ipa.nemo,
	class_=<class 'nemo.collections.tts.models.fastpitch.FastPitchModel'>
)
PretrainedModelInfo(
	pretrained_model_name=tts_en_fastpitch_multispeaker,
	description=This model is tra

In [11]:
import nemo.collections.tts as nemo_tts
import soundfile as sf
import torch

def nemo_text_to_speech(text, output_path="output.wav"):
    # Load pre-trained FastPitch model
    fastpitch_model = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch")

    # Load HiFi-GAN vocoder model
    hifigan_model = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan")

    # Generate spectrogram from input text
    tokens = fastpitch_model.parse(text)
    spectrogram = fastpitch_model.generate_spectrogram(tokens=tokens)

    # Convert spectrogram to waveform
    if torch.cuda.is_available():
        spectrogram = spectrogram.to('cuda')
        hifigan_model = hifigan_model.to('cuda')
    waveform = hifigan_model.convert_spectrogram_to_audio(spec=spectrogram)

    # Detach tensor and move to CPU
    waveform = waveform.detach().cpu().numpy()

    # Ensure waveform is 1D
    waveform = waveform.squeeze()

    # Save the waveform to a file
    sf.write(output_path, waveform, samplerate=22050)
    print(f"Saved speech to {output_path}")

if __name__ == "__main__":
    sample_text = "Hello, this is a test for checking the performance of Nemo Nvidia for Rasoul to find the quality of the output."
    nemo_text_to_speech(sample_text)


[NeMo I 2024-05-16 22:41:08 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.23.0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2024-05-16 22:41:08 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.23.0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo
[NeMo I 2024-05-16 22:41:08 common:924] Instantiating model from pre-trained checkpoint


 NeMo-text-processing :: INFO     :: Creating ClassifyFst grammars.
INFO:NeMo-text-processing:Creating ClassifyFst grammars.
[NeMo W 2024-05-16 22:42:30 en_us_arpabet:66] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2024-05-16 22:42:30 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /ws/LJSpeech/nvidia_ljspeech_train_clean_ngc.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/supplementary
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
  

[NeMo I 2024-05-16 22:42:30 features:289] PADDING: 1
[NeMo I 2024-05-16 22:42:32 save_restore_connector:249] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.23.0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2024-05-16 22:42:32 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.23.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.
[NeMo I 2024-05-16 22:42:32 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.23.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2024-05-16 22:42:32 common:924] Instantiating model from pre-trained checkpoint


[NeMo W 2024-05-16 22:42:39 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2024-05-16 22:42:39 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2024-05-16 22:42:39 features:289] PADDING: 0


[NeMo W 2024-05-16 22:42:39 features:266] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2024-05-16 22:42:39 features:289] PADDING: 0
[NeMo I 2024-05-16 22:42:40 save_restore_connector:249] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.23.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


[NeMo W 2024-05-16 22:42:40 fastpitch:291] parse() is meant to be called in eval mode.
[NeMo W 2024-05-16 22:42:40 fastpitch:368] generate_spectrogram() is meant to be called in eval mode.


Saved speech to output.wav
