### Details :

Official website : https://www.outeai.com/


Huggingface : https://huggingface.co/OuteAI/OuteTTS-0.2-

Github : https://github.com/edwko/OuteTTS

If you want to install the gguf models locally :

https://huggingface.co/OuteAI/OuteTTS-0.2-500M-GGUF/tree/main


Currently supports 4 languages : English, Chineese, Japaneese and Korean

### If you are running this on Google Collab, make sure to change the runtime to T4 GPU


### To install the package :

1. !pip install outetts

2. https://pypi.org/project/llama-cpp-python/  , go here and !pip install llama-cpp-python

## Case 1 : Using the OuteTTS model to convert the input prompt into a speech output .

In [2]:
!pip install outetts

import outetts

# Configure the model
model_config = outetts.HFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en",
)

# Initialize the interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)

# Optional: Load speaker from default presets
interface.print_default_speakers()

speaker = interface.load_default_speaker(name="female_1")

output = interface.generate(
        text ="Marketers make change happen for the smallest viable market and by delivering anticipated, personal, \
        and relevant messages that people actually want to get.",
        temperature=0.1,
        repetition_penalty=1.1,
        max_length=4096,
        speaker=speaker,
        )

output.save("speech_op_1.wav")

output.play()



# Understandinf the code :


# model_config snippet : Sets up the configuration for the text-to-speech model. It specifies the model to use ("OuteAI/OuteTTS-0.2-500M") and the language ("en" for English).
# interface snippet : Creates an interface object to interact with the chosen text-to-speech model, using configuration provided.
# interface.print snippet : Displays available speaker options. Useful for selection.
# speaker snippet : Loads a specific speaker preset (female_1 in this case) for generating audio with a female voice.
# output snippet : Generates the audio from the input text using the chosen speaker
# output.save : Saves the generated audio to a file named "output.wav".
# output.play() : Plays the generated audio file.


# terminologies used :
# temperature : A lower temperature (like 0.1) results in more predictable and less varied audio
# repetition_penalty=1.1 : This parameter prevents the model from repeating words or phrases excessively.
# max_length=4096 : This sets a limit on the length of the generated audio. It's expressed in a unit related to the internal representation of the text-to-speech model




  WeightNorm.apply(module, name, dim)


making attention of type 'vanilla' with 768 in_channels


  state_dict_raw = torch.load(model_path, map_location="cpu")['state_dict']
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.



=== ALL AVAILABLE SPEAKERS ===
Total: 16 speakers across 4 languages
--------------------------------------------------

EN (6 speakers):
  - male_1
  - male_2
  - male_3
  - male_4
  - female_1
  - female_2

JA (4 speakers):
  - male_1
  - female_1
  - female_2
  - female_3

KO (4 speakers):
  - male_1
  - male_2
  - female_1
  - female_2

ZH (2 speakers):
  - male_1
  - female_1


=== SPEAKERS FOR CURRENT INTERFACE LANGUAGE ===
Language: EN (6 speakers)
--------------------------------------------------
  - male_1
  - male_2
  - male_3
  - male_4
  - female_1
  - female_2

To use a speaker: load_default_speaker(name)





## Case 2 : Voice Cloning using OuteTTS

In [3]:
!pip install outetts

import outetts

# Configure the GGUF model
model_config = outetts.GGUFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en", # Supported languages in v0.2: en, zh, ja, ko
    n_gpu_layers=0,
)

# Initialize the GGUF interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)

speaker = interface.create_speaker(
    audio_path="/content/harvard.wav",
    transcript="The stale smell of old beer lingers.It takes heat to bring out the odor.A cold dip restores health and zest.A salt pickle tastes fine with hamTacos al pastor are my favorite.A zestful food is the hot cross bun."
)

# Optional: Save and load speaker profiles
interface.save_speaker(speaker, "speaker.json")
speaker = interface.load_speaker("speaker.json")

# no need for default speakers here .

output = interface.generate(
    text="Marketers make change happen for the smallest viable market and by delivering anticipated, personal, \
        and relevant messages that people actually want to get.",
    # Lower temperature values may result in a more stable tone,
    # while higher values can introduce varied and expressive speech
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,

    # Optional: Use a speaker profile for consistent voice characteristics
    # Without a speaker profile, the model will generate a voice with random characteristics
    speaker=speaker,
)

# Save the synthesized speech to a file
output.save("speech_op_2.wav")

# Optional: Play the synthesized speech
output.play()


making attention of type 'vanilla' with 768 in_channels


Downloading: "https://dl.fbaipublicfiles.com/mms/torchaudio/ctc_alignment_mling_uroman/model.pt" to /root/.cache/torch/hub/checkpoints/model.pt
100%|██████████| 1.18G/1.18G [00:09<00:00, 129MB/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


## Case 3 : Voice Cloning with candidate audio sample

In [4]:
!pip install outetts

import outetts

# Configure the GGUF model
model_config = outetts.GGUFModelConfig_v1(
    model_path="OuteAI/OuteTTS-0.2-500M",
    language="en", # Supported languages in v0.2: en, zh, ja, ko
    n_gpu_layers=0,
)

# Initialize the GGUF interface
interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)

speaker = interface.create_speaker(
    audio_path="/content/Chirantan_audio.wav",
    transcript="Marketers make change happen for the smallest viable market and by delivering anticipated, personal, \
        and relevant messages that people actually want to get"
)

# Optional: Save and load speaker profiles
interface.save_speaker(speaker, "speaker.json")
speaker = interface.load_speaker("speaker.json")

# no need for default speakers here .

output = interface.generate(
    text="Marketers make change happen for the smallest viable market and by delivering anticipated, personal, \
        and relevant messages that people actually want to get.",
    # Lower temperature values may result in a more stable tone,
    # while higher values can introduce varied and expressive speech
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,

    # Optional: Use a speaker profile for consistent voice characteristics
    # Without a speaker profile, the model will generate a voice with random characteristics
    speaker=speaker,
)

# Save the synthesized speech to a file
output.save("speech_op_voiceClone.wav")

# Optional: Play the synthesized speech
output.play()


making attention of type 'vanilla' with 768 in_channels


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
