## Introduction

The goal is to give to the Home Assistant the voice of the preferred character. The main challenge here is that the available solutions are quite resource-intensive and slow. Even if the smarthome controller had a luxury GPU, which is not the case, generating voice on-the-fly would still create huge latency in replies. Therefore the only option is to generate the voice beforehead. If the text lines contain parameners (like "the timer is set for _seven minutes_"), these parameters must be assembled from individual words.

We've written a simple architecture supporting this. In addition, this architecture is used to define the rules for speech recognition engine, Rhasspy. These two systems, dubbing and Rhasspy, create a self-testing loop: we can generate the audio files with dubbing and then test them with Rhasspy.

In this demo, we will show how dubbing works.

The main data classes are located in `kaia.persona.dub.core.structures`. 

* `Dub` is an abstract class representing the connection of some value (of arbitrary type) to the string and voice line.
* `SetDub` is a class that binds values of a finite set. `DictDub` and `EnumDub` are its descendants for dictionaries and enums.
* `SequenceDub` is a sequence of constants and some other dubs.
* `UnionDub` contains several sequences. The idea is that it gets the values, converts it to dict, then finds a sequence processing these values, and uses this sequence to create a string representation of the value.

Other dubs are language-specific and are located in `kaia.persona.dub.languages.en`. These are, e.g., `CardinalDub` and `OrdinalDub` which inherit `SetDub` and represent numbers; or `DateDub` which extends `UnionDub` and processes `datetime.date` objects.

To define intents and replies of the assistant, `Template` class is used; this class contains `UnionDub` as a field. `Template` also contains methods for parsing, to-string convertion and others. These methods are shortcuts for algorithms that are located in `kaia.persona.dub.core.algorithms`. These algorithms are implementations of depth-first search over `UnionDub`, and you don't need to import them directly.

To represent a particular sentence that is a combination of `Template` and the associated value, `Utterance` is used.

## How it works

We will now create a template of an average complexity to demonstrate how dubbing works.

In [1]:
from kaia.persona.dub.languages.en import Template, CardinalDub, PluralAgreement

template = Template(
    'It is {hours} {hours_word} and {minutes} {minutes_word}',
    hours = CardinalDub(0, 24),
    hours_word = PluralAgreement('hours', 'hour', 'hours'),
    minutes = CardinalDub(0, 60),
    minutes_word = PluralAgreement('minutes', 'minute', 'minutes')
)

The following cells demonstrate `to_str` and `parse` methods of the `Template` class:

In [2]:
value = dict(hours=11, minutes=1)
string = template.to_str(value)
string

'It is eleven hours and one minute'

Notice the word "hours" and "minute". The form is choosen by `PluralAgreement` in accordance with the value of the corresponding field.

Template can also parse strings:

In [3]:
template.parse(string)

{'minutes': 1, 'hours': 11}

Now to the voiceover. The classes in `kaia.persona.dub.core.dubbing` are responsible to convert the intents objects into tasks for Brainbox.

First, we need to select voice.

In [4]:
from kaia.brainbox import BrainBox

box = BrainBox()
voice = box.settings.tortoise_tts.test_voice
voice

'test_voice'

Then, a batch name we will assign to the tasks:

In [5]:
import datetime
batch = f'sample_voicing'
batch

'sample_voicing'

The idea is that "generic" dubs, such are CardinalDub or DateDub, are processed once. Then, the custom templates are processed. And finally, the non-generic dubs that are used by these templates (like local EnumDub) are processed. All this is done by a Fragmenter:

In [6]:
from kaia.persona.dub.languages.en import DubbingTaskCreator

tc = DubbingTaskCreator()
sequences = tc.fragment([CardinalDub(0,60)], [template], voice)

`sequences` are the list of the sequences of the fragments. Each fragment represents a non-interruptable text that is going to be voiced over. Sequence represents the fragments that follow in a particular order, e.g. "set the timer for seven minutes" is going to be fragmented into "set the timer for", "seven" and "minutes" fragment, where "seven" will be a placeholder: we need a word here for the sentence to make sense, but we are not going to use this voiceover, because it's defined by a `CardinalDub`.

In [7]:
[(s.get_text(), len(s.get_text())) for s in sequences]

[('Orange, twenty, eleven, thirty, fifty, three, sixty, apple. ', 60),
 ('Orange, six, one, nineteen, thirteen, seven, five, apple. ', 58),
 ('Orange, zero, nine, forty, eighteen, seventeen, apple. ', 55),
 ('Orange, four, two, sixteen, ten, fifteen, fourteen, apple. ', 59),
 ('Orange, twelve, eight, apple. ', 30),
 ('It is one hour and one minute', 29),
 ('Orange, minute, minutes, apple. ', 32),
 ('Orange, hour, hours, apple. ', 28)]

You probably notice something weird with all these oranges and apples. Those are the buffer words, and it seems like quality of dubbing is better with them. TortoiseTTS dubs these sentences and then we use some internal features of ToirtoiseTTS to cut the result into slices that correspond to words, but these borders are imperfect and buffering words seem to help.

Also we combine short words together thus eliminating a problem of TortoiseTTS which dubs a short sentence "six" as "sixsix" for unknown reasons.

All of it is a programmable behaviour and can be adjusted for other dubbing networks.

Then, we optimize the sequences by packing them together. This reduces the time the TortoiseTTS is going to spend processing them. However, TortoiseTTS cannot process too long sequences, thus the length is limited. 

In [8]:
optimized_sequences = tc.optimize_sequences(sequences)
len(sequences), len(optimized_sequences)

(8, 8)

In our case, optimization didn't bring anything, because shortest sequences together are longer that the limit.

Then, we create `DubAndCutTasks` (that are not TortoiseTTS-specific) and then specific `BrainBoxTask` for TortoiseTTS:

In [9]:
dub_and_cut_tasks = tc.create_dub_and_cut_tasks(optimized_sequences)
bb_tasks = tc.create_tasks(
    dub_and_cut_tasks,
    'TortoiseTTS',
    'aligned_dub',
    batch)
bb_tasks[:-1]

[{'id': 'id_d921824343074318808b5f545c623361', 'decider': 'TortoiseTTS', 'method': 'aligned_dub', 'arguments': {'voice': 'test_voice', 'text': 'Orange, twenty, eleven, thirty, fifty, three, sixty, apple. '}, 'dependencies': None, 'back_track': None, 'batch': 'sample_voicing'},
 {'id': 'id_57ec2a8c562d49758e79717d96aa7f65', 'decider': 'TortoiseTTS', 'method': 'aligned_dub', 'arguments': {'voice': 'test_voice', 'text': 'Orange, six, one, nineteen, thirteen, seven, five, apple. '}, 'dependencies': None, 'back_track': None, 'batch': 'sample_voicing'},
 {'id': 'id_94edc6489e0544059105617c1d0a0360', 'decider': 'TortoiseTTS', 'method': 'aligned_dub', 'arguments': {'voice': 'test_voice', 'text': 'Orange, zero, nine, forty, eighteen, seventeen, apple. '}, 'dependencies': None, 'back_track': None, 'batch': 'sample_voicing'},
 {'id': 'id_b2e111833f564261af29becb4aa2a403', 'decider': 'TortoiseTTS', 'method': 'aligned_dub', 'arguments': {'voice': 'test_voice', 'text': 'Orange, four, two, sixteen, t

The last task in this list contains all the cuts that are to be made, it's quite huge and thus we omit it.

Uncomment the function call in the following cells and execute them, if you have a Brainbox service ready to process the tasks. Otherwise, you will then use the ready voice pack we are providing.

In [11]:
ADDRESS = 'http://192.168.178.50'

api = box.create_api(ADDRESS)

def create_tasks(tasks):
    for task in tasks:
        api.add_task(task)

#create_tasks(bb_tasks)

Now you can monitor your BrainBox server until it finishes the task:

In [12]:
from ipywidgets import HTML

HTML(f'<a href="{api.address}" target="_blank">BrainBox</a>')

HTML(value='<a href="http://192.168.178.50:8090" target="_blank">BrainBox</a>')

The following cell will download the result from the BrainBox and place it locally.

In [17]:
from kaia.infra import Loc
from pathlib import Path
from kaia.persona.dub.languages.en import DubbingPack

pack_path =  Path('files/sample_dubbing.zip')
host_path = Loc.temp_folder/'demos/dubbing/sample_dubbing'


def download_pack(recode = False):
    target_task = [t for t in api.get_tasks(batch) if t['back_track'] == 'Dubbing'][-1]
    print(target_task['received_timestamp'])
    result = api.get_result(target_task['id'])
    if result is None:
        raise ValueError('Not yet ready')
    api.download(result, pack_path, True)

#download_pack(True)

`DubbingPack` is a class that contains all the dubbings for all the voices, plus several options per voice that are produced by TortoiseTTS by default. To do actual dubbing:

In [16]:
from ipywidgets import Audio, VBox

pack = DubbingPack.from_zip(host_path, pack_path)

audios = []
for i in range(3):
    dubber = pack.create_dubber(voice, i)
    audios.append(Audio.from_file(dubber.dub_string(string, template), autoplay=False))

print(string)
VBox(audios)

It is eleven hours and one minute


VBox(children=(Audio(value=b'RIFF\x02\xbf\x01\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb…

The result is not perfect. Aside from the intonation shift (which is probably inevitable), there are annoying noises on the border of the fragments. Those come from imperfections of cutting: the internal TortoiseTTS tensors are used for that. Hopefully it can be fixed either by some postprocessing of the fragments, or by chosing another voiceover system that better supports pauses.