Rhasspy is a magnificent voice management software. It's open-source, multi-platform and allows API to:

* Manage mic and speakers
* Runs speech recognition that is based on predefined sentences
* Has a text-to-speech features, although it's less impressive
* Manages wake-up-word

In other words, Rhasspy is a solid foundation for our home assistant.

Simplest way to install it is Docker. You need to have Docker on your local machine. 

In [2]:
from kaia.avatar.dub.languages.en import RhasspyAPI

RhasspyAPI.warmup()

This will download everything that is needed.

You only need to do it once, as Rhasspy adds itself to the docker startup and starts atomatically when the system boots (or, in Windows, when Docker starts).

The command _will not_ connect Rhasspy to your microphone or speakers, as we are not intended to use this functionality right now.

You can now open Rhasspy and see what's there. You won't need to configure it manually, as in the following cells we'll configure Rhasspy via api.

In [3]:
from IPython.display import HTML

ADDRESS = '127.0.0.1:12101'

HTML(f'<a href="http://{ADDRESS}" target="_blank">Open Rhasspy</a>')

Open the link, set "Kaldi" for "Speech-to-text" and "Fsticuffs" for "Intent recognition". Save and restart, then click "Download" on the top of the page. This will configure Rhasspy for the functionality we need.

It is sure possible to achieve programmatically via API, but I failed to do it fast and decided not to dive into this topic.

Now let's try Rhasspy in action. First, let's reproduce steps from the previous notebooks and create and audio file.

In [8]:
from kaia.brainbox import BrainBoxTask, BrainBoxTaskPack, DownloadingPostprocessor, BrainBox
from kaia.infra import FileIO
from ipywidgets import Audio
from kaia.avatar.server.dubbing_service import BrainBoxDubbingService
from kaia.brainbox import BrainBox
from kaia.avatar.dub.languages.en import *

def task_generator(text, voice):
    return BrainBoxTaskPack(
        BrainBoxTask(
            id = BrainBoxTask.safe_id(), 
            decider='OpenTTS', 
            arguments=dict(voice='coqui-tts:en_vctk', lang='en', speakerId=voice, text=text)
            ),
        (),
        DownloadingPostprocessor(take_element_before_downloading=0, opener=FileIO.read_bytes)
    )

template = Template(
    'Set the timer for {hours} {hours_word} and {minutes} {minutes_word}',
    hours = CardinalDub(0, 24),
    hours_word = PluralAgreement('hours', 'hour', 'hours'),
    minutes = CardinalDub(0, 60),
    minutes_word = PluralAgreement('minutes', 'minute', 'minutes')
).with_name('set_the_timer')

with BrainBox().create_test_api() as bb_api:
    service = BrainBoxDubbingService(task_generator, bb_api)
    utterance = template.utter(dict(hours=11, minutes=1))
    voiceover = service.dub_string(utterance.to_str(), 'p225').data
Audio(value=voiceover, autoplay = False)

Audio(value=b'RIFFD\x18\x02\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00"V\x00\x00D\xac\x00\x00\x02\x00\x10\x00…

Run these cells to train Rhasspy on the template of the utterance (currently only one) and recognize it:

In [9]:
rhasspy_api = RhasspyAPI(ADDRESS, [template])
rhasspy_api.train()
recognized_utterance = rhasspy_api.recognize(voiceover)
print(recognized_utterance.template.name, recognized_utterance.value)

set_the_timer {'minutes': 1, 'hours': 11}


So, it recognizes the file correctly.

Then, we can use the test on the larger scale. 
First, we use predefined `Intents` that contain various intents.
Second, we use TestingTools to generate lots of variants for the template with different values.

In [10]:
import os
from kaia.avatar.dub.sandbox import Intents
from kaia.avatar.dub.languages.en import TestingTools
import pandas as pd
from kaia.infra import FileIO
from pathlib import Path

tmp_folder = Path('rhasspy_test_files')
os.makedirs(tmp_folder, exist_ok=True)
samples_path = tmp_folder/'samples.pkl'

if not samples_path.is_file():
    test = TestingTools(Intents.get_templates(), 100)
    FileIO.write_pickle(test, samples_path)
else:
    test = FileIO.read_pickle(samples_path)
    
TestingTools.samples_to_df(test.samples).head()

Unnamed: 0,s,true_intent,true_value,recognition_obj,parsed_intent,parsed_value,failure,match_intent,match_keys,match_values,match
0,Yes,kaia.avatar.dub.sandbox.intents.Intents.yes,{},,,,False,False,False,False,False
1,Sure,kaia.avatar.dub.sandbox.intents.Intents.yes,{},,,,False,False,False,False,False
2,Go on,kaia.avatar.dub.sandbox.intents.Intents.yes,{},,,,False,False,False,False,False
3,No.,kaia.avatar.dub.sandbox.intents.Intents.no,{},,,,False,False,False,False,False
4,Stop!,kaia.avatar.dub.sandbox.intents.Intents.no,{},,,,False,False,False,False,False


We now need to build voice overs for all of these utterances. Here, a MediaLibrary will be very handy.

In [11]:
import os
import shutil
from pathlib import Path

media_library_path = Path(tmp_folder/'rhasspy_test_voiceover.zip')

def create_voiceover_task(samples):
    tasks = []
    tags = {}
    dependencies = {}
    
    for i, sample in enumerate(samples):
        id = BrainBoxTask.safe_id()
        tasks.append(BrainBoxTask(id=id, decider='OpenTTS', arguments=dict(text=sample.s, voice='coqui-tts:en_vctk', lang='en', speakerId='p225')))
        dependencies[id] = id
        tags[id] = dict(sample_id=i, s=sample.s)
    
    voiceover_task = BrainBoxTask(id = BrainBoxTask.safe_id(), decider='Collector', arguments=dict(tags=tags), dependencies=dependencies)
    return BrainBoxTaskPack(voiceover_task, tasks, DownloadingPostprocessor())


if not media_library_path.is_file():
    with BrainBox().create_test_api() as api:
        task = create_voiceover_task(test.samples)
        path = api.execute(task)
        shutil.copy(path, media_library_path)

Let's browse what's media library

In [14]:
from kaia.brainbox import MediaLibrary

lib = MediaLibrary.read(media_library_path)
lib.to_df().head()

Unnamed: 0,sample_id,s,option_index,filename,timestamp,job_id
0,0,Yes,0,a4bfe29d-2faf-4ed8-bb64-c90e3f906604.wav,2024-02-17 11:55:41.345600,id_121e1ead144c468bb68b462d53d6a944
1,1,Sure,0,13bc1108-39ae-4042-a50f-1727df84bad4.wav,2024-02-17 11:55:41.345600,id_ca1f6e7eb6074ee382e310c53c528c39
2,2,Go on,0,3d2b913c-641c-4d82-ba47-dee0a4f35308.wav,2024-02-17 11:55:41.345600,id_c42570dc11da4ee5b826eb442902b65b
3,3,No.,0,5ab0dac1-f321-43fd-81e9-89405146eeed.wav,2024-02-17 11:55:41.345600,id_a50c86288cad4df78f6f083352a22847
4,4,Stop!,0,3246a745-15fb-4aa4-be42-06eb270acc4b.wav,2024-02-17 11:55:41.345600,id_acbe52ac7d014688a24f7ccebe91398a


Now, we will feed all the sound files from media library to RhasspyAPI and recognize them:

In [15]:
test_result_path = tmp_folder/'test_results.pkl'

if not test_result_path.is_file():
    lib = MediaLibrary.read(media_library_path)
    sample_index_to_file = {record.tags['sample_id'] : record for record in lib.records}
    rhasspy_api = RhasspyAPI(ADDRESS, test.intents)
    rhasspy_api.train()
    test_result = test.test_voice(sample_index_to_file, rhasspy_api)
    FileIO.write_pickle(test_result, test_result_path)
else:
    test_result = FileIO.read_pickle(test_result_path)

Let's see some stats:

In [19]:
df = test.samples_to_df(test_result)
df.match_intent.mean(), df.match_values.mean()

(0.9935483870967742, 0.7806451612903226)

## Notes on Rhasspy/TTS test

Using Rhasspy to test TTS is a perfect way to see flaws in the TTS solution. This was the way how we realized that using TortoiseTTS to produce fragments and then combine these fragmets is not really a viable option. Moreover, we discovered some fixable issues as well:

* the upper bounds of string to voiceover with Tortoise: around 60-70 characters. Sometimes, much longer strings can be processed, but sometimes no.

* that sequences like "six, sixteenth, sixth, sixtieth, sixty, tenth" are not the best way to organize the voiceover, as TortoiseTTS fails to pronounce "tenth" in this case.

* How to cut sequence like "three, four" into fragments. It appears the correct cut is "three, " and "four". "three"/"four" will lose the ending of "three", and "three,"/" four" will add a noise to the beginning of "four". Unfortunately, there is no pause tag in TortoiseTTS that could improve this even further.

Moreover, used such analysis to understand, which fragments need to be re-generated.