Summary
The project defines a voice assistant solution for controlling a VoIP telephone.
Funding
The project was funded by the German Federal Ministry of Education and Research under grant number 01IS22S34 from September 2022 to February 2023. The authors are responsible for the content of this publication.
Installation
Ubuntu 20.04 is used as reference OS with the apt package manager.
On other OS a different package manager providing the same packages should work.
Prerequisites
Python
Install Python 3.10:
sudo apt install python3.10FFmpeg:
sudo apt-get install ffmpegaarch64
In an aarch64 environment install the following dependencies:
# aarch64 architecture
sudo apt install python3.10-dev portaudio19-devXavier NX
The Xavier NX needs a special PyTorch version. The NVIDIA instructions must be followed summarized here:
sudo apt-get -y install autoconf bc build-essential g++-8 gcc-8 clang-8 lld-8 gettext-base gfortran-8 iputils-ping libbz2-dev libc++-dev libcgal-dev libffi-dev libfreetype6-dev libhdf5-dev libjpeg-dev liblzma-dev libncurses5-dev libncursesw5-dev libpng-dev libreadline-dev libssl-dev libsqlite3-dev libxml2-dev libxslt-dev locales moreutils openssl python-openssl rsync scons python3-pip libopenblas-dev
export TORCH_INSTALL=https://developer.download.nvidia.cn/compute/redist/jp/v502/pytorch/torch-1.13.0a0+410ce96a.nv22.12-cp38-cp38-linux_aarch64.whl
python3 -m pip install --upgrade pip; python3 -m pip install aiohttp; export "LD_LIBRARY_PATH=/usr/lib/llvm-8/lib:$LD_LIBRARY_PATH"; python3 -m pip install --upgrade protobuf; python3 -m pip install --no-cache $TORCH_INSTALLNOTE: Do not install numpy=='1.19.4' scipy=='1.5.3' like given in the original NVIDIA instructions. This would be incompatible with Coqui TTS.
Take a recent PyTorch version as TORCH_INSTALL matching the JetPack version running (here 502) from PyTorch Wheel downloads.
WiFi
Needed, if no ethernet cable should be used.
Instructions taken from: https://www.linuxbabe.com/ubuntu/connect-to-wi-fi-from-terminal-on-ubuntu-18-04-19-04-with-wpa-supplicant
NOTE: Replace wlp3s0 with the interface report by iwconfig.
sudo apt install wireless-tools
sudo apt install net-tools
iwconfig
sudo ifconfig wlp3s0 up
sudo iw dev wlp3s0 scan ap-force | grep SSID
sudo apt install wpasupplicant
wpa_passphrase 'your-ESSID' your-wifi-passphrase | sudo tee /etc/wpa_supplicant.conf
# test connection
sudo wpa_supplicant -c /etc/wpa_supplicant.conf -i wlp3s0
# different terminal
sudo dhclient wlp3s0
ip addr show wlp3s0
# CTRL+C to abort wpa_supplicant
sudo cp /lib/systemd/system/wpa_supplicant.service /etc/systemd/system/wpa_supplicant.service
sudo nano /etc/systemd/system/wpa_supplicant.service
# change line to ExecStart=/sbin/wpa_supplicant -u -s -c /etc/wpa_supplicant.conf -i wlp3s0
# comment out Alias=dbus-fi.w1.wpa_supplicant1.service
sudo systemctl daemon-reload
sudo systemctl enable wpa_supplicant.service
sudo nano /etc/systemd/system/dhclient.serviceUse the following content:
[Unit]
Description= DHCP Client
Before=network.target
After=wpa_supplicant.service
[Service]
Type=forking
ExecStart=/sbin/dhclient wlp3s0 -v
ExecStop=/sbin/dhclient wlp3s0 -r
Restart=always
[Install]
WantedBy=multi-user.target
sudo systemctl enable dhclient.service
sudo systemctl restart wpa_supplicant.service
# check status
sudo journalctl -u wpa_supplicant.service
sudo systemctl restart dhclient.service
ifconfigTime Daemon
timedatectl set-ntp true
sudo nano /etc/systemd/timesyncd.confUse:
[Time]
NTP=de.pool.ntp.org
Restart and check time:
sudo timedatectl set-timezone Europe/Berlin
systemctl restart systemd-timesyncd
journalctl -u systemd-timesyncdHardware
MeLe Quieter2A
Add the WiFi driver:
sudo apt install rtl8821ce-dkms
sudo rebootSeed reComputer J2012
This mini PC is equipped with an NVIDIA Xavier NX 16 GB module to support offline speech recognition with OpenAID Whisper.
The WiFi Intel Wireless-AC 8265 M.2 key A+E module (or a compatible WiFi module) must be added to have WiFi support.
2 IPEX MHF2 cable and two SMA antennas.
Add M.2 SSD Disk
More disk space is need to run all necessary tools. A 64 GB or larger M.2 SSD has to be used and the instructions have to be followed to boot from the SSD.
Flash latest JetPack version
To get the latest JetPack version (here 502) following the instruction to update the system.
WiFi after sleep not working
The system has issues with resuming the network (LAN and WiFi) when resumed.
This script restarts the WiFi after the sleep mode:
sudo -i
cat <<'EOT' > /lib/systemd/system-sleep/restorenetwork.sleep
#!/bin/sh
PATH=/sbin:/usr/sbin:/bin:/usr/bin
case "$1" in
pre)
;;
post)
modprobe -r iwlmvm
modprobe -r iwlwifi
modprobe iwlmvm
;;
esac
exit 0
EOT
chmod +x /lib/systemd/system-sleep/restorenetwork.sleep
exitTelephone NLU
Install Telephone NLU.
Program
Install the program:
pip install git+https://github.com/kaoh/HandsFreeTelephone.gitIf the Whisper installation fails try:
pip install setuptools-rustDownload speech models:
hands-free-telephone-setupConfiguration
Microphone Intensity
Each microphone has a different intensity which can be considered as silence. This must be measured with:
audio-intensityWhile using the program read something while the program is running. The output can be used with the --mic_threshold option of the
hands-free-telephone binary.
Some recorded values:
| Speaker | Value |
|---|---|
| ReSpeaker USB Mic Array | 0.010 |
| Jabra UC 750 | 0.015 |
| Epos 40+ | 0.010 |
| Logitech P510e | 0.025 |
Cloud Credentials
Create a directory hands_free_telephone.
If installed as system service with sudo mkdir /home/hands-free-user/hands_free_telephone.
If locally installed in the user directory with mkdir ~/hands_free_telephone.
Place there:
- The MS Azure cloud configuration
ms-azure.json. See below how to create it. - The Google cloud API configuration
google-cloud.json. Download it from the Google console. - The IBM clod API configuration
ibm-sst-cloud.jsonandibm-tts-cloud.json. Download it from the IBM console.
MS Azure Configuration
Look up the API key and insert it into the ms-azure.json file:
{
"api-key": "API-KEY"
}Run as System Service
To run the program at system start execute the following scripts or adjust them to support your OS.
sudo -i
# execute the following scripts
exitCreate dedicated system user:
adduser --shell=/bin/false --gecos "Hands-Free User" --disabled-login handsfree
usermod -a -G audio handsfreeCreate Systemd scripts:
mkdir -p /lib/systemd/system/
cp hands_free_telephone.service /lib/systemd/system/Enable and restart services:
/bin/systemctl enable hands_free_telephone
/bin/systemctl restart hands_free_telephoneIn case of instabilities a Cron job can be created to restart the server every day:
mkdir -p /etc/cron.d
cat <<EOT > /etc/cron.d/hands_free_telephone
30 3 * * * root /bin/systemctl restart hands_free_telephone
EOTExecution
Start the Rasa server and the Rasa action server of the telephone_nlu project. Then:
hands-free-telephone NOTE: When running under aarch64 the bundled version libgomp of scikit must be preloaded. Export first LD_PRELOAD (or add LD_PRELOAD as environment variable to the starter):
export LD_PRELOAD=</usr | <virtual env directory>>/lib/python3.8/site-packages/sklearn/__check_build/../../scikit_learn.libs/libgomp-d22c30c5.so.1.0.0To get a help screen type:
hands-free-telephone --help Development
Build Project
Checkout the project:
git clone <URL>
git pull --recurse-submodules
git submodule init
git submodule updatecd <project directory>
pyenv install 3.8.16
pyenv local 3.8.16
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -r requirements.txtInstall the prerequisites from the installation section.
Test Packaging
Test the pyproject.toml package:
python3 -m venv testbuild
source testbuild/bin/activate
pip install -U pip
pip install .To clean the build run (Otherwise old artifacts are retained):
rm -rf dist build *.egg-infoVosk (Keyword Recognition)
Vosk is used a keyword offline recognizer.
Thevosk-model-small-de-0.15 model is downloaded in the vosk folder.
Coqui STT (Offline Recognition)
This does only work on x86_64 since no aarch64 wheels are provided.
The German speech model is using the German model from Coqui.
External Scorer
These steps have to be executed to improve the scorer by providing a target vocabulary.
A language model matching the corpus of telephone related commands is created according to the Coqui documentation.
Download the alphabet.txt and the checkpoints of the German Coqui model and place it into the coqui folder.
The links to the checkpoint files are linked from Jaco-Assistant.
Look at the Mozilla's DeepSpeech German link, which should direct to the DeepSpeech model d17s5_de.
Download the d17s5_de.tar.gz file and extract it to under the coqui folder in the d17s5 directory.
Also place again the alphabet.txt in the d17s5 directory.
NOTE: PyCharm is by default removing trailing whitespaces. Disable this under "Settings | Editor | General | Remove trailing spaces on Save"
The space must be included in alphabet.txt as a possible character otherwise rubbish will be created for the scorer.
Pay attention to this when editing the file.
Virtual Environment
For the scorer a separate virtual environment was set up. Since TensorFlow 1.15.4 is not available for python 3.8 a Python 3.7 environment is needed.
git clone https://github.com/pyenv/pyenv.git $HOME/.pyenvAdd to your ~/.bashrc:
## pyenv configs
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
if command -v pyenv 1>/dev/null 2>&1; then
eval "$(pyenv init -)"
fiLog out or run:
source ~/.bashrcpyenv install 3.7.9
pyenv local 3.7.9
python3 -m venv coqui-stt-train-venv
source coqui-stt-train-venv/bin/activateCoqui Tools
git clone https://github.com/coqui-ai/STT
cd STT
python -m pip install --upgrade pip wheel setuptools
python -m pip install --upgrade -e .Language Corpus
Defined in coqui/data/corpus.txt. Execute:
cd coqui
python3 create_input.pyThis will create input.txt based on the corpus and several placeholder files.
Build KenML
cd STT
git submodule init
git pull --recurse-submodules
cd kenlm
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
mkdir -p build
cd build
cmake ..
make -j 4Execute generate_lm.py
cd coqui
python3 ../STT/data/lm/generate_lm.py --input_txt input.txt --output_dir . \
--top_k 10000 --kenlm_bins ../STT/kenlm/build/bin/ \
--arpa_order 3 --max_arpa_memory "85%" --arpa_prune "0|0|0" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallbackCreate Scorer
# Download and extract appropriate native_client package:
curl -LO https://github.com/coqui-ai/STT/releases/download/v1.4.0/native_client.tflite.Linux.tar.xz
tar xvf native_client.tflite.Linux.tar.xz
./generate_scorer_package --checkpoint d17s5 --lm lm.binary --vocab vocab-10000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284Optimize scorer's alpha and beta values:
NOTE: This step did not improve the recognition after several training epochs.
cp -R STT/training/coqui_stt_training .
cp STT/lm_optimizer.py .
python3 lm_optimizer.py --test_files training/train.csv --checkpoint_dir d17s5 kenlm.scorerIBM Speech To Text Customizations
Create a custom speech corpus:
export API_KEY=
curl -X POST -u "apikey:${API_KEY}" --header "Content-Type: application/json" --data "{\"name\": \"Telefon model\", \"base_model_name\": \"de-DE_BroadbandModel\", \"description\": \"Telefon custom language model\"}" ""https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations"
# response:
#{"customization_id": "7e76cde1-e1ea-404a-97ff-a6e643ad2409"}Use the input.txt creates with python3 create_input.py in the deepspeech directory and then create some samples.
Not all lines of the input.txt corpus are needed:
cat input.txt | while read -r line; do random=$RANDOM; if [ $random -lt $((32767 / 100)) ]; then echo "$line" >> input_sample.txt; fi; done;Import to IBM Speech To Text:
curl -X POST -u "apikey:${API_KEY}" --data-binary @deepspeech/input_sample.txt "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/corpora/telephone?allow_overwrite=true"Inspect the result:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/corpora/telephone"
#response:
#{
# "out_of_vocabulary_words": 131,
# "total_words": 3374880,
# "name": "telephone",
# "status": "analyzed"
#}Add grammar (Specification):
curl -X POST -u "apikey:${API_KEY}" --header "Content-Type: application/srgs" --data-binary @deepspeech/grammar.txt "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/grammars/telephone-abnf?allow_overwrite=true"Query status:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/grammars/telephone-abnf"Train the model:
curl -X POST -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/train"Inspect the result:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409"NOTE: If there are any words not working it might help to correct the word, e.g. fix the sounds_like parameter.
Check word:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/words/Katja"Set sounds_like:
curl -X PUT -u "apikey:${API_KEY}" --header "Content-Type: application/json" --data "{\"sounds_like\":[\"katja\"]}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/words/Katja"Test recognition:
curl -X POST -u "apikey:${API_KEY}" --header "Content-Type: audio/wav" --data-binary @.config/log/2021-11-03_22:15:00:049934.wav "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/recognize?model=de-DE_BroadbandModel&language_customization_id=7e76cde1-e1ea-404a-97ff-a6e643ad2409&grammar_name=telephone-abnf"audiotorch Compilation
This is necessary for an aarch64 computer, e.g. a Raspberry Pi 4 or Xavier NX.
For aarch64 a wheel must be created manually and placed in the prebuilt folder.
git clone https://github.com/pytorch/audio.git
cd audio
python setup.py install
cd dist
wheel convert torchaudio-2.0.0a0+c6a5235-py3.8-linux-aarch64.egg