The project defines a voice assistant solution for controlling a VoIP telephone.
The project was funded by the German Federal Ministry of Education and Research under grant number 01IS22S34 from September 2022 to February 2023. The authors are responsible for the content of this publication.
Ubuntu 20.04 is used as reference OS with the apt
package manager.
On other OS a different package manager providing the same packages should work.
Install Python 3.10:
sudo apt install python3.10
FFmpeg:
sudo apt-get install ffmpeg
In an aarch64 environment install the following dependencies:
# aarch64 architecture
sudo apt install python3.10-dev portaudio19-dev
The Xavier NX needs a special PyTorch version. The NVIDIA instructions must be followed summarized here:
sudo apt-get -y install autoconf bc build-essential g++-8 gcc-8 clang-8 lld-8 gettext-base gfortran-8 iputils-ping libbz2-dev libc++-dev libcgal-dev libffi-dev libfreetype6-dev libhdf5-dev libjpeg-dev liblzma-dev libncurses5-dev libncursesw5-dev libpng-dev libreadline-dev libssl-dev libsqlite3-dev libxml2-dev libxslt-dev locales moreutils openssl python-openssl rsync scons python3-pip libopenblas-dev
export TORCH_INSTALL=https://developer.download.nvidia.cn/compute/redist/jp/v502/pytorch/torch-1.13.0a0+410ce96a.nv22.12-cp38-cp38-linux_aarch64.whl
python3 -m pip install --upgrade pip; python3 -m pip install aiohttp; export "LD_LIBRARY_PATH=/usr/lib/llvm-8/lib:$LD_LIBRARY_PATH"; python3 -m pip install --upgrade protobuf; python3 -m pip install --no-cache $TORCH_INSTALL
NOTE: Do not install numpy=='1.19.4' scipy=='1.5.3'
like given in the original NVIDIA instructions. This would be incompatible with Coqui TTS.
Take a recent PyTorch version as TORCH_INSTALL
matching the JetPack version running (here 502) from PyTorch Wheel downloads.
Needed, if no ethernet cable should be used.
Instructions taken from: https://www.linuxbabe.com/ubuntu/connect-to-wi-fi-from-terminal-on-ubuntu-18-04-19-04-with-wpa-supplicant
NOTE: Replace wlp3s0
with the interface report by iwconfig
.
sudo apt install wireless-tools
sudo apt install net-tools
iwconfig
sudo ifconfig wlp3s0 up
sudo iw dev wlp3s0 scan ap-force | grep SSID
sudo apt install wpasupplicant
wpa_passphrase 'your-ESSID' your-wifi-passphrase | sudo tee /etc/wpa_supplicant.conf
# test connection
sudo wpa_supplicant -c /etc/wpa_supplicant.conf -i wlp3s0
# different terminal
sudo dhclient wlp3s0
ip addr show wlp3s0
# CTRL+C to abort wpa_supplicant
sudo cp /lib/systemd/system/wpa_supplicant.service /etc/systemd/system/wpa_supplicant.service
sudo nano /etc/systemd/system/wpa_supplicant.service
# change line to ExecStart=/sbin/wpa_supplicant -u -s -c /etc/wpa_supplicant.conf -i wlp3s0
# comment out Alias=dbus-fi.w1.wpa_supplicant1.service
sudo systemctl daemon-reload
sudo systemctl enable wpa_supplicant.service
sudo nano /etc/systemd/system/dhclient.service
Use the following content:
[Unit]
Description= DHCP Client
Before=network.target
After=wpa_supplicant.service
[Service]
Type=forking
ExecStart=/sbin/dhclient wlp3s0 -v
ExecStop=/sbin/dhclient wlp3s0 -r
Restart=always
[Install]
WantedBy=multi-user.target
sudo systemctl enable dhclient.service
sudo systemctl restart wpa_supplicant.service
# check status
sudo journalctl -u wpa_supplicant.service
sudo systemctl restart dhclient.service
ifconfig
timedatectl set-ntp true
sudo nano /etc/systemd/timesyncd.conf
Use:
[Time]
NTP=de.pool.ntp.org
Restart and check time:
sudo timedatectl set-timezone Europe/Berlin
systemctl restart systemd-timesyncd
journalctl -u systemd-timesyncd
Add the WiFi driver:
sudo apt install rtl8821ce-dkms
sudo reboot
This mini PC is equipped with an NVIDIA Xavier NX 16 GB module to support offline speech recognition with OpenAID Whisper.
The WiFi Intel Wireless-AC 8265 M.2 key A+E module (or a compatible WiFi module) must be added to have WiFi support.
2 IPEX MHF2 cable and two SMA antennas.
More disk space is need to run all necessary tools. A 64 GB or larger M.2 SSD has to be used and the instructions have to be followed to boot from the SSD.
To get the latest JetPack version (here 502) following the instruction to update the system.
The system has issues with resuming the network (LAN and WiFi) when resumed.
This script restarts the WiFi after the sleep mode:
sudo -i
cat <<'EOT' > /lib/systemd/system-sleep/restorenetwork.sleep
#!/bin/sh
PATH=/sbin:/usr/sbin:/bin:/usr/bin
case "$1" in
pre)
;;
post)
modprobe -r iwlmvm
modprobe -r iwlwifi
modprobe iwlmvm
;;
esac
exit 0
EOT
chmod +x /lib/systemd/system-sleep/restorenetwork.sleep
exit
Install Telephone NLU.
Install the program:
pip install git+https://github.com/kaoh/HandsFreeTelephone.git
If the Whisper installation fails try:
pip install setuptools-rust
Download speech models:
hands-free-telephone-setup
Each microphone has a different intensity which can be considered as silence. This must be measured with:
audio-intensity
While using the program read something while the program is running. The output can be used with the --mic_threshold
option of the
hands-free-telephone
binary.
Some recorded values:
Speaker | Value |
---|---|
ReSpeaker USB Mic Array | 0.010 |
Jabra UC 750 | 0.015 |
Epos 40+ | 0.010 |
Logitech P510e | 0.025 |
Create a directory hands_free_telephone
.
If installed as system service with sudo mkdir /home/hands-free-user/hands_free_telephone
.
If locally installed in the user directory with mkdir ~/hands_free_telephone
.
Place there:
- The MS Azure cloud configuration
ms-azure.json
. See below how to create it. - The Google cloud API configuration
google-cloud.json
. Download it from the Google console. - The IBM clod API configuration
ibm-sst-cloud.json
andibm-tts-cloud.json
. Download it from the IBM console.
Look up the API key and insert it into the ms-azure.json
file:
{
"api-key": "API-KEY"
}
To run the program at system start execute the following scripts or adjust them to support your OS.
sudo -i
# execute the following scripts
exit
Create dedicated system user:
adduser --shell=/bin/false --gecos "Hands-Free User" --disabled-login handsfree
usermod -a -G audio handsfree
Create Systemd scripts:
mkdir -p /lib/systemd/system/
cp hands_free_telephone.service /lib/systemd/system/
Enable and restart services:
/bin/systemctl enable hands_free_telephone
/bin/systemctl restart hands_free_telephone
In case of instabilities a Cron job can be created to restart the server every day:
mkdir -p /etc/cron.d
cat <<EOT > /etc/cron.d/hands_free_telephone
30 3 * * * root /bin/systemctl restart hands_free_telephone
EOT
Start the Rasa server and the Rasa action server of the telephone_nlu
project. Then:
hands-free-telephone
NOTE: When running under aarch64 the bundled version libgomp
of scikit
must be preloaded. Export first LD_PRELOAD
(or add LD_PRELOAD
as environment variable to the starter):
export LD_PRELOAD=</usr | <virtual env directory>>/lib/python3.8/site-packages/sklearn/__check_build/../../scikit_learn.libs/libgomp-d22c30c5.so.1.0.0
To get a help screen type:
hands-free-telephone --help
Checkout the project:
git clone <URL>
git pull --recurse-submodules
git submodule init
git submodule update
cd <project directory>
pyenv install 3.8.16
pyenv local 3.8.16
python3 -m venv venv
source venv/bin/activate
python -m pip install -U pip
pip install -r requirements.txt
Install the prerequisites from the installation section.
Test the pyproject.toml
package:
python3 -m venv testbuild
source testbuild/bin/activate
pip install -U pip
pip install .
To clean the build run (Otherwise old artifacts are retained):
rm -rf dist build *.egg-info
Vosk is used a keyword offline recognizer.
Thevosk-model-small-de-0.15 model is downloaded in the vosk folder.
This does only work on x86_64 since no aarch64 wheels are provided.
The German speech model is using the German model from Coqui.
These steps have to be executed to improve the scorer by providing a target vocabulary.
A language model matching the corpus of telephone related commands is created according to the Coqui documentation.
Download the alphabet.txt
and the checkpoints of the German Coqui model and place it into the coqui
folder.
The links to the checkpoint files are linked from Jaco-Assistant.
Look at the Mozilla's DeepSpeech German link, which should direct to the DeepSpeech model d17s5_de.
Download the d17s5_de.tar.gz
file and extract it to under the coqui
folder in the d17s5
directory.
Also place again the alphabet.txt
in the d17s5
directory.
NOTE: PyCharm is by default removing trailing whitespaces. Disable this under "Settings | Editor | General | Remove trailing spaces on Save"
The space must be included in alphabet.txt
as a possible character otherwise rubbish will be created for the scorer.
Pay attention to this when editing the file.
For the scorer a separate virtual environment was set up. Since TensorFlow 1.15.4 is not available for python 3.8 a Python 3.7 environment is needed.
git clone https://github.com/pyenv/pyenv.git $HOME/.pyenv
Add to your ~/.bashrc
:
## pyenv configs
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
if command -v pyenv 1>/dev/null 2>&1; then
eval "$(pyenv init -)"
fi
Log out or run:
source ~/.bashrc
pyenv install 3.7.9
pyenv local 3.7.9
python3 -m venv coqui-stt-train-venv
source coqui-stt-train-venv/bin/activate
git clone https://github.com/coqui-ai/STT
cd STT
python -m pip install --upgrade pip wheel setuptools
python -m pip install --upgrade -e .
Defined in coqui/data/corpus.txt
. Execute:
cd coqui
python3 create_input.py
This will create input.txt
based on the corpus and several placeholder files.
cd STT
git submodule init
git pull --recurse-submodules
cd kenlm
sudo apt install build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
mkdir -p build
cd build
cmake ..
make -j 4
cd coqui
python3 ../STT/data/lm/generate_lm.py --input_txt input.txt --output_dir . \
--top_k 10000 --kenlm_bins ../STT/kenlm/build/bin/ \
--arpa_order 3 --max_arpa_memory "85%" --arpa_prune "0|0|0" \
--binary_a_bits 255 --binary_q_bits 8 --binary_type trie --discount_fallback
# Download and extract appropriate native_client package:
curl -LO https://github.com/coqui-ai/STT/releases/download/v1.4.0/native_client.tflite.Linux.tar.xz
tar xvf native_client.tflite.Linux.tar.xz
./generate_scorer_package --checkpoint d17s5 --lm lm.binary --vocab vocab-10000.txt --package kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284
NOTE: This step did not improve the recognition after several training epochs.
cp -R STT/training/coqui_stt_training .
cp STT/lm_optimizer.py .
python3 lm_optimizer.py --test_files training/train.csv --checkpoint_dir d17s5 kenlm.scorer
Create a custom speech corpus:
export API_KEY=
curl -X POST -u "apikey:${API_KEY}" --header "Content-Type: application/json" --data "{\"name\": \"Telefon model\", \"base_model_name\": \"de-DE_BroadbandModel\", \"description\": \"Telefon custom language model\"}" ""https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations"
# response:
#{"customization_id": "7e76cde1-e1ea-404a-97ff-a6e643ad2409"}
Use the input.txt
creates with python3 create_input.py
in the deepspeech
directory and then create some samples.
Not all lines of the input.txt
corpus are needed:
cat input.txt | while read -r line; do random=$RANDOM; if [ $random -lt $((32767 / 100)) ]; then echo "$line" >> input_sample.txt; fi; done;
Import to IBM Speech To Text:
curl -X POST -u "apikey:${API_KEY}" --data-binary @deepspeech/input_sample.txt "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/corpora/telephone?allow_overwrite=true"
Inspect the result:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/corpora/telephone"
#response:
#{
# "out_of_vocabulary_words": 131,
# "total_words": 3374880,
# "name": "telephone",
# "status": "analyzed"
#}
Add grammar (Specification):
curl -X POST -u "apikey:${API_KEY}" --header "Content-Type: application/srgs" --data-binary @deepspeech/grammar.txt "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/grammars/telephone-abnf?allow_overwrite=true"
Query status:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/grammars/telephone-abnf"
Train the model:
curl -X POST -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/train"
Inspect the result:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409"
NOTE: If there are any words not working it might help to correct the word, e.g. fix the sounds_like
parameter.
Check word:
curl -X GET -u "apikey:${API_KEY}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/words/Katja"
Set sounds_like
:
curl -X PUT -u "apikey:${API_KEY}" --header "Content-Type: application/json" --data "{\"sounds_like\":[\"katja\"]}" "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/customizations/7e76cde1-e1ea-404a-97ff-a6e643ad2409/words/Katja"
Test recognition:
curl -X POST -u "apikey:${API_KEY}" --header "Content-Type: audio/wav" --data-binary @.config/log/2021-11-03_22:15:00:049934.wav "https://api.eu-de.speech-to-text.watson.cloud.ibm.com/v1/recognize?model=de-DE_BroadbandModel&language_customization_id=7e76cde1-e1ea-404a-97ff-a6e643ad2409&grammar_name=telephone-abnf"
This is necessary for an aarch64 computer, e.g. a Raspberry Pi 4 or Xavier NX.
For aarch64 a wheel must be created manually and placed in the prebuilt
folder.
git clone https://github.com/pytorch/audio.git
cd audio
python setup.py install
cd dist
wheel convert torchaudio-2.0.0a0+c6a5235-py3.8-linux-aarch64.egg