Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Abstract

Spoken Language Understanding (SLU) models are a core component of voice assistants (VA), such as Alexa, Bixby, and Google Assistant. In this paper, we introduce a pipeline designed to extend SLU systems to new languages, utilizing Large Language Models (LLMs) that we fine-tune for machine translation of slot-annotated SLU training data. Our approach improved on the MultiATIS++ benchmark, a primary multi-language SLU dataset, in the cloud scenario using an mBERT model. Specifically, we saw an improvement in the Overall Accuracy metric: from 53% to 62.18%, compared to the existing state-of-the-art method, Fine and Coarse-grained Multi-Task Learning Framework (FC-MTLF). In the on-device scenario (tiny and not pretrained SLU), our method improved the Overall Accuracy from 5.31% to 22.06% over the baseline Global-Local Contrastive Learning Framework (GL-CLeF) method. Contrary to both FC-MTLF and GL-CLeF, our LLM-based machine translation does not require changes in the production architecture of SLU. Additionally, our pipeline is slot-type independent: it does not require any slot definitions or examples.

HuggingFace

Fine tuned version of BigTranslate for slot translation on multiAtis++ languages can be found on HuggingFace:

https://huggingface.co/Samsung/BigTranslateSlotTranslator

How to run

BigTranslate slot-translation finetuning

Enter the BigTranslateFineTuning/ dir

Prepare MASSIVE dataset

python prepare_massive_data.py --output-dir massive

Run BigTranslate LoRA finetuning

bash deepspeed_bigtranslate_train.sh

Translate multiATIS++ dataset

To get the dataset go to https://github.com/amazon-research/multiatis and then put the train_EN.tsv file into BigTranslateFineTuning/multiATIS folder.

Model will be automatically downloaded from huggingface hub. You can also specify path to the model in translate_mulitatis.sh script (--model-name argument)

bash translate_multiatis.sh

Training input data preparation

cd MT/
python prepare_multiatis_jointnlu_traindata.py \
    $TRANSLATED_DATA_DIR \
    False \
    False \
    True \
    True \
    False \
    $OUTPUT_TRAINING_DATA_DIR \
    5 \
    $SOURCE_LANG \
    $TARGET_LANG \
    0.95

Training input data postprocessing

bash multiatis_data_postprocess.sh $OUTPUT_TRAINING_DATA_DIR

Run training and evaluation

cd JointBERT
mkdir eval_results fails data
cp -r $OUTPUT_TRAINING_DATA_DIR data
python3 main.py \
    --task $(basename $OUTPUT_TRAINING_DATA_DIR) \
    --model_type ${MODEL_TYPE} \
    --patience 5 \
    --train_batch_size 500 \
    --do_train \
    --do_eval \
    --ratio 1.0 \
    --device_name cuda:0 \
    --validation loss \
    --lang $TARGET_LANG \
    --model_dir $MODEL_OUTPUT_DIR \
    --percent_broken 0 \
    --num_train_epochs 50

where:

MODEL_TYPE - multibert, xlmr_scratch or any other model type defined in utils.py

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
BigTranslateFineTuning		BigTranslateFineTuning
JointBERT		JointBERT
MT		MT
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BigTranslateFineTuning

BigTranslateFineTuning

JointBERT

JointBERT

MT

MT

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Abstract

HuggingFace

How to run

BigTranslate slot-translation finetuning

Translate multiATIS++ dataset

Training input data preparation

Training input data postprocessing

Run training and evaluation

About

Releases

Packages

Contributors 2

Languages

License

Samsung/MT-LLM-NLU

Folders and files

Latest commit

History

Repository files navigation

Large Language Models for Expansion of Spoken Language Understanding Systems to New Languages

Abstract

HuggingFace

How to run

BigTranslate slot-translation finetuning

Translate multiATIS++ dataset

Training input data preparation

Training input data postprocessing

Run training and evaluation

About

Resources

License

Stars

Watchers

Forks

Languages