Skip to content

This repository is the official implementation of "Unimodal Aggregation for CTC-based Speech Recognition".

Notifications You must be signed in to change notification settings

Audio-WestlakeU/UMA-ASR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

UMA-ASR

This repository is the official implementation of "Unimodal Aggregation for CTC-based Speech Recognition".

This work has been accepted by ICASSP 2024.

version version python

Paper 🤩 | Issues 😅 | Lab 🙉 | Contact 😘

Introduction

This project works on non-autoregressive automatic speech recognition. A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token, and thus to learn better feature representations for text tokens. The frame-wise features and weights are both derived from an encoder. Then, the feature frames with unimodal weights are integrated and further processed by a decoder. Connectionist temporal classification (CTC) loss is applied for training. Moreover, by integrating self-conditioned CTC into the proposed framework, the performance can be further noticeably improved.

The proposed UMA model

Get started

  1. The proposed method is implemented using ESPnet2. So please make sure you have installed ESPnet successfully.
  2. Roll back espnet to the specified version as follows:
    git checkout v.202304
    
  3. Clone the UMA-ASR codes by:
    git clone https://github.com/Audio-WestlakeU/UMA-ASR
    
  4. Copy the configurations of the recipes in the egs2 folder to the corresponding directory in "espnet/egs2/". At present, experiments have only been conducted on AISHELL-1, AISHELL-2, HKUST dataset. If you want to experiment on other Chinese datasets, you can refer to these configurations.
  5. Copy the files in the espnet2 folder to the corresponding folder in "espnet/espnet2", and check that the comment path in the file header matches your path.
  6. To experiment, follow the ESPnet's steps. You can implement UMA method by simply changing run.sh from the command line to our run_unimodal.sh. For example:
    ./run_unimodal.sh --stage 10 --stop_stage 13
    
    Be careful to change the permissions of the bash files to executable.
    chmod -x asr_unimodal.sh
    chmod -x run_unimodal.sh
    

Citation

You can cite this paper like:

@article{fang2023unimodal,
    title={Unimodal Aggregation for CTC-based Speech Recognition},
    author={Ying Fang and Xiaofei Li},
    journal={arXiv preprint arXiv:2309.08150},
    year={2023}
}

About

This repository is the official implementation of "Unimodal Aggregation for CTC-based Speech Recognition".

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages