Embodied Multimodal ASR

📄 Multimodal Speech Recognition for Language-Guided Embodied Agents
Allen Chang, Xiaoyuan Zhu, Aarav Monga, Seoho Ahn,
Tejas Srinivasan, Jesse Thomason

A multimodal ASR implementation utilizing a language-guided agent's vision to reduce errors in spoken instruction transcripts. The model is trained on a dataset derived from the ALFRED household task dataset with synthetic spoken instructions, which are noised systematically by masking spoken words. Spoken instructions transcribed by multimodal ASR models result in higher task completion success rates for a language-guided embodied agent.

Installation

Clone the repository:

$ git clone https://github.com/Cylumn/embodied-multimodal-asr.git

Download the required packages:

$ conda create -n asr_exps python=3.9.13
$ conda activate asr_exps
$ conda install pytorch==1.12.0 torchvision==0.13.0 torchaudio==0.12.0 -c pytorch

$ pip install -r requirements.txt

Dataset download and preprocessing

Download the ALFRED dataset:

$ cd data
$ sh download_data.sh full

$ cd ../
$ python preprocess.py

Model training and testing

To train the model:

$ python train.py \
    --pipeline {'unimodal','multimodal'} \
    --id_noise {'{speaker_label}_clean','{speaker_label}_mask_{p_mask}'}

To test the model:

$ python test.py \
    --run {'{pipeline}_[{id_noise}]_{epochs}'} \
    --id_noise {'{speaker_label}_clean','{speaker_label}_mask_{p_mask}'}

Colab demo and loading pre-trained models

Download pre-trained model weights:

$ cd models
$ sh download_pretrained.sh

Loading a pre-trained model:

import torch
import numpy as np
from sklearn.preprocessing import LabelEncoder
from lib.models import MultimodalDecoder, ASRPipeline

# Word Tokenizer
tokenizer = LabelEncoder()
tokenizer.classes_ = np.load('media/demo/tokenizer.npy')
n_tokens = len(tokenizer.classes_)

# ASR Model
multimodal = ASRPipeline(
    decoder=MultimodalDecoder(
        d_audio=[312, 768], d_vision=512, d_out=n_tokens, 
        depth=4, max_target_len=25, dropout=0.3
    ),
    tokenizer=tokenizer, device='cuda'
)
multimodal.decoder.load_state_dict(torch.load(
    f'models/multimodal_[american_clean]_pretrained.pt',
    map_location='cuda'
))

Colab demo:

Citation

@inproceedings{chang23_interspeech,
  author={Allen Chang and Xiaoyuan Zhu and Aarav Monga and Seoho Ahn and Tejas Srinivasan and Jesse Thomason},
  title={{Multimodal Speech Recognition for Language-Guided Embodied Agents}},
  year=2023,
  booktitle={Interspeech 2023}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embodied Multimodal ASR

Installation

Dataset download and preprocessing

Model training and testing

Colab demo and loading pre-trained models

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
lib		lib
media		media
models		models
.gitignore		.gitignore
README.md		README.md
demo.ipynb		demo.ipynb
preprocess.py		preprocess.py
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Cylumn/embodied-multimodal-asr

Folders and files

Latest commit

History

Repository files navigation

Embodied Multimodal ASR

Installation

Dataset download and preprocessing

Model training and testing

Colab demo and loading pre-trained models

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages