Removes lines that failed to pass through the ASR error injection pipeline.
Contains a final processing of the evaluation data, including multiplying slot labels for words split into multiple words when turned into phonemes.
Contains an initial processing of the evaluation data.
The phoneme2vec embeddings using gensim's word2vec models are trained here.
The seq2seq embeddings using PyTorch are trained here.
Contains an initial processing of the training data.
The SQuAD and SUBJ training data used can be found here and the CMU Pronouncing Dictionary data here. The SQuAD and SUBJ training data can be downloaded using the AWS Command Line Interface. Once downloaded, the phoneme_embedding_training_set
needs to be placed under ./data
.
The evaluation data used can be found here. Once downloaded, the atis
and snips
datasets also need to be placed under ./data
.
The ASR errors in the evaluation data were created using this ASR error injection pipeline and this phonemizer was used to create phonemized versions of the text sequences that are in the datasets.
Word2Vec skip-gram embeddings were trained using the Gensim framework and seq2seq models using pytorch.