Skip to content

MLSpeech/speech_yolo

Repository files navigation

SpeechYOLO: Detection and Localization of Speech Objects

Yael Segal (segal.yael@campus.technion.ac.il)
Tzeviya Sylvia Fuchs (fuchstz@cs.biu.ac.il)
Joseph Keshet (jkeshet@technion.ac.il)

SpeechYOLO, inspired by the YOLO algorithm , uses object detection methods from the vision domain for speech recognition. The goal of SpeechYOLO is to localize boundaries of utterances within the input signal, and to correctly classify them. Our system is composed of a convolutional neural network, with a simple least-meansquares loss function.

The paper can be found here.
If you find our work useful, please cite:

@article{segal2019speechyolo,
  title={SpeechYOLO: Detection and Localization of Speech Objects},
  author={Segal, Yael and Fuchs, Tzeviya Sylvia and Keshet, Joseph},
  journal={Proc. Interspeech 2019},
  pages={4210--4214},
  year={2019}
}

Installation instructions

  • Python 3.6+

  • Pytorch 1.3.1

  • numpy

  • librosa

  • soundfile

  • Download the code:

    git clone https://github.com/MLSpeech/speech_yolo.git
    

How to use

Pretrain network on the Google Commands dataset:

  • Download data from Google Commands.

  • Each directory contains .wav files, each containing a single word. Split the keyword directories into train, val and test folders (code). Your data should look as follows:

    data
        └───train
        |   |_____word_1
        │   |       │   1.wav
        │   |       │   2.wav
        │   |       │   3.wav
        │   |
        |   |_____word_2
        │   |       │   4.wav
        │   |       │   5.wav
        │   |       │   6.wav          
        └───val
        |   |_____word_1
        │   |       │   7.wav
        │   |       │   8.wav
        │   |       │   9.wav
        │   |
        |   |_____word_2
        │   |       │   10.wav
        │   |       │   11.wav
        │   |       │   12.wav     
        └───test
        |   |_____word_1
        │   |       │   13.wav
        │   |       │   14.wav
        │   |       │   15.wav
        │   |
        |   |_____word_2
        │   |       │   16.wav
        │   |       │   17.wav
        │   |       │   18.wav     
    

    You should have 30 folders (keywords) in every train \ val \test directory. See example in gcommand_toy_example.

  • run:

     python pretrain_run.py --train_path [path_to_data\train_folder]  
                            --valid_path [path_to_data\val_folder] 
                            --test_path [path_to_data\test_folder] 
                            --arc VGG19 
                            --cuda  
                            --save_folder [directory_for_saving_models]  
    

    This code runs a convolutional network for multiclass command classification.

    Our pretraining model could be found here.

Run SpeechYOLO

  • We ran SpeechYOLO on the LibriSpeech dataset. See data preparation instructions.

  • For simplicity, the SpeechYOLO code assumes that the .wav files are of length 1 sec each.

  • To train, run:

     python run_speech_yolo.py  --train_data [path_to_train_data]  
                                --val_data [path_to_validation_data]
                                --arc VGG19
                                --prev_classification_model [path_to_model_from_pretrain_part]
                                --save_folder [folder_to_save_speechyolo_model]
    

    If you want to load a previously trained speech_yolo_model file for further training, add:

    --trained_yolo_model [path_to_file].

    Our trained model could be found here: vgg11, vgg19.

  • To test, run:

     python test_yolo.py  --train_data [path_to_train_data]  
                          --test_data [path_to_test_data]
                          --model [path_to_speechyolo_model]
    

Our results for threshold theta = 0.4 are:

threshold: 0.4
Actual Accuracy (Val): 0.7746057716719794
F1 regular mean: 0.806897659409815
precision: 0.836339972153278
recall: 0.7794578126322322

Releases

No releases published

Packages

No packages published

Languages