Switch branches/tags
Nothing to show
Find file History
Latest commit a33003f Oct 18, 2018


DCASE2018 task4 baseline: Large-scale weakly labeled semi-supervised sound event detection in domestic environments

This repository contains python scripts to execute the baseline.

You can find discussion about the dcase challenge here: dcase_discussions


task4_crnn.py needs Python >= 2.7, dcase_util >= 0.1.9, tqdm >= 4.11.2, ffmpeg >= 2.8.11, tensorflow >= 1.6.0, keras >= 2.1.5, sed_eval >= 0.2.1, librosa >= 0.6.0

A simplified installation procedure example is provide below for python 3.6 based Anconda distribution for Linux based system:

  1. install Ananconda
  2. create, activate an environment in python 3.6 and launch ./install_dependencies.sh Be careful it installs tensorflow cpu, you can change it with a gpu version if you have one.

Note: The baseline has been tested with python 3.6, on linux (ubuntu 14.04 and 16.04)


  • Clone the repository
  • Install the dependencies (check install_environment.sh)
  • Run KERAS_BACKEND="tensorflow" python task4_crnn.py

Note: To download the dataset separately go to dataset folder and follow instructions. Otherwise, the dataset (82 Gb) will be downloaded the first time you launch the script


To get more information about the dataset and files, go to the dataset folder

Files and folders

  • task4_crnn.py is the main script, it will execute the different steps of the baseline (see also below)
  • evaluation_measures.py contains the measure calculated during the baseline. Note the final measure is the macro f measure produced on events
  • Dataset_dcase2018.py contains the class dataset used in this baseline, it downloads data and produce meta_files usable in the baseline system
  • task4_crnn.yaml contains system parameters
  • notebooks folder is defined to contain codes or notebooks which can help other participants. Do not hesitate to do pull request in this repository. We can discuss about notebooks in this discussion. (If you want to make changes in a notebook already in the repo, It is recommended to use: nbdime)

Here is a list of files produced by the baseline and their explanation:

  • meta.txt lists all items contained in the dataset.
  • item_access_error.log.csv lists all files that were not downloaded. It should be the concatenation of missing_files_[dataset].py in dataset folder
  • filelist.python.hash is a hash file generated by dcase_util in order to avoid rechecking if the dataset exists every time when the baseline is run.
  • evaluation_setup folder generated by dcase_util that contains information used by the dataset to retrieve folds.
  • baseline folder generated by dcase_util that contains features, models and parameters


System description

The baseline system is based on two conolutional recurrent neural network (CRNN) using 64 log mel-band magnitudes as features. 10 seconds audio files are divided in 500 frames.

Using these features, we train a first CRNN with three convolution layers (64 filters (3x3), max pooling (4) along the frequency axis and 30% dropout), one recurrent layer (64 Gated Recurrent Units GRU with 30% dropout on the input), a dense layer (10 units sigmoid activation) and global average pooling across frames. The system is trained for 100 epochs (early stopping after 15 epochs patience) on weak labels (1578 clips, 20% is used for validation). This model is trained at clip level (file containing the event or not), inputs are 500 frames long (10 sec audio file) for a single output frame. This first model is used to predict labels of unlabeled files (unlabel_in_domain, 14412 clips).

A second model based on the same architecture (3 convolutional layers and 1 recurrent layer) is trained on predictions of the first model (unlabel_in_domain, 14412 clips; the weak files, 1578 clips are used to validate the model). The main difference with the first pass model is that the output is the dense layer in order to be able to predict event at frame level. Inputs are 500 frames long, each of them labeled identically following clip labels. The model outputs a decision for each frame. Preprocessing (median filtering) is used to obtain events onset and offset for each file. The baseline system includes evaluations of results using event-based F-score as metric.

Script description

The baseline system is a semi supervised approach:

  • Download the data (only the first time)
  • First pass at clip level:
    • Train a CRNN on weak data (train/weak) - 20% of data used for validation
    • Predict unlabel (in domain) data (train/unlabel_in_domain)
  • Second pass at frame level:
    • Train a CRNN on predicted unlabel data from the first pass (train/unlabel_in_domain) - weak data (train/weak) is used for validation Note: labels are used at frames level but annotations are at clip level, so if an event is present in the 10 sec, all frames contain this label during training
    • Predict strong test labels (test/) Note: predict an event with an onset and offset
  • Evaluate the model between test annotations and second pass predictions (Metric is (macro-average) event based)

System performance (event-based measures with a 200ms collar on onsets and a 200ms / 20% of the events length collar on offsets):

Event-based overall metrics (macro-average)
F-score 14.68 %
ER 1.54

Note: This performance was obtained on a CPU based system (Intel® Xeon E5-1630 -- 8 cores, 128Gb RAM). The total runtime was approximately 24h.

Note: The performance might not be exactly reproducible on a GPU based system. However, it runs in around 8 hours on a single Nvidia Geforce 1080 Ti GPU.


If you are using this source code please consider citing the following paper:

R. Serizel, N. Turpault, H. Eghbal-Zadeh, A. P. Shah . “Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments ”. Submitted to DCASE2018 Workshop, 2018.


Nicolas Turpault, Romain Serizel, Hamid Eghbal-Zadeh, Ankit Parag Shah, 2018 -- Present


This software is distributed under the terms of the MIT License (https://opensource.org/licenses/MIT)