This repository provides an implementation of this paper.
In this paper, Audio Scene classification is performed using deep learning methods. The database used is the LITIS-rouen collection which contains 3026 data from 19 different environments (classes), each of which has a length of 30 seconds. The sampling frequency is 22050 Hz. Considering 3026 data, as defined in the database, 2419 were used as training data and the rest as test data.
In general, processing methods (for audio, etc.) are performed using machine learning methods, which include preprocessing, feature extraction, and classification. In order to preprocessing, background noise removal methods (to create an auxiliary channel) and log mel spectrogram (to show as many features as possible in the signal) were used. CNN, GRU and Attention were used to extract the features, and finally a linear SVM layer was used instead of the Softmax layer for classification.
The network training process is done in such a way that firstly the number of data is increased by between class data augmentation method and the network is trained using the Softmax function in the classification layer. After complete training of the neural network, the softmax layer is removed and the SVM classifier is replaced. In this case, the actual data is used to learn SVM and the data generated by the between class method is not used for SVM training.
Other parameters used in the neural network include the number of epochs, batch size, optimizer and learning rate, which are selected 500, 100, Adam and 0.0001, respectively.
For more information refer to the original paper.