Some sounds are distinct and instantly recognizable, like a baby’s laugh or the strum of a guitar.

Other sounds aren’t clear and are difficult to pinpoint. If you close your eyes, can you tell which of the sounds below is a chainsaw versus a blender?

Moreover, we often experience a mix of sounds that create an ambience – like the clamoring of construction, a hum of traffic from outside the door, blended with loud laughter from the room, and the ticking of the clock on your wall. The sound clip below is of a busy food court in the UK.

Partly because of the vastness of sounds we experience, no reliable automatic general-purpose audio tagging systems exist. Currently, a lot of manual effort is required for tasks like annotating sound collections and providing captions for non-speech events in audiovisual content.

To tackle this problem, Freesound (an initiative by MTG-UPF that maintains a collaborative database with over 370,000 Creative Commons Licensed sounds) and Google Research’s Machine Perception Team (creators of AudioSet, a large-scale dataset of manually annotated audio events with over 500 classes) have teamed up to develop the dataset for this competition.

You’re challenged to build a general-purpose automatic audio tagging system using a dataset of audio files covering a wide range of real-world environments. Sounds in the dataset include things like musical instruments, human sounds, domestic sounds, and animals from Freesound’s library, annotated using a vocabulary of more than 40 labels from Google’s AudioSet ontology. To succeed in this competition your systems will need to be able to recognize an increased number of sound events of very diverse nature, and to leverage subsets of training data featuring annotations of varying reliability (see Data section for more information).

In [1]:
# set path

import sys
sys.path.insert(0,'../src')

Import necessary packages

In [2]:
from dotenv import load_dotenv
import os
import pandas as pd
from information import Information
from pre_processing import PreProcessing
from prepare_data import PrepareData
from sound_oop import SoundObjectOriented
from utils.sound_features import get_mfcc_features_2


pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.



Load environment settings

In [3]:
# Load envs

ENV = os.getenv("ENV")
TRAIN_PATH = os.getenv("TRAIN_PATH")
TEST_PATH = os.getenv("TRAIN_PATH")

Load data

In [4]:
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test_post_competition.csv")

Extract labels

In [5]:
LABELS = list(train.label.unique())
label_idx = {label: i for i, label in enumerate(LABELS)}
train.set_index("fname", inplace=True)
test.set_index("fname", inplace=True)
train["label_idx"] = train.label.apply(lambda x: label_idx[x])

Extract MFCC for noth train/test audio files

In [6]:
prepare_data = PrepareData()
train_extracted = prepare_data.extract_features(
    "../data/train", "train", loadPreComputed=False
)
test_extracted = prepare_data.extract_features(
    "../data/test", "test", loadPreComputed=False
)


pre-processing object is created



100%|██████████| 31/31 [00:04<00:00,  7.68it/s]
100%|██████████| 15/15 [00:01<00:00,  9.83it/s]


Extract cooresponding labels

In [7]:
y_train = train.loc[train_extracted.index.to_numpy()]

Create the main Sound Classifier Object and train

In [8]:
sound_oop = SoundObjectOriented()
sound_oop.add_data(train_extracted, test_extracted, y_train, index_name="fname")
# sound_oop.information()
sound_oop.pre_processing()
# sound_oop.information()

ML = sound_oop.ml(sound_oop)
ML.show_available_algorithms()
ML.init_regressors("all")
ML.train_test_validation()


Information object is created


pre-processing object is created


SoundObjectOriented object is created


Your data has been added


Data has been Pre-Processed


Machine Learning object is created

You can fit your data with the following models

Elastic Net
Kernel Ridge
Bayesian Ridge
Lasso
Lasso Lars Ic
Random Forest
Svm
Xgboost
Gradient Boosting


Gradient Boosting === > Initialized



Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 4.040e-01, tolerance: 8.210e-02


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.125e-02, tolerance: 5.833e-04


Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 1.393e+00, tolerance: 3.821e-01






********** Training *********************** Testing **********
R^2    :  0.9999970849106857         -1.185583323982778
Adj R^2:  1.0000000156032243         1.0116984911453581
MAE    :  0.0008913791841930812      86.30391357910844
MSE    :  2.360883663676816e-06      86.30391357910844
RMSE   :  0.0015365167306856168      9.289989966577382

********** Training *********************** Testing **********
R^2    :  0.9999999999997233         -2.247238345432569
Adj R^2:  1.0000000000000016         1.017381075621352
MAE    :  1.0646782014036817e-06     106.86810486295758
MSE    :  2.5660898943927996e-11     106.86810486295758
RMSE   :  5.065658786764856e-06      10.337703074811037


ValueError: y should be a 1d array, got an array of shape (24, 3) instead.