# <center>Freesound General-Purpose Audio Tagging Challenge</center>

![Logo](https://upload.wikimedia.org/wikipedia/commons/3/3c/Freesound_project_website_logo.png)

Freesound is a collaborative database of Creative Commons Licensed sounds. The aim of this competition is to classify audio files that cover real-world sounds from musical instruments, humans, animals, machines, etc. Few of the labels are: `Trumpet`, `Squeak`, `Meow`, `Applause` and `Finger_sapping`.  One of the challenges is that not all labels are manually verified. A creative solution should be able to partially rely on these *weak* annotations.

Let's take a tour of the data visualization and model building through this kernel. If you like this work, please show your support by upvotes. Happy Kaggling!

### Contents
1. [Exploratory Data Analysis](#eda)
    * [Loading data](#loading_data)
    * [Distribution of Categories](#distribution)
    * [Reading Audio Files](#audio_files)
    * [Audio Length](#audio_length)
2. [Building a Model using Raw Wave](#1d_model_building)
    * [Model Discription](#1d_discription)
    * [Configuration](#configuration)
    * [DataGenerator class](#data_generator)
    * [Normalization](#1d_normalization)
    * [Training 1D Conv](#1d_training)
    * [Ensembling 1D Conv Predictions](#1d_ensembling)
3. [Introduction to MFCC](#intro_mfcc)
    * [Generating MFCC using Librosa](#librosa_mfcc)
4. [Building a Model using MFCC](#2d_model_building)
    * [Preparing Data](#2d_data)
    * [Normalization](#2d_normalization)
    * [Training 2D Conv on MFCC](#2d_training)
    * [Ensembling 2D Conv Predictions](#2d_ensembling)
5. [Ensembling 1D Conv and 2D Conv Predictions](#1d_2d_ensembling)
6. [Results and Conclusion](#conclusion)


<a id="eda"></a>
## <center>1. Exploratory Data Analysis</center>

In [0]:
# Change this to True to replicate the result
#COMPLETE_RUN = False
COMPLETE_RUN = True

# Downloading data from a Shared Google Drive zip file

        
##  * The file audio_train8.zip contains original training FreeSound Challenge data <font color=red> downsampled to 8 KHz</font> using subsample.py

## * File train.csv contains the labelling information provided by FreeSounf 

In [0]:
! pip install googledrivedownloader

In [0]:
from google_drive_downloader import GoogleDriveDownloader as gdd


# Download audio data from a shared GoogleDrive file

gdd.download_file_from_google_drive(file_id='19c_Pc9dC_E96AGxN5ZdB-5UeOCDpYPvW',
                                    dest_path='./audio_train8.zip',
                                    unzip=False)

# Download csv data from a shared GoogleDrive file

gdd.download_file_from_google_drive(file_id='1wEUNo9A_2W29YD8HWGNPoshs6Irk4VzQ',
                                    dest_path='./train.csv',
                                    unzip=False)

In [0]:
ls

## Extract audio (wav) files into audio_train8 subdirectory

In [0]:
! unzip -q audio_train8.zip

### ... check that there are 9473 audio files

In [0]:
ls ./audio_train8 | wc

<a id="loading_data"></a>
### Loading CSV data


---

First some imports

In [0]:
import numpy as np
np.random.seed(1001)

import os
import shutil

import IPython
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook

%matplotlib inline
matplotlib.style.use('ggplot')

In [0]:
train = pd.read_csv("train.csv")

In [0]:
train.head()

In [0]:
print("Number of training examples=", train.shape[0], "  Number of classes=", len(train.label.unique()))

In [0]:
print(train.label.unique())

<a id="distribution"></a>
### Distribution of Categories

In [0]:
category_group = train.groupby(['label', 'manually_verified']).count()
plot = category_group.unstack().reindex(category_group.unstack().sum(axis=1).sort_values().index)\
          .plot(kind='bar', stacked=True, title="Number of Audio Samples per Category", figsize=(16,10))
plot.set_xlabel("Category")
plot.set_ylabel("Number of Samples");

In [0]:
print('Minimum samples per category = ', min(train.label.value_counts()))
print('Maximum samples per category = ', max(train.label.value_counts()))

We observe that:
1. The number of audio samples per category is **non-unform**. The minimum number of audio samples in a category is `94` while the maximum is `300`
2. Also, the proportion of `maually_verified` labels per category is non-uniform.
<a id="audio_files"></a>
### Reading Audio Files

The audios are [Pulse-code modulated](https://en.wikipedia.org/wiki/Audio_bit_depth) with a [bit depth](https://en.wikipedia.org/wiki/Audio_bit_depth) of 16 and a [sampling rate](https://en.wikipedia.org/wiki/Sampling_%28signal_processing%29) of 8 kHz (NOT 44.1 kHz)

![16-bit PCM](https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Pcm.svg/500px-Pcm.svg.png)

* **Bit-depth = 16**: The amplitude of each sample in the audio is one of 2^16 (=65536) possible values. 
* **Samplig rate = 8000 (NOT 44.1 kHz)**: Each second in the audio consists of **8000** not 44100 samples. So, if the duration of the audio file is 3.2 seconds, the audio will consist of 8000\*3.2= 24000 values (44100\*3.2 = 141120 values).

Let's listen to an audio file in our dataset and load it to a numpy array

In [0]:
import IPython.display as ipd  # To play sound in the notebook
fname = './audio_train8/' + '00353774.wav'   # Cello
ipd.Audio(fname)

In [0]:
# Using wave library
import wave
wav = wave.open(fname)
print("Sampling (frame) rate = ", wav.getframerate())
print("Total samples (frames) = ", wav.getnframes())
print("Duration = ", wav.getnframes()/wav.getframerate())

In [0]:
# Using scipy
from scipy.io import wavfile
rate, data = wavfile.read(fname)
print("Sampling (frame) rate = ", rate)
print("Total samples (frames) = ", data.shape)
print(data)

Let's plot the audio frames

In [0]:
plt.figure(figsize=(12,6))
plt.plot(data, '-', );

Let's zoom in on first 2000 samples

In [0]:
plt.figure(figsize=(16, 4))
plt.plot(data[:2000], '.'); plt.plot(data[:2000], '-');

<a id="audio_length"></a>
### Audio Length

We shall now analyze the lengths of the audio files in our dataset

In [0]:
train['nframes'] = train['fname'].apply(lambda f: wave.open('./audio_train8/' + f).getnframes())


_, ax = plt.subplots(figsize=(16, 4))
sns.violinplot(ax=ax, x="label", y="nframes", data=train)
plt.xticks(rotation=90)
plt.title('Distribution of audio frames, per label', fontsize=16)
plt.show()

We observe:
1. The distribution of audio length across labels is non-uniform and has high variance.

Let's now analyze the frame length distribution in Train and Test.

In [0]:

train.nframes.hist(bins=100)
plt.suptitle('Frame Length Distribution in Train', ha='center', fontsize='large');

# Methodology: Train/Test .... by now, to make it simple:
    Prepare Train and Test data from FreeSound Training DataSet

It is important to convert raw labels to integer indices

In [0]:
LABELS = list(train.label.unique())
label_idx = {label: i for i, label in enumerate(LABELS)}
train.set_index("fname", inplace=True)
train["label_idx"] = train.label.apply(lambda x: label_idx[x])
if not COMPLETE_RUN:
    train = train[:2000]

In [0]:
train.index

In [0]:
train.head()



---

### - first prepare balanced test data:

      test_df Pandas DataFrame

In [0]:
# Select a test sample from TRAIN dataset with same number of audio per class

n_samples_per_class=20
test_sample=train.groupby('label')['label_idx'].apply(lambda x: x.sample(n=n_samples_per_class))

In [0]:
# test_sample is a Multilevel panadas Series with the file names in the second level (1)


test_list= list(test_sample.index.get_level_values(1))
test_df=train.loc[train.index.isin(test_list)]



---

### - now let's take the rest of data as the Training dataset:

    train_df Pandas DataFrame

In [0]:
train_df=train.loc[~train.index.isin(test_list)]

print('Number of audios for final train: ',len(train_df))
print('Number of audios for final test : ',len(test_df))

<a id="configuration"></a>
#### Configuration

The Configuration object stores those learning parameters that are shared between data generators, models, and training functions. Anything that is `global` as far as the training is concerned can become the part of Configuration object.

In [0]:
class Config(object):
    def __init__(self,
                 sampling_rate=8000, audio_duration=2, n_classes=41,
                 use_mfcc=False, n_folds=10, learning_rate=0.0001, 
                 max_epochs=50, n_mfcc=20):
        self.sampling_rate = sampling_rate
        self.audio_duration = audio_duration
        self.n_classes = n_classes
        self.use_mfcc = use_mfcc
        self.n_mfcc = n_mfcc
        self.n_folds = n_folds
        self.learning_rate = learning_rate
        self.max_epochs = max_epochs

        self.audio_length = self.sampling_rate * self.audio_duration
        if self.use_mfcc:
            self.dim = (self.n_mfcc, 1 + int(np.floor(self.audio_length/512)), 1)
        else:
            self.dim = (self.audio_length, 1)

## <font color=red>See Librosa</font>


# Librosa Introduction

<font color=FA9900 size=4>LibROSA is a python package for music and audio analysis. It provides the building blocks necessary to create music information retrieval systems.</font>

- Tutorial home: http://librosa.github.io/librosa/tutorial.html
- Librosa home: http://librosa.github.io/
- User forum: https://groups.google.com/forum/#!forum/librosa

In [0]:
! pip install -q librosa

In [0]:
import librosa

## Though it is better using generators, lets try reading ALL the Data at ONCE using prepare data

<a id="data_generator"></a>
#### DataGenerator Class

The DataGenerator class inherits from **`keras.utils.Sequence`** . It is useful for preprocessing and feeding the data to a Keras model. 
* Once initialized with a batch_size, it computes the number of batches in an epoch. The **`__len__`** method tells Keras how many batches to draw in each epoch. 
* The **`__getitem__`** method takes an index (which is the batch number) and returns a batch of the data (both X and y) after calculating the offset. During test time, only `X` is returned.
* If we want to perform some action after each epoch (like shuffle the data, or increase the proportion of augmented data), we can use the **`on_epoch_end`** method.

Note:
**`Sequence`** are a safer way to do multiprocessing. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.

In [0]:
class DataGenerator(Sequence):
    def __init__(self, config, data_dir, list_IDs, labels=None, 
                 batch_size=64, preprocessing_fn=lambda x: x):
        self.config = config
        self.data_dir = data_dir
        self.list_IDs = list_IDs
        self.labels = labels
        self.batch_size = batch_size
        self.preprocessing_fn = preprocessing_fn
        self.on_epoch_end()
        self.dim = self.config.dim

    def __len__(self):
        return int(np.ceil(len(self.list_IDs) / self.batch_size))

    def __getitem__(self, index):
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
        list_IDs_temp = [self.list_IDs[k] for k in indexes]
        return self.__data_generation(list_IDs_temp)

    def on_epoch_end(self):
        self.indexes = np.arange(len(self.list_IDs))

    def __data_generation(self, list_IDs_temp):
        cur_batch_size = len(list_IDs_temp)
        X = np.empty((cur_batch_size, *self.dim))

        input_length = self.config.audio_length
        for i, ID in enumerate(list_IDs_temp):
            file_path = self.data_dir + ID
            
            # Read and Resample the audio
            data, _ = librosa.core.load(file_path, sr=self.config.sampling_rate,
                                        res_type='kaiser_fast')

            # Random offset / Padding
            if len(data) > input_length:
                max_offset = len(data) - input_length
                offset = np.random.randint(max_offset)
                data = data[offset:(input_length+offset)]
            else:
                if input_length > len(data):
                    max_offset = input_length - len(data)
                    offset = np.random.randint(max_offset)
                else:
                    offset = 0
                data = np.pad(data, (offset, input_length - len(data) - offset), "constant")
                
            # Normalization + Other Preprocessing
            if self.config.use_mfcc:
                data = librosa.feature.mfcc(data, sr=self.config.sampling_rate,
                                                   n_mfcc=self.config.n_mfcc)
                data = np.expand_dims(data, axis=-1)
            else:
                data = self.preprocessing_fn(data)[:, np.newaxis]
            X[i,] = data

        if self.labels is not None:
            y = np.empty(cur_batch_size, dtype=int)
            for i, ID in enumerate(list_IDs_temp):
                y[i] = self.labels[ID]
            return X, to_categorical(y, num_classes=self.config.n_classes)
        else:
            return X

In [0]:
def prepare_data(df, config, data_dir):
    if config.use_mfcc :
      X = np.empty(shape=(df.shape[0], config.dim[0], config.dim[1], 1))
    else:
      X = np.empty(shape=(df.shape[0], config.dim[0], config.dim[1]))
      
    input_length = config.audio_length
    for i, fname in enumerate(df.index):
        # print(fname)
        file_path = data_dir + fname
        data, _ = librosa.core.load(file_path, sr=config.sampling_rate, res_type="kaiser_fast")

        # Random offset / Padding
        if len(data) > input_length:
            max_offset = len(data) - input_length
            offset = np.random.randint(max_offset)
            data = data[offset:(input_length+offset)]
        else:
            if input_length > len(data):
                max_offset = input_length - len(data)
                offset = np.random.randint(max_offset)
            else:
                offset = 0
            data = np.pad(data, (offset, input_length - len(data) - offset), "constant")

        if config.use_mfcc :
          data = librosa.feature.mfcc(data, sr=config.sampling_rate, n_mfcc=config.n_mfcc)
          data = np.expand_dims(data, axis=-1)
        else:
          data = data[:, np.newaxis]
            
        X[i,] = data
    return X

In [0]:
test_df.index

In [0]:
config = Config(sampling_rate=8000, audio_duration=2, n_folds=10, learning_rate=0.001)
if not COMPLETE_RUN:
    config = Config(sampling_rate=8000, audio_duration=1, n_folds=2, learning_rate=0.01, max_epochs=1)

In [0]:
X_test = prepare_data(test_df, config, './audio_train8/')

In [0]:
X_test.shape

In [0]:
X_train = prepare_data(train_df, config, './audio_train8/')

In [0]:
X_train.shape

## Labels: we will use one-hot-encoding from to_categorical provided by Keras

In [0]:
from tensorflow import keras
from keras.utils import to_categorical

In [0]:
test_labels = to_categorical(list(test_df.label_idx), num_classes=config.n_classes)
train_labels = to_categorical(list(train_df.label_idx), num_classes=config.n_classes)

In [0]:
print('Train labels shape',train_labels.shape)
print('Test labels shape',test_labels.shape)

<a id="1d_model_building"></a>
## <center>2. Building a Model using Raw Wave</center>
We will build two models:
1. The first model will take the raw audio (1D array) as input and the primary operation will be Conv1D
2. The second model will take the MFCCs as input. (We will explain MFCC later)

<a id="1d_discription"></a>
### Keras Model using raw wave

Our model has the architecture as follows:
![raw](https://raw.githubusercontent.com/zaffnet/images/master/images/raw_model.jpg)

**Important:**
Due to the time limit on Kaggle Kernels, it is not possible to perform 10-fold training of a large model. I have trained the model locally and uploaded its output files as a dataset. If you wish to train the bigger model, change `COMPLETE_RUN = True` at the beginning of the kernel.

## Some essential imports

In [0]:
from keras import losses, models, optimizers
from keras.activations import relu, softmax
from keras.callbacks import (EarlyStopping, LearningRateScheduler,
                             ModelCheckpoint, TensorBoard, ReduceLROnPlateau)
from keras.layers import (Convolution1D, Dense, Dropout, GlobalAveragePooling1D, 
                          GlobalMaxPool1D, Input, MaxPool1D, concatenate)
from keras.utils import Sequence, to_categorical

<a id="1d_normalization"></a>
#### Normalization

Normalization is a crucial preprocessing step. The simplest method is rescaling the range of features to scale the range in [0, 1]. 

In [0]:
def audio_norm(data):
    max_data = np.max(data)
    min_data = np.min(data)
    data = (data-min_data)/(max_data-min_data+1e-6)
    return data-0.5


---


## Using Keras sequential....

---


In [0]:
from keras.models import Sequential

from keras.layers import MaxPooling1D, Dropout, Dense, Flatten, GlobalMaxPool1D

from keras.layers import Convolution1D as Conv1D

nclass = config.n_classes
input_length = config.audio_length

model = Sequential()
# input: audio signal (input_length,1)
# apply 16 convolution filters of length 9 each and relu activation
model.add(Conv1D(1.....

model.add(GlobalMaxPool1D.....
model.add(Dropout(0.25))

...
model.add(Dense(nclass, activation='softmax'))


In [0]:
model.summary()

In [0]:
#from keras.optimizers import SGD
opt = optimizers.Adam(config.learning_rate)

model.compile(loss='categorical_crossentropy',
              optimizer=opt,
              metrics=['accuracy'])

In [0]:
.... USE...
, batch_size=100, epochs=50 , \
shuffle=True

....validation_data=(X_test, test_labels))


---

## <font color=red>Now let's use:
## Keras Functional API

See this [Keras Functional API guide](https://keras.io/getting-started/functional-api-guide/)


---


In [0]:
def simple_1d_conv_model(config):
    
    nclass = config.n_classes
    input_length = config.audio_length
    
    inp = Input(shape=(input_length,1))
    
    .....
    
    

    ....
    return model

In [0]:
model = simple_1d_conv_model(config)

In [0]:
model.summary()

In [0]:
history = model.fit(X_train, train_labels, batch_size=100, epochs=50 , \
                    shuffle=True, validation_data=(X_test, test_labels))

## <font color=red> Now try more complex (or a simple! dumy?) architecture</font>

* The dummy model is just for debugging purpose.
* Our 1D Conv model is fairly deep and is trained using Adam Optimizer with a learning rate of 0.0001

In [0]:
def get_1d_dummy_model(config):
    
    nclass = config.n_classes
    input_length = config.audio_length
    
    inp = Input(shape=(input_length,1))
    x = GlobalMaxPool1D()(inp)
    out = Dense(nclass, activation=softmax)(x)

    model = models.Model(inputs=inp, outputs=out)
    opt = optimizers.Adam(config.learning_rate)

    model.compile(optimizer=opt, loss=losses.categorical_crossentropy, metrics=['acc'])
    return model

def get_1d_conv_model(config):
    
    nclass = config.n_classes
    input_length = config.audio_length
    
    inp = Input(shape=(input_length,1))
    x = Convolution1D(16, 9, activation=relu, padding="valid")(inp)
    x = Convolution1D(16, 9, activation=relu, padding="valid")(x)
    x = MaxPool1D(16)(x)
    x = Dropout(rate=0.1)(x)
    
    x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
    x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
    x = MaxPool1D(4)(x)
    x = Dropout(rate=0.1)(x)
    
    x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
    x = Convolution1D(32, 3, activation=relu, padding="valid")(x)
    x = MaxPool1D(4)(x)
    x = Dropout(rate=0.1)(x)
    
    x = Convolution1D(256, 3, activation=relu, padding="valid")(x)
    x = Convolution1D(256, 3, activation=relu, padding="valid")(x)
    x = GlobalMaxPool1D()(x)
    x = Dropout(rate=0.2)(x)

    x = Dense(64, activation=relu)(x)
    x = Dense(1028, activation=relu)(x)
    out = Dense(nclass, activation=softmax)(x)

    model = models.Model(inputs=inp, outputs=out)
    opt = optimizers.Adam(config.learning_rate)

    model.compile(optimizer=opt, loss=losses.categorical_crossentropy, metrics=['acc'])
    return model

## Simple train /test

In [0]:
if COMPLETE_RUN:
        model = get_1d_conv_model(config)
else:
        model = get_1d_dummy_model(config)

In [0]:
model.summary()

In [0]:
history = model.fit(X_train, train_labels, batch_size=100, epochs=50 , \
                    shuffle=True, validation_data=(X_test, test_labels))

<a id="intro_mfcc"></a>
## <center> 3. Introduction to MFCC

As we have seen in the previous section, our Deep Learning models are powerful enough to classify sounds from the raw audio. We do not require any complex feature engineering. But before the Deep Learning era, people developed techniques to extract features from audio signals. It turns out that these techniques are still useful. One such technique is computing the MFCC (Mel Frquency Cepstral Coefficients) from the raw audio. Before we jump to MFCC, let's talk about extracting features from the sound.

If we just want to classify some sound, we should build features that are **speaker independent**. Any feature that only gives information about the speaker (like the pitch of their voice) will not be helpful for classification. In other words, we should extract features that depend on the "content" of the audio rather than the nature of the speaker. Also, a good feature extraction technique should mimic the human speech perception. We don't hear loudness on a linear scale. If we want to double the perceived loudness of a sound, we have to put 8 times as much energy into it. Instead of a linear scale, our perception system uses a log scale. 

Taking these things into account, Davis and Mermelstein came up with MFCC in the 1980's. MFCC mimics the logarithmic perception of loudness and pitch of human auditory system and tries to eliminate speaker dependent characteristics by excluding the fundamental frequency and their harmonics. The underlying mathematics is quite complicated and we will skip that. For those interested, here is the [detailed explanation](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/).

![http://recognize-speech.com/images/FeatureExtraction/MFCC/MFCC_Flowchart.png](http://recognize-speech.com/images/FeatureExtraction/MFCC/MFCC_Flowchart.png)

<a id="librosa_mfcc"></a>
#### Generating MFCC using Librosa
The library librosa has a function to calculate MFCC. Let's compute the MFCC of an audio file and visualize it.

In [0]:
import librosa
SAMPLE_RATE = 44100
fname = '../input/freesound-audio-tagging/audio_train/' + '00044347.wav'   # Hi-hat
wav, _ = librosa.core.load(fname, sr=SAMPLE_RATE)
wav = wav[:2*44100]

In [0]:
mfcc = librosa.feature.mfcc(wav, sr = SAMPLE_RATE, n_mfcc=40)
mfcc.shape

In [0]:
plt.imshow(mfcc, cmap='hot', interpolation='nearest');