## TASK 3: transfer from vggish audioset

Sound detection and classifications is a field that has grown a lot in recent years. Audioset's publication, a collection of more than 2 million sounds, has considerably increased the interest in this field. VGGish is a variant of the VGG architecture but adapted for audio with a Mel-spectrogram as input. It is trained with Audioset data, the most extensive dataset in audio. As in Task 2, transfer learning is carried out by retraining only the last classification layer, as in previous proposals. 





In [None]:
! pip install -U pip
! pip install -U torch==1.5.1
! pip install -U torchaudio==0.5.1
! pip install -U torchvision==0.6.1
! pip install -U matplotlib==3.2.1
! pip install -U clearml>=0.16.1
! pip install -U pandas==1.0.4
! pip install -U numpy==1.18.4
! pip install -U tensorboard==2.2.1
!pip install git+https://github.com/mir-dataset-loaders/mirdata.git@Pedro/good_sounds
!pip install essentia
!pip install essentia-tensorflow
!wget https://essentia.upf.edu/models/classifiers/voice_instrumental/voice_instrumental-vggish-audioset-1.pb

Collecting git+https://github.com/mir-dataset-loaders/mirdata.git@Pedro/good_sounds
  Cloning https://github.com/mir-dataset-loaders/mirdata.git (to revision Pedro/good_sounds) to /tmp/pip-req-build-0ikul_al
  Running command git clone -q https://github.com/mir-dataset-loaders/mirdata.git /tmp/pip-req-build-0ikul_al
  Running command git checkout -b Pedro/good_sounds --track origin/Pedro/good_sounds
  Switched to a new branch 'Pedro/good_sounds'
  Branch 'Pedro/good_sounds' set up to track remote branch 'Pedro/good_sounds' from 'origin'.
--2021-03-26 10:31:57--  https://essentia.upf.edu/models/classifiers/voice_instrumental/voice_instrumental-vggish-audioset-1.pb
Resolving essentia.upf.edu (essentia.upf.edu)... 84.89.139.43
Connecting to essentia.upf.edu (essentia.upf.edu)|84.89.139.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 288629030 (275M)
Saving to: ‘voice_instrumental-vggish-audioset-1.pb.1’


2021-03-26 10:32:50 (5.23 MB/s) - ‘voice_instrumental-v

In [None]:
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0


In [None]:
if torch.cuda.is_available:
  print('GPU available')
else:
  print('Please set GPU via Edit -> Notebook Settings.')

NameError: ignored

In [None]:
import PIL
import io

import pandas as pd
import numpy as np
from pathlib2 import Path
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset
from torch.utils.tensorboard import SummaryWriter

import torchaudio
from torchvision.transforms import ToTensor
from torchvision import models

%matplotlib inline

In [None]:

configuration_dict = {'number_of_epochs': 20, 'batch_size': 8, 'dropout': 0.3, 'base_lr': 0.005, 
                      'number_of_mel_filters': 64, 'resample_freq': 22050}


In [None]:
import mirdata

from google.colab import drive
drive.mount('/content/drive', force_remount=True)

g = mirdata.initialize("good_sounds", data_home="drive/MyDrive/good_sounds")

Mounted at /content/drive


## Generate data input

In [None]:
import os
import pickle

def emb_path(audio_path):
    # Embeddings path
    pre, ext = os.path.splitext(audio_path)
    return pre + '.pickle'

def emb_path_vgg(audio_path):
    # Embeddings path
    pre, ext = os.path.splitext(audio_path)
    dir, file = os.path.split(pre)
    return os.path.join(dir, 'vgg#' + file + '.pickle')

def emb_path_pen(audio_path):
    # Embeddings path
    pre, ext = os.path.splitext(audio_path)
    dir, file = os.path.split(pre)
    return os.path.join(dir, 'pen#' + file + '.pickle')

def store(data, filename):
    # Store data (serialize)
    with open(filename, 'wb') as handle:
        pickle.dump(data, handle, protocol=pickle.HIGHEST_PROTOCOL)


def load(filename):
    # Load data (deserialize)
    with open(filename, 'rb') as handle:
        unserialized_data = pickle.load(handle)
    return unserialized_data

In [None]:
import os
from essentia.standard import MonoLoader, TensorflowPredictVGGish

def get_data(dataset, klasses):
    # import pdb; pdb.set_trace()
    sound_selected = []
    track_ids, tracks = [], []
    if klasses == "all":
        for k, t in dataset.load_tracks().items():
            if t.get_sound_info['klass'] and t.get_sound_info['id'] not in sound_selected:
                track_ids.append(k)
                tracks.append(t)
                sound_selected.append(t.get_sound_info['id'])
    else:
        for klass in klasses:
          for k, t in dataset.load_tracks().items():
              if t.get_sound_info['instrument'] == klass and t.get_sound_info['klass'] and t.get_sound_info['id'] not in sound_selected:
                  track_ids.append(k)
                  tracks.append(t)
                  sound_selected.append(t.get_sound_info['id'])
    return track_ids, tracks

count = 0
sizes = []

keys,tracks = get_data(g, ['violin', 'cello', 'bass'])
for k, t in zip(keys,tracks):
    if not os.path.exists(emb_path_vgg(t.audio_path)) or not os.path.exists(emb_path_pen(t.audio_path)):
        print(count)
        sr = 16000
        aud = MonoLoader(filename=t.audio_path, sampleRate=sr)()
        sizes.append(len(aud))
        audio = aud.copy()
        audio.resize(48000, refcheck=False)
        print(audio.shape, min(audio), max(audio))
        # Retrieve the output of the penultimate layer
        penultimate_layer = TensorflowPredictVGGish(graphFilename='/content/voice_instrumental-vggish-audioset-1.pb', output='model/fully_connected/BiasAdd')(audio)
        store(penultimate_layer, emb_path_pen(t.audio_path))
        # Retrieve weights of the last layer of VGGish
        vgg_embed = TensorflowPredictVGGish(graphFilename='/content/voice_instrumental-vggish-audioset-1.pb', output='model/vggish/fc2/BiasAdd')(audio)
        store(vgg_embed, emb_path_vgg(t.audio_path))

    count += 1


In [None]:
min(sizes), max(sizes)

(7693, 352179)

## Generate data

In [None]:
import sklearn 
import random
from essentia.standard import MonoLoader, TensorflowPredictMusiCNN

def get_data(dataset, klasses):
    # import pdb; pdb.set_trace()
    sound_selected = []
    track_ids, tracks = [], []
    if klasses == "all":
        for k, t in dataset.load_tracks().items():
            if t.get_sound_info['klass'] and t.get_sound_info['id'] not in sound_selected:
                track_ids.append(k)
                tracks.append(t)
                sound_selected.append(t.get_sound_info['id'])
    else:
        for klass in klasses:
          for k, t in dataset.load_tracks().items():
              if t.get_sound_info['instrument'] == klass and t.get_sound_info['klass'] and t.get_sound_info['id'] not in sound_selected:
                  track_ids.append(k)
                  tracks.append(t)
                  sound_selected.append(t.get_sound_info['id'])
    return track_ids, tracks


class good_soundsDataset(torch.utils.data.Dataset):
    def __init__(
        self,
        mirdataset,
        seq_duration=0.5,
        random_start=True,
        resample=8000,
        subset=0,
        train_split=0.8,
        test_split=0.2,
        random_seed=32,
        klasses=["violin"]
    ):
        """
        """
        self.klasses = klasses # a 'klass' or 'all' or 'family'
        self.seq_duration = seq_duration
        self.dataset = mirdataset
        track_ids, tracks = get_data(self.dataset, self.klasses)
        self.track_ids = track_ids
        self.tracks = tracks
        self.resample = resample
        self.set = subset
        self.random_start = random_start

        #### build a list with labels
        self.labels = {label: i for i,label in enumerate(['good', 'bad'])}
        full_labels = [x.get_sound_info['klass'] for x in tracks]
        #### build the three subsets: train, validation, test using train_test_split, a stratified split with the labels
        self.trackids_train, self.trackids_test = sklearn.model_selection.train_test_split(self.track_ids, train_size=1-test_split, random_state=random_seed, stratify=full_labels)
        train_labels = [l for l,i in zip(full_labels, self.track_ids) if i in self.trackids_train]
        self.trackids_train, self.trackids_valid = sklearn.model_selection.train_test_split(self.trackids_train, train_size=train_split, random_state=random_seed, stratify=train_labels)


    def __getitem__(self, index):

        #### get the file with index in the corresponding subset
        if self.set==0:
            track_id = self.trackids_train[index]
        elif self.set==1:
            track_id = self.trackids_valid[index]
        elif self.set==2:
            track_id = self.trackids_test[index]
        track = self.dataset.track(track_id)

        
        embeddings = load(emb_path_vgg(track.audio_path))
        embeddings = torch.tensor(embeddings)
        embeddings = torch.reshape(embeddings, (384,))
        # print(embeddings.shape)
        audio_signal = np.array([])

        return audio_signal, 16000, embeddings, self.labels['good' if track.get_sound_info['klass'].startswith('good') else 'bad']

    def __len__(self):
        if self.set==0:
            return len(self.trackids_train)
        elif self.set==1:
            return len(self.trackids_valid)
        else:
            return len(self.trackids_test)

random_seed=0

train_dataset = good_soundsDataset(mirdataset=g, subset=0, random_seed=random_seed)
train_loader = torch.utils.data.DataLoader(train_dataset,batch_size=64,num_workers=4,pin_memory=True)
valid_dataset = good_soundsDataset(mirdataset=g, subset=1, random_seed=random_seed)
valid_loader = torch.utils.data.DataLoader(valid_dataset,batch_size=64,num_workers=4,pin_memory=True)
test_dataset = good_soundsDataset(mirdataset=g, subset=2, random_seed=random_seed)
test_loader = torch.utils.data.DataLoader(test_dataset,batch_size=64,num_workers=4,pin_memory=True)

classes = ('good', 'bad')
sounds, sample_rate, inputs, labels = train_dataset[0]
print(type(sounds), type(sample_rate), type(inputs), type(labels))

<class 'numpy.ndarray'> <class 'int'> <class 'torch.Tensor'> <class 'int'>


In [None]:
string = {
    "v" : {
        "k": ["violin"]
    },
    "c" : {
        "k": ["cello"]
    },
    "b" : {
        "k": ["bass"]
    },
    "all" : {
        "k": ["violin", "cello", "bass"]
    }
}
for k in string.keys():
    string[k]['train_dataset'] = good_soundsDataset(mirdataset=g, subset=0, random_seed=random_seed, klasses=string[k]['k'])
    string[k]['train_loader'] = torch.utils.data.DataLoader(string[k]['train_dataset'],batch_size=64,num_workers=4,pin_memory=True)
    string[k]['test_dataset'] = good_soundsDataset(mirdataset=g, subset=2, random_seed=random_seed, klasses=string[k]['k'])
    string[k]['test_loader'] = torch.utils.data.DataLoader(string[k]['test_dataset'],batch_size=64,num_workers=4,pin_memory=True)

In [None]:
inputs.size()

torch.Size([300])

## TORCH VGGish Model

In [None]:
import torch
import numpy as np

class myVGGish(torch.nn.Module):
    def __init__(self):
          super(myVGGish, self).__init__()
          self.fc1 = torch.nn.Linear(384, 256)
          self.relu1 = torch.nn.ReLU()
          self.fc2 = torch.nn.Linear(256, 128)
          self.relu2 = torch.nn.ReLU()
          self.fc3 = torch.nn.Linear(128, 2) 
    

    def forward(self, x):
          x = self.fc1(x)
          x = self.relu1(x)
          x = self.fc2(x)
          x = self.relu2(x)
          x = self.fc3(x)
          return x


model = myVGGish()

Device to use: 0


myVGGish(
  (fc2): Linear(in_features=300, out_features=128, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=128, out_features=2, bias=True)
)

In [None]:
tensorboard_writer = SummaryWriter('./tensorboard_logs')

In [None]:
def plot_signal(signal, title, cmap=None):
    fig = plt.figure()
    if signal.ndim == 1:
        plt.plot(signal)
    else:
        plt.imshow(signal, cmap=cmap)    
    plt.title(title)
    
    plot_buf = io.BytesIO()
    plt.savefig(plot_buf, format='jpeg')
    plot_buf.seek(0)
    plt.close(fig)
    return ToTensor()(PIL.Image.open(plot_buf))

In [None]:
def train(model, epoch, loader):
    model.train()
    for batch_idx, (sounds, sample_rate, inputs, labels) in enumerate(loader['train_loader']):
        # print(inputs.shape, torch.min(inputs), torch.max(inputs))
        # print(labels.shape, torch.min(labels), torch.max(labels))
        inputs = inputs.to(device)
        labels = labels.to(device)

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        # _, predicted = torch.max(outputs, 1)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        iteration = epoch * len(loader['train_loader']) + batch_idx
        if batch_idx % log_interval == 0: #print training stats
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'
                  .format(epoch, batch_idx * len(inputs), len(loader['train_loader'].dataset), 
                          100. * batch_idx / len(loader['train_loader']), loss))
            tensorboard_writer.add_scalar('training loss/loss', loss, iteration)
            tensorboard_writer.add_scalar('learning rate/lr', optimizer.param_groups[0]['lr'], iteration)
                
        
        # if batch_idx % debug_interval == 0:    # report debug image every "debug_interval" mini-batches
        #     for n, (inp, pred, label) in enumerate(zip(inputs, predicted, labels)):
        #         series = 'label_{}_pred_{}'.format(classes[label.cpu()], classes[pred.cpu()])
        #         tensorboard_writer.add_image('Train MelSpectrogram samples/{}_{}_{}'.format(batch_idx, n, series), 
        #                                      plot_signal(inp.cpu().numpy().squeeze(), series, 'hot'), iteration)

In [None]:
import sklearn

def test(model, epoch, loader):
    model.eval()
    class_correct = list(0. for i in range(2))
    class_total = list(0. for i in range(2))
    true = []
    predict = []
    with torch.no_grad():
        for idx, (sounds, sample_rate, inputs, labels) in enumerate(loader['test_loader']):
            inputs = inputs.to(device)
            labels = labels.to(device)

            outputs = model(inputs)

            _, predicted = torch.max(outputs, 1)
            true = true + labels.tolist()
            predict = predict + predicted.tolist()
            c = (predicted == labels)
            for i in range(len(inputs)):
                label = labels[i].item()
                class_correct[label] += c[i].item()
                class_total[label] += 1
        
            iteration = (epoch + 1) * len(loader['test_loader'])

    total_accuracy = 100 * sum(class_correct)/sum(class_total)
    balanced_accuracy = sklearn.metrics.balanced_accuracy_score(true, predict)
    f_score = sklearn.metrics.f1_score(true, predict)
    precision = sklearn.metrics.precision_score(true, predict)
    recall = sklearn.metrics.recall_score(true, predict)
    cm = sklearn.metrics.confusion_matrix(true, predict)
    print('[Iteration {}] Accuracy on the {} test audios: {}%\n'.format(epoch, sum(class_total), total_accuracy))
    print('[Iteration {}] balanced accuracy on the {} test audios: {}%\n'.format(epoch, sum(class_total), balanced_accuracy))
    print('[Iteration {}] f_score on the {} test audios: {}%\n'.format(epoch, sum(class_total), f_score))
    print('[Iteration {}] precision on the {} test audios: {}%\n'.format(epoch, sum(class_total), precision))
    print('[Iteration {}] recall on the {} test audios: {}%\n'.format(epoch, sum(class_total), recall))
    print('[Iteration {}] cm on the {} test audios: {}%\n'.format(epoch, sum(class_total), cm))

## DEPLOY MODELS

In [None]:

def deploy(loader):
    optimizer = optim.SGD(model.parameters(), lr = configuration_dict.get('base_lr', 0.001), momentum = 0.9)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size = configuration_dict.get('number_of_epochs')//3, gamma = 0.1)
    criterion = nn.CrossEntropyLoss()

    device = torch.cuda.current_device() if torch.cuda.is_available() else torch.device('cpu')
    print('Device to use: {}'.format(device))
    model.to(device)

    log_interval = 10
    debug_interval = 25
    for epoch in range(configuration_dict.get('number_of_epochs', 10)):
        print(epoch)
        train(model, epoch, loader)
        test(model, epoch, loader)
        scheduler.step()

deploy(string['v'])

Device to use: 0
0
[Iteration 0] Accuracy on the 277.0 test audios: 84.47653429602889%

[Iteration 0] balanced accuracy on the 277.0 test audios: 0.5%

[Iteration 0] f_score on the 277.0 test audios: 0.9158512720156556%

[Iteration 0] precision on the 277.0 test audios: 0.8447653429602888%

[Iteration 0] recall on the 277.0 test audios: 1.0%

[Iteration 0] cm on the 277.0 test audios: [[  0  43]
 [  0 234]]%

1




[Iteration 1] Accuracy on the 277.0 test audios: 84.47653429602889%

[Iteration 1] balanced accuracy on the 277.0 test audios: 0.5%

[Iteration 1] f_score on the 277.0 test audios: 0.9158512720156556%

[Iteration 1] precision on the 277.0 test audios: 0.8447653429602888%

[Iteration 1] recall on the 277.0 test audios: 1.0%

[Iteration 1] cm on the 277.0 test audios: [[  0  43]
 [  0 234]]%

2
[Iteration 2] Accuracy on the 277.0 test audios: 84.47653429602889%

[Iteration 2] balanced accuracy on the 277.0 test audios: 0.5%

[Iteration 2] f_score on the 277.0 test audios: 0.9158512720156556%

[Iteration 2] precision on the 277.0 test audios: 0.8447653429602888%

[Iteration 2] recall on the 277.0 test audios: 1.0%

[Iteration 2] cm on the 277.0 test audios: [[  0  43]
 [  0 234]]%

3
[Iteration 3] Accuracy on the 277.0 test audios: 84.47653429602889%

[Iteration 3] balanced accuracy on the 277.0 test audios: 0.5%

[Iteration 3] f_score on the 277.0 test audios: 0.9158512720156556%

[Itera

In [None]:
deploy(string['c'])

Device to use: 0
0
[Iteration 0] Accuracy on the 146.0 test audios: 81.5068493150685%

[Iteration 0] balanced accuracy on the 146.0 test audios: 0.5%

[Iteration 0] f_score on the 146.0 test audios: 0.8981132075471698%

[Iteration 0] precision on the 146.0 test audios: 0.815068493150685%

[Iteration 0] recall on the 146.0 test audios: 1.0%

[Iteration 0] cm on the 146.0 test audios: [[  0  27]
 [  0 119]]%

1




[Iteration 1] Accuracy on the 146.0 test audios: 81.5068493150685%

[Iteration 1] balanced accuracy on the 146.0 test audios: 0.5%

[Iteration 1] f_score on the 146.0 test audios: 0.8981132075471698%

[Iteration 1] precision on the 146.0 test audios: 0.815068493150685%

[Iteration 1] recall on the 146.0 test audios: 1.0%

[Iteration 1] cm on the 146.0 test audios: [[  0  27]
 [  0 119]]%

2
[Iteration 2] Accuracy on the 146.0 test audios: 81.5068493150685%

[Iteration 2] balanced accuracy on the 146.0 test audios: 0.5%

[Iteration 2] f_score on the 146.0 test audios: 0.8981132075471698%

[Iteration 2] precision on the 146.0 test audios: 0.815068493150685%

[Iteration 2] recall on the 146.0 test audios: 1.0%

[Iteration 2] cm on the 146.0 test audios: [[  0  27]
 [  0 119]]%

3
[Iteration 3] Accuracy on the 146.0 test audios: 81.5068493150685%

[Iteration 3] balanced accuracy on the 146.0 test audios: 0.5%

[Iteration 3] f_score on the 146.0 test audios: 0.8981132075471698%

[Iteration 

In [None]:
deploy(string['b'])

Device to use: 0
0
[Iteration 0] Accuracy on the 32.0 test audios: 50.0%

[Iteration 0] balanced accuracy on the 32.0 test audios: 0.5%

[Iteration 0] f_score on the 32.0 test audios: 0.6666666666666666%

[Iteration 0] precision on the 32.0 test audios: 0.5%

[Iteration 0] recall on the 32.0 test audios: 1.0%

[Iteration 0] cm on the 32.0 test audios: [[ 0 16]
 [ 0 16]]%

1




[Iteration 1] Accuracy on the 32.0 test audios: 50.0%

[Iteration 1] balanced accuracy on the 32.0 test audios: 0.5%

[Iteration 1] f_score on the 32.0 test audios: 0.6666666666666666%

[Iteration 1] precision on the 32.0 test audios: 0.5%

[Iteration 1] recall on the 32.0 test audios: 1.0%

[Iteration 1] cm on the 32.0 test audios: [[ 0 16]
 [ 0 16]]%

2
[Iteration 2] Accuracy on the 32.0 test audios: 50.0%

[Iteration 2] balanced accuracy on the 32.0 test audios: 0.5%

[Iteration 2] f_score on the 32.0 test audios: 0.6666666666666666%

[Iteration 2] precision on the 32.0 test audios: 0.5%

[Iteration 2] recall on the 32.0 test audios: 1.0%

[Iteration 2] cm on the 32.0 test audios: [[ 0 16]
 [ 0 16]]%

3
[Iteration 3] Accuracy on the 32.0 test audios: 50.0%

[Iteration 3] balanced accuracy on the 32.0 test audios: 0.5%

[Iteration 3] f_score on the 32.0 test audios: 0.6666666666666666%

[Iteration 3] precision on the 32.0 test audios: 0.5%

[Iteration 3] recall on the 32.0 test audio

In [None]:
deploy(string['all'])

Device to use: 0
0
[Iteration 0] Accuracy on the 454.0 test audios: 81.05726872246696%

[Iteration 0] balanced accuracy on the 454.0 test audios: 0.5%

[Iteration 0] f_score on the 454.0 test audios: 0.8953771289537712%

[Iteration 0] precision on the 454.0 test audios: 0.8105726872246696%

[Iteration 0] recall on the 454.0 test audios: 1.0%

[Iteration 0] cm on the 454.0 test audios: [[  0  86]
 [  0 368]]%

1




[Iteration 1] Accuracy on the 454.0 test audios: 81.05726872246696%

[Iteration 1] balanced accuracy on the 454.0 test audios: 0.5%

[Iteration 1] f_score on the 454.0 test audios: 0.8953771289537712%

[Iteration 1] precision on the 454.0 test audios: 0.8105726872246696%

[Iteration 1] recall on the 454.0 test audios: 1.0%

[Iteration 1] cm on the 454.0 test audios: [[  0  86]
 [  0 368]]%

2
[Iteration 2] Accuracy on the 454.0 test audios: 81.05726872246696%

[Iteration 2] balanced accuracy on the 454.0 test audios: 0.5%

[Iteration 2] f_score on the 454.0 test audios: 0.8953771289537712%

[Iteration 2] precision on the 454.0 test audios: 0.8105726872246696%

[Iteration 2] recall on the 454.0 test audios: 1.0%

[Iteration 2] cm on the 454.0 test audios: [[  0  86]
 [  0 368]]%

3
[Iteration 3] Accuracy on the 454.0 test audios: 81.05726872246696%

[Iteration 3] balanced accuracy on the 454.0 test audios: 0.5%

[Iteration 3] f_score on the 454.0 test audios: 0.8953771289537712%

[Itera