<a href="https://colab.research.google.com/github/RubyQianru/Deep-Learning-for-Media/blob/main/Adapt_to_Audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning for Media
#### MPATE-GE 2039 - DM-GY 9103

---


This is a class excercise, it counts towards your class participation grade. Notebook based on the companion materials of:

<blockquote>
"Deep Learning with Python", Second Edition by  F. Chollet, 2021.
</blockquote>

Follow the instructions below.

## Instrument classification using audio

Based on the code from the notebook "Building Blocks" that we discussed in class, complete this notebook to train a classifier with audio. Change your runtime to use a GPU for faster results.

### Obtain the dataset

For this assignment we will use a mini version of the Medley-Solos-DB dataset:

<blockquote>
V. Lostanlen, C.E. Cella. Deep convolutional networks on the pitch spiral for musical instrument recognition. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
</blockquote>

Download the dataset mini version of the dataset [from this link](bit.ly/mini_medley_solos_db), and save it in your Drive under `mir_datasets/mini_medley_db_solos`.

In [1]:
!pip install mirdata==0.3.8   # this is a package for working with music datasets

Collecting mirdata==0.3.8
  Downloading mirdata-0.3.8-py3-none-any.whl (17.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.2/17.2 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
Collecting black>=23.3.0 (from mirdata==0.3.8)
  Downloading black-24.1.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
Collecting Deprecated>=1.2.14 (from mirdata==0.3.8)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting jams>=0.3.4 (from mirdata==0.3.8)
  Downloading jams-0.3.4.tar.gz (51 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.3/51.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pretty-midi>=0.2.10 (from mirdata==0.3.8)
  Downloading pretty_midi-0.2.10.tar.gz (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
import mirdata
import librosa
import numpy as np
import random


data_home = '/content/drive/My Drive/mir_datasets/mini_medley_solos_db'
dataset = mirdata.initialize('medley_solos_db', data_home=data_home)

In [4]:
if not os.path.exists("/content/drive/My Drive/mir_datasets/"):
  print("Make a directory at `My Drive/mir_datasets/`!")

if not (os.path.exists("/content/drive/My Drive/mir_datasets/mini_medley_solos_db")
  and os.path.exists("/content/drive/My Drive/mir_datasets/mini_medley_solos_db/audio")
  and os.path.exists("/content/drive/My Drive/mir_datasets/mini_medley_solos_db/annotation")):
  print("Unzip `mini_medley_solos_db.zip` at `My Drive/mir_datasets/`! It will create two sub-folders, `audio` and `annotation`.")
  print("If you're done with it on your laptop, you may need to wait till your Google Drive is sync'ed.")


In [5]:
# check that the code runs by loading a random file
dataset.track('fe798314-bdfb-5055-f633-5c2df5129be4').audio

(array([-0.00023576, -0.00034744, -0.00029236, ..., -0.00042982,
         0.00110277,  0.00333256], dtype=float32),
 22050)

We are not going to use the audio waveform directly, the sampling rate of a waveform is very high and it's a lot of data to deal with!

Instead, we're going to ''summarize'' its content by extracting some audio features. Those features are called [MFCCs](https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd), which roughly speaking represent timbre information pretty well.

In [6]:
def compute_mfccs(y, sr, n_fft=2048, hop_length=512, n_mels=128, n_mfcc=20):
    """Compute mfccs for an audio file, removing the 0th MFCC coefficient
    to be independent of loudness

    Parameters
    ----------
    y : np.array
        Mono audio signal
    sr : int
        Audio sample rate
    n_fft : int
        Number of points for computing the fft
    hop_length : int
        Number of samples to advance between frames
    n_mels : int
        Number of mel frequency bands to use
    n_mfcc : int
        Number of mfcc's to compute

    Returns
    -------
    mfccs: np.array (t, n_mfcc - 1)
        Matrix of mfccs

    """

    mfcc = librosa.feature.mfcc(y=y,
                                sr=sr,
                                n_mfcc=n_mfcc,
                                n_fft=n_fft,
                                hop_length=hop_length,
                                n_mels=n_mels).T

    return mfcc[:, 1:]

In [7]:
# run this to create the track ("songs") splits
all_tracks = dataset.load_tracks()
tracks_train = [t for t in all_tracks.values() if t.subset == 'training']
tracks_test = [t for t in all_tracks.values() if t.subset == 'test']
random.shuffle(tracks_test)
tracks_test = tracks_test[:65] # 10% test

print("There are {} tracks in the training set".format(len(tracks_train)))
print("There are {} tracks in the test set".format(len(tracks_test)))

There are 584 tracks in the training set
There are 65 tracks in the test set


In [8]:
# get the audio features for each audio track into a list
features_train = [compute_mfccs(t.audio[0], t.audio[1]) for t in tracks_train]
features_test = [compute_mfccs(t.audio[0], t.audio[1]) for t in tracks_test]
# get the labels
labels_train = [t.instrument_id for t in tracks_train]
labels_test = [t.instrument_id for t in tracks_test]
# convert them into an array
features_train = np.array(features_train)
features_test = np.array(features_test)
labels_train = np.array(labels_train)
labels_test = np.array(labels_test)

In [9]:
features_train.shape

(584, 129, 19)

In [10]:
len(labels_train)

584

In [11]:
labels_train

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
       4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,

In [12]:
features_test.shape

(65, 129, 19)

In [13]:
len(labels_test)

65

In [14]:
labels_test

array([4, 3, 4, 0, 3, 3, 2, 7, 7, 3, 6, 7, 3, 7, 3, 1, 3, 3, 7, 4, 3, 3,
       4, 2, 3, 3, 3, 0, 4, 7, 3, 4, 3, 5, 2, 7, 3, 2, 3, 4, 1, 2, 1, 6,
       2, 3, 3, 1, 1, 6, 2, 7, 7, 7, 1, 4, 4, 4, 7, 6, 2, 2, 1, 4, 7])

### The network architecture

Add code to create a two-dense-layer neural network for instrument classification. The first layer should have a `relu` activation and the second one a `softmax` activation.

How many units? (= how large are the layers?)

- First layer: 🤷 you can set some number like, 10, or 100, or 30, or 512.
- Second (and last) layer: What do you think? Why were there 10 units in the last layer, in the MNIST digit classification examples?

In [41]:
# YOUR CODE HERE
#

from tensorflow import keras
from tensorflow.keras import layers
model = keras.Sequential([
    layers.Dense(100, activation="relu"),
    layers.Dense(10, activation="softmax")
])

### The compilation step

Add code to compile the model with a `rmsprop` optimizer, with a `sparse_categorical_crossentropy` loss and `accuracy` as metric.

In [42]:
# YOUR CODE HERE
#

model.compile(optimizer="rmsprop",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

### Preparing the audio data

A dense layer expects a matrix (tensor rank-2) as input. Values, should be normalized between 1 and -1.

In [43]:
# YOUR CODE HERE
#
features_train_r2 = features_train.reshape((584, 129 * 19))
features_train_r2 = features_train_r2.astype("float32") / 287.44623
features_test_r2 = features_test.reshape((65, 129 * 19))
features_test_r2 = features_test_r2.astype("float32") / 287.44623

In [44]:
features_train_r2.max()

1.0

In [45]:
features_train_r2.min()

-0.584992

In [46]:
# Check your code
assert len(features_train_r2.shape) == 2
assert features_train_r2.max() <= 1
assert features_train_r2.min() >= -1

assert len(features_test_r2.shape) == 2
assert features_test_r2.max() <= 1
assert features_test_r2.min() >= -1

### "Fitting" the model

In [48]:
# YOUR CODE HERE
# model..
model.fit(features_train_r2, labels_train, epochs=5, batch_size=64)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x79d9ef499e70>

### Using the model to make predictions

In [49]:
# YOUR CODE HERE

predictions = model.predict(features_test_r2[0:5])




In [50]:
predictions[0].argmax()

4

In [51]:
predictions[0][3]

5.874169e-05

In [52]:
labels_test[0]

4

In [53]:
# check what instrument that corresponds to what label
np.unique([f'{t.instrument_id}-{t.instrument}' for t in tracks_test])

array(['0-clarinet', '1-distorted electric guitar', '2-female singer',
       '3-flute', '4-piano', '5-tenor saxophone', '6-trumpet', '7-violin'],
      dtype='<U27')

**Evaluating the model on new data**

In [54]:
# YOUR CODE HERE

test_loss, test_acc = model.evaluate(features_test_r2, labels_test)
print(f"test_acc: {test_acc}")

test_acc: 0.6769230961799622
