<a href="https://colab.research.google.com/github/HumanAndMachineHearing/Practical_2023/blob/main/Assignment_2_Students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Programming Assignment 2. Audio feature extraction for sound classification**

# 1. Introduction
<p align = "justify"> In the previous session, you explored the ESC-50 dataset and analyzed and visualized sound waves, spectra, and spectrograms of sound clips in the ESC-50 dataset. In this session, you will dive deeper into the topic of audio feature extraction.

**Practical Report**
<br>
For this session, you are expected to add the output of and answers to the exercises as defined in the notebook to the Practical Report. A link to the templates for the practical reports can be found in the Readme file.


#Preparation

Before embarking on the exercises, import the libraries that are essential for this excerise.

In [None]:
# Machine learning framework
import torch

# Library for audio and signal processing with PyTorch
import torchaudio
import torchaudio.transforms as T # For common audio processings and feature extractions. Implements features as objects
import torchaudio.functional as F # Implements features as standalone functions

# For manipulating directory paths
import os

# For working with datasets
import pandas as pd

# Plotting library
import matplotlib.pyplot as plt
# To embed plots within the notebook
%matplotlib inline

# Scientific and vector computation for Python
import numpy as np


## Sound selection

Use the same selection of example sounds as in Assignment 1.

In [None]:
# Add code to load example sounds with Torchaudio

#2. Mel spectrogram

<p align = "justify"> In Session 1, you calculated time-frequency representations of sound clips as power spectrograms. The frequency axis of these spectrograms was linear from 1 Hz to 22 kHz. In this exercise, you will use the Mel filterbank to calculate an alternative spectrogram representation that resembles human hearing properties.

## Visualizing the Mel filter bank
**Exercise 2.1:**
<p align = "justify"> (A) Try different numbers of Mel filters and plot the resulting filter banks.
<p align = "justify"> (B) Describe how the Mel filterbank aims to mimic human hearing properties and why it is beneficial for a Machine Hearing model to receive input that resembles these human hearing properties.



In [None]:
# Specify parameters for mel filterbank
mgram_n_fft = 1024
mgram_n_mels = 64
mgram_f_min = 1 # set minimum frequency
mgram_f_max = sample_rate / 2 # Nyquist frequency; maximum frequency

In [None]:
# Define mel filterbank
mel_filters = F.melscale_fbanks(
    int(mgram_n_fft // 2 + 1), #Audio f
    n_mels= mgram_n_mels,
    f_min= mgram_f_min,
    f_max= mgram_f_max,
    sample_rate=sample_rate,
    norm = None, # choose 'Slaney' for area normalization
    mel_scale = 'htk',
)

In [None]:
# To calculate the center frequencies of the mel filters, first calculate the center mel of
# the minimum and maximum

# extract center frequencies of mel filters
mel_min = 2595*np.log10(1+(mgram_f_min/700))
mel_max = 2595*np.log10(1+(mgram_f_max/700))
bin_width = (mel_max-mel_min)/(mgram_n_mels+1)
mel_axis = np.linspace(mel_min+(bin_width/2),mel_max-(bin_width/2),mgram_n_mels) # shift the bins so they do not exceed the edges
mel_axis_infreq = (np.round(700*(np.power(10,(mel_axis/2595))-1))).astype(int) # calculate frequency axis


In [None]:
 # calculate frequency axis
mspectrogram_freqs = (np.round(np.arange(0,np.shape(mel_filters)[0])*sample_rate/mgram_n_fft)).astype(int)

# Visualize mel filterbank as 2D image
fig = plt.figure()
plt.title("Mel filter bank")
plt.imshow(mel_filters, aspect="auto", origin = 'lower', cmap = 'jet', vmin = 0, vmax = 1)
plt.colorbar(label = 'amplitude')
plt.ylabel("Frequency (Hertz)");
plt.xlabel("Mel bin (number)");
ytickdefinition = [0, 100, 200, 300, 400, 500]
plt.yticks(ytickdefinition,mspectrogram_freqs[ytickdefinition]);
xtickdefinition = [0,10,20,30,40,50,60]
#plt.xticks(xtickdefinition,mel_axis_infreq[xtickdefinition]); # use this to label axis with Mel center frequencies rather than filter number

# Visualize mel filterbank as line plot
fig = plt.figure()
plt.plot(mel_filters);
plt.title('Mel filters visualized')
plt.ylabel('Amplitude');
plt.xlabel('Frequency (Herz)')
xtickdefinition = [0, 100, 200, 300, 400, 500]
plt.xticks(xtickdefinition,mspectrogram_freqs[xtickdefinition]);

## Calculate and visualize the Log-Mel spectrogram

**Exercise 2.2:**
<p align = 'justify'> Calculate and plot Log-Mel spectrograms of the sounds that you selected. Compare the Log-Mel spectrogram plots to the spectrogram plots of Assignment 1 to describe what the effect is of using a Mel filterbank on the time-frequency representation of a sound.

In [None]:
# Specify parameters for mel spectrogram
mgram_win_length = 512 # window size; default = n_fft; determines the resolution of the time axis
mgram_hop_length = 256 # length of hop between STFT windows; default = win_legth / 2; # determines the resolution of the time axis

In [None]:
# Define  transform to mel spectrogram, either magnitude or power
mel_spectrogram = T.MelSpectrogram(
    sample_rate = sample_rate,
    n_fft = mgram_n_fft,
    win_length = mgram_win_length,
    hop_length = mgram_hop_length,
    center = True,
    pad_mode = "reflect",
    power = 2.0,
    norm = None,
    n_mels = mgram_n_mels,
    mel_scale = "htk",
)

In [None]:
# Perform transform to Mel spectrogram
melspec = mel_spectrogram(waveform)

# Define time axis
mspectrogram_time_axis = np.round(np.arange(0,np.shape(melspec)[2])*((np.shape(waveform)[1]/np.shape(melspec)[2])/sample_rate),2)

# Convert units of power spectrogram to decibel
offset_val = 1e-10 # add small offset to avoid taking logarithm of zero
melspec_db = 10 * np.log10(melspec+offset_val)

In [None]:
from re import X
# Convert to dB to create log-mel spectrogram
fig = plt.figure(figsize = (10,6))
plt.title('Log-Mel Spectrogram NAME SOUND')
plt.ylabel('Center Frequency (Hz)')
plt.xlabel('Time (seconds)')
plt.imshow(np.squeeze(melspec_db), origin = 'lower', cmap = 'jet', vmin = -30, vmax = 30, aspect = 3) # aspect sets the aspect ratio of the pixels
plt.colorbar(shrink = 0.3,label = 'decibel'); # add colorbar of same size as original plot
xtickdefinition = [0,100,200,300,400,500,600,700,800,861];
plt.xticks(xtickdefinition,mspectrogram_time_axis[xtickdefinition]);
ytickdefinition = [0,10,20,30,40,50,60];
plt.yticks(ytickdefinition,mel_axis_infreq[ytickdefinition]);


# 3. Mel frequency cepstral coefficients
The final audio feature that you will investigate in this practical is the Mel Frequency Cepstral Coefficient (MFCC).

**Exercise 2.3:**
Calculate and plot MFCCs of the sounds that you selected and add them to your report.

In [None]:
# Specify parameters MFCC
mfcc_n_fft = 1024 # include descriptions of these parameters.
mfcc_win_length = 512
mfcc_hop_length = 256
mfcc_n_mels = 64
mfcc_n_mfcc = 64

In [None]:
# Define MFCC transform

mfcc_transform = T.MFCC(
    sample_rate = sample_rate,
    n_mfcc = mfcc_n_mfcc,
    melkwargs ={
        "n_fft": mfcc_n_fft,
        "n_mels": mfcc_n_mels,
        "hop_length": mfcc_hop_length,
        "mel_scale": "htk",
    },
)

In [None]:
# Perform MFCC transform
mfcc_ex = mfcc_transform(waveform)

In [None]:
# Plot MFCC
fig = plt.figure(figsize=(10,6))
plt.title('MFCC ')
plt.ylabel('Mel frequency bin')
plt.xlabel('Time (seconds)')
plt.imshow(np.squeeze(mfcc_ex[0]), origin = 'lower', cmap = 'jet', vmin = -30, vmax = 30, aspect = 3)
plt.colorbar(shrink = 0.3, label = 'Coefficient');
xtickdefinition = [0,100,200,300,400,500,600,700,800,861];
plt.xticks(xtickdefinition,mspectrogram_time_axis[xtickdefinition]);


# 4. Preparing audio features for Resnet-18

<p align="justify"> The Resnet model was originally developed for image classification. This means that the implementation we use here can take either one-channel input (corresponding to grayscale images in the original implementation) or three-channel input (corresponding to RGB images in the original implementation). Furthermore, the network expects square input.

<p align="justify"> During this practical, you will train the ResNet-18 model on one-channel audio input (corresponding to one audio feature) and on three-channel input (corresponding to a combination of three audio features) to investigate the relevance of various audio features on sound classification performance.

<p align="justify">Given the computational load of training the model, it is recommended to generate features of dimensions 128 x 128. For the time dimension, cut sound clips to 3 seconds duration instead of 5 seconds.

**Exercise 2.4:**
<br>
<p align="justify"> Based on your analysis of audio features in this practical, select which audio feature you will use as one-channel input and which combination of features you will use as three-channel input. Motivate your choice and use illustrations of the audio features to support your motivation. Also describe the parameters that you used for generating the audio features (e.g., 128 x 128).  



In [None]:
# Add your code here