# PRÁCTICA 4 - APRENDIZAJE PROFUNDO - MASTER EN INTELIGENCIA ARTIFICIAL APLICADA

# JOSÉ LORENTE LÓPEZ - DNI: 48842308Z

Práctica asociada a la clasificación por etiquetas de sonidas usando Deep Learning.

En este primer ipynb tomaremos el dataset con los audios y preprocesaremos los mismos para poder trabajar con ellos.

# Sound Classification using Deep Learning - Preprocessing
## >> Database Download and Feature Extraction

![uc3m](http://materplat.org/wp-content/uploads/LogoUC3M.jpg)
## Mount Google Drive, install dependencies, download the database and perform the feature extraction process

**It is recommendable to execute this script before the lab session because the whole process can take up more than 1 hour to be executed.**

Once the process is finished, you should find in your Google Drive directory the following items:



*   A directory called *UrbanSound8k* that contains the speech files in wav format
*   A zip file called *us8k_features.pkl* that contains the corresponding features (log-mel spectrograms)


**Note that you only need to run this function once.**





In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Create a new folder in your drive and change to that directory as: 
# /content/drive/My_Drive/new_dir_that_you_just_created_for_this_lab
import os
os.chdir('/content/drive/MyDrive/Master Inteligencia Artificial Aplicada - UCIIIM/p4DL/Dataset')

## Required Python libraries for the lab session

You may need to install librosa using pip as follows:

> **!pip install librosa==0.8.0**


In [None]:
# Importamos las librerías necesarias para el desarrollo de la práctica

import os

import librosa
import numpy as np
import pandas as pd

from tqdm import tqdm

---

## 1. Download and Uncompress the Audio Data
The database used in this lab session is the [UrbanSound8k dataset](https://urbansounddataset.weebly.com/urbansound8k.html) that contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes:

* air_conditioner
* car_horn
* children_playing
* dog_bark
* drilling
* enginge_idling
* gun_shot
* jackhammer
* siren
* street_music

All files have been recorded at a sampling frequency of 22050 Hz.




Utilizaremos el dataset "UrbanSound8k". Esta base de datos cuenta con 8732 sonidos (de menos de 4 segundos) etiquetas en 10 clases diferentes (aire acondicionado, niños jugando, disparo de pistola, sirena, ...).

Contamos con una frecuencia de muestreo "fs" de 22050 Hz (Ts = 1/fs = "periodo de muestreo" es el tiempo entre muestra y muestra al muestrear la señal analógica de audio).



In [None]:
# Download the UrbanSound8k dataset
DOWNLOAD_DATASET = True
EXTRACT_DATASET = True
DELETE_DATASET_TAR = True

DATASET_URL = "https://goo.gl/8hY5ER"

if DOWNLOAD_DATASET:
    !wget $DATASET_URL

if EXTRACT_DATASET:
    !tar xf 8hY5ER

if DELETE_DATASET_TAR:
    !rm -f 8hY5ER    

--2022-12-21 23:36:05--  https://goo.gl/8hY5ER
Resolving goo.gl (goo.gl)... 108.177.127.100, 108.177.127.139, 108.177.127.138, ...
Connecting to goo.gl (goo.gl)|108.177.127.100|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz [following]
--2022-12-21 23:36:05--  https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz
Resolving zenodo.org (zenodo.org)... 188.185.124.72
Connecting to zenodo.org (zenodo.org)|188.185.124.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6023741708 (5.6G) [application/octet-stream]
Saving to: ‘8hY5ER’


2022-12-21 23:38:21 (42.4 MB/s) - ‘8hY5ER’ saved [6023741708/6023741708]



In [None]:
# Set paths to the UrbanSound8K dataset and metadata file
US8K_AUDIO_PATH = os.path.abspath('UrbanSound8K/audio/')
US8K_METADATA_PATH = os.path.abspath('UrbanSound8K/metadata/UrbanSound8K.csv')

# Load the csv metadata file into a Pandas DataFrame structure
us8k_metadata_df = pd.read_csv(US8K_METADATA_PATH,
                               usecols=["slice_file_name", "fold", "classID"],
                               dtype={"fold": "uint8", "classID" : "uint8"})

us8k_metadata_df

Unnamed: 0,slice_file_name,fold,classID
0,100032-3-0-0.wav,5,3
1,100263-2-0-117.wav,5,2
2,100263-2-0-121.wav,5,2
3,100263-2-0-126.wav,5,2
4,100263-2-0-137.wav,5,2
...,...,...,...
8727,99812-1-2-0.wav,7,1
8728,99812-1-3-0.wav,7,1
8729,99812-1-4-0.wav,7,1
8730,99812-1-5-0.wav,7,1


Descargamos el dataset (10 carpetas donde en cada una de ellas están los audios de cada etiqueta) y creamos un csv con: nombre de audio, carpeta en la que se encuentra, etiqueta.

## 2. Feature Extraction
The feature sequences consist of the log-mel spectrograms of the audio files belonging to the UrbanSound8K database.

In particular, the log-mel spectrograms are computed using the following configuration:

* Frame period or hop length = 512 samples (512 / 22050 = 23.25 ms)
* Size (length) of the analysis window = 1024 samples (1024 / 22050 = 46.5 ms)
* Number of filters in the mel filterbank = 128

For the log-mel spectrogram computation, we have used the function **melspectrogram** from the module *feature* of the *librosa* package. This function has, among others, the following input arguments:

* y: speech signal 
* sr: sampling frequency
* hop_length: frame period or hop length (in samples)
* win_length: window size (in samples)
* n_mels: number of filters in the mel filterbank

Note that in this function the window size and the hop length must be expressed in samples. Taking into account that the sampling frequency (sr) indicates that 1 second correspond to fs samples (in our case, as sr = 22050 Hz, 1 second corresponds to 22050 samples), the conversion from **samples** to **seconds** is performed by:

```
seconds = samples/sr = samples/22050
```


Definimos una función que obtenga los espectrogramas de cada uno de los audios del dataset. 

In [None]:
# Extract a log-mel spectrogram for each audio file in the dataset and store it
# into a Pandas DataFrame along with its class and fold label.

# Note that the resulting log-mel spectrograms (that can be seen as sequences of
# features) are forced to have a fixed length that is determined by the input
# argument "num_of_frames". Sequences longer than this quantity are cut, whereas
# sequences shorter than this quantity are padded at the beginning and the
# end with a predefined constant value.

# Configuration variables for log-mel spectrogram computation
WINDOW_LENGTH = 1024  # length of the analysis window in samples
HOP_LENGTH = 512      # number of samples between successive frames (frame period or hop length)
N_MEL = 128           # number of Mel bands to generate


def compute_melspectrogram_with_fixed_length(audio, sampling_rate, num_of_frames=128):
    try:
        # compute a mel-scaled spectrogram
        melspectrogram = librosa.feature.melspectrogram(y=audio, 
                                                        sr=sampling_rate, 
                                                        hop_length=HOP_LENGTH,
                                                        win_length=WINDOW_LENGTH, 
                                                        n_mels=N_MEL)

        # convert a power spectrogram to decibel units (log-mel spectrogram)
        melspectrogram_db = librosa.power_to_db(melspectrogram, ref=np.max)
        
        melspectrogram_length = melspectrogram_db.shape[1]
        
        # pad or fix the length of spectrogram 
        if melspectrogram_length != num_of_frames:
            melspectrogram_db = librosa.util.fix_length(melspectrogram_db, 
                                                        size=num_of_frames, 
                                                        axis=1, 
                                                        constant_values=(-80.0, -80.0))
    except Exception as e:
        print("\nError encountered while parsing files\n>>", e)
        return None 
    
    return melspectrogram_db

Recortamos los audios para que todos duren 3 segundos y creamos un csv donde cada muestra es un audio y sus atributos son: valores_espectograma, etiqueta, carpeta_ubicada

In [None]:
# Extract the log-mel spectrograms of the whole audio database.
# The length of the log-mel sequences is fixed to NUM_OF_FRAMES = 128 frames,
# that corresponds to NUM_OF_FRAMES*HOP_LENGTH, i.e. approximately, 3 seconds 

# Configuration variables for feature extraction
SOUND_DURATION = 3.0    # fixed duration of an audio excerpt in seconds
NUM_OF_FRAMES = 128     # fixed duration in frames

features = []

# iterate through all dataset examples and compute log-mel spectrograms
for index, row in tqdm(us8k_metadata_df.iterrows(), total=len(us8k_metadata_df)):
    file_path = f'{US8K_AUDIO_PATH}/fold{row["fold"]}/{row["slice_file_name"]}'
    audio, sample_rate = librosa.load(file_path, duration=SOUND_DURATION, res_type='kaiser_fast')

    melspectrogram = compute_melspectrogram_with_fixed_length(audio, sample_rate, num_of_frames=NUM_OF_FRAMES)
    label = row["classID"]
    fold = row["fold"]
    
    features.append([melspectrogram, label, fold])

# convert into a Pandas DataFrame 
us8k_features = pd.DataFrame(features, columns=["melspectrogram", "label", "fold"])

100%|██████████| 8732/8732 [08:43<00:00, 16.68it/s]


### Store the data

Guardamos el último csv creado

In [None]:
# Store the data

# Write the Pandas DataFrame object to .pkl file
WRITE_DATA = True

if WRITE_DATA:
  us8k_features.to_pickle("us8k_features.pkl")