<a href="https://colab.research.google.com/github/MohamadMahdiDarvishi/Audio-ML/blob/main/Notebooks/Audio_ML_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Music genre classification : preparing the dataset

## 0. Prerequirities

In [None]:
# Importing
import json
import os
import math
import librosa

# Setting Hyper parameters
DATASET_PATH = "/content/dataset/Data/genres_original"
JSON_PATH = "/content/dataset/data_10.json"
SAMPLE_RATE = 22050
TRACK_DURATION = 30 # in seconds
SAMPLE_PER_TRACK = TRACK_DURATION * SAMPLE_RATE # All samples gathered

📑 **Bringing dataset to notebook**

In [None]:
# creating folder for putting dataset data in
os.makedirs("/content/dataset" , exist_ok = True)
os.path.isdir("/content/dataset")

# mounting google drive for using data inside
from google.colab import drive
drive.mount('/content/drive')

# Copying data from mounted google drive to local dataset folder
%cp -r /content/drive/MyDrive/GTZAN/archive_2.zip /content/dataset

# changing primary working directory for unzipping data
%cd "/content/dataset/"

# Unziping data copied to local dataset folder
import zipfile
with zipfile.ZipFile("/content/dataset/archive_2.zip" , "r") as zipref :
  zipref.extractall()

Mounted at /content/drive
/content/dataset


📖 **Food for thought**

for unziping `file.zip` zipped file by `zipfile` module

```
# code snippet
import zipfile
with zipfile.ZipFile('file.zip', 'r') as zip_ref:
    zip_ref.extractall()

```
as though for `rar` files


```
# code snippet
import rarfile
with rarfile.RarFile('file.rar', 'r') as rar_ref:
    rar_ref.extractall()

```


🤔 **Diffrerence between `!` and `%` in IPython**

* When you run a command with `!` , it directly executes a bash command in a subshell.

* When you run a command with `%` , it executes one of the magic commands defined in IPython.

📖 **Food for thought**

* `subshell` - A subshell is a child shell that is spawned by the main shell (also known as the parent shell). It is a separate process with its own set of variables and command history, and it allows you to execute commands and perform operations within a separate environment.

* `magic commands` - Magic commands generally known as magic functions are special commands in IPython that provide special functionalities to users like modifying the behavior of a code cell explicitly, simplifying common tasks like timing code execution, profiling, etc.

📚 **Refrences**

* [subshell](https://www.geeksforgeeks.org/shell-scripting-subshell/)

* [magic commands in IPython](https://www.geeksforgeeks.org/useful-ipython-magic-commands/#:~:text=Magic%20commands%20generally%20known%20as,code%20execution%2C%20profiling%2C%20etc.)

In [None]:
# removing some tracks : as the data structure of them is damaged and can't be loaded
%rm '/content/dataset/Data/genres_original/jazz/jazz.00054.wav'
# removing some unusefull meta data putted on dataset zip file
%rm '/content/dataset/Data/images_original' -r

## 1 . `save_mfcc` function

**steps :**

1. Creating raw output
2. Walking through and adding folder names to mapping
3. loading music data
4. walking through track segments and calculating mfcc
5. writing data in json format on output file


In [None]:
def save_mfcc(dataset_path , json_path , num_mfcc = 13 , n_fft = 2048 , hop_length = 512 , num_segment = 5) :

  """
  Extract MFCC from music Datasets and saving them in json file along with genre lables

    :param dataset_path (str) : Path to json file used to save MFCCs
    :param json_path (str) : Path to json file used to save MFCCs
    :param num_mfcc (int) : Number of coefficients to extract
    :param n_fft (int) : Interval we consider to apply FFT. Measured in # of samples
    :param hop_length (int) : Sliding window for FFT. Measured in # of samples
    :param sum_segments (int) : Number of segments we want to divide sample tracks into
    :return :

  """

  # 1. Saving data for Converting to json file
  data = {
      "mapping" : [] ,
      "lables" : [] ,
      "mfcc" : []
  }

  # Calculating expected mfcc vectors per segment
  samples_per_segment = int(SAMPLE_PER_TRACK / num_segment)
  num_mfcc_vectors_per_segment = math.ceil(samples_per_segment / hop_length) # math.ceil rounds to the greater integer

  # Note : for easy understanding of function operation -> dirnames is useless

  for i , (dirpath , dirnames , filenames) in enumerate(os.walk(dataset_path)) :

      # 2 . walking through dataset folder and adding folder names to the mapping
      if dirpath is not dataset_path :
        mapping = dirpath.split("/")[-1]
        data["mapping"].append(mapping)
        print(f"processing : {mapping}")

        # 3 . loading music signals from files in dataset_path
        for f in filenames :
          file_path = os.path.join(dirpath , f)
          signal , sample_rate = librosa.load(path = file_path ,sr = SAMPLE_RATE)
          # 4. walking through track segments and calculating mfcc on them
          for j in range(num_segment) :
            start = j * samples_per_segment
            end = start + samples_per_segment
            mfcc = librosa.feature.mfcc(y = signal[start:end], sr = SAMPLE_RATE , n_mfcc = num_mfcc , n_fft = n_fft , hop_length = hop_length)
            mfcc = mfcc.T # transposing mfcc matrix

            # checking the lenght of mfcc vectors
            if len(mfcc) == num_mfcc_vectors_per_segment :
              data["mfcc"].append(mfcc.tolist())
              data["lables"].append(i-1)

              # printing out the final work
              print(f"{file_path} , segment : {j+1} , label : {i-1}")
  # 5 . exiting walking through folders' "for" loop and writing data in json format on output file
  with open(json_path , "w") as fp :
    json.dump(data , fp , indent = 4)

if __name__ == "__main__" :
  save_mfcc(DATASET_PATH , JSON_PATH , num_segment = 10)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 4 , label : 4
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 5 , label : 4
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 6 , label : 4
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 7 , label : 4
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 8 , label : 4
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 9 , label : 4
/content/dataset/Data/genres_original/reggae/reggae.00018.wav , segment : 10 , label : 4
processing : jazz
/content/dataset/Data/genres_original/jazz/jazz.00024.wav , segment : 1 , label : 5
/content/dataset/Data/genres_original/jazz/jazz.00024.wav , segment : 2 , label : 5
/content/dataset/Data/genres_original/jazz/jazz.00024.wav , segment : 3 , label : 5
/content/dataset/Data/genres_original/jazz/jazz.

❗ **Note**

In machine Learning classification problems , the lables starts from `0` index not `1`   

📖 **Food for thought**

Actully code snippet below prevents `if-statement` from importing and being runned in the another module

```
# code snippet
if __name__ == "__main__" :
  # do something
```

📚 **Refrences**

[What does `if __name__ == "__main__"` do in code snippet](https://stackoverflow.com/questions/419163/what-does-if-name-main-do)

In [None]:
%cp '/content/dataset/data_10.json' '/content/drive/My Drive'