# **Data Preprocessing**

In [1]:
# Importing the drive module from google.colab library
from google.colab import drive

# Mounting the Google Drive to the Colab environment
drive.mount('/content/drive')

project_path = '/content/drive/My Drive/GitHub/MarineMammalSoundClassification/'
%cd /content/drive/My Drive/GitHub/MarineMammalSoundClassification/

Mounted at /content/drive
/content/drive/.shortcut-targets-by-id/1oJSL58N419Ve8pd0wCvgXEy52hLM2tJN/MarineMammalSoundClassification


In [2]:
import os
import shutil
import json
from utils.utilities import ensure_dir, get_wav_duration

## **Data Cleaning**

During the initial preprocessing, folders containing fewer than 20 instances were deleted. We then attempted to use the pyAudioAnalysis library to split the data into training, validation, and test sets. However, some files in the remaining folders produced errors, including an entire class (Fin_FinbackWhale). Consequently, during a second round of preprocessing, this class folder and the problematic files within other folders were also deleted.

Overall, the following folders of classes and specific files from the remaining classes have been removed from our data:

Directories:

*   **data/LeopardSeal** (10 instances)
*   **data/MinkeWhale** (17 instances)
*   **data/WeddellSeal** (2 instances)
*   **data/Fin_FinbackWhale** (IndexError: index 15 is out of bounds for axis 0 with size 15)

Files:

*   **data/PantropicalSpottedDolphin/84021003.wav** (WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign, but file has nSamplesPerSec = 40960, nBlockAlign = 2, and nAvgBytesPerSec = 61440)
*   **data/Short_Finned(Pacific)PilotWhale/57021003.wav** (WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign, but file has nSamplesPerSec = 30000, nBlockAlign = 2, and nAvgBytesPerSec = 45000)
*   **data/SpermWhale/84021003.wav** (WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign, but file has nSamplesPerSec = 40960, nBlockAlign = 2, and nAvgBytesPerSec = 61440)


In [None]:
directory_paths = ['data/LeopardSeal', 'data/MinkeWhale', 'data/WeddellSeal', 'data/Fin_FinbackWhale']

for dpath in directory_paths:
  if os.path.exists(dpath):
      shutil.rmtree(dpath)
      print(f"The directory {dpath} has been deleted.")
  else:
      print(f"The directory {dpath} does not exist.")

file_paths = ['data/PantropicalSpottedDolphin/84021003.wav', 'data/Short_Finned(Pacific)PilotWhale/57021003.wav', 'data/SpermWhale/84021003.wav']

for fpath in file_paths:
  if os.path.exists(fpath):
      os.remove(fpath)
      print(f"The file {fpath} has been deleted.")
  else:
      print(f"The file {fpath} does not exist.")

The directory data/LeopardSeal has been deleted.
The directory data/MinkeWhale has been deleted.
The directory data/WeddellSeal has been deleted.
The directory data/Fin_FinbackWhale has been deleted.
The file data/PantropicalSpottedDolphin/84021003.wav has been deleted.
The file data/Short_Finned(Pacific)PilotWhale/57021003.wav has been deleted.
The file data/SpermWhale/84021003.wav has been deleted.


We also delete any files that, according to their metadata, appear to contain <u>more than one mammal sounds</u>.

In [None]:
# Load metadata from JSON files
with open('metadata/wav_metadata.json', 'r') as f:
    wav_metadata = json.load(f)

with open('metadata/species.json', 'r') as f:
    species = json.load(f)

for m in wav_metadata:
    na = wav_metadata[m]['NA:'].replace("  ", " ")
    num_of_animals = na.split('|')

    if len(num_of_animals) > 1:
        print(m+'.wav')
        print(na)

        animal_1_code = num_of_animals[0].split(" ")[-2]
        animal_2_code = num_of_animals[1].split(" ")[-1]

        animal_1_name = next((s for s in species if species[s]['code'] == animal_1_code), None)
        animal_2_name = next((s for s in species if species[s]['code'] == animal_2_code), None)

        if animal_1_name and animal_2_name:
            print(animal_1_name, animal_2_name)

            filepath_1 = os.path.join('data', animal_1_name, m + '.wav')
            filepath_2 = os.path.join('data', animal_2_name, m + '.wav')

            if os.path.exists(filepath_1):
                os.remove(filepath_1)

            if os.path.exists(filepath_2):
                os.remove(filepath_2)

72021005.wav
1, possibly 2+ AA1A | 1+ CC2A
BowheadWhale BeardedSeal
7202100T.wav
1+ AA1A | 1+ CC2A
BowheadWhale BeardedSeal
7202100V.wav
1+ AA1A | 1+ CC2A
BowheadWhale BeardedSeal
7202100Z.wav
1+ AA1A | 1+ CC2A
BowheadWhale BeardedSeal
78018002.wav
3+ CC2A | 1+ AA1A
BeardedSeal BowheadWhale
78018003.wav
3+ CC2A | 1+ AA1A
BeardedSeal BowheadWhale
7801800B.wav
1+ AA1A | 3+ CC2A
BowheadWhale BeardedSeal
7801800D.wav
1+ AA1A | 3+ CC2A
BowheadWhale BeardedSeal
7801800H.wav
1+ AA1A | 3+ CC2A
BowheadWhale BeardedSeal
7801800J.wav
1+ AA1A | 3+ CC2A
BowheadWhale BeardedSeal
91012009.wav
2+ BD10A | 1+ BA2A
MelonHeadedWhale SpermWhale
9101200B.wav
2 BD10A | 1+ BA2A
MelonHeadedWhale SpermWhale
9101200K.wav
2+ BD10A | 1+ BA2A
MelonHeadedWhale SpermWhale
9101201E.wav
1 BD10A | 1 BA2A
MelonHeadedWhale SpermWhale
84021003.wav
1 BA2A | 3 BD15A
SpermWhale PantropicalSpottedDolphin
90058068.wav
3 BA2A | 100-150 BD15A
SpermWhale PantropicalSpottedDolphin
9005807N.wav
3 BA2A | 100-150 BD15A
SpermWhale Pant

## **Convert Sample Rate**

Our EDA revealed that the .wav files have varying sample rates across different species. Therefore, before creating spectrograms and testing the models, <u>we standardize all .wav files to a sample rate of 22050</u>, using the [ffmpeg-python](https://github.com/kkroening/ffmpeg-python) library.

In [None]:
!pip install ffmpeg-python

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [None]:
import ffmpeg

def convert_sample_rate(input_file, output_file, new_sample_rate):
    """
    Convert the sample rate of an audio file to a new specified rate.

    Parameters:
    input_file (str): The path to the input audio file.
    output_file (str): The path where the output audio file will be saved.
    new_sample_rate (int): The new sample rate to set for the audio file.

    Returns:
    None
    """
    (
        ffmpeg
        .input(input_file)
        .output(output_file, ar=new_sample_rate)
        .overwrite_output()
        .run()
    )

In [None]:
# Define the input (source) and output(destination) directories
## Source and destination directories
source_dir = 'data'
destination_dir = 'data_22050'

# Ensure the destination directory exists
ensure_dir(destination_dir)

## Walk through all files in the source directory
for root, dirs, files in os.walk(source_dir):
    for file in files:
        if file.lower().endswith('.wav'):
            input_file = os.path.join(root, file)
            print('Convert: '+input_file)
            relative_path = os.path.relpath(input_file, source_dir)
            output_file = os.path.join(destination_dir, relative_path)
            ensure_dir(os.path.dirname(output_file))
            convert_sample_rate(input_file, output_file, 22050)

Convert: data/AtlanticSpottedDolphin/61025001.wav
Convert: data/AtlanticSpottedDolphin/61025002.wav
Convert: data/AtlanticSpottedDolphin/61025003.wav
Convert: data/AtlanticSpottedDolphin/61025004.wav
Convert: data/AtlanticSpottedDolphin/61025006.wav
Convert: data/AtlanticSpottedDolphin/61025007.wav
Convert: data/AtlanticSpottedDolphin/61025008.wav
Convert: data/AtlanticSpottedDolphin/61025009.wav
Convert: data/AtlanticSpottedDolphin/6102500A.wav
Convert: data/AtlanticSpottedDolphin/6102500B.wav
Convert: data/AtlanticSpottedDolphin/6102500C.wav
Convert: data/AtlanticSpottedDolphin/6102500D.wav
Convert: data/AtlanticSpottedDolphin/6102500E.wav
Convert: data/AtlanticSpottedDolphin/6102500F.wav
Convert: data/AtlanticSpottedDolphin/6102500G.wav
Convert: data/AtlanticSpottedDolphin/6102500H.wav
Convert: data/AtlanticSpottedDolphin/6102500I.wav
Convert: data/AtlanticSpottedDolphin/6102500J.wav
Convert: data/AtlanticSpottedDolphin/6102500K.wav
Convert: data/AtlanticSpottedDolphin/6102500M.wav


## **Data Splitting**

Next, the [split-folders](https://pypi.org/project/split-folders/) library was used to split folders with .wav files into train, validation and test (dataset) folders. The ratio we chose was 80% train, 10% validation, and 10% test because the dataset has a fairly small number of instances per class and many classes. We wanted to ensure that there would be a sufficient amount of data for training the model.

In [None]:
!pip install split-folders

Collecting split-folders
  Downloading split_folders-0.5.1-py3-none-any.whl (8.4 kB)
Installing collected packages: split-folders
Successfully installed split-folders-0.5.1


Before splitting the files into train, validation, and test sets, we ensure that <u>all the long files (>30secs) are included in the train set</u> to prevent them from affecting the majority voting in the test set.

In [None]:
ensure_dir('data_split/train')

In [None]:
source_dir = 'data_22050'
destination_dir = 'data_split/train'

## Walk through all files in the source directory
for root, dirs, files in os.walk(source_dir):
    for file in files:
        if file.lower().endswith('.wav'):
            input_file = os.path.join(root, file)
            duration = get_wav_duration(input_file)
            relative_path = os.path.relpath(input_file, source_dir)
            output_file = os.path.join(destination_dir, relative_path)
            ensure_dir(os.path.dirname(output_file))
            if duration > 30:
                shutil.move(input_file, output_file)

In [None]:
import splitfolders

# Split the dataset into train, validation, and test sets
# Parameters:
# - "data": The path to the original dataset directory
# - output="data_split": The path where the split data will be saved
# - seed=1337: Seed for random number generator to ensure reproducibility
# - ratio=(.8, .1, .1): The split ratio for train, validation, and test sets
# - group_prefix=None: Option to keep files with the same prefix together, set to None as it's not needed here
# - move=True: Move files instead of coping them
splitfolders.ratio("data_22050", output="data_split", seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=True)

Copying files: 1527 files [00:09, 159.73 files/s]


In [None]:
# Delete the empty folder data_22050
shutil.rmtree('data_22050')

## **Extract Handcrafted Features**

We used the pyAudioAnalysis library, specifically the `MidTermFeatures.directory_feature_extraction` function, to calculate the handcrafted features for all directories in each set: train, validation, and test. The results were saved in separate CSV files for each set in the `handcrafted_features` directory to make it easier to retrieve them later.

In [3]:
!pip install eyed3
!pip install pydub
!pip install pyAudioAnalysis

Collecting eyed3
  Downloading eyed3-0.9.7-py3-none-any.whl (246 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.1/246.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting coverage[toml]<6.0.0,>=5.3.1 (from eyed3)
  Downloading coverage-5.5-cp310-cp310-manylinux1_x86_64.whl (238 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.0/239.0 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting deprecation<3.0.0,>=2.1.0 (from eyed3)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting filetype<2.0.0,>=1.0.7 (from eyed3)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Installing collected packages: filetype, deprecation, coverage, eyed3
Successfully installed coverage-5.5 deprecation-2.1.0 eyed3-0.9.7 filetype-1.2.0
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1
Collecting pyAudioAnalysis
  Downloadin

In [4]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from pyAudioAnalysis import MidTermFeatures as aF

In [5]:
def create_csv_with_features(set_name):
    """
    Extracts audio features from directories of audio files and saves them to a CSV file.

    Args:
    set_name (str): The name of the dataset (subdirectory in 'data_split') to process.

    This function assumes the following directory structure:
    - data_split/
        - set_name/
            - class1/
            - class2/
            - ...

    For each class directory, the function extracts audio features using
    the directory_feature_extraction function and saves the results in a CSV file
    located in the 'handcrafted_features' directory.
    """

    set_dir = os.path.join("data_split", set_name)
    set_classes = os.listdir(set_dir)
    dirs = [os.path.join(set_dir, c) for c in set_classes]

    # Define parameters for feature extraction
    m_win, m_step, s_win, s_step = 1, 1, 0.1, 0.05

    features = []

    for d in dirs:
        # Extract feature matrix, file names, and feature names for the directory
        f, files, fn = aF.directory_feature_extraction(d, m_win, m_step, s_win, s_step)

        # Get the class name from the directory path
        class_name = os.path.basename(d)

        # Remove the directory path from file names
        files = [f.replace(d + '/', '') for f in files]

        # Extend feature list with class name and file name
        extended_f = [[class_name, b] + a.tolist() for a, b in zip(f, files)]
        features.extend(extended_f)

    col_names = ['class', 'file'] + fn
    features_df = pd.DataFrame(features, columns=col_names)

    # Save the DataFrame to a CSV file in the 'handcrafted_features' directory
    features_df.to_csv(os.path.join('handcrafted_features', f'{set_name}_features.csv'), sep='\t', header=True)

In [6]:
ensure_dir('handcrafted_features')

for set_name in ['train', 'val', 'test']:
  create_csv_with_features(set_name)

Analyzing file 1 of 46: data_split/train/AtlanticSpottedDolphin/61025001.wav
Analyzing file 2 of 46: data_split/train/AtlanticSpottedDolphin/61025002.wav
Analyzing file 3 of 46: data_split/train/AtlanticSpottedDolphin/61025003.wav
Analyzing file 4 of 46: data_split/train/AtlanticSpottedDolphin/61025004.wav
Analyzing file 5 of 46: data_split/train/AtlanticSpottedDolphin/61025006.wav
Analyzing file 6 of 46: data_split/train/AtlanticSpottedDolphin/61025007.wav
Analyzing file 7 of 46: data_split/train/AtlanticSpottedDolphin/61025008.wav
Analyzing file 8 of 46: data_split/train/AtlanticSpottedDolphin/61025009.wav
Analyzing file 9 of 46: data_split/train/AtlanticSpottedDolphin/6102500A.wav
Analyzing file 10 of 46: data_split/train/AtlanticSpottedDolphin/6102500B.wav
Analyzing file 11 of 46: data_split/train/AtlanticSpottedDolphin/6102500D.wav
Analyzing file 12 of 46: data_split/train/AtlanticSpottedDolphin/6102500E.wav
Analyzing file 13 of 46: data_split/train/AtlanticSpottedDolphin/6102500F

## **Extract and Save Spectrograms**

The following code utilizes the [librosa](https://github.com/librosa/librosa) library to extract spectrograms and melgrams from audio files.

In [None]:
from utils.utilities import ensure_dir
from utils.spec_functions import create_pkl_with_spectrograms

In [None]:
ensure_dir('spectrograms')

In [None]:
for set_name in ['train', 'val', 'test']:
  create_pkl_with_spectrograms(set_name, 5.0) #1.0, 2.0 & 5.0