# **Data Preprocessing**

In [3]:
# Importing the drive module from google.colab library
from google.colab import drive

# Mounting the Google Drive to the Colab environment
drive.mount('/content/drive')

project_path = '/content/drive/My Drive/GitHub/MarineMammalSoundClassification/'
%cd /content/drive/My Drive/GitHub/MarineMammalSoundClassification/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/GitHub/MarineMammalSoundClassification


## **Data Cleaning**

The following code deletes folders of classes that have fewer than 20 instances and files that were causing errors with the pyAudioAnalysis library, before splitting the data into train/validation/test sets.

Directories:

*   **data/LeopardSeal** (10 instances)
*   **data/MinkeWhale** (17 instances)
*   **data/WeddellSeal** (2 instances)
*   **data/Fin_FinbackWhale** (IndexError: index 15 is out of bounds for axis 0 with size 15)

Files:

*   **data/PantropicalSpottedDolphin/84021003.wav** (WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign, but file has nSamplesPerSec = 40960, nBlockAlign = 2, and nAvgBytesPerSec = 61440)
*   **data/Short_Finned(Pacific)PilotWhale/57021003.wav** (WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign, but file has nSamplesPerSec = 30000, nBlockAlign = 2, and nAvgBytesPerSec = 45000)
*   **data/SpermWhale/84021003.wav** (WAV header is invalid: nAvgBytesPerSec must equal product of nSamplesPerSec and nBlockAlign, but file has nSamplesPerSec = 40960, nBlockAlign = 2, and nAvgBytesPerSec = 61440)


In [2]:
import shutil
import os

directory_paths = ['data/LeopardSeal', 'data/MinkeWhale', 'data/WeddellSeal', 'data/Fin_FinbackWhale']

for dpath in directory_paths:
  if os.path.exists(dpath):
      shutil.rmtree(dpath)
      print(f"The directory {dpath} has been deleted.")
  else:
      print(f"The directory {dpath} does not exist.")

file_paths = ['data/PantropicalSpottedDolphin/84021003.wav', 'data/Short_Finned(Pacific)PilotWhale/57021003.wav', 'data/SpermWhale/84021003.wav']

for fpath in file_paths:
  if os.path.exists(fpath):
      os.remove(fpath)
      print(f"The file {fpath} has been deleted.")
  else:
      print(f"The file {fpath} does not exist.")

The directory data/LeopardSeal has been deleted.
The directory data/MinkeWhale has been deleted.
The directory data/WeddellSeal has been deleted.
The directory data/Fin_FinbackWhale has been deleted.
The file data/PantropicalSpottedDolphin/84021003.wav has been deleted.
The file data/Short_Finned(Pacific)PilotWhale/57021003.wav has been deleted.
The file data/SpermWhale/84021003.wav has been deleted.


## **Data Splitting**

Next, the split-folders library was used to split folders with .wav files into train, validation and test (dataset) folders. The ratio we chose was 80% train, 10% validation, and 10% test because the dataset has a fairly small number of instances per class and many classes. We wanted to ensure that there would be a sufficient amount of data for training the model.

In [4]:
!pip install split-folders

Collecting split-folders
  Downloading split_folders-0.5.1-py3-none-any.whl (8.4 kB)
Installing collected packages: split-folders
Successfully installed split-folders-0.5.1


In [5]:
import os

if not os.path.exists('data_split'):
   os.makedirs('data_split')

In [6]:
import splitfolders

# Split the dataset into train, validation, and test sets
# Parameters:
# - "data": The path to the original dataset directory
# - output="data_split": The path where the split data will be saved
# - seed=1337: Seed for random number generator to ensure reproducibility
# - ratio=(.8, .1, .1): The split ratio for train, validation, and test sets
# - group_prefix=None: Option to keep files with the same prefix together, set to None as it's not needed here
# - move=False: Copy files instead of moving them
splitfolders.ratio("data", output="data_split", seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False)

Copying files: 1615 files [07:31,  3.58 files/s]
