# Data Preprocessing
Pre-requisites: Run the notebook `1_data_acquisition.ipynb` to download the labeled data of humpback whale vocalizations from [Orcasound's AWS open data repository](https://open.quiltdata.com/b/acoustic-sandbox/tree/humpbacks/Emily-Vierling-Orcasound-data/Em_HW_data/flac_files/).

This notebook retrieves the humpback vocalizations from raw files according to the annotations and saves the retrieved signals in separate files in WAV format. The retrieved vocalization data is aimed at training the local humpback whale vocalization model able to detect the presence of humpback whale vocalizations in audio files.

In [None]:
#!pip install librosa soundfile

Uncomment the code below if you are using Google Colaboratory. It will connect to the project folder in Google Drive and will use the `data` from the data folder of the project folder.

In [1]:
from google.colab import drive
drive.mount("/content/gdrive")
%cd gdrive/MyDrive/Colab Notebooks/local_humpback_vocalization/local_humpback_vocalization/notebooks

Mounted at /content/gdrive
/content/gdrive/MyDrive/Colab Notebooks/local_humpback_vocalization/local_humpback_vocalization/notebooks


In [23]:
import os
import scipy.signal as sp
import pandas as pd
import librosa
import soundfile as sf

In [15]:
data_download_folder = "../data"
annotations_path = f"{data_download_folder}/raw/annotations"
audio_path = f"{data_download_folder}/raw/audio"
extracted_calls_path = f"{data_download_folder}/extracted_calls"

In [16]:
isExist = os.path.exists(extracted_calls_path)
if not isExist:
   os.makedirs(extracted_calls_path)

## 1. Extract Humpback Vocalizations

To extract humpback vocalizations and store them in separate WAV files, we will use the annotation files that contain the result of labeling. In particular, the annotation files provide the information about the begin and end times of vocalizations, as well as frequencies and call types.

### 1.1. Load Annotation File

In [8]:
annotation_filename = "OS_10_03_2021_19_34_00_.Table.1.selections.txt"

df = pd.read_csv(f"{annotations_path}/{annotation_filename}", sep="\t")
df.head()

Unnamed: 0,Selection,Begin Time (s),End Time (s),Low Freq (Hz),High Freq (Hz),Call Type
0,1,1646.999571,1648.733984,628.263,1297.059,Ascending moan
1,2,1653.223452,1654.641001,749.862,1134.926,Moan
2,3,1659.862135,1660.595925,770.129,1033.594,Moan
3,4,1661.796673,1663.747887,283.732,709.329,Ascending moan
4,5,1678.344185,1680.262045,506.664,1013.327,Moan


### 1.2. Extract and Store Vocalizatons

Since the annotation files contain specific low-frequency and high-frequency indications in Hz, we apply a bandpass filter to the extracted audio segments to retain only the frequencies within that range. This is done using the scipy.signal library's `butter` and `filtfilt` functions to design and apply the filter.

In [21]:
# Function to design a Butterworth bandpass filter
def butter_bandpass(lowcut, highcut, fs, order=5):
    nyquist = 0.5 * fs
    low = lowcut / nyquist
    high = highcut / nyquist
    b, a = sp.butter(order, [low, high], btype='band')
    return b, a

# Function to apply the bandpass filter to a signal
def butter_bandpass_filter(data, lowcut, highcut, fs, order=5):
    b, a = butter_bandpass(lowcut, highcut, fs, order=order)
    y = sp.filtfilt(b, a, data)
    return y

In [24]:
audio_file_name = annotation_filename.split(".")[0]

x, sr = librosa.load(f"{audio_path}/{audio_file_name}.flac", sr=44100)  # x is the audio signal, sr is the sample rate

for index, row in df.iterrows():
  selection = row["Selection"]
  start_time = row["Begin Time (s)"]
  end_time = row["End Time (s)"]
  lowcut = row["Low Freq (Hz)"]
  highcut = row["High Freq (Hz)"]
  call_type = row["Call Type"].replace(" ","_")

  # Convert time to sample index
  start_sample = librosa.time_to_samples(start_time, sr=sr)
  end_sample = librosa.time_to_samples(end_time, sr=sr)

  # Extract the sample
  extracted_sample = x[start_sample:end_sample]

  # Apply the bandpass filter
  filtered_sample = butter_bandpass_filter(extracted_sample, lowcut, highcut, sr, order=6)

  # Save the extracted sample to a new file
  isExist = os.path.exists(f"{extracted_calls_path}/{call_type}")
  if not isExist:
    os.makedirs(f"{extracted_calls_path}/{call_type}")
  sf.write(f"{extracted_calls_path}/{call_type}/{audio_file_name}_{selection}.wav", filtered_sample, sr)

Repeat 1.1. and 1.2 for each annotation file.