### This notebook follows the same steps than `spectrogram_generator` but for several datasets at once

In [None]:
import sys
import os
import pickle
import numpy as np
import glob
from tqdm import tqdm
from pathlib import Path
import pandas as pd
import subprocess
from OSmOSE import Spectrogram, Job_builder
from OSmOSE.utils import *

path_osmose_dataset = "/home/datawork-osmose/dataset/"
path_osmose_home = "/home/datawork-osmose/"

jb = Job_builder()

display_folder_storage_infos(path_osmose_home)

## List datasets

Below you can list all the built datasets under a specific path. This path is composed of the 2 following arguments:

`path_osmose`: location of all your datasets & campaigns

`campaign_ID` : leave blank if you want to list the dataset directly under `path_osmose`, else the list of datasets provided will concern only the datasets under `path_osmose` \ `campaign_ID`

In [None]:
list_dataset(path_osmose_dataset, "APOCADO")

### Summary

**I. Select dataset** : choose your dataset to be processed and get key metadata on it

**II. Configure spectrograms** : define all spectrogram parameters, and adjust them based on spectrograms computed on the fly

**III. Generate spectrograms** : launch the complete generation of spectrograms

# I. Select dataset 

If your datasets are part of a recording campaign, please provide their names in the list `list_campaign_name`; in that case your dataset should be present in `{path_osmose_dataset}/{campaign_name}/{dataset_name}`. Otherwise set the default value to "".

In [None]:
list_dataset_name = [
    "APOCADO_C1D3_ST335556632",
    "APOCADO_C8D1_ST7181",
    "APOCADO_C8D1_ST7191",
    "APOCADO_C8D2_ST7181",
    "APOCADO_C8D2_ST7191",
    "APOCADO_C8D3_ST7181",
    "APOCADO_C8D3_ST7191",
    "APOCADO_C8D4_ST7181",
    "APOCADO_C8D4_ST7191",
    "APOCADO_C8D5_ST7181",
    "APOCADO_C8D5_ST7191",
    "APOCADO_C8D6_ST7181",
    "APOCADO_C8D6_ST7191",
    "APOCADO_C8D7_ST7181",
    "APOCADO_C8D7_ST7191",
    "APOCADO_C8D8_ST7181",
    "APOCADO_C8D8_ST7191",
    "APOCADO_C8D9_ST7181",
    "APOCADO_C8D9_ST7191",
    "APOCADO_C8D10_ST7181",
    "APOCADO_C8D10_ST7191",
    "APOCADO_C8D11_ST7181",
    "APOCADO_C8D11_ST7191",
    "APOCADO_C8D12_ST7181",
    "APOCADO_C8D12_ST7191",
    "APOCADO_C8D13_ST7189",
    "APOCADO_C8D13_ST7190",
    "APOCADO_C8D14_ST7189",
    "APOCADO_C8D14_ST7190",
    "APOCADO_C8D15_ST7189",
    "APOCADO_C8D15_ST7190",
    "APOCADO_C9D1_ST7181",
    "APOCADO_C9D2_ST7181",
    "APOCADO_C9D3_ST7181",
    "APOCADO_C9D4_ST7181",
    "APOCADO_C9D5_ST7191",
    "APOCADO_C9D6_ST7191",
    "APOCADO_C9D7_ST7191",
    "APOCADO_C9D8_ST7191",
    "APOCADO_C10D1_ST7179",
    "APOCADO_C10D2_ST7193",
    "APOCADO_C10D3_ST7179",
    "APOCADO_C10D4_ST7193",
    "APOCADO_C10D5_ST7179",
    "APOCADO_C10D6_ST7193",
    "APOCADO_C10D7_ST7179",
    "APOCADO_C10D9_ST7193",
]

list_campaign_name = ["APOCADO"] * len(list_dataset_name)

## Metadata of one dataset

Here you can print several parameters from a single dataset

In [None]:
i = -1

dataset_name = list_dataset_name[i]
campaign_name = list_campaign_name[i]

dataset = Spectrogram(
    dataset_path=Path(path_osmose_dataset, campaign_name, dataset_name),
    owner_group="gosmose",
    local=False,
)

print(dataset)

## Configure spectrograms

Set your spectrogram parameters, they will be the same for all your datasets.

The two following parameters `spectro_duration` (in s) and `dataset_sr` (in Hz) will allow you to process your data using different file durations (ie segmentation) and/or sampling rate (ie resampling) parameters. `spectro_duration` is the maximal duration of the spectrogram display window.

To process audio files from your original folder (ie without any segmentation and/or resampling operations), use the original audio file duration and sample rate parameters estimated at your dataset uploading (they are printed in the previous cell). 

Then, you can set the value of `zoom_levels`, which is the number of zoom levels you want (they are used in our web-based annotation tool APLOSE). With `zoom_levels = 0`, your shortest spectrogram display window has a duration of `spectro_duration` seconds (that is no zoom at all) ; with `zoom_levels = 1`, a duration of `spectro_duration`/2 seconds ; with `zoom_levels = 2`, a duration of `spectro_duration`/4 seconds ...

After that, you can set the following classical spectrogram parameters : `nfft` (in samples), `winsize` (in samples), `overlap` (in \%). **Note that with those parameters you set the resolution of your spectrogram display window with the smallest duration, obtained with the highest zoom level.**

Finally:
- `batch_number` indicates the number of concurrent jobs. A higher number can speed things up until a certain point. It still does not work very well.

- The variable below `save_matrix` should be set to True if you want to generate the numpy matrices along your png spectrograms

### /!\ These parameters will be affected to all the selected datasets /!\

In [None]:
spectro_duration = 10
dataset_sr = 128000  # Hz

zoom_levels = 0

nfft = 1024  # samples
window_size = 1024  # samples
overlap = 20  # %

batch_number = 30
save_matrix = False
force_init = True  # set this parameter to True to remove existing directories of spectrograms and associated audio files

#### Amplitude normalization 

Eventually, we also propose you different modes of data/spectrogram normalization.

Normalization over raw data samples with the variable `data_normalization` (default value `'none'`, i.e. no normalization) :
- instrument-based normalization with the three parameters `sensitivity_dB` (in dB, default value = 0), `gain` (in dB, default value = 0) and `peak_voltage` (in V, default value = 1). Using default values, no normalization will be performed ;

- z-score normalization over a given time period through the variable `zscore_duration`, applied directly on your raw timeseries. The possible values are:
    - `zscore_duration = 'original'` : the audio file duration will be used as time period ;
    - `zscore_duration = '10H'` : any time period put as a string using classical [time alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases). This period should be higher than your file duration. 

Normalization over spectra with the variable `spectro_normalization` (default value `'density'`, see OSmOSEanalytics/documentation/theory_spectrogram.pdf for details) :
- density-based normalization by setting `spectro_normalization = 'density'`
- spectrum-based normalization by setting `spectro_normalization = 'spectrum'` 

In the cell below, you can also have access to the amplitude dynamics in dB throuh the parameters `dynamic_max` and `dynamic_min`, the colormap `spectro_colormap` to be used (see possible options in the [documentation](https://matplotlib.org/stable/tutorials/colors/colormaps.html)) and specify the frequency cut `HPfilter_freq_min` of a high-pass filter if needed.

In [None]:
list_sensitivity = [
    -176.4,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -175.9,
    -174.9,
    -174.5,
    -174.7,
    -174.5,
    -174.7,
    -174.5,
    -174.7,
    -175.9,
    -175.9,
    -175.9,
    -175.9,
    -174.9,
    -174.9,
    -174.9,
    -174.9,
    -174.9,
    -174.6,
    -174.9,
    -174.6,
    -174.9,
    -174.6,
    -174.9,
    -174.6,
]

list_gain_dB = [0] * len(list_sensitivity)  # parameter for 'instrument' mode
list_peak_voltage = [2] * len(list_sensitivity)  # parameter for 'instrument' mode

In [None]:
data_normalization_param = "instrument"  # 'instrument' OR 'zscore' OR 'none'
spectro_normalization_param = "density"  # 'density' OR 'spectrum'
zscore_duration = ""  # parameter for 'zscore' mode, values = time alias OR 'original'
dynamic_min = 0  # dB
dynamic_max = 120  # dB
colormap = "viridis"
hp_filter_min_freq = 1  # Hz

In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL !

for campaign_name, dataset_name, sensitivity, gain_dB, peak_voltage in zip(
    list_campaign_name,
    list_dataset_name,
    list_sensitivity,
    list_gain_dB,
    list_peak_voltage,
):

    print(f"\n### {dataset_name}")

    dataset = Spectrogram(
        dataset_path=Path(path_osmose_dataset, campaign_name, dataset_name),
        owner_group="gosmose",
        local=False,
    )

    dataset.spectro_duration = spectro_duration
    dataset.dataset_sr = dataset_sr
    dataset.nfft = nfft
    dataset.window_size = window_size
    dataset.overlap = overlap
    dataset.data_normalization = data_normalization_param
    dataset.zscore_duration = zscore_duration
    dataset.sensitivity = sensitivity
    dataset.gain_dB = gain_dB
    dataset.peak_voltage = peak_voltage
    dataset.spectro_normalization = spectro_normalization_param
    dataset.dynamic_max = dynamic_max
    dataset.dynamic_min = dynamic_min
    dataset.colormap = colormap
    dataset.hp_filter_min_freq = hp_filter_min_freq
    dataset.batch_number = batch_number

    ## segmentation
    dataset.initialize(
        env_name=sys.executable.replace("/bin/python", ""),
        force_init=force_init,
        last_file_behavior="discard",
    )

    ## compute expected_nber_segmented_files
    if (
        dataset.spectro_duration
        != pd.read_csv(
            str(dataset._get_original_after_build()) + "/metadata.csv", header=0
        )["audio_file_origin_duration"][0]
    ):
        origin_file_metadata = pd.read_csv(
            str(dataset._get_original_after_build()) + "/file_metadata.csv"
        )
        nber_files_to_process = 0
        for dd in origin_file_metadata["duration"].values:
            nber_files_to_process += dd / (
                dataset.spectro_duration - dataset.audio_file_overlap
            )
        nber_files_to_process = round(nber_files_to_process)
    else:
        nber_files_to_process = pd.read_csv(
            str(dataset._get_original_after_build()) + "/metadata.csv", header=0
        )["audio_file_count"][0]

    batch_size = nber_files_to_process // dataset.batch_number

    dataset.save_spectro_metadata(False)

    for batch in range(dataset.batch_number):
        i_min = batch * batch_size
        i_max = (
            i_min + batch_size
            if batch < dataset.batch_number - 1
            else nber_files_to_process
        )  # If it is the last batch, take all files

        jobfile = jb.build_job_file(
            script_path=Path(
                os.path.abspath("../src"), "qsub_spectrogram_generator_pkg.py"
            ),
            script_args=f"--dataset-path {dataset.path}\
                    --dataset-sr {dataset.dataset_sr} \
                    --batch-ind-min {i_min}\
                    --batch-ind-max {i_max}\
                    {'--save-matrix' if save_matrix else ''}",
            jobname="OSmOSE_SpectroGenerator",
            preset="low",
            env_name=sys.executable.replace("/bin/python", ""),
            mem="70G",
            walltime="10:00:00",
            logdir=dataset.path.joinpath("log"),
        )

    pending_jobs = [
        jobid
        for jobid in dataset.pending_jobs
        if b"finished"
        not in subprocess.run(["qstat", jobid], capture_output=True).stderr
    ]
    job_id_list = jb.submit_job(dependency=pending_jobs)  # submit all built job files
    nb_jobs = len(jb.finished_jobs) + len(job_id_list)

    print(f"The job ids are {job_id_list}")