# PREAMBULE
Here are imported a few librairies to run the codes

In [None]:
# FILL IN RED PARTS !
import sys
from pathlib import Path

from OSmOSE import Spectrogram, Job_builder
from OSmOSE.utils.core_utils import display_folder_storage_info, list_dataset

sys.path.append(r"../src")
from utils_datarmor import (
    adjust_spectro,
    generate_spectro,
    display_progress,
    monitor_job,
    read_job,
)

path_osmose_home = Path(r"/home/datawork-osmose/")
path_osmose_dataset = path_osmose_home / "dataset"

jb = Job_builder()

display_folder_storage_info(dir_path=path_osmose_home)

`list_dataset` take as an argument the path to the datasets : `path_osmose` and the project path which is a optional argument in the case where several datasets are grouped into a single folder, leave `project = ""` if the dataset is directly located in `path_osmose`

In [None]:
# FILL IN RED PARTS !
list_dataset(path_osmose=path_osmose_dataset, project="")

## Summary

**I. Select dataset** : choose your dataset to be processed and get key metadata on it

**II. Configure spectrograms** : define all spectrogram parameters, and adjust them based on spectrograms computed on the fly

**III. Generate spectrograms** : launch the complete generation of spectrograms

# I. Select dataset 

If your dataset is part of a project, please provide its name with `project_name` ; in that case your dataset should be present in `home/datawork-osmose/dataset/{project_name}/{dataset_name}`. Otherwise let the default value `project_name = ""`.

In [None]:
# FILL IN RED PARTS !
project_name = ""
dataset_name = "APOCADO_test_reshaper"

dataset = Spectrogram(
    dataset_path=Path(path_osmose_dataset, project_name, dataset_name),
    owner_group="gosmose",
    local=False,
)

print(dataset)

# II. Configure spectrograms

The two following parameters `spectro_duration` (in s) and `dataset_sr` (in Hz) will allow you to process your data using different file durations (ie segmentation) and/or sampling rate (ie resampling) parameters. `spectro_duration` is the maximal duration of the spectrogram display window.

To process audio files from your original folder (ie without any segmentation and/or resampling operations), use the original audio file duration and sample rate parameters estimated at your dataset uploading (they are printed in the previous cell). 

In [None]:
# FILL IN GREEN PARTS !
dataset.campaign = project_name
dataset.project = project_name
dataset.spectro_duration = 600  # seconds
dataset.dataset_sr = 96_000  # Hz

Then, you can set the value of `zoom_levels`, which is an integer corresponding to the number of zoom levels you want (they are used in our web-based annotation tool APLOSE).
With `zoom_levels = 0`, your shortest spectrogram display window has a duration of `spectro_duration` seconds (that is no zoom at all) ; with `zoom_levels = 1`, a duration of `spectro_duration`/2 seconds ; with `zoom_levels = 2`, a duration of `spectro_duration`/4 seconds.

In [None]:
# FILL IN GREEN PARTS !
dataset.zoom_level = 4

After that, you can set the following classical spectrogram parameters : `nfft` (in samples), `winsize` (in samples), `overlap` (in \%). **Note that with those parameters you set the resolution of your spectrogram display window with the smallest duration, obtained with the highest zoom level.**

In [None]:
# FILL IN GREEN PARTS !
dataset.nfft = 1_024
dataset.window_size = 4_096
dataset.overlap = 50

In case of audio segmentation, you can use the following variable `audio_file_overlap` (in seconds, default value = 0) to set an overlap in seconds between two consecutive segments.

In [None]:
# FILL IN GREEN PARTS !
dataset.audio_file_overlap = 0

In case you do not want to concatenate your audio files, set following variable `dataset.concat` to `False` (default value = `True`). Otherwise, data will be concatenated then segmented according to `dataset.spectro_duration` parameter.

In [None]:
dataset.concat = True

### Amplitude normalization 

Eventually, we also propose different modes of data/spectrogram normalization.

Normalization over raw data samples with the variable `data_normalization` (default value `'none'`, i.e. no normalization) :
- instrument-based normalization with the three parameters `sensitivity_dB` (in dB, default value = 0), `gain` (in dB, default value = 0) and `peak_voltage` (in V, default value = 1). Using default values, no normalization will be performed ;

- z-score normalization over a given time period through the variable `zscore_duration`, applied directly on your raw timeseries. The possible values are:
    - `zscore_duration = 'original'` : the audio file duration will be used as time period ;
    - `zscore_duration = '10H'` : any time period put as a string using classical [time alias](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases). This period should be higher than your file duration. 

Normalization over spectra with the variable `spectro_normalization` (default value `'density'`, see OSmOSEanalytics/documentation/theory_spectrogram.pdf for details) :
- density-based normalization by setting `spectro_normalization = 'density'`
- spectrum-based normalization by setting `spectro_normalization = 'spectrum'` 

In the cell below, you can also have access to the amplitude dynamics in dB throuh the parameters `dynamic_max` and `dynamic_min`, the colormap `spectro_colormap` to be used (see possible options in the [documentation](https://matplotlib.org/stable/tutorials/colors/colormaps.html)) and specify the frequency cut `HPfilter_freq_min` of a high-pass filter if needed.

In [None]:
# FILL IN GREEN and RED PARTS !
dataset.data_normalization = "zscore"  # 'instrument' OR 'zscore' OR 'none'
dataset.zscore_duration = (
    "original"  # parameter for 'zscore' mode, values = time alias OR 'original'
)

dataset.sensitivity = -169.7  # parameter for 'instrument' mode
dataset.gain_dB = 0  # parameter for 'instrument' mode
dataset.peak_voltage = 2.5  # parameter for 'instrument' mode

dataset.spectro_normalization = "spectrum"  # 'density' OR 'spectrum'
dataset.spectro_colormap = "viridis"
dataset.dynamic_max = 40  # dB
dataset.dynamic_min = -40  # dB
dataset.hp_filter_min_freq = 1  # Hz

You can now check the size of your spectrogram resulting from those parameters

In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL IN!
dataset.check_spectro_size()

### Adjust spectrogram parameters

In the cell below you can visualize some spectrograms computed on the fly

- `number_adjustment_spectrograms` is the number of spectrogram examples used to adjust your parameters

- You can use the variable `file_list` in the cell below to adjust your spectrogram parameters using specific files; be careful these files must be present in a `temp_adjustment_output_dir` folder computed with a random selection; put their names in this list as follows, eg `file_list = ['file1.wav','file2.wav']` otherwise set `file_list` to an empty list `[]`

In [None]:
# FILL IN GREEN and RED PARTS !
number_adjustment_spectrogram = 1
file_list = []

In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL IN!
adjust_spectro(
    dataset=dataset,
    number_adjustment_spectrogram=number_adjustment_spectrogram,
    file_list=file_list,
)

# III. Generate spectrograms

- `dataset.batch_number` indicates the number of concurrent jobs. A higher number can speed things up until a certain point. It still does not work very well.

- **If you create your spectrograms for an APLOSE campaign, set** `write_datasets_csv_for_APLOSE=True` **below !**

- The variable below `save_matrix` should be set to True if you want to generate the numpy matrices along your png spectrograms

In [None]:
# FILL IN GREEN PARTS !
dataset.batch_number = 10
write_datasets_csv_for_aplose = False
save_matrix = False
save_welch = False

You can set `datetime_begin` and `datetime_end` so that the reshaped audio files and spectrograms begin and end at specified datetimes.
**Note that If you want to keep the original begin and end datetimes, set those variable to `None`.**

In [None]:
datetime_begin = "2024-02-22T03:50:00+0000"
datetime_end = "2024-03-01T10:50:00+0000"

### Segmentation


In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL IN!
dataset.initialize(
    env_name=sys.executable.replace("/bin/python", ""),
    force_init=False,
    datetime_begin=datetime_begin,
    datetime_end=datetime_end,
)

### Spectrogram generation

In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL IN!
generate_spectro(
    dataset=dataset,
    path_osmose_dataset=path_osmose_dataset,
    write_datasets_csv_for_aplose = write_datasets_csv_for_aplose,
    overwrite=True,
    save_matrix=save_matrix,
    save_welch=save_welch,
    datetime_begin=datetime_begin,
    datetime_end=datetime_end,
)

### Track progress

You can monitor the segmentation and the spectrogram generation here

In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL IN!
display_progress(dataset, datetime_begin=datetime_begin, datetime_end=datetime_end)

You also monitor jobs status here

In [None]:
# JUST RUN THIS CELL : NOTHING TO FILL IN!
monitor_job(dataset)

You can read a specific output file here providing its name, eg `job_id = 'job1_ID'`

In [None]:
# FILL IN RED PART !
read_job(job_id="605103.datarmor0", dataset=dataset)

You can also monitor the jobs in a terminal using the command `qstat -u username`