## Regenerate Train/Val/Test Sets for `PATH_DATASET/train_soundscapes/`
Since the first way (i.e. notebooks `01-data_exploration.ipynb`) for generating the data no longer corresponds to our need,
we decided to regenerate the dataset, this time

- not excluding the validation set
- save `.npy` files into `./train_npy/` and `./val_npy/`

We will combine this notebook with

- `utils.py`
- `soundscape_to_npy.py`

In [1]:
from utils import *

## `train_soundscapes/`

In [2]:
PATH_DATASET

PosixPath('/home/phunc20/datasets/kaggle/birdclef-2021')

In [3]:
df_train_soundscape = pd.read_csv(PATH_DATASET/"train_soundscape_labels.csv")
df_train_soundscape.head()

Unnamed: 0,row_id,site,audio_id,seconds,birds
0,7019_COR_5,COR,7019,5,nocall
1,7019_COR_10,COR,7019,10,nocall
2,7019_COR_15,COR,7019,15,nocall
3,7019_COR_20,COR,7019,20,nocall
4,7019_COR_25,COR,7019,25,nocall


<s>There are a total of `20` `.ogg` files in `train_soundscapes/`: I would like to split these into train/val/test sets.</s>

- <s>`12` files for train</s>
- <s>`4` files for val</s>
- <s>`4` files for test</s>

Unlike our first attempt, here I would like to use `StratifiedShuffleSplit` (from `sklearn`) on the column `birds` of `df_train_soundscape`


In [4]:
"bird dog cat".split()

['bird', 'dog', 'cat']

In [8]:
map(birdLabel_to_nBirds, ("nocall", "cat dog", "tiger"))

<map at 0x7f72ebb30d50>

In [7]:
list(map(birdLabel_to_nBirds, ("nocall", "cat dog", "tiger")))

[0, 2, 1]

In [14]:
df_train_soundscape["n_birds"] = list(map(birdLabel_to_nBirds, df_train_soundscape["birds"]))
df_train_soundscape.loc[480:580, ["birds", "n_birds"]]

Unnamed: 0,birds,n_birds
480,grekis rucwar,2
481,grekis rucwar,2
482,rucwar,1
483,rucwar,1
484,rucwar,1
...,...,...
576,rucwar,1
577,rucwar,1
578,rucwar whcpar,2
579,whcpar,1


In [21]:
df_train_soundscape["n_birds"].value_counts()

0    1529
1     627
2     183
3      55
4       5
5       1
Name: n_birds, dtype: int64

We use string concatenation for the construction of the column `"npy_path"`.

In [None]:
str(PATH_DATASET)

In [None]:
df_train_soundscape["npy_parent"] = ""
is_test = df_train_soundscape.is_test == True
is_train = ~is_test
df_train_soundscape.loc[is_test, "npy_parent"] = str(testSoundScapes)
df_train_soundscape.loc[is_train, "npy_parent"] = str(trainSoundScapes)
df_train_soundscape.loc[is_train].head()

In [None]:
df_train_soundscape["npy_path"] = \
    df_train_soundscape["npy_parent"] + "/" + \
    df_train_soundscape["row_id"] + ".npy"

# or equiv.
#df_train_soundscape.loc[:, "npy_path"] = \
#    df_train_soundscape.loc[:, "npy_parent"] + "/" + \
#    df_train_soundscape.loc[:, "row_id"] + ".npy"

In [None]:
df_train_soundscape.loc[is_test, ["row_id", "npy_path"]]

In [None]:
df_train_soundscape.loc[is_train, ["row_id", "npy_path"]]

For the columns `"longitude", "latitude"`, we will loop thru `D_location_coordinate`.

In [None]:
D_location_coordinate

In [None]:
lo, la = D_location_coordinate["COR"]
lo, la

In [None]:
for location, coordinate in D_location_coordinate.items():
    lo, la= coordinate.longitude, coordinate.latitude
    location_filter = df_train_soundscape.loc[:, "site"] == location
    df_train_soundscape.loc[location_filter, "longitude"] = lo
    df_train_soundscape.loc[location_filter, "latitude"] = la
df_train_soundscape.loc[:, ["site", "longitude", "latitude"]]

### Objective 1: `.ogg` to `.npy`

#### `joblib` way

In [None]:
def audio_to_mels(audio,
                  sr=SR,
                  n_mels=128,
                  fmin=0,
                  fmax=None):
    fmax = fmax or sr // 2
    mel_spec_computer = MelSpecComputer(sr=sr,
                                        n_mels=n_mels,
                                        fmin=fmin,
                                        fmax=fmax)
    mels = standardize_uint8(mel_spec_computer(audio))
    return mels

def every_5sec(id_,
               sr=SR,
               resample=True,
               res_type="kaiser_fast",
               single_process=True,
               save_to=Path("corbeille"),
               n_workers=2
                ):
    """
    - read the audio file of ID `id_`
    - cut the read audio into pieces of 5 seconds
    - convert each piece into `.npy` file and save
    """
    path_ogg = next((PATH_DATASET / "train_soundscapes").glob(f"{id_}*.ogg"))
    location = (path_ogg.name).split("_")[1]
    whole_audio, orig_sr = soundfile.read(path_ogg, dtype="float32")
    if resample and orig_sr != sr:
        whole_audio = librosa.resample(whole_audio, orig_sr, sr, res_type=res_type)
    n_samples = len(whole_audio)
    n_samples_5sec = sr * 5
    save_to.mkdir(exist_ok=True)

    def convert_and_save(i):
        audio_i = whole_audio[i:i + n_samples_5sec]
        mels_i = audio_to_mels(audio_i)
        path_i = save_to / f"{id_}_{location}_{((i + n_samples_5sec) // n_samples_5sec) * 5}.npy"
        np.save(str(path_i), mels_i)

    if single_process:
        for i in range(0, n_samples - n_samples % n_samples_5sec, n_samples_5sec):
            #audio_i = whole_audio[i:i + n_samples_5sec]
            ## No need the next check because in range() we have subtracted the remainder.
            ## That is, len(audio_i) is guaranteed to be n_samples_5sec for all i.
            ##if len(audio_i) < n_samples_5sec:
            ##    pass
            #mels_i = audio_to_mels(audio_i)
            #path_i = save_to / f"{id_}_{location}_{((i + n_samples_5sec) // n_samples_5sec) * 5}.npy"
            #np.save(str(path_i), mels_i)
            convert_and_save(i)
    else:
        pool = joblib.Parallel(n_workers)
        mapping = joblib.delayed(convert_and_save)
        tasks = (mapping(i) for i in range(0, n_samples - n_samples % n_samples_5sec, n_samples_5sec))
        pool(tasks)

def soundscapes_to_npy(is_test=False, n_processes=4):
    pool = joblib.Parallel(n_processes)
    mapping = joblib.delayed(every_5sec)
    if is_test:
        tasks = list(mapping(id_, save_to=testSoundScapes) for id_ in S_testSoundScapeIDs)
        #tasks = list(mapping(id_,
        #                     single_process=False,
        #                     save_to=testSoundScapes)
        #             for id_ in S_testSoundScapeIDs)
    else:
        tasks = list(mapping(id_, save_to=trainSoundScapes) for id_ in S_trainSoundScapeIDs)
        #tasks = list(mapping(id_,
        #                     single_process=False,
        #                     save_to=trainSoundScapes)
        #             for id_ in S_trainSoundScapeIDs)
    pool(tqdm(tasks))

### Nota Bene
- `tasks` (i.e. input to `joblib.Parallel`) can be either a generator or a list, but since I do not know a priori the length of a generator, when combined with the usage of `tqdm`, the progress bar will lack the capability to show progress percentage, compared to using a list.

In [None]:
%%time
soundscapes_to_npy()

In [None]:
soundscapes_to_npy(is_test=True)

In [None]:
S_testSoundScapeIDs

In [None]:
!ls $trainSoundScapes | wc -l

In [None]:
!ls $testSoundScapes | wc -l

In [None]:
16 * (600 // 5)

In [None]:
4 * (600 // 5)

Let's at least verify that the saved images exhibit difference.<br>
Try execute the next cell several times to see randomly the melspectrograms.

In [None]:
rand_npy = random.choice(list(trainSoundScapes.iterdir()))
rand_image = np.load(rand_npy)
print(f"rand_npy = {rand_npy.name}")
librosa.display.specshow(rand_image);

### Objective 2: Construct `df_train_soundscape`

Recall that
> - We want to update `df_train_soundscape` to contain more information. What information?
>   - Date: Can be separated.
>   - Corresponding `.npy` path: Can be separated.
>   - Longitude, latitude: Can be separated.
>   - birds label to birds indices?
>   - new col `"n_birds"` and do a stat?

Construct a dictionary for

- key: recording location, e.g. `COR`, `SSW`, etc.
- value: possibly `NamedTuple(longitude, latitude)`

I think the year won't make much difference.