# Creating Dataset for Number of Speaker Estimation

This notebook shows how we have created the dataset for the Number of Speaker Estimation project.

We first start with importing `util.py`, where all necessary code is provided.

In [1]:
import sys
sys.path.insert(0, './src/')
from src import util

dataset = "train50"


In [2]:
data_dir = "./data/{}wavs/".format(dataset)
new_dir = "./data/{}splits/".format(dataset)
t = 10
util.create_audio_splits(data_dir, new_dir, t)

5634 files found in total
['./data/train50wavs/train-clean-50/3168/173564/3168-173564-0041.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0040.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0042.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0043.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0044.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0045.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0036.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0022.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0023.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0037.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0021.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0035.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0009.wav', './data/train50wavs/train-clean-50/3168/173564/3168-173564-0008

  files_per_speaker = np.array(files_per_speaker)


Important to note is that for this code to work, LibriSpeech has to be downloaded and unzipped in a `./data/` folder. The general folder path is then:
```
./data/LibriSpeech/type_of_librispeech/speaker_id/chapter_id/audiofile.flac
```
If you have downloaded, for example, the LibriSpeech train-100 clean dataset, the following path would be valid:
```
./data/LibriSpeech/train-clean-100/19/198/19-198-0000.flac
```

**Please note that running this notebook requires a lot of harddisk space. At each step, we save the output as new files in a new folder. This is intentional, for testing purposes and to have a back-up of the previous steps. If you want to run this notebook withouth doing so, code in the source files has to be changed, or intermediate results should manually be removed from the harddisk.**

## 1. Rewrite LibriSpeech `.flac` to `.wav`

We first rewrite LibriSpeech to wav files, which we do by calling `util.flacs_to_wavs`.
This function requires two parameters:

 - `data_dir` represents where the original data is located. According to the beforementioned structure, this would be `./data/LibriSpeech/'
 - `new_dir` represents the new location where the `.wav` files will be written to.

In the cell below, the `new_dir` is called wavs100, to indicate the file format (`.wav`) and to remind that this is from train-100 clean.

In [2]:
data_dir = "./data/{}/".format(dataset)
new_dir = "./data/{}wavs/".format(dataset)

In [3]:
util.flacs_to_wavs(data_dir, new_dir)

172it [00:38,  4.41it/s]


## 2. Split `.wav` files in fragments of 5 seconds

The function `util.split_audio_in_samples` is written to do this. It requires three parameters:

 - `data_dir` represents where the original `.wav` files are located. According to the previous cell with code, this is `./data/wavs100/`
 - `new_dir` represents the new location where the shorter `.wav` files will be written to.
 - `t` is how long the files should be in seconds, in our case `t = 5`.

In [None]:
data_dir = "./data/{}wavs/".format(dataset)
new_dir = "./data/{}splits/".format(dataset)
t = 10

In [None]:
util.split_audio_in_samples(data_dir, new_dir, t)

## 3. Change Loudness of All Splits to 70 dB

The function `util.change_loudness` will change the loudness of every split to 70 dB. It requires three parameters:
 - `data_dir` denotes where the splitted `.wav` files are located.
 - `new_dir` denotes where the new files will be written to.
 - `target` denotes the target level of decibels. Default is set to 70.
 
Since we use a subprocess call in this function, this function will run a bit longer than the previous function calls.

In [None]:
data_dir = "./data/{}splits/".format(dataset)
new_dir = "./data/{}normalized/".format(dataset)
target = 70.

In [None]:
util.change_loudness(data_dir, new_dir, target)

## 4. Merging Files

For merging splits, and hence actually creating the datasets, we use the function `util.merge_splits`. It takes three parameters:
 - `data_dir` represents where the normalized split files are located.
 - `new_dir` represents the location where the merged split files will be written to
 - `max_number_of_speakers`

This function is different from the others before, since this function creates another folder in `new_dir`, namely a folder named `metadata`, where `.txt` files are saved containing the speaker IDs from the speakers that are merged. The name of this `.txt` file corresponds with the merged `.wav` file.

In [None]:
data_dir = "./data/{}normalized/".format(dataset)
new_dir = "./data/trainset50/"
max_nr_of_speakers = 10

In [None]:
util.merge_audiofiles(data_dir, new_dir, max_nr_of_speakers)

## 5. Creating STFTs from Audio Data

In [10]:
data_dir = "./data/trainset250/train"
new_dir = "./data/trainset250/stft"

In [11]:
util.create_spectrograms(data_dir, new_dir)

100%|██████████| 2562/2562 [02:33<00:00, 16.67it/s]
100%|██████████| 2563/2563 [00:50<00:00, 50.87it/s]
100%|██████████| 2563/2563 [02:29<00:00, 17.20it/s]
100%|██████████| 2562/2562 [02:23<00:00, 17.80it/s]
100%|██████████| 2570/2570 [01:27<00:00, 29.49it/s]
100%|██████████| 2561/2561 [02:32<00:00, 16.82it/s]
100%|██████████| 2562/2562 [02:28<00:00, 17.25it/s]
100%|██████████| 2562/2562 [02:06<00:00, 20.30it/s]
100%|██████████| 2567/2567 [01:56<00:00, 21.98it/s]
100%|██████████| 2567/2567 [01:43<00:00, 24.79it/s]
100%|██████████| 2562/2562 [02:13<00:00, 19.19it/s]
12it [22:45, 113.78s/it]


## 6. Done

Now we have generated our dataset, which we can use in the Number of Speaker Estimation task. 
In summary, we did the following:
 - Rewrite `.flac` to `.wav`
 - Split `.wav` files in fragments of 5 seconds
 - Normalized loudness in every file to be 70 dB
 - Merged files together, taking into account that:
     - Two speakers can never be in the same merged file;
     - Classes are as balanced as possible, although perfect class balance is difficult to achieve;
     - Normalized merged data arrays to be in the range of [-1, 1] to prevent clipping.
 - Created STFTs from the audio files, such that we can directly pass the data to the network.