# Preparing Dataset with Lou's Voice

### Step 1: Split Audio Files into Chunks < 10 Seconds

- Use the Python script `scripts/split_on_silence.py`.
- The script creates new audio chunks without modifying the original files.
- Chunks are saved with the duration in milliseconds in the filename. Example: `LOU0051693_1575ms.wav`.
- Chunks are saved in directories based on length:
  - `more_than_10_seconds`
  - `less_than_10_seconds`
- These directories are created relative to the script's first positional argument, dataset_dir. For example, if `dataset_dir` is `~/storage/dataset`, the output will be:
  - `~/storage/dataset/split_on_silence/more_than_10_seconds`
  - `~/storage/dataset/split_on_silence/less_than_10_seconds`
- Run the script multiple times. Start with default settings, then adjust `--min-silence`, `--silence-thresh`, and `--keep-silence` as needed.
- You probably want to drop any file under `1000ms`. 

For more details on script options, see below. 


In [4]:
!python ../scripts/split_on_silence.py -h

usage: split_on_silence [-h] [--min-silence MIN_SILENCE]
                        [--silence-thresh SILENCE_THRESH]
                        [--keep-silence KEEP_SILENCE]
                        dataset_dir

This program processes a directory of audio files by splitting them at points
of silence. After splitting, the resulting audio segments are categorized
based on their duration: - Segments longer than 10 seconds are saved to a
specified directory for longer files. - Segments shorter than 10 seconds are
saved to a specified directory for shorter files. The current script max
duration is set to 100.0 seconds.

positional arguments:
  dataset_dir           Input directory

options:
  -h, --help            show this help message and exit
  --min-silence MIN_SILENCE
                        Minimum silence length in milliseconds
  --silence-thresh SILENCE_THRESH
                        Silence threshold in dBfs
  --keep-silence KEEP_SILENCE
                        The amount of silence to k

**Example:**

For the purpose of demonstration, there is a small dataset being hosted on Github that we can use.  Run the command below to download to `../examples/dataset`

In [76]:
!!wget -q https://github.com/Pbotsaris/coqui-ljs-vits-train/releases/download/0.1/dataset.zip -O temp.zip; unzip temp.zip -d ../examples/.; rm temp.zip

['Archive:  temp.zip',
 '   creating: ../examples/./dataset/',
 '  inflating: ../examples/./dataset/Lou_VO_Heineken_Remendo_240802.wav  ',
 '  inflating: ../examples/./dataset/Lou_VO_YodaBank_230512.wav  ',
 '  inflating: ../examples/./dataset/VO_Lou_Sorriso_230325.wav  ']

In [77]:
!ls -l ../examples/dataset | awk '{print $NF}'

68248
Lou_VO_Heineken_Remendo_240802.wav
Lou_VO_YodaBank_230512.wav
VO_Lou_Sorriso_230325.wav


And here are the duration of these files. All of than are much longer than 10 seconds or 1000ms.

In [78]:
import os
from pydub import AudioSegment

dataset_dir='../examples/dataset'

for filename in os.listdir(dataset_dir):
    audio = AudioSegment.from_file(os.path.join(dataset_dir,filename))
    print(f'filename: {filename:40} ⇥ {len(audio)}ms')
    

filename: Lou_VO_Heineken_Remendo_240802.wav       ⇥ 52091ms
filename: Lou_VO_YodaBank_230512.wav               ⇥ 282253ms
filename: VO_Lou_Sorriso_230325.wav                ⇥ 150853ms


now let's split these files. For the first split, the default options, which are more conservative, suffice. 

In [79]:
!python ../scripts/split_on_silence.py ../examples/dataset/

Creating output directory:  ../examples/dataset/../split_on_silence
creating split subdirectories "../examples/dataset/../split_on_silence/more_than_10_seconds" and "../examples/dataset/../split_on_silence/less_than_10_seconds"...
Iterating over files in the dataset directory "../examples/dataset/"...
Exported chunk with duration    1581ms.
Exported chunk with duration    8297ms.
Exported chunk with duration    4109ms.
Exported chunk with duration    5739ms.
Exported chunk with duration    1525ms.
Exported chunk with duration    1825ms.
Exported chunk with duration    5735ms.
Exported chunk with duration   10673ms.
Exported chunk with duration     267ms.
Exported chunk with duration     564ms.
Exported chunk with duration    2948ms.
Exported chunk with duration     718ms.
Exported chunk with duration     669ms.
Exported chunk with duration    2585ms.
Exported chunk with duration    3777ms.
Exported chunk with duration     404ms.
Exported chunk with duration    1884ms.
Exported chunk wi

Now let's check the split output in `examples/split_on_silence`

In [80]:
!tree ../examples/split_on_silence

[01;34m../examples/split_on_silence[0m
├── [01;34mless_than_10_seconds[0m
│   ├── [00;36mLOU0000000_1581ms.wav[0m
│   ├── [00;36mLOU0000001_8297ms.wav[0m
│   ├── [00;36mLOU0000002_4109ms.wav[0m
│   ├── [00;36mLOU0000003_5739ms.wav[0m
│   ├── [00;36mLOU0000004_1525ms.wav[0m
│   ├── [00;36mLOU0000005_1825ms.wav[0m
│   ├── [00;36mLOU0000006_5735ms.wav[0m
│   ├── [00;36mLOU0000008_267ms.wav[0m
│   ├── [00;36mLOU0000009_564ms.wav[0m
│   ├── [00;36mLOU0000010_2948ms.wav[0m
│   ├── [00;36mLOU0000011_718ms.wav[0m
│   ├── [00;36mLOU0000012_669ms.wav[0m
│   ├── [00;36mLOU0000013_2585ms.wav[0m
│   ├── [00;36mLOU0000014_3777ms.wav[0m
│   ├── [00;36mLOU0000015_404ms.wav[0m
│   ├── [00;36mLOU0000016_1884ms.wav[0m
│   ├── [00;36mLOU0000017_2333ms.wav[0m
│   ├── [00;36mLOU0000018_1903ms.wav[0m
│   ├── [00;36mLOU0000019_3507ms.wav[0m
│   ├── [00;36mLOU0000020_4254ms.wav[0m
│   ├── [00;36mLOU0000021_3068ms.wav[0m
│   ├── [00;36mLOU0000022_1401ms.wav[0m
│  

For this small dataset the first round of split was almost enough - we have only 1 chunk longer than 10 seconds: `more_than_10_seconds`
Let's organize our output 

- move chunks in `less_than_10_seconds` to a `ready` directory and `more_than_10_seconds` to `needs_split`

In [81]:
!mv ../examples/split_on_silence/less_than_10_seconds ../examples/ready 
!mv ../examples/split_on_silence/more_than_10_seconds ../examples/needs_split
!rmdir ../examples/split_on_silence # we don't need this directory anymore
print('Directories in ../example:')
!ls ../examples | awk '{print $NF}'

Directories in ../example:
dataset
needs_split
ready


Preview the file that still needs spliting. It's slightly over 10 seconds

In [82]:
from IPython.display import Audio


Audio('../examples/needs_split/LOU0000007_10673ms.wav')

This file is a specially challenging scenario. I am almost 10 seconds and there is another breathing between words. Let's now split again the remaing with very aggresive options. That said, you probably wants you 2nd split to be more conservative, for example:
```
!python ../scripts/split_on_silence.py --min-silence 110 --silence-thresh -60 --keep-silence 40 ../examples/needs_split/
```

Let's push to see how far we can go for `LOU0000007_10673ms.wav`

In [83]:
!python ../scripts/split_on_silence.py --min-silence 40 --silence-thresh -60 --keep-silence 30 ../examples/needs_split/

Creating output directory:  ../examples/needs_split/../split_on_silence
creating split subdirectories "../examples/needs_split/../split_on_silence/more_than_10_seconds" and "../examples/needs_split/../split_on_silence/less_than_10_seconds"...
Iterating over files in the dataset directory "../examples/needs_split/"...
Exported chunk with duration    2230ms.
Exported chunk with duration    1353ms.
Exported chunk with duration    7090ms.


Verifying the output

In [84]:
root='../examples/split_on_silence/less_than_10_seconds'

for filename in os.listdir(root):
    if filename.endswith('.wav'):
        print(f'filename: {filename}')
        display(Audio(os.path.join(root, filename)))

filename: LOU0000002_7090ms.wav


filename: LOU0000000_2230ms.wav


filename: LOU0000001_1353ms.wav


**This is not a great result because "cultura" is being cut off. This is an example of a case wether you want to manually split or drop file from dataset**

This process most be done carefully and manually. Make auditions to the audio files and check the durations. 
