# Speech Dataset Generator - Usage Guide


## Prerequisites

### Pyannote Agreement


Before running the code, ensure that you have agreed to share your contact information to access the pyannote embedding model. A similar agreement may be required for the pyannote speaker diarization model.

1. https://huggingface.co/pyannote/embedding

2. https://huggingface.co/pyannote/speaker-diarization

### Huggingface token

You need to generate a token at https://huggingface.co/settings/tokens

## Installation


In [None]:
# 1. Clone the Repository
!git clone https://github.com/davidmartinrius/speech-dataset-generator.git
%cd speech-dataset-generator

In [None]:
# 2. Set Up Environment
!python3.10 -m venv venv
!source venv/bin/activate
!pip install -r requirements.txt

In [None]:
# 3. HuggingFace Token
!echo "HF_TOKEN=yourtoken" > .env
# Make sure to replace 'yourtoken' with your actual HuggingFace token.

In [None]:
# 4. Set up path
import os
os.environ['PYTHONPATH'] += ":/content/speech-dataset-generator"

## Usage


In [4]:
import os
from IPython.display import Audio, display

def display_the_list_of_files(output_directory):

    # List all files in the output directory
    file_list = [f for f in os.listdir(output_directory) if f.endswith('.wav')]

    # Display the list of files
    print("List of generated .wav files:")
    for i, file_name in enumerate(file_list[:10]):
        print(f"{file_name}\n")
        
# Function to play audio
def play_audio(wavs_directory):

    # Let the user choose a file to play
    selected_file = input("Enter the filename to play (e.g., example_file.wav): ")
    file_path = os.path.join(wavs_directory, selected_file)
    print(file_path)

    # Check if the selected file exists
    if os.path.exists(file_path):
        print(f"Playing: {selected_file}")
        display(Audio(filename=file_path))
    else:
        print(f"File '{selected_file}' not found in the output directory.")

### Basic

In the next audio there is:
- 2 speakers
- 2 genders
- Background noise
- A length of 2:14 minutes

In [None]:
display(Audio(filename="./assets/example_audio_1.mp3"))

In this audio I am going to apply to filters. 
1. deepfilternet to decrement the noise
2. resembleai to enhance the audio quality
3. Silence removal

In [None]:
output_directory = "./outputs/output_combining_enhancers"

# No enhancer is used
!python speech_dataset_generator/main.py --input_file_path ./assets/example_audio_1.mp3 --output_directory {output_directory} --range_times 3-15 --enhancers deepfilternet resembleai

After processing the audio you got:
1- enhanced audios
2- Segmented audios in the range you specified. In this case from 5 to 10 seconds for each speaker
3- chroma_database, where the speakers are persisted, so you can reuse this database to process other files and the labels of the speakers will be the same
4- A metadata.csv + wavs folder, this is the LJSpeech dataset standard

Inside enhanced folder you can listen the improved audio without silences: The original was 2:14 minutes. Now it has been reduced to 1:44 minutes.

In [None]:
display(Audio(filename=os.path.join(output_directory, "enhanced", "example_audio_1_enhanced.mp3")))

Let's see what is inside wavs folder:

In [None]:
wavs_directory = os.path.join(output_directory, "wavs")
display_the_list_of_files(wavs_directory)

In [None]:
#Use one of the file names. Example of the output:
#List of generated .wav files:
#    1709255795_1479612617475313631572.wav

#When executing this a prompt will ask for a file name:

play_audio(wavs_directory)

### Advanced (still in progress)

##### Example: Input from a File


##### Generate with no enhancer. The base audio must be of very good quality, or it will be discarded

In [None]:
output_directory = "./outputs/output_no_enhancer"

# No enhancer is used
!python speech_dataset_generator/main.py --input_file_path ./assets/example_audio_1.mp3 --output_directory {output_directory} --range_times 5-10

In [None]:
wavs_directory = os.path.join(output_directory, "wavs")
display_the_list_of_files(wavs_directory)

In [None]:
play_audio(wavs_directory)

##### Using deepfilternet enhancer

In [None]:
!python speech_dataset_generator/main.py --input_file_path ./assets/example_audio_1.mp3 --output_directory ./outputs/output_deepfilternet --range_times 4-10 --enhancers deepfilternet

#### Using resembleai enhancer

In [None]:
!python speech_dataset_generator/main.py --input_file_path ./assets/example_audio_1.mp3 --output_directory ./outputs/output_resembleai --range_times 4-10 --enhancers resembleai

#### Combining enhancers

In [None]:
!python speech_dataset_generator/main.py --input_file_path ./assets/example_audio_1.mp3 --output_directory ./outputs/output_combining_enhancers --range_times 4-10 --enhancers deepfilternet resembleai

#### Example: Input from a Folder

In [None]:
!python speech_dataset_generator/main.py --input_folder ./assets --output_directory ./outputs/output_folder --range_times 4-10 --enhancers deepfilternet

#### Example: Input from YouTube (Single Video or Playlists)


#### Youtube Single Video

In [None]:
# Youtube Single Video
!python speech_dataset_generator/main.py --youtube_download https://www.youtube.com/watch?v=ExJZAegsOis --output_directory ./outputs/output_youtube --range_times 5-15 --enhancers deepfilternet resembleai

#### Combining a YouTube video + Input File

In [None]:
!python speech_dataset_generator/main.py --youtube_download https://www.youtube.com/watch?v=ExJZAegsOis --input_file_path ./assets/example_audio_1.mp3 --output_directory ./outputs/output_youtube_and_file --range_times 5-15 --enhancers deepfilternet resembleai

#### Combining YouTube video + Input Folder

In [None]:
!python speech_dataset_generator/main.py --youtube_download https://www.youtube.com/watch?v=ExJZAegsOis --input_folder ./assets --output_directory ./outputs/output_youtube_and_folder --range_times 5-15 --enhancers deepfilternet resembleai