# Welcome to Modal notebooks!

Write Python code and collaborate in real time. Your code runs in Modal's
**serverless cloud**, and anyone in the same workspace can join.

This notebook comes with some common Python libraries installed. Run
cells with `Shift+Enter`.

In [25]:
!modal secret create kaggle-secret \
    KAGGLE_USERNAME=seifosamahosney \
    KAGGLE_KEY=<REDACTED_KAGGLE_KEY>

Created a new secret [32m'kaggle-secret'[0m with the keys [32m'KAGGLE_USERNAME'[0m, [32m'KAGGLE_KEY'[0m

Use it in your Modal app:

[40m                                                                                                                [0m
[92;40m@app[0m[91;40m.[0m[97;40mfunction[0m[97;40m([0m[97;40msecrets[0m[91;40m=[0m[97;40m[[0m[97;40mmodal[0m[91;40m.[0m[97;40mSecret[0m[91;40m.[0m[97;40mfrom_name[0m[97;40m([0m[93;40m"[0m[93;40mkaggle-secret[0m[93;40m"[0m[97;40m)[0m[97;40m][0m[97;40m)[0m[40m                                                [0m
[96;40mdef[0m[97;40m [0m[92;40msome_function[0m[97;40m([0m[97;40m)[0m[97;40m:[0m[40m                                                                                            [0m
[97;40m    [0m[97;40mos[0m[91;40m.[0m[97;40mgetenv[0m[97;40m([0m[93;40m"[0m[93;40mKAGGLE_USERNAME[0m[93;40m"[0m[97;40m)[0m[40m                                            

In [29]:
%%writefile download_tts_simple.py
import modal
import subprocess
import shutil
from pathlib import Path

app = modal.App("download-tts-fixed")

volume = modal.Volume.from_name("tts-dataset-storage", create_if_missing=True)

@app.function(
    image=modal.Image.debian_slim().pip_install("kaggle"),
    secrets=[modal.Secret.from_name("kaggle-secret")],
    volumes={"/data": volume},
    timeout=3600,
)
def download_to_volume():
    """Download Kaggle TTS dataset and structure it correctly"""

    # Target structure (Parler-TTS compatible)
    target_dir = Path("/data/tts_dataset/teacher_dataset_large_updated")
    voices_dir = target_dir / "voices"

    target_dir.mkdir(parents=True, exist_ok=True)
    voices_dir.mkdir(exist_ok=True)

    # Temp download location
    temp_download_path = Path("/tmp/kaggle_data")
    temp_download_path.mkdir(exist_ok=True)

    print(f"üì• Downloading Kaggle dataset to {temp_download_path}...")

    # Download + unzip
    cmd = [
        "kaggle", "datasets", "download",
        "-d", "seifosamahosney/tts-dataset",
        "-p", str(temp_download_path),
        "--unzip",
        "--force",
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        print(f"‚ùå Kaggle Error:\n{result.stderr}")
        return "Download failed"

    print("‚úÖ Download successful! Locating dataset files...")

    # -------------------------
    # Locate voices directory
    # -------------------------
    voices_dirs = list(temp_download_path.rglob("voices"))

    if not voices_dirs:
        print("‚ùå voices folder not found anywhere!")
        print("üìÇ Extracted contents:")
        for p in temp_download_path.rglob("*"):
            print(" -", p)
        return "voices folder missing"

    source_voices_dir = voices_dirs[0]
    print(f"üìÇ Found voices folder at: {source_voices_dir}")

    # -------------------------
    # Move wav files
    # -------------------------
    count = 0
    for wav_file in source_voices_dir.rglob("*.wav"):
        target_file = voices_dir / wav_file.name
        if not target_file.exists():
            shutil.move(str(wav_file), str(target_file))
            count += 1

    print(f"üöö Moved {count} .wav files to {voices_dir}")

    # -------------------------
    # Locate & move metadata
    # -------------------------
    metadata_files = list(temp_download_path.rglob("metadata.jsonl"))
    metadata_dst = target_dir / "metadata.jsonl"

    if metadata_files:
        shutil.move(str(metadata_files[0]), str(metadata_dst))
        print(f"üìÑ Moved metadata.jsonl to {metadata_dst}")
    else:
        print("‚ö†Ô∏è metadata.jsonl not found")

    # -------------------------
    # Verification
    # -------------------------
    has_wavs = any(voices_dir.glob("*.wav"))
    has_metadata = metadata_dst.exists()

    print(f"üîé WAV files present: {has_wavs}")
    print(f"üîé Metadata present: {has_metadata}")

    volume.commit()
    print("üíæ Saved to permanent volume 'tts-dataset-storage'")

    return f"Dataset ready at {target_dir}"

@app.local_entrypoint()
def main():
    download_to_volume.remote()

Overwriting download_tts_simple.py


In [30]:
!modal run download_tts_simple.py

[?25l[34m‚†ã[0m Initializing...[2K[32m‚úì[0m Initialized. [37mView run at [0m[4;37mhttps://modal.com/apps/wartory705/main/ap-CGj1yHkwsbJBrfN97YZr4G[0m
[34m‚†ã[0m Initializing...[2K[34m‚†ã[0m Initializing...
[?25h[1A[2K[?25l[34m‚†ã[0m Creating objects...[2K[34m‚†∏[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†ã[0m Creating mount /root/download_tts_simple.py: Uploaded 0/1 files[2K[1A[2K[34m‚†¶[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†∏[0m Creating mount /root/download_tts_simple.py: Uploaded 0/1 files[2K[1A[2K[34m‚†è[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†¶[0m Creating mount /root/download_tts_simple.py: Finalizing index of 1 files[2K[1A[2K[34m‚†π[0m Creating objects...
[37m‚îú‚îÄ‚îÄ [0müî® Created mount /root/download_tts_simple.py
[37m‚îî‚îÄ‚îÄ [0müî® Created function download_to_volume.
[?25h[1A[2K[1A[2K[1A[2K[32m‚úì[0m Created objects.
[37m‚îú‚îÄ‚îÄ [0müî® Created mount /root/download_tt

In [31]:
!modal secret create hf-secret HF_TOKEN=<REDACTED_HF_TOKEN> # you must give it access to write and read

Created a new secret [32m'hf-secret'[0m with the key [32m'HF_TOKEN'[0m

Use it in your Modal app:

[40m                                                                                                                [0m
[92;40m@app[0m[91;40m.[0m[97;40mfunction[0m[97;40m([0m[97;40msecrets[0m[91;40m=[0m[97;40m[[0m[97;40mmodal[0m[91;40m.[0m[97;40mSecret[0m[91;40m.[0m[97;40mfrom_name[0m[97;40m([0m[93;40m"[0m[93;40mhf-secret[0m[93;40m"[0m[97;40m)[0m[97;40m][0m[97;40m)[0m[40m                                                    [0m
[96;40mdef[0m[97;40m [0m[92;40msome_function[0m[97;40m([0m[97;40m)[0m[97;40m:[0m[40m                                                                                            [0m
[97;40m    [0m[97;40mos[0m[91;40m.[0m[97;40mgetenv[0m[97;40m([0m[93;40m"[0m[93;40mHF_TOKEN[0m[93;40m"[0m[97;40m)[0m[40m                                                                                      

In [6]:
%%writefile train_parler.py
import modal
import os
import subprocess
from pathlib import Path

# -------------------------
# CONFIG
# -------------------------
GPU_CONFIG = "H100:1"
NUM_GPUS = 1

VOLUME_NAME = "tts-dataset-storage"
MOUNT_PATH = Path("/data")
OUTPUT_DIR = MOUNT_PATH / "parler-tts-finetuned-h100"
HF_DATASET_REPO = "SeifElden2342532/parler-tts-dataset-format"

# -------------------------
# DEPENDENCIES (H100 SAFE)
# -------------------------
REQUIREMENTS = [
    "torch==2.4.1",
    "torchaudio==2.4.1",
    "accelerate",
    "datasets[audio]",
    "transformers==4.46.1",
    "pydantic==1.10.17",
    "tqdm",
    "soundfile",
    "scipy",
    "pyyaml",
    "protobuf==4.25.8",
    "wandb",
    "evaluate",
    "jiwer",
    "librosa",
    "bitsandbytes",
    "huggingface_hub",
    "parler-tts @ git+https://github.com/huggingface/parler-tts.git",
]

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.1.1-devel-ubuntu22.04",
        add_python="3.11",
    )
    .apt_install("git", "ffmpeg", "libsndfile1")
    .run_commands("ulimit -n 65536")
    .pip_install(
        *REQUIREMENTS,
        extra_index_url="https://download.pytorch.org/whl/cu121",
    )
)

app = modal.App(
    "parler-tts-h100-finetune",
    image=image,
)

# -------------------------
# TRAIN FUNCTION
# -------------------------
@app.function(
    volumes={str(MOUNT_PATH): modal.Volume.from_name(VOLUME_NAME)},
    timeout=25000,
    gpu=GPU_CONFIG,
    env={
        "FORCE_LIBSNDFILE": "1",
        "HF_AUDIO_DISABLE_TORCHCODEC": "1",
    },
)
def finetune_parler_tts():
    repo_path = Path("/root/parler-tts")

    if not repo_path.exists():
        print("üì• Cloning Parler-TTS repository...")
        subprocess.run(
            ["git", "clone", "https://github.com/huggingface/parler-tts.git", str(repo_path)],
            check=True,
        )

    # -------------------------
    # PATCH KNOWN PARLER-TTS BUGS
    # -------------------------
    import training.data

    data_py_path = Path(training.data.__file__)
    content = data_py_path.read_text()

    buggy_code = (
        'metadata_dataset_names = metadata_dataset_names.split("+") '
        'if metadata_dataset_names is not None else None'
    )
    fixed_code = (
        'metadata_dataset_names = metadata_dataset_names.split("+") '
        'if (metadata_dataset_names is not None and isinstance(metadata_dataset_names, str)) '
        'else [None] * len(dataset_names)'
    )
    if buggy_code in content:
        content = content.replace(buggy_code, fixed_code)

    buggy_eval_code = 'vectorized_datasets["validation"]'
    fixed_eval_code = 'vectorized_datasets["eval"]'
    if buggy_eval_code in content:
        content = content.replace(buggy_eval_code, fixed_eval_code)

    data_py_path.write_text(content)

    training_script_path = repo_path / "training" / "run_parler_tts_training.py"
    script_content = training_script_path.read_text()

    buggy_num_proc = (
        'num_proc=min(data_args.preprocessing_num_workers, '
        'len(vectorized_datasets["eval"]) - 1),'
    )
    fixed_num_proc = "num_proc=1,"
    if buggy_num_proc in script_content:
        script_content = script_content.replace(buggy_num_proc, fixed_num_proc)

    training_script_path.write_text(script_content)

    # -------------------------
    # TRAINING COMMAND
    # -------------------------
    model_name = "parler-tts/parler-tts-mini-v1"
    os.makedirs(OUTPUT_DIR, exist_ok=True)

    training_command = f"""
accelerate launch --num_processes={NUM_GPUS} training/run_parler_tts_training.py \\
  --model_name_or_path "{model_name}" \\
  --train_dataset_name "{HF_DATASET_REPO}" \\
  --train_dataset_config_name "default" \\
  --train_split_name "train" \\
  --eval_dataset_name "{HF_DATASET_REPO}" \\
  --eval_dataset_config_name "default" \\
  --eval_split_name "validation" \\
  --max_train_samples 2000 \\
  --max_eval_samples 200 \\
  --seed 42 \\
  --do_train true \\
  --do_eval true \\
  --preprocessing_num_workers 1 \\
  --evaluation_strategy "epoch" \\
  --description_column_name "text_description" \\
  --prompt_column_name "text" \\
  --target_audio_column_name "audio" \\
  --description_tokenizer_name "google/flan-t5-base" \\
  --prompt_tokenizer_name "google/flan-t5-base" \\
  --save_to_disk "/tmp/parler_dataset_processed" \\
  --temporary_save_to_disk "/tmp/parler_dataset_temp" \\
  --output_dir "{OUTPUT_DIR}" \\
  --overwrite_output_dir true \\
  --per_device_train_batch_size 8 \\
  --per_device_eval_batch_size 8 \\
  --gradient_accumulation_steps 2 \\
  --gradient_checkpointing true \\
  --optim "adamw_bnb_8bit" \\
  --max_steps 400 \\
  --bf16 true \\
  --report_to "none"
"""

    print("\nüöÄ Starting Parler-TTS fine-tuning on H100‚Ä¶")
    subprocess.run(training_command, shell=True, check=True, cwd=str(repo_path))

    modal.Volume.from_name(VOLUME_NAME).commit()
    print("\n‚úÖ Fine-tuning complete!")

# -------------------------
# ENTRYPOINT
# -------------------------
@app.local_entrypoint()
def main():
    finetune_parler_tts.remote()

Overwriting train_parler.py


In [7]:
!modal run train_parler.py

[?25l[34m‚†ã[0m Initializing...[2K[32m‚úì[0m Initialized. [37mView run at [0m[4;37mhttps://modal.com/apps/wartory705/main/ap-fHxZml9vM7yBYgNvoz8SFG[0m
[34m‚†ã[0m Initializing...[2K[34m‚†ã[0m Initializing...
[?25h[1A[2K[?25l[34m‚†ã[0m Creating objects...[2K[34m‚†∏[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†ã[0m Creating mount /root/train_parler.py: Uploaded 0/1 files[2K[1A[2K[34m‚†¶[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†∏[0m Creating mount /root/train_parler.py: Uploaded 0/1 files[2K[1A[2K[34m‚†è[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†¶[0m Creating mount /root/train_parler.py: Uploaded 0/1 files[2K[1A[2K[33mBuilding image im-pNYI1mfnCZZlZjMvE3lyUJ
[0m[34m‚†ô[0m[33m Creating objects...[0m[33m
[0m[37m‚îî‚îÄ‚îÄ [0m[34m‚†á[0m[33m Creating mount /root/train_parler.py: Uploaded 0/1 files[0m[2K[1A[2K[34m‚†π[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†è[0m Creating mount /root/train_parle

In [14]:
%%writefile compare_models.py
import modal
import os
from pathlib import Path

GPU_CONFIG = "H100:1"
VOLUME_NAME = "tts-dataset-storage"
MOUNT_PATH = Path("/data")
FINETUNED_MODEL_PATH = MOUNT_PATH / "parler-tts-finetuned-h100"
BASE_MODEL_NAME = "parler-tts/parler-tts-mini-v1"

REQUIREMENTS = [
    "torch==2.4.1",
    "torchaudio==2.4.1",
    "transformers==4.46.1",
    "parler-tts @ git+https://github.com/huggingface/parler-tts.git",
    "soundfile",
    "scipy",
]

image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.1.1-devel-ubuntu22.04",
        add_python="3.11",
    )
    .apt_install("git", "ffmpeg", "libsndfile1")
    .pip_install(
        *REQUIREMENTS,
        extra_index_url="https://download.pytorch.org/whl/cu121",
    )
)

app = modal.App("parler-tts-comparison", image=image)

@app.function(
    volumes={str(MOUNT_PATH): modal.Volume.from_name(VOLUME_NAME)},
    gpu=GPU_CONFIG,
    timeout=600,
    env={
        "FORCE_LIBSNDFILE": "1",
        "HF_AUDIO_DISABLE_TORCHCODEC": "1",
    },
)
def run_comparison(prompt: str, description: str):
    import torch
    from parler_tts import ParlerTTSForConditionalGeneration
    from transformers import AutoTokenizer
    import soundfile as sf

    device = "cuda" if torch.cuda.is_available() else "cpu"

    def generate_audio(model_id, label):
        print(f"üîä Loading {label} model‚Ä¶")
        model = ParlerTTSForConditionalGeneration.from_pretrained(model_id).to(device)
        model.eval()

        prompt_tokenizer = AutoTokenizer.from_pretrained(model_id)
        description_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

        with torch.inference_mode():
            input_ids = description_tokenizer(
                description, return_tensors="pt"
            ).input_ids.to(device)

            prompt_input_ids = prompt_tokenizer(
                prompt, return_tensors="pt"
            ).input_ids.to(device)

            audio = model.generate(
                input_ids=input_ids,
                prompt_input_ids=prompt_input_ids,
            )

        audio_arr = audio.cpu().numpy().squeeze()
        filename = f"output_{label.lower().replace(' ', '_')}.wav"
        sf.write(filename, audio_arr, model.config.sampling_rate)

        with open(filename, "rb") as f:
            return f.read(), filename

    base_audio, base_file = generate_audio(BASE_MODEL_NAME, "Base")

    if not FINETUNED_MODEL_PATH.exists():
        return f"‚ùå Fine-tuned model not found at {FINETUNED_MODEL_PATH}", None

    ft_audio, ft_file = generate_audio(str(FINETUNED_MODEL_PATH), "Fine-tuned")

    return {
        "base": (base_audio, base_file),
        "finetuned": (ft_audio, ft_file),
    }

@app.local_entrypoint()
def main():
    test_prompt = (
        "Well, when you play super hard, your muscles get tiny little changes that make them feel a bit tired. But don't you worry, those changes are actually helping them get stronger so you can play even more!"
    )
    test_description = (
        "A male speaker delivers a gentle and moderate-paced speech. The recording is clean with a natural quality. The voice has a neutral pitch."
    )

    print("üéß Starting comparison inference‚Ä¶")
    results = run_comparison.remote(test_prompt, test_description)

    if isinstance(results, str):
        print(results)
        return

    for key, (audio_data, filename) in results.items():
        with open(filename, "wb") as f:
            f.write(audio_data)
        print(f"‚úÖ Saved {key} result to {filename}")

    print("\nüéâ Comparison complete! Listen to both files.")

Overwriting compare_models.py


In [15]:
!modal run compare_models

[33m‚îÇ[0m Using Python module paths will require using the -m flag in a future version of Modal.                       [33m‚îÇ[0m
[33m‚îÇ[0m Use `modal run -m compare_models` instead.                                                                   [33m‚îÇ[0m
[33m‚ï∞‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚ïØ[0m
[?25l[34m‚†ã[0m Initializing...[2K[32m‚úì[0m Initialized. [37mView run at [0m[4;37mhttps://modal.com/apps/wartory705/main/ap-XI8SjZyJXwLf26xwJL3bcf[0m
[34m‚†ã[0m Initializing...[2K[34m‚†ã[0m Initializing...
[?25h[1A[2K[?25l[34m‚†ã[0m Creating objects...[2K[34m‚†∏[0m Creating objects...
[37m‚îî‚îÄ‚îÄ [0m[34m‚†ã[0m Creating mount /root/compare_mo