# Fish Speech FineTuning

Documentation available [here](https://speech.fish.audio/finetune/).

## Data Structuration

In [17]:
!ls ../data/

anta_dataset.txt  raw  train.txt  val.txt  wavs


In [7]:
with open('../data/anta_dataset.txt', 'r') as f:
    anta = f.readlines()

In [11]:
print(f"There are {len(anta)} lines of text data in the dataset!")

There are 19998 lines of text data in the dataset!


In [26]:
anta[0].split('|')[0].replace('.wav', '.lab')

'4726f100376ece0fc1d20cd60b64d17cc63c2b275bd7e0798f9867be8cc38af2.lab'

In [22]:
!cd ../data/wavs/ && wc -l

^C


In [32]:
from tqdm import tqdm

for line in tqdm(anta):
    cleaned_line = line.replace('\n', '')
    name_with_label = cleaned_line.split('|')
    filepath = '../data/wavs/' + name_with_label[0].replace('.wav', '.lab')
    
    with open(filepath, "x") as f:
        f.write(name_with_label[1])

100%|██████████████████████████████████████████████████████████████████████| 19998/19998 [00:00<00:00, 30562.67it/s]


## Dependencies

In [None]:
!pip3 install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121

In [None]:
# Install fish-speech
!cd src/fish-speech
!pip3 install -e .

In [None]:
# It's recommended to apply loudness normalization to the dataset. You can use fish-audio-preprocess to do this
!cd src/audio-preprocess
!pip install -e .
!fap loudness-norm ../../data/wavs ../../data/clean --clean

## Batch extraction of semantic tokens

### Downloading if the VQGAN weights

In [None]:
!huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4

In [2]:
%cd ../src/fish-speech/

/home/caytu/Wolof-TTS/src/fish-speech


In [None]:
# extracting semantic tokens
!python tools/vqgan/extract_vq.py ../../data/clean \
    --num-workers 1 --batch-size 16 \
    --config-name "firefly_gan_vq" \
    --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

In [None]:
!cp ../../data/wavs/*.lab ../../data/clean/

In [11]:
# Pack the dataset into protobuf
!python tools/llama/build_dataset.py \
    --input "../../data/clean" \
    --output "../../data/clean/protos" \
    --text-extension .lab \
    --num-workers 16

0it [00:00, ?it/s]
Loading ../../data/clean: 0it [00:00, ?it/s][A
Loading ../../data/clean: 19998it [00:00, 80151.53it/s][A

Grouping ../../data/clean:   0%|                      | 0/19998 [00:00<?, ?it/s][A
Grouping ../../data/clean:  19%|█▌      | 3824/19998 [00:00<00:00, 38237.39it/s][A
Grouping ../../data/clean:  39%|███     | 7758/19998 [00:00<00:00, 38880.65it/s][A
Grouping ../../data/clean:  59%|████   | 11723/19998 [00:00<00:00, 39214.26it/s][A
Grouping ../../data/clean:  79%|█████▌ | 15719/19998 [00:00<00:00, 39505.58it/s][A
Grouping ../../data/clean: 100%|███████| 19998/19998 [00:00<00:00, 39333.88it/s][A
[32m2024-10-29 19:45:34.593[0m | [1mINFO    [0m | [36m__main__[0m:[36mtask_generator_folder[0m:[36m46[0m - [1mFound 1 groups in ../../data/clean, ['../../data/clean']...[0m
[32m2024-10-29 19:45:39.728[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m156[0m - [1mFinished writing 0 shards to ../../data/clean/protos[0m
1it [00:06,  6.03s/it]


## Fine-tuning with LoRA
You can modify the training parameters such as *batch_size*, *gradient_accumulation_steps*, etc. to fit your GPU memory by modifying `fish_speech/configs/text2semantic_finetune.yaml`.
> By default, the model will only learn the speaker's speech patterns and not the timbre. You still need to use prompts to ensure timbre stability. If you want to learn the timbre, you can increase the number of training steps, but __this may lead to overfitting__.

In [14]:
!mv ../../data/clean .

In [15]:
!mv clean data

In [None]:
!python fish_speech/train.py --config-name text2semantic_finetune \
    project=$project \
    +lora@model.model.lora_config=r_8_alpha_16

After training, you need to convert the LoRA weights to regular weights before performing inference.

In [None]:
!python tools/llama/merge_lora.py \
    --lora-config r_8_alpha_16 \
    --base-weight checkpoints/fish-speech-1.4 \
    --lora-weight results/checkpoints/step_000011200.ckpt \
    --output checkpoints/fish-speech-1.4-yth-lora/

## Inference

Download the required `vqgan` and `llama` models from our Hugging Face repository

In [18]:
!huggingface-cli download fishaudio/fish-speech-1.4 --local-dir checkpoints/fish-speech-1.4

Fetching 8 files: 100%|█████████████████████████| 8/8 [00:00<00:00, 3897.60it/s]
/home/caytu/Wolof-TTS/src/fish-speech/checkpoints/fish-speech-1.4


Generating prompt from voice
> If you plan to let the model randomly choose a voice timbre, you can skip this step.

In [22]:
!cp ../../data/wavs/0ca4770a33bb4a238270029a43694cae7999903e3bdb5257fae69da565f57eff.wav ./sample_timbre.wav

In [28]:
!python tools/vqgan/inference.py \
    -i "sample_timbre.wav" \
    --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

  @autocast(enabled = False)
  @autocast(enabled = False)
  @autocast(enabled = False)
[32m2024-10-30 10:34:16.526[0m | [1mINFO    [0m | [36m__main__[0m:[36mload_model[0m:[36m43[0m - [1mLoaded model: <All keys matched successfully>[0m
[32m2024-10-30 10:34:16.527[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m72[0m - [1mProcessing in-place reconstruction of sample_timbre.wav[0m
[32m2024-10-30 10:34:16.545[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m83[0m - [1mLoaded audio with 6.22 seconds[0m
  with autocast(enabled = False):
  with quantization_context():
[32m2024-10-30 10:34:16.900[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m91[0m - [1mGenerated indices of shape torch.Size([8, 134])[0m
[32m2024-10-30 10:34:17.296[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m110[0m - [1mGenerated audio of shape torch.Size([1, 1, 274432]), equivalent to 6.22 seconds from 134 features, features/second: 21.53[0

Generate semantic tokens from text
> This command will create a `codes_N` file in the working directory, where __N__ is an integer starting from 0.
You may also want to use `--compile` to fuse CUDA kernels for faster inference (~30 tokens/second -> ~500 tokens/second). 
Correspondingly, if you do not plan to use acceleration, you can comment out the `--compile` parameter.
For GPUs that do not support `bf16`, you may need to use the `--half` parameter.

In [29]:
!python tools/llama/generate.py \
    --text "Màngi tuddu Anta, di wax ak yéen ci kàllaamay wolof!" \
#    --prompt-text "Anta's speech is very clear, and she speaks in a very expressive voice, really slowly and with minimal variation in speed." \
#    --prompt-tokens "fake.npy" \
    --checkpoint-path "checkpoints/fish-speech-1.4/" \
    --num-samples 2 \
    --compile

[32m2024-10-30 10:35:30.848[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m662[0m - [1mLoading model ...[0m
[32m2024-10-30 10:35:38.476[0m | [1mINFO    [0m | [36m__main__[0m:[36mload_model[0m:[36m360[0m - [1mRestored model from checkpoint[0m
[32m2024-10-30 10:35:38.476[0m | [1mINFO    [0m | [36m__main__[0m:[36mload_model[0m:[36m364[0m - [1mUsing DualARTransformer[0m
[32m2024-10-30 10:35:38.487[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m676[0m - [1mTime to load model: 7.64 seconds[0m
[32m2024-10-30 10:35:38.496[0m | [1mINFO    [0m | [36m__main__[0m:[36mgenerate_long[0m:[36m448[0m - [1mEncoded text: Màngi tuddu Anta, di wax ak yéen ci kàllaamay wolof![0m
[32m2024-10-30 10:35:38.496[0m | [1mINFO    [0m | [36m__main__[0m:[36mgenerate_long[0m:[36m466[0m - [1mGenerating sentence 1/1 of sample 1/1[0m
  self.gen = func(*args, **kwds)
  3%|█▏                                      | 116/4056 [00:08<04:56, 13

Generate vocals from semantic tokens  
__VQGAN Decoder__

In [30]:
!python tools/vqgan/inference.py \
    -i "codes_0.npy" \
    --checkpoint-path "checkpoints/fish-speech-1.4/firefly-gan-vq-fsq-8x1024-21hz-generator.pth"

  @autocast(enabled = False)
  @autocast(enabled = False)
  @autocast(enabled = False)
[32m2024-10-30 10:36:02.390[0m | [1mINFO    [0m | [36m__main__[0m:[36mload_model[0m:[36m43[0m - [1mLoaded model: <All keys matched successfully>[0m
[32m2024-10-30 10:36:02.391[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m96[0m - [1mProcessing precomputed indices from codes_0.npy[0m
[32m2024-10-30 10:36:02.997[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m110[0m - [1mGenerated audio of shape torch.Size([1, 1, 239616]), equivalent to 5.43 seconds from 117 features, features/second: 21.53[0m
[32m2024-10-30 10:36:03.005[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m117[0m - [1mSaved audio to fake.wav[0m


In [31]:
from IPython.display import Audio

Audio('fake.wav')