Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC #250

FurkanGozukara · 2025-10-24T15:13:14Z

FurkanGozukara
Oct 24, 2025
Maintainer

Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC

Full tutorial: https://www.youtube.com/watch?v=OiMRlqcgDL0

Today, we're going to dive into the cutting-edge world of voice cloning and speech synthesis, a technology that has the potential to revolutionize communication as we know it. We're going to make use of TorToiSe TTS, an incredibly powerful yet user-friendly open-source tool for training and generating speech. And the best part? With TorToiSe TTS Fast it is fast and doesn't require you to be a tech wizard! This step-by-step tutorial will guide you through the whole process, from A to Z. By the end of this tutorial, you'll be able to clone any voice and generate speech so realistic, it will sound just like the original! It's almost like being able to shape-shift voices!

Source GitHub File ⤵️

https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Tutorials/Deep-Voice-Clone-Tutorial-Tortoise-TTS.md

Scripts Patreon Download Link ⤵️

https://www.patreon.com/posts/voice-clone-82712205

Our Discord server ⤵️

https://bit.ly/SECoursesDiscord

If I have been of assistance to you and you would like to show your support for my work, please consider becoming a patron on 🥰 ⤵️

https://www.patreon.com/SECourses

Technology & Science: News, Tips, Tutorials, Tricks, Best Applications, Guides, Reviews ⤵️

https://www.youtube.com/playlist?list=PL_pbwdIyffsnkay6X91BWb9rrfLATUMr3

Playlist of StableDiffusion Tutorials, Automatic1111 and Google Colab Guides, DreamBooth, Textual Inversion / Embedding, LoRA, AI Upscaling, Pix2Pix, Img2Img ⤵️

https://www.youtube.com/playlist?list=PL_pbwdIyffsmclLl0O144nQRnezKlNdx3

00:00:00 Introduction to Deep Voice Cloning Tutorial with a cloned voice

00:02:36 Preparing speech training data by installing and using OZEN Toolkit

00:07:27 How to merge multiple audio files into a single audio file

00:12:44 Installing DL-Art-School DLAS to do voice training and cloning with Tortoise TTS

00:13:57 Setting up DLAS training interface

00:16:59 Explanation of what is batch size, step count and epoch count

00:20:12 Explanation of iteration count, epoch number and loss value during training

00:21:24 Display loss values during training as a graph

00:23:20 Starting speech synthesis after training has been completed

00:23:49 How to install Tortoise TTS Fast library

00:26:07 How to use speech synthesizing commands with Tortoise TTS Fast

00:31:35 Fix inference.py file for my custom easy script to work

00:36:20 How to use Adobe Podcast to improve speech sound quality

00:38:24 How to automatically synthesize speech with drag and drop

00:43:12 Low VRAM settings for weaker GPUs

00:45:25 Automatic checkpoint comparison script

00:50:26 Latest versions of all scripts

Comprehensive Guide to Voice Cloning with TorToiSe TTS and Preprocessing with OZEN Toolkit

Welcome, tech enthusiasts and data scientists alike, to a thrilling journey through the revolutionary landscape of voice cloning and speech synthesis technologies. This in-depth tutorial is aimed at demystifying the complex aspects of this innovative technology, breaking it down into manageable, easy-to-follow steps. Today, we are going to explore the robust capabilities of TorToiSe TTS, a state-of-the-art, open-source tool that stands out for its incredible power, user-friendly interface, and quick results, even for those with a non-technical background.

Next, we'll guide you through the process of setting up TorToiSe TTS Fast, a streamlined version of the tool that optimizes speed without compromising on the quality of the output. Whether you are a seasoned programmer or just starting out in the world of deep learning, we will walk you through the installation process, system requirements, and a detailed overview of the user interface, ensuring you are well-equipped to navigate the tool with ease.

Following that, we'll delve into the intricate task of data pre-processing, leveraging the prowess of the OZEN toolkit, a widely acclaimed pre-processing tool designed to enhance the quality and effectiveness of your training datasets. We'll explain how to use it to clean up and format your voice data, thereby optimizing it for effective training.

After we've preprocessed the data, the tutorial will move into the heart of the voice cloning process: training and fine-tuning the model using the deep learning art school (DLAS) library. We'll walk you through the different parameters you can adjust, how to interpret the training progress, and the techniques for fine-tuning the model to generate high-quality synthetic speech.

By the conclusion of this comprehensive tutorial, you will be well-versed in the art of voice cloning, equipped with the skills to transform any voice and generate speech so lifelike, it's virtually indistinguishable from the original. Imagine the ability to shape-shift voices, making them sound just the way you want - it's no longer science fiction, but a skill you can master today. So, buckle up for an informative ride into the world of voice cloning and speech synthesis with TorToiSe TTS and the OZEN toolkit.

#TextToSpeech #DeepFake #VoiceOver #VoiceTraining #VoiceCloning #DeepVoice

Video Transcription

00:00:00 Hello and welcome one and all.
00:00:02 Today we are embarking on an exciting journey through the most intuitive and comprehensive
00:00:08 deep voice cloning tutorial you will find anywhere.
00:00:12 Our mission?
00:00:13 To show you how to train your voice and generate hours of speech in just a single click.
00:00:19 This tutorial is built on open source projects, meaning it is entirely free and runs directly
00:00:25 on your own computer.
00:00:27 You may have heard of pricey services like Eleven Labs that offer voice training and
00:00:31 speech synthesis, but why break the bank when you can achieve stunning results without any
00:00:37 prior technical knowledge?
00:00:39 That is right, all you need to do is follow along with me.
00:00:43 Interestingly, the voice you are hearing now is a deep clone.
00:00:48 Drop a comment to let me know what you think and who it reminds you of.
00:00:52 We will switch to my natural voice after this introduction, so stay tuned.
00:00:57 Every step of this tutorial is meticulously detailed, both in this video and in the accompanying
00:01:04 GitHub introduction file.
00:01:06 This file is your one stop shop packed with all the information you will need.
00:01:11 You will find the link in the description and the pinned comment, so don't forget to
00:01:15 check it out.
00:01:16 Are you wondering why this tutorial stands out from the rest?
00:01:20 It is because with a simple drag and drop, your speech file can be transformed into your
00:01:25 complete training data.
00:01:26 Thanks to our chosen user friendly graphical user interface, you can set up all training
00:01:32 parameters and launch your training with just a single click.
00:01:36 Once the training is complete, my specially developed scripts will enable you to convert
00:01:41 an entire text file into speech, again, just by dragging and dropping.
00:01:46 What is more, each step is clearly documented and demonstrated in this tutorial video, making
00:01:52 it incredibly easy for you to follow along.
00:01:56 This tutorial offers a seamless and convenient process unlike any other.
00:02:01 So are you ready to jump into the world of voice cloning?
00:02:05 Let's get started.
00:02:07 So there will be three parts of this tutorial.
00:02:10 From now on, I will use our new GitHub repository to add these commands and other links instead
00:02:17 of using Gist.
00:02:19 The link of this file will be in the description and everything you need will be in this file.
00:02:24 You see it is neatly organized.
00:02:26 It is also much better looking.
00:02:29 If you also start our repository and watch our repository, I would appreciate that very
00:02:34 much.
00:02:35 The first step is preparing training speech files.
00:02:38 To do that, first you need to accept the license terms of this link.
00:02:43 This is a segmentation model.
00:02:45 Hugging Face requires to accept their license to use this.
00:02:49 Currently, I am logged into my account.
00:02:51 You can register for free to use Hugging Face.
00:02:54 Then we will begin with cloning this repository.
00:02:57 This is an open source tool that will allow us to preprocess the speech files.
00:03:02 This is making our job very, very easy.
00:03:05 Preprocessing of audio files is really hard and extremely important.
00:03:10 So I will clone it into this directory, git clone, then we need to install it.
00:03:16 For installation, you need to have Anaconda or Miniconda.
00:03:19 The links are in here.
00:03:21 Also, I have shown how to install in this tutorial.
00:03:24 So when I type Ana, it displays me Anaconda prompt like this.
00:03:29 That means that it is installed.
00:03:31 Once you installed Anaconda or Miniconda, enter inside the Ozen Toolkit directory and
00:03:37 double click set up Ozen.bat file.
00:03:39 You see it says that it found Anaconda.
00:03:41 However, since there is already virtual environment with the Ozen name, I have to delete it to
00:03:47 reinstall.
00:03:48 Now I will do that.
00:03:50 So I will enter inside this folder and my Anaconda virtual environments are here.
00:03:55 I will delete both of them.
00:03:57 Okay the folders are deleted.
00:03:58 Now I will run again the setup: Ozen.bat file.
00:04:02 Okay.
00:04:03 It says that it has found Anaconda.
00:04:05 It made the Ozen virtual environment directory, and now it will collect all the packages and
00:04:12 install them.
00:04:13 This will take a while.
00:04:14 So be patient.
00:04:15 After a while, you will see that it is installing all the necessary libraries like this.
00:04:20 Okay.
00:04:21 The installation has been completed.
00:04:23 Now I will show you what kind of output you are expecting.
00:04:28 It will show all of the packages collected, installed like this.
00:04:32 There isn't any error.
00:04:34 Okay.
00:04:35 In the bottom, you will see attempting to uninstall, found installation, installing.
00:04:39 And in the very bottom, you should see a message like this.
00:04:43 And in the end, it says that it is done.
00:04:47 So hit the key to continue.
00:04:49 Once the installation has been completed, now we are ready to preprocess our training
00:04:54 data.
00:04:56 So for preprocessing, all we need to do is drag and drop our speech file into here.
00:05:01 However, there are some key issues.
00:05:04 I will right click Ozen.py and I will edit it with Notepad++.
00:05:09 So in this file, there are several settings that you may want to change.
00:05:14 It will use Whisper for transcribing.
00:05:16 I already have amazing Whisper tutorial on my channel if you are interested in it.
00:05:22 I also put the link of that into this file.
00:05:25 Whisper tutorial.
00:05:26 So what about Whisper?
00:05:28 It is by default using large version 2, which requires about 10 to 11 GB VRAM.
00:05:34 If you don't have such graphic card, what can you do?
00:05:37 You can use the other models written here.
00:05:40 For example, medium.en, it will use about 5 GB VRAM memory, copy it.
00:05:46 And for changing it, you need to just change the last part of it like this.
00:05:51 And it will work.
00:05:52 There is a trade-off that you need to make.
00:05:55 If you don't have sufficient amount of VRAM memory, the transcribing quality will be lower.
00:06:00 I will use a large model since I have RTX 3090.
00:06:04 And there are also other settings.
00:06:06 I am not changing any of the other settings because they are working.
00:06:10 By the way, also, make sure to change the value in here as well.
00:06:14 You can be ensured that both here and here are same for Whisper.
00:06:19 However, there is one another option which will be much more slower.
00:06:22 If you don't want to use medium model, you can change this default CUDA to CPU.
00:06:28 When you do this, it will use CPU for transcribing.
00:06:34 It will be significantly slower, but large model will also work on lower VRAM graphic
00:06:40 cards.
00:06:41 However, I would prefer medium model if I were you.
00:06:45 Okay, as a next, I will move my speech file into here.
00:06:48 Then all you need to do is drag and drop it into this CMD window like this.
00:06:54 It will open a new CMD window and it will start processing files.
00:07:00 Since we didn't set our Hugging Face token, we need to set it in here.
00:07:05 To get your Hugging Face token, you can directly open this link.
00:07:09 You need to have a free Hugging Face account.
00:07:11 Then click New Token.
00:07:14 Type anything you want.
00:07:15 Generate Token.
00:07:16 Copy Token.
00:07:17 Right click and paste it here.
00:07:19 Hit Enter.
00:07:20 And it will start pre-processing your speech data.
00:07:23 If your speech data is not a single file, what can you do?
00:07:28 You can merge all of them into a single file.
00:07:30 To merge multiple audio files into a single file you can use this ffmpeg -i command I
00:07:38 have written here.
00:07:39 You just need to give the file names inside a given folder.
00:07:43 You will also get some warnings like this.
00:07:46 You can ignore them.
00:07:48 It will quite take a while in this screen.
00:07:50 You won't see anything.
00:07:52 However, you see, my CPU usage is 100% because the script is currently finding the speech-less
00:08:01 parts in the given speech file.
00:08:03 And then splitting the speech into small parts because the training depends on small speech
00:08:11 files.
00:08:12 Okay, after the script processed file, it started to whisper.
00:08:15 However, I have an error because apparently the script owner didn't fix its Torch version
00:08:22 installation.
00:08:23 So what we are going to do is we will install Torch ourselves with CUDA enabled.
00:08:29 To do that, first you need to activate this Ozen virtual environment.
00:08:34 How to do that?
00:08:36 Type anaconda.
00:08:37 You will see this.
00:08:38 Open it.
00:08:39 Then type conda activate the virtual environment name like this.
00:08:45 Okay, apparently since I also have mini conda, I need to open mini conda.
00:08:50 So I will open mini conda 3.
00:08:52 And now mini conda started.
00:08:54 I am not deleting these parts because you may also encounter problems and this is how
00:09:00 you are going to solve your problems.
00:09:02 So type conda activate Ozen.
00:09:05 Then we will do pip uninstall Torch.
00:09:08 Okay, apparently 1.13 was installed.
00:09:11 I will uninstall it.
00:09:13 Then you need to execute this command.
00:09:15 It will install Torch version 1.13 with CUDA support like this.
00:09:20 Okay, new Torch installation has been completed.
00:09:23 There is no error.
00:09:25 Closing it.
00:09:26 Then I will close this screen too.
00:09:28 Then I will drag and drop the generated WAV file like this.
00:09:32 It will start processing again.
00:09:34 Each time when you preprocess a speech file, it will generate folders inside output folder.
00:09:41 You see this is the previous one we made with the name of trainingspeech.mp3.
00:09:45 When I open it, you will see WAVs folder here, train and valid txt here.
00:09:52 These txt files will be written with the whisper transcription.
00:09:55 Since the last time failed, they are empty.
00:09:58 However, we can see the segmented WAV files here correctly.
00:10:03 And now it is doing the segmentation once again.
00:10:06 Okay, it is segmented.
00:10:08 So it will start processing whisper with transcribing.
00:10:11 Yes, we are seeing the whisper transcribing.
00:10:15 While transcribing, you can open the text files and you will see the WAV files with
00:10:19 their name and the transcription of them like this.
00:10:24 The speech file will be split into small parts ss I said.
00:10:27 When I click more, I will show you their duration.
00:10:31 So I pick duration here.
00:10:33 Okay, duration isn't empty.
00:10:35 Let's add another sort option.
00:10:38 Okay, the length is working.
00:10:40 So these are the durations of the split files.
00:10:44 Maximum duration is about 23 seconds.
00:10:48 Let me open one of them to show you.
00:10:51 It is exactly this problem which I am wrestling with at present in the basement of the house.
00:10:55 So this is the 30 numbered file.
00:10:58 Let's see, it's transcribing.
00:10:59 Okay, here it's transcription.
00:11:02 It is exactly this problem which I am wrestling with at present in the basement of the house.
00:11:08 So let's listen again.
00:11:09 It is exactly this problem which I am wrestling with at present in the basement of the house.
00:11:13 As you see, it is correct.
00:11:15 You will see the process here.
00:11:17 It has segmented our entire speech file into 691 segments and then it will transcribe each
00:11:25 one of them and add them into the text file.
00:11:28 Without this repository, this would be a huge task to preprocess them.
00:11:34 However, thanks to Ozen Public Library, it is done really fast with easiness.
00:11:41 The transcribing of all of the files have been completed.
00:11:46 Now before moving to the next part, there is a very important thing that I had to do
00:11:52 a lot of research to find out.: Go to your output.
00:11:55 Go to your folder.
00:11:57 Open train.txt file.
00:12:00 Select all with Ctrl A then Ctrl C copy.
00:12:04 Then go to encoding.
00:12:06 Make it UTF-8.
00:12:08 I am using Notepad++ by the way.
00:12:10 Then you may notice some of the characters are displayed like this.
00:12:15 These are errors because of the encoding.
00:12:18 Then paste it and they will be fixed and save it.
00:12:21 Then open the valid.txt file as well.
00:12:24 Ctrl A. Ctrl C. Go to encoding.
00:12:28 Make it UTF-8.
00:12:30 Ctrl V. Save and close.
00:12:32 If you don't do this, then you will likely to encounter errors in the next part.
00:12:38 So now we have our WAV files.
00:12:40 We have our train and valid.txt files and we are ready.
00:12:44 As the second part, we will install DL Art School for training.
00:12:50 DL Art School is a fork of this one.
00:12:54 So to install it, we will begin with doing a git clone.
00:12:58 I open my folder.
00:13:00 Open a new cmd.
00:13:02 Clone it inside there.
00:13:04 It is cloned.
00:13:05 Then enter inside the folder and run setup DLAS.bat file like this.
00:13:12 You see it has found Anaconda.
00:13:15 It is generating DLAS virtual environment inside my miniconda installation.
00:13:20 This will also take a while, so be patient.
00:13:23 The installation has been completed.
00:13:26 Now I will show you the messages that I have got.
00:13:29 So in the end, you will see messages like this.
00:13:32 Installation messages.
00:13:33 Successfully installed.
00:13:35 And then it will download autoregressive model.
00:13:38 It may take a while depending on your download speed.
00:13:41 And finally, all of the installations are completed.
00:13:45 Press any key to continue.
00:13:46 Then we are ready to start our training.
00:13:49 To do that, click start DLAS.cmd file.
00:13:53 And it will start this very nice graphical user interface for us.
00:13:57 First, decide your project name.
00:14:00 The files will be saved inside named as your project name folder.
00:14:06 So let's say voiceclone like this.
00:14:09 GPU IDs you can define.
00:14:11 I will use my first GPU.
00:14:13 You don't need to change any of these variables.
00:14:16 Okay, the first very important thing is setting our data set path.
00:14:21 Click three dots icon here.
00:14:24 Got your ozen-toolkit directory.
00:14:27 Inside here output.
00:14:28 And this is my final training data set.
00:14:31 I am picking the main folder like this.
00:14:34 Not the WAVs files.
00:14:37 Select folder.
00:14:38 If you select incorrect folder, it will give you an error.
00:14:41 Let me show you.
00:14:42 For example, let's select here.
00:14:43 You see data set folder is not valid.
00:14:46 So you need to pick a valid data set folder like this.
00:14:50 Then these variables what you should set.
00:14:53 This is hard to decide actually.
00:14:55 So as you increase your train batch size, it will be faster training because at each
00:15:01 step it will train multiple files at the same time.
00:15:06 However, it will also increase your VRAM usage.
00:15:09 Since we need to do a lot of training, you should set both of these train batch size
00:15:13 and validation batch size as much as possible.
00:15:17 So you can try bigger values until you get out of memory error.
00:15:22 I am clicking auto settings and it is setting some variables like this as you are seeing.
00:15:27 Then there is one important thing.
00:15:30 Logger settings.
00:15:31 So we should save checkpoints.
00:15:34 Because you may decide to interrupt your training.
00:15:37 And if you don't save any checkpoint, then you won't have anything to use.
00:15:43 If you also save states, it will save the state of the model training.
00:15:47 But this will take a lot of space on your computer.
00:15:50 So be careful.
00:15:51 I will make print status 5, save checkpoints 5 and visual debug frequency 5.
00:15:56 Then we are ready.
00:15:58 We just need to click start training.
00:16:01 After you click start training, you will see messages in your CMD window.
00:16:05 Apparently, we have an error and it looks like Torch error.
00:16:10 So now I will reinstall the Torch on the virtual environment of this repository.
00:16:17 So first we will activate DLAS.
00:16:20 I am copying this command, starting my mini conda, typing command activated the DLAS virtual
00:16:27 environment.
00:16:28 Then pip uninstall Torch like this.
00:16:31 It has installed version 2.
00:16:33 I will uninstall it.
00:16:34 Then I will execute this command.
00:16:36 This is the latest Torch version from PyTorque website.
00:16:40 OK, the Torch installation has been completed.
00:16:43 Let's restart.
00:16:44 Double click start DLAS.cmd file.
00:16:47 The user interface started.
00:16:49 It will auto load all of the previous settings as you are seeing right now.
00:16:53 Just click start training and the training is starting.
00:16:56 It will display all of the parameters like this.
00:17:00 Now I will explain number of steps, number of epochs and how it is working.
00:17:06 You see it displays that number of training data elements is 552.
00:17:12 This is automatically set and number of iterations is 4.
00:17:16 So how did it come with these numbers?
00:17:20 We did set our train batch size as 138.
00:17:26 And we have 552 training sound files in the training data set.
00:17:32 Therefore, with 4 steps it will be able to process 552 sound files.
00:17:41 Because batch size is 138.
00:17:43 And we did set number of maximum steps to 50,000.
00:17:49 Therefore, it will do 50,000 steps training and with 4 step it will do 1 epoch.
00:17:57 Which means it will do total 12,500 epochs.
00:18:02 Just pause the video and read the description I have written in here and you will understand
00:18:08 how the number of iterations, number of epochs, number of total steps are decided, calculated
00:18:17 and printed on the screen.
00:18:19 It took me a while to understand how it is working and you won't find this information
00:18:25 anywhere else.
00:18:26 You may get some warnings like this.
00:18:28 You can ignore them.
00:18:29 And finally, the training started.
00:18:32 You see it is using autoregressive.pth file, which I said it was downloading.
00:18:38 It is also located inside this folder under experiments.
00:18:42 Okay, training is going on right now.
00:18:45 It is using this much GPU.
00:18:48 There are also other open applications such as video recording and other things.
00:18:52 Therefore, this is probably higher than what it should be.
00:18:56 As I said, if you reduce your batch size, it will use lesser VRAM.
00:19:01 However, increasing your batch size to maximum possible batch size is better.
00:19:06 Okay, the first epoch is completed and it has saved the files inside experiments inside
00:19:14 my name which is voice clone.
00:19:17 My project name inside models.
00:19:19 You see this is 0_gpt.pth file.
00:19:23 This is the file where our trained voice is.
00:19:28 And since we are saving the states, the state file is also saved here.
00:19:33 After the initial zero state, it will save after every five epochs.
00:19:37 Why?
00:19:38 Because it is the setting that we set.
00:19:40 We set as save after every five epochs.
00:19:44 So be careful with that because each file is like one gigabyte for state and for model
00:19:51 checkpoints, it is like 1.6 gigabytes.
00:19:55 So it will take a lot of space in your computer if you set it frequently saving checkpoints.
00:20:00 Okay, we are over 40 epochs and we are ready to start generating our cloned voice.
00:20:07 However, there are several important things that I want to explain you.
00:20:12 As the training continue.
00:20:14 You will notice that it displays epoch number, iteration count, and most importantly, loss
00:20:21 values as you are seeing right now.
00:20:23 So when the loss value staying same or gets increasing, that means that you have reached
00:20:31 to the point where training is completed.
00:20:35 Also in the upcoming part of the video, I will show how to compare different checkpoints.
00:20:41 But this is important.
00:20:42 We want to have as possibly as lower loss value and also do checkpoint comparison.
00:20:50 And if you also carefully look with every epoch, you will see the loss value we are
00:20:56 getting is decreasing.
00:20:58 It starts with 2.97, then 2.92, then 2.89, then 2.85.
00:21:08 The loss value is a term used in the machine learning.
00:21:11 You can ask ChatGPT to learn more about this.
00:21:15 Moreover, to display loss values as a graph, I prepared this script.
00:21:21 The only part you need to change with this script is the this file path where the training
00:21:29 log is located.
00:21:31 Let me: run it and show you the results.
00:21:33 I start a new CMD type Python and run the loss value.py and it will display this graph
00:21:41 for me.
00:21:42 You can save it.
00:21:43 You can zoom it.
00:21:44 For example, let's save the figure.
00:21:47 Also from here you can make it bigger like this.
00:21:49 This is really convenient to use and it will make your job and life easier to see how the
00:21:56 loss values is getting decreased over time.
00:22:00 This script will be also posted on our Patreon post, which I will explain in the upcoming
00:22:06 chapter of the video.
00:22:07 And also, it will be available in the GitHub file as well with the instruction how to use.
00:22:13 OK, we have got 40 epoch checkpoint and now we are ready to start generating our cloned
00:22:19 voice.
00:22:20 However, there are several important things that I want to mention.
00:22:24 First of all, if you have more training data and if they are higher quality, you will get
00:22:30 better results.
00:22:31 Moreover, your learning rate stepping would also matter, but I didn't test them, so I
00:22:37 have used whatever the values the algorithm has given me with auto settings.
00:22:42 So you can change this learning rate, learning rate stepping and other values here.
00:22:47 Also, more training doesn't mean always better results.
00:22:52 It may be over trained.
00:22:54 Therefore, you should compare several checkpoints.
00:22:57 If I were you, I wouldn't save these many checkpoints.
00:23:01 Probably I would make save checkpoint frequency like 50 and then I would compare different
00:23:08 checkpoints.
00:23:09 I can't give you exact good values because I didn't have that much time to test every
00:23:15 parameters, but these are the fundamentals.
00:23:18 For part 3, generating cloned voice, we will use Tortoise TTS fast.
00:23:24 Unfortunately, there is not a neaty UI for this, but I will provide a good .bat file
00:23:31 that will make your life much easier.
00:23:34 So we will begin with cloning the Tortoise TTS fast.
00:23:38 Simply this repository.
00:23:40 It is forked from the original Tortoise TTS.
00:23:43 However, this repository is much faster than Tortoise TTS.
00:23:48 So let's begin.
00:23:49 I will clone it inside my same folder like this.
00:23:53 After cloning done.
00:23:55 We will make a new virtual environment because we don't want it to mess with our other installations.
00:24:01 So enter inside Tortoise TTS folder or wherever you want to install the virtual environment.
00:24:08 Open a new CMD type like this.
00:24:11 I am using Python 3.10.9 version.
00:24:14 So if you use Python 3.9 or if you use Python 3.11, it may not work.
00:24:21 You should definitely use Python 3.10.x version, preferably the same version as me.
00:24:28 Okay, virtual environment is generated.
00:24:30 Now move into the virtual environment folder like this, or you can start new CMD inside
00:24:36 there.
00:24:37 Move to scripts, type activate.
00:24:40 Now the new virtual environment is activated and we are ready to install the Tortoise TTS
00:24:46 fast.
00:24:47 So we will begin installation with Torch version 1.13.
00:24:52 Just copy paste it.
00:24:54 Be careful that you are in the activated virtual environment folder like this.
00:24:59 Okay, torch installation is completed.
00:25:01 Now as a next step, we need to move into the main installation folder.
00:25:06 This one for me.
00:25:08 So watch me cd.. and cd..
00:25:12 And now I am in the main installation folder.
00:25:15 And now I will run this installation command where while I am in the main folder and virtual
00:25:22 environment is activated, hit enter.
00:25:24 It will install every dependency inside your virtual environment folder.
00:25:30 Okay, the installation has been completed.
00:25:32 You may get some warning messages like this.
00:25:35 You can ignore them.
00:25:36 Now as a next step, we will install this one.
00:25:39 Copy paste while virtual environment is activated.
00:25:41 Okay, it has been also installed.
00:25:45 Now we are ready to start using.
00:25:47 There is a web UI in this repository.
00:25:50 However, it is not working good, unfortunately.
00:25:54 So we have to use command line interface.
00:25:57 This is the base command that we will use.
00:26:00 As I said, I will also share and show you a much better script.
00:26:04 So let's begin with the simple command.
00:26:07 Copy this.
00:26:08 So this is the base command.
00:26:09 But we need to do is we need to first give the checkpoint path here.
00:26:14 My checkpoint is here.
00:26:16 While hitting left shift key, I right click and copy as path.
00:26:20 Then I paste it here.
00:26:22 And then I need to give my prompt to generate voice.
00:26:26 Let's generate such as welcome to the software engineering courses channel.
00:26:31 There is also one more thing.
00:26:33 You see you need to run this Python file inside script folder.
00:26:38 So while virtual environment is activated, we need to move inside scripts folder like
00:26:43 this.
00:26:44 This is the folder where I need to run this command.
00:26:47 So now I am ready.
00:26:49 I will change this as Python and the command itself just copy pasted here.
00:26:54 Okay, we are not in the correct folder.
00:26:57 Let's check it out.
00:26:58 You see there is a one single letter difference because their GitHub description is not up
00:27:04 to date.
00:27:05 The file name is not like this, but it is like this.
00:27:08 So let's run it again.
00:27:10 When generating a voice with Tortoise there is some important parameters.
00:27:16 The most important one is the preset.
00:27:19 So when we open this file, you will see all of the options it has.
00:27:24 The preset it has ultra fast, fast, standard, high quality.
00:27:30 So I tested standard and high quality.
00:27:33 They really require a lot of VRAM.
00:27:35 If you don't have such amount of VRAM, then you should use ultra fast preset.
00:27:40 Fast is also working decent.
00:27:42 Okay, since their repo is not up to date, there is no option as very fast.
00:27:48 So we need to modify our command as preset fast and then run the command again and it
00:27:56 will start.
00:27:57 Okay, there is one more error.
00:27:59 And the last error is the checkpoint.
00:28:01 When we check out the file, we see that the checkpoint parameter is now like this.
00:28:08 So this will be dash dash like this.
00:28:11 Don't worry.
00:28:12 I already put this into our GitHub file.
00:28:16 So let's copy and run here.
00:28:19 And it starts generating with our trained voice model.
00:28:23 By the way, 40 epoch is very low.
00:28:26 Therefore, the quality we are going to get will be probably pretty bad.
00:28:31 Okay, the voice is generated.
00:28:33 It only took 25 seconds as you are seeing here.
00:28:36 The result will be saved inside scripts inside result folder and the random combined means
00:28:44 that it combined all of the generated voices in one run.
00:28:48 I will explain that.
00:28:49 Don't worry.
00:28:50 So let's run and see it.
00:28:52 Welcome to the software engineering courses channel.
00:28:55 So this is the voice we got.
00:28:57 Let's also listen the original one.
00:28:59 Okay, let's listen this one indeed.
00:29:02 Yes, I'm trying to find out what's happening about luncheon.
00:29:06 And let's also listen another original one.
00:29:08 It is exactly this problem which I am wrestling with at present in the basement of the house.
00:29:12 And let's listen our generated voice one more time.
00:29:16 Welcome to the software engineering courses channel.
00:29:19 Okay, this time I will use my pre trained model, which is at the 1480 epoch.
00:29:26 So I will change this epoch file to this one and regenerate the voice.
00:29:33 And let's compare the results.
00:29:35 Okay, new file is generated.
00:29:38 Let's listen it.
00:29:39 Welcome to the software engineering courses channel.
00:29:41 Now it is much better, but we can still improve it.
00:29:45 How?
00:29:46 We can improve it with standard quality.
00:29:49 Let's test it.
00:29:50 So instead of fast I will try standard.
00:29:52 Okay, with standard quality, it took about one minute to generate.
00:29:57 So let's listen it.
00:29:59 Welcome to the software engineering courses channel.
00:30:02 And this is the standard quality.
00:30:04 There is one more parameter that we can use to improve quality, which is diffusion iterations.
00:30:10 Let's also increase it.
00:30:12 So to add that, what we need to do is --diffusion_iterations and let's set it as 500.
00:30:19 And let's rerun.
00:30:20 While it is running let me show you some of the other parameters here.
00:30:24 There is low VRAM parameter.
00:30:26 If you provide this as --low VRAM true, it will be enabled.
00:30:32 There is also half parameters, half precision.
00:30:35 All these will reduce your VRAM usage.
00:30:39 Some of them may also reduce your speed such as low VRAM probably.
00:30:44 The explanations are really, really good.
00:30:46 So how did I open this file?
00:30:48 This file is inside scripts and in here: Tortoise_TTS.py file.
00:30:56 So you can read all of these parameters, see their description, and you can add them if
00:31:01 you wish.
00:31:03 So currently it is doing our 500 repetition.
00:31:06 However, with this approach, it is very hard to generate long speeches.
00:31:12 So I will show you which option and which script we are going to use to generate long
00:31:19 speech consistently.
00:31:20 Okay, the file is generated inside results.
00:31:25 Welcome to the software engineering courses channel.
00:31:27 Sometimes it may not be best.
00:31:29 Okay, now time to move to our script that will generate entire speeches.
00:31:35 For my script to work, we need to make a change in the inference.py file.
00:31:40 So right click edit with notepad plus plus.
00:31:45 Then find split text function like this and then copy this code.
00:31:51 Go to your file, find split text.
00:31:54 It is in the 37th line and replace it with my text.
00:31:59 Then we are ready to use the script that I'm going to show now.
00:32:04 Before starting my developed scripts, let me explain to you what this code change will
00:32:11 make.
00:32:12 With this way we will be able to provide multiple sentences, separating them with semicolon.
00:32:19 So here I have a modified command line argument.
00:32:22 This also takes output directory, so you will learn how to use it as well.
00:32:28 Let's copy this.
00:32:29 Let's also add several different sentences to this like I am showing you right now.
00:32:36 So now I have three sentences separated with a semicolon as you are seeing right now.
00:32:42 With this command, it will generate three separate voices and one combined voice.
00:32:47 I will also add two more parameters to improve quality of the generated speech.
00:32:52 The first one is changing preset to high quality, copy it and change the preset.
00:32:58 And the second one is the diffusion iterations.
00:33:01 So I will make the diffusion iterations to 1000 like this.
00:33:05 I will also set my best epoch checkpoint.
00:33:09 We need to provide text split argument for this command to work.
00:33:14 So I will add it to the beginning --text_split.
00:33:18 And then you can write here anything like GG.
00:33:21 I also have modified the command here as well.
00:33:25 So you can check it out.
00:33:27 Copy by clicking here.
00:33:28 Then you can paste it into any text editor and modify the necessary variables such as
00:33:35 output directory, number of iterations you need, your text prompt that you want it to
00:33:42 convert into a speech, your checkpoint file.
00:33:45 Now I will activate virtual environment.
00:33:47 Activate, then move to the correct folders like this and paste the command.
00:33:53 It will start with loading TTS and this time you see it shows rendering one of three.
00:34:01 Why?
00:34:02 Because we have separated our prompt with semicolon and we have three sentences.
00:34:09 So you can give it a very big text and you can separate each speech part with a semicolon.
00:34:16 Why do you need this separation?
00:34:19 Because the model will start hallucinating if you provided a very long text.
00:34:24 By default it is split with 200 characters.
00:34:29 However, it is not optimal.
00:34:31 You should split sentences yourself or use the script that I will show you after this.
00:34:38 First file generation has been completed.
00:34:40 It automatically moves to second file.
00:34:43 It also shows which sentence it is turning into a speech file from here as you are seeing.
00:34:50 Generating autoregressive samples is the most time taking part of this methodology.
00:34:58 All three samples have been generated.
00:35:00 They are saved in this folder that we have provided.
00:35:05 So they are inside here.
00:35:06 Now let's listen to each one of them and their combined version.
00:35:11 Welcome to the software engineering courses channel.
00:35:14 This channel is the best source for learning, technology and artificial intelligence.
00:35:18 Please subscribe.
00:35:21 Welcome to the software engineering courses channel.
00:35:23 This channel is the best source for learning technology and artificial intelligence.
00:35:27 Please subscribe.
00:35:29 If you have noticed that the last sentence was the most weird one.
00:35:35 Why?
00:35:36 Because this model is not able to generate either very long sentences or very small sentences.
00:35:42 Therefore, what can you do?
00:35:44 You can merge these two sentences into one sentence and it will significantly improve
00:35:49 the output.
00:35:50 Let's try again.
00:35:51 Since virtual environment is activated, I just copy paste the new command and just hit
00:35:56 enter.
00:35:57 By the way,: the script will overwrite the previous files if you don't change your output
00:36:03 folder.
00:36:04 Okay, new generation is completed.
00:36:06 Let's listen again.
00:36:08 Welcome to the software engineering courses channel.
00:36:11 This channel is the best source for learning technology and artificial intelligence.
00:36:16 Please subscribe.
00:36:18 You see it made huge difference.
00:36:20 We can further improve our generated speech files by using Adobe podcast.
00:36:26 Their service is free.
00:36:28 You only need to log in with an account.
00:36:31 That account can be a free account.
00:36:33 So let's drag and drop this file into here like this.
00:36:37 Okay, the enhanced file generated.
00:36:40 Let's listen it.
00:36:42 Welcome to the software engineering courses channel.
00:36:44 This channel is the best source for learning technology and artificial intelligence.
00:36:50 Please subscribe.
00:36:52 Welcome to the software engineering courses channel.
00:36:55 This channel is the best source for learning technology and artificial intelligence.
00:37:00 Please subscribe.
00:37:02 Okay, as you are seeing, you can improve your speech by using this Adobe podcast.
00:37:09 As you have seen so far, it is very hard to use this technology this way.
00:37:15 So we need some kind of automation for it.
00:37:18 Let's say you have a speech file like this and you want to process it with a single click.
00:37:25 Do not want to spend any time.
00:37:27 I have coded three script files as you are seeing right now.
00:37:32 They are posted on our Patreon page.
00:37:35 All you need is click this link.
00:37:38 Download the attached files like this.
00:37:41 Click the each file.
00:37:42 It will download each of the files.
00:37:44 Put them into the folder like this where you have installed.
00:37:48 Then follow the instructions written here.
00:37:51 I will also show you in a moment what are these files and how to use them.
00:37:57 Then modify the necessary parameters as you need as instructed here.
00:38:03 I have put every parameter here that you can change and written a great explanation for
00:38:09 each one of them.
00:38:10 I will also now show you each file and how it works.
00:38:15 So our non-Patreon members can also code them and use them.
00:38:20 So the very first file is pre-process given text file.py.
00:38:25 In this file what we are doing is simple.
00:38:28 This file will pre-process our speech file.
00:38:33 Let me run it and show you the output.
00:38:35 Python pre-process given text file and it is pre-processed.
00:38:40 So the speech file will become like this.
00:38:42 You see now they are separated with a semicolon.
00:38:45 If the sentence is too long then it will split the sentence with a semicolon.
00:38:52 Let me show you an example.
00:38:54 So I will make one of the sentences very long.
00:38:57 Such as let's make here and put a comma so it will separate from here.
00:39:02 Okay now you see it is separated from this point.
00:39:06 If there is no comma in the long sentence.
00:39:08 What will it do?
00:39:10 So let's remove the comma from here.
00:39:13 Now I will rerun and now this sentence is separated from here.
00:39:18 So this script will make your job much easier.
00:39:21 With this script.
00:39:22 You won't have to manually fix your text file to generate your speech.
00:39:27 This will also merge smaller sentences because the model works well with some duration.
00:39:35 If it is very small duration like two seconds speech.
00:39:39 Then it is not working very well.
00:39:40 Or if it is over 12 seconds, it's also not working very well.
00:39:45 So this script will also merge these lines as a single line instead of each one is unique
00:39:51 line.
00:39:52 So let's see the script.
00:39:53 The only parameter in this script that you need to modify is the split length.
00:39:58 By default it is suggested to split your text with 200 characters.
00:40:03 However I am using 175.
00:40:06 You can test different values as well.
00:40:09 Then we define split sentences function like this.
00:40:12 Then we have merge sentences function like this.
00:40:16 Then we have process sentences function like this.
00:40:19 You can pause the video and code them if you are not our Patreon supporter.
00:40:24 Then we have main function like this.
00:40:26 And finally, we call the main function here.
00:40:29 And let's look at the second file which will make our life much easier.
00:40:34 First let me demonstrate you what this script do.
00:40:38 Process given speech text.bat file.
00:40:41 I will just drag and drop this speech into this bat file and it will start a new cmd.
00:40:49 It will automatically preprocess the given text file and it will start generating speeches.
00:40:55 Let's see it in real time.
00:40:57 So you see with the parameters we did set in that bat file, it started processing the
00:41:03 speech.
00:41:04 It is generating the first line here.
00:41:07 It is one of 19 because our speech is split into 19 parts with semicolon.
00:41:14 So the first one is already generated because currently the settings are really fast.
00:41:19 You are seeing the IT per seconds.
00:41:21 It is really really fast.
00:41:23 It is in the fast mode.
00:41:24 This is not the best mode, but as you see, you can generate sound files very fast with
00:41:30 fast mode.
00:41:31 I will explain all of them in a moment.
00:41:33 Don't worry.
00:41:35 So it already processed five parts of the 19 parts and now let's see the script file.
00:41:42 As you are seeing this will make your life much easier.
00:41:44 With this script you will be able to generate very long audio files very easily.
00:41:51 Just writing your text file, drag and drop it into this script and you will be able to
00:41:57 generate your speech.
00:41:58 Okay so the script is starting like this.
00:42:01 So which parameters you need to change here?
00:42:04 In this part.
00:42:05 You don't need to change any parameter.
00:42:08 Just write it as it is looking.
00:42:10 In here I have a lot of explanation as you are seeing.
00:42:14 Now this part is important.
00:42:16 You need to change this path as in your installation folder.
00:42:21 You need to change this part to where you have installed your Tortoise TTS Fast library.
00:42:28 You need to change this with the path of your installation where your virtual environment
00:42:34 is installed.
00:42:35 Of course then you need to change the path here as well.
00:42:39 This is really important.
00:42:40 Then you need to set your output directory here.
00:42:43 This is where your generated speech files will be saved.
00:42:48 This is voice directory.
00:42:49 I am not sure if this is improving the speech quality or not, but here I am setting my generated
00:42:57 wav files as we have shown in the previous parts of the video.
00:43:02 Then here we are setting our checkpoint file of our training.
00:43:06 So with the other script, you can test multiple checkpoints and use the best one.
00:43:12 Then in here there are several parameters you can set.
00:43:15 If you have a lower VRAM graphic card.
00:43:18 You should start with low VRAM true like this.
00:43:22 If you are still failing.
00:43:24 You should make preset ultra fast.
00:43:26 You can also reduce number of diffusions, but this shouldn't affect your vram usage.
00:43:31 However it will make generation faster.
00:43:34 If you are still getting out of VRAM error.
00:43:37 With these settings, you can also set no cache to true.
00:43:42 And still, if you get out of memory error, then as a last step, you can make half true.
00:43:49 So the half will reduce your quality significantly.
00:43:52 The others will make your generation speed slower such as low vram or no cache.
00:43:59 The preset also significantly affects your training quality.
00:44:04 So you should try first the other presets like fast, standard.
00:44:09 And if you only get out of vram error out of memory error, you should try the lower
00:44:16 preset like ultra fast.
00:44:19 So these are all the parameters that you need to change.
00:44:23 I have written every parameters in our GitHub file so you see each one is written here.
00:44:31 And when you set all of them only one time, then you will be able to use same script again
00:44:37 and again without doing any other manual thing.
00:44:40 Let's say you have encountered a problem.
00:44:42 Then we have a Discord server.
00:44:45 Click it.
00:44:46 Join our server and then you can ask me any questions.
00:44:49 You see, we have over 2400 members with over 500 online members.
00:44:54 We are growing and I am expecting you to join as well.
00:44:58 I have uploaded the file to the Patreon with these settings.
00:45:02 No cache false, half false, low vram false, diffusion iterations count 250, preset standard.
00:45:11 But as I said, according to your graphic card, you should change them.
00:45:15 I think this script works with as low as 6 GB having GPUs maybe 4 GB I didn't test.
00:45:22 So it is up to you to do testing.
00:45:25 And the next script we have is epoch comparison.
00:45:29 This script is pretty much same.
00:45:31 You need to change the same parameters.
00:45:32 Additionally this script will take your checkpoints folder and then it will generate speech files
00:45:42 with each one of the checkpoints.
00:45:45 Let me demonstrate you.
00:45:46 So here I have a checkpoint tests folder.
00:45:50 As you are seeing right now I did put 19 files.
00:45:53 It is set as here.
00:45:55 Then the outbase directory is set as here.
00:45:59 All I need to do is then just drag and drop the speech file into epoch comparison.bat
00:46:05 file.
00:46:06 It will start with the first checkpoint path it finds.
00:46:10 Currently python is sorting with numbering so it starts from 1000.
00:46:15 You can change this behavior with adding 00 to each one of the files.
00:46:20 Then it will be sorted as expected.
00:46:24 Otherwise it will start from zeros like this.
00:46:27 And you see it started with the first checkpoint and then it will iteratively load each checkpoint
00:46:33 and generate speech files.
00:46:35 Now I will show results.
00:46:37 So I have done a comprehensive checkpoint comparison with the script.
00:46:42 You see, there are 55 checkpoint comparison folders.
00:46:47 Then I have written this simple python file to move all of the generated checkpoint sound
00:46:54 files into a new folder to be able to test them easily.
00:46:59 Here 54 checkpoint comparison generated speech files.
00:47:04 Now I will show you some of them and let's see how the output is changing as we do more
00:47:11 training.
00:47:13 The first numbers are the number of steps and in my case, every four step was one epoch.
00:47:20 Therefore 40 step is 10 epoch.
00:47:23 Let's begin with listening the 0 step.
00:47:28 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:47:33 and learning from six different types of data or modalities.
00:47:37 Let's listen 40 steps 10 epoch.
00:47:41 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:47:46 and learning from six different types of data or modalities.
00:47:51 You see.
00:47:52 In the end, you may get some hallucination like this because as you generate a bigger
00:47:59 speech, it tends to hallucinate.
00:48:01 So what you need to do is cutting those parts.
00:48:05 Let's listen 40 the epoch.
00:48:07 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:48:13 and learning from six different types of data or modalities.
00:48:20 Okay let's listen 80 epoch.
00:48:22 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:48:29 and learning from six different types of data or modalities.
00:48:36 Okay let's listen 150 epochs.
00:48:39 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:48:46 and learning from six different types of data or modalities.
00:48:53 Let's listen 280 epochs.
00:48:55 This is one of the best that I have found personally.
00:48:59 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:49:05 and learning from six different types of data or modalities.
00:49:09 Meta often modalities.
00:49:10 Unfortunately there is still hallucination, but we can cut that part and use the original
00:49:17 generated part.
00:49:19 Now I will show you some of the very bigger epoch counts and you will see how it gets
00:49:26 over-trained.
00:49:27 Let's listen 400 epochs.
00:49:29 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:49:36 and learning from six different types of data or modalities.
00:49:40 This almost didn't have any hallucination, but the sound quality decreased.
00:49:46 Let's listen 530 epoch and you will see how the sound quality decreased.
00:49:52 The loss value of this one was below 0.1.
00:49:56 However the lower loss value doesn't always mean that it is better in machine learning
00:50:02 because there is a very significant problem described as over-training and as you do more
00:50:09 over-training the machine learning model in this case, our trained voice will lose its
00:50:16 generalization capability.
00:50:18 Therefore, the quality will decrease.
00:50:19 Okay let's listen it.
00:50:21 Meta has introduced ImageBind an open source artificial intelligence model capable of combining
00:50:27 and learning from six different types of data or modalities.
00:50:32 So before ending the video, I will show you all of the latest script files.
00:50:38 The files are posted on our Patreon.
00:50:41 The link will be in the description and also in the GitHub file.
00:50:45 You don't need these files to follow this tutorial and also generate speech, but these
00:50:52 files will make your job easier.
00:50:54 You can just follow our GitHub file this file and everything is also written here that you
00:51:00 need along with the options, parameters, commands.
00:51:05 So these files are optional.
00:51:08 So let's see each one of the files latest versions which are also shared here.
00:51:14 The first file is preprocess given text file.
00:51:17 I did some updates.
00:51:18 So split length that you need to set in this file.
00:51:21 You can pause video and type every code if you wish or just download from our Patreon
00:51:27 page.
00:51:28 So this is the main method as you are seeing right now.
00:51:31 I did some improvements, added some replacements.
00:51:35 Also one very important thing is limiting the total length because the command line
00:51:42 is not allowing you to use infinite number of characters in commands.
00:51:47 Therefore we need to limit the total length of the speech.
00:51:52 So 7900 is working pretty decent.
00:51:55 One another important thing is saving files with utf-8.
00:51:59 This is also important.
00:52:02 So this is the entire script of the preprocess given text file.
00:52:07 Now I will show you more zoomed version so if you want to type you can just pause video
00:52:11 and type.
00:52:12 Here another part.
00:52:13 I am just showing you with a little bit scrolling and here the last part.
00:52:19 If you be generous, become our Patreon supporter and download from Patreon.
00:52:23 I would appreciate that very much.
00:52:25 So the second file is process given speech text.bat file.
00:52:31 I have modified the input code.
00:52:33 Because it wasn't able to read exclamation character.
00:52:36 Therefore we have modified this part of this script.
00:52:40 This is improved script.
00:52:41 It is working very well.
00:52:44 So here you are seeing the other parts.
00:52:45 Don't forget to change this path.
00:52:48 Also this path.
00:52:50 These are important.
00:52:51 From this part of the code.
00:52:53 You don't need to change anything.
00:52:55 Also don't forget to change this path as well and also this path as well.
00:52:59 These are important.
00:53:00 You can also change the parameters here.
00:53:03 As you are seeing I have typed every one of the parameters.
00:53:07 Personally I find that turning off voice fixer with false and using Adobe speech enhancer
00:53:13 is working better.
00:53:14 This is posted on the GitHub file.
00:53:17 You see I said that you can turn off voice fixer and use podcast Adobe to enhance generated
00:53:23 speech file.
00:53:24 Also I have enabled low VRAM.
00:53:26 You can also change other settings.
00:53:28 All of them are written on here.
00:53:31 So read here carefully and set your parameters as you wish.
00:53:36 So this is the entire script.
00:53:38 This part.
00:53:39 This part.
00:53:40 And this part.
00:53:41 Just pause the video and write it if you wish.
00:53:44 Okay one another file is epoch comparison.
00:53:46 Epoch comparison is really important because finding the best epoch otherwise would take
00:53:53 a lot of time.
00:53:54 Therefore this is the beginning of the epoch comparison file.
00:53:58 Actually until this part it is exactly same as process given speech text.bat file.
00:54:05 So in here we have this part.
00:54:07 Okay in this script don't forget to change this path.
00:54:11 This is important.
00:54:12 Also don't forget to change this path.
00:54:14 This is important.
00:54:15 Don't forget to set your checkpoints directory.
00:54:18 Don't forget to change your output base directory here as well.
00:54:23 These are all important settings that you need to make for to use this script.
00:54:28 Also in here.
00:54:29 Again there are parameters like you are seeing right now.
00:54:32 You can use this.
00:54:34 You may wonder what is candidates.
00:54:36 Candidates is extremely important parameter.
00:54:39 I will show you in a moment.
00:54:40 So this is the end of the script.
00:54:43 Just pause the video and type them if you wish or download from our Patreon page.
00:54:48 So let me show you what does candidates do.
00:54:51 Here you are seeing 10 candidates generated for each speech sentence.
00:54:56 I will demonstrate them one by one.
00:54:59 So let's see several candidates for the first speech.
00:55:03 Hello and welcome one and all.
00:55:06 Today we are embarking on an exciting journey through the most intuitive and comprehensive
00:55:11 deep voice cloning tutorial you will find anywhere.
00:55:15 Our mission: let's see the second generated candidate for this speech.
00:55:20 Hello and welcome one and all.
00:55:23 Today we are embarking on an exciting journey through the most intuitive and comprehensive
00:55:29 deep voice cloning tutorial you will find anywhere.
00:55:34 Our mission: let's see the third candidate.
00:55:36 So by generating multiple candidates, you can pick the best generated voice and use
00:55:42 it.
00:55:43 This is the logic of it and it is also pretty fast because only the last diffusion part
00:55:48 is used to generate different candidates.
00:55:51 Hello and welcome one and all.
00:55:54 Today we are embarking on an exciting journey through the most intuitive and comprehensive
00:56:00 deep voice cloning tutorial you will find anywhere.
00:56:03 Our mission: you understand the logic.
00:56:06 So the last script that I will show you is move checkpoint sound files.
00:56:12 Because the checkpoint files are generated like this and we want to move all of them
00:56:16 into a single folder if we want to compare them easily.
00:56:20 So you see I have moved all of them here and let's open move checkpoint sound files.
00:56:26 So this is the script of the move checkpoint sound files.
00:56:30 Don't forget to change this path and also this path.
00:56:34 This is a pretty simple script.
00:56:36 One another important thing is your speech.txt file.
00:56:40 Make sure that its encoding is UTF-8.
00:56:43 This is important because if you don't use UTF-8 then the special characters won't be
00:56:51 saved.
00:56:52 Also try to avoid using special characters like this one because they are not correctly
00:56:57 read with the command line interface and also probably they won't be understood by the trained
00:57:04 voice model.
00:57:05 So use UTF-8 and also try to avoid using special characters.
00:57:11 One another very important thing is try to avoid writing abbreviations like PhD because
00:57:17 model is not able to understand all of the abbreviations.
00:57:21 So what can you do?
00:57:22 Instead of PhD you can write doctor of philosophy the full sentence of the abbreviation.
00:57:28 These are all important things.
00:57:30 I hope you have enjoyed.
00:57:31 I have spent quite a bit time to prepare this tutorial about a one week.
00:57:38 So your support is tremendously important for me on the Patreon on the YouTube with
00:57:44 join, with liking, subscribing, leave a comment.
00:57:48 Because the view counts from these very time consuming tutorials are not very good.
00:57:54 So therefore, without your support I am not getting much revenue and I need revenue to
00:58:00 be able to continue produce better quality videos.
00:58:04 Better new training videos.
00:58:07 Also on my channel I have amazing other tutorials as well.
00:58:11 You can click our playlists and in here you will see our playlists such as Stable Diffusion
00:58:16 playlist, technology and science playlist.
00:58:19 In the Stable Diffusion playlist we have amazing tutorials as you are seeing right now so you
00:58:24 can learn a lot of information from our DreamBooth playlist.
00:58:28 We also have a DreamBooth playlist file on our GitHub repository.
00:58:33 Just go to the main repository file like this and in the read me file you will find all
00:58:39 of our Stable Diffusion videos like you are seeing right now.
00:58:43 There are currently over 30 tutorial videos at the moment so you can check each one of
00:58:49 them and learn many more things about Stable Diffusion.
00:58:53 If the script files get improved later.
00:58:56 I will update this post and I will also notify all of our Patreon users.
00:59:02 So always check out this post to get the latest script file.
00:59:06 You will find the necessary link in the description of the video like here as you are seeing right
00:59:12 now and our Patreon link here and our Discord link here.
00:59:17 Also I will put the necessary links in the pinned comment in this video like you are
00:59:22 seeing right now.
00:59:23 This is an example.
00:59:24 So check out the description and the pinned comment to find out the links.
00:59:29 Hopefully see you in another amazing video.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC #250

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC #250

Uh oh!

FurkanGozukara Oct 24, 2025 Maintainer

Master Deep Voice Cloning in Minutes: Unleash Your Vocal Superpowers! Free and Locally on Your PC

Video Transcription

Replies: 0 comments

FurkanGozukara
Oct 24, 2025
Maintainer