-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
28 changed files
with
479 additions
and
631 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
name: Deploy docs | ||
on: | ||
push: | ||
branches: | ||
- main | ||
jobs: | ||
docs: | ||
# Create latest docs | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v3 | ||
with: | ||
token: ${{ secrets.SGILE_PAT }} | ||
submodules: recursive | ||
fetch-depth: 0 # fetch all commits/branches | ||
- name: Set up Conda | ||
uses: conda-incubator/setup-miniconda@v2 | ||
with: | ||
python-version: 3.9 | ||
- name: Install libsndfile | ||
run: sudo apt-get install -y libsndfile1 | ||
- name: Install Torch related deps | ||
run: | | ||
conda install -c conda-forge pycountry pyworld | ||
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117 | ||
- name: Install other dependencies | ||
run: | | ||
pip install coverage soundfile | ||
pip install -e . | ||
- name: Install documentation dependencies | ||
run: | | ||
pip install -r docs/requirements.txt | ||
- name: Setup doc deploy | ||
run: | | ||
git config user.name 'github-actions[bot]' | ||
git config user.email 'github-actions[bot]@users.noreply.github.com' | ||
- name: Deploy docs with mike 🚀 | ||
run: | | ||
mkdocs build | ||
mike deploy --push --update-aliases latest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Background to Text-to-Speech |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Customize to your language | ||
|
||
## Step 1: Make sure you have Permission! | ||
|
||
So, you want to build a text-to-speech system for a new language or dataset - cool! But, just because you **can** build a text-to-speech system, doesn't mean you **should**. There are a lot of tricky ethical | ||
questions around text-to-speech. It's not ethical to just use audio you find somewhere if it doesn't have explicit permission to use it for the purposes of text-to-speech. The first step is to make sure you have | ||
permission to use the data in question and that whoever contributed their voice to the data you want to use is aware and supportive of your goal. | ||
|
||
## Step 2: Gather Your Data | ||
|
||
The first thing to do is to get all the data you have (in this case audio with text transcripts) together in one place. Your audio should be in 'wav' format. Ideally it would be 16bit, mono (one channel) audio sampled somewhere between 22.05kHz and 48kHz. If that doesn't mean anything to you, don't worry, we can ensure the right format in later steps. | ||
It's best if your audio clips are somewhere between half a second and 10 seconds long. Any longer and it could be difficult to train. If your audio is longer than this, we suggest processing it into smaller chunks first. | ||
|
||
Your text should be consistently written and should be in a pipe-separated values spreadsheet, similar to [this file](https://github.com/roedoejet/EveryVoice/blob/main/everyvoice/filelists/lj_full.psv). It should have a column that contains text and a column that contains the `basename` of your associated audio file. So if you have a recording of somebody saying "hello how are you?" and the corresponding audio is called `mydata0001.wav` | ||
then you should have a psv file that looks like this: | ||
|
||
```csv hl_lines="2" | ||
basename|text | ||
mydata0001|hello how are you? | ||
mydata0002|some other sentence. | ||
... | ||
``` | ||
|
||
We also support comma and tab separated files, but recommend using pipes (|). | ||
|
||
You can also use the "festival" format which is like this (example from [Sinhala TTS](https://openslr.org/30/)): | ||
|
||
```text | ||
( sin_2241_0329430812 " කෝකටත් මං වෙනදා තරම් කාලෙ ගන්නැතිව ඇඳ ගත්තා " ) | ||
( sin_2241_0598895166 " ඇන්ජලීනා ජොලී කියන්නේ පසුගිය දිනවල බොහෝ සෙයින් කතා බහට ලක්වූ චරිතයක් " ) | ||
( sin_2241_0701577369 " ආර්ථික චින්තනය හා සාමාජීය දියුණුව ඇති කළ හැකිවනුයේ පුද්ගල ආර්ථික දියුණුව සලසා දීමෙන්ය " ) | ||
( sin_2241_0715400935 " ඉන් අදහස් වන්නේ විචාරාත්මක විනිවිද දැකීමෙන් තොර බැල්මයි " ) | ||
( sin_2241_0817100025 " අප යුද්ධයේ පළමු පියවරේදීම පරාද වී අවසානය " ) | ||
``` | ||
|
||
In this format, there are corresponding wav files labelled sin_2241_0329430812.wav etc.. | ||
|
||
## Step 3: Install EveryVoice | ||
|
||
Head over to the [install documentation](../install.md) and install EveryVoice | ||
|
||
## Step 4: Run the Configuration Wizard 🧙 | ||
|
||
Once you have your data, the best thing to do is to run the Configuration Wizard 🧙. To do that run: | ||
|
||
```bash | ||
everyvoice config-wizard | ||
``` | ||
|
||
After running the config-wizard, cd into your newly created directory. Let's call it `test` for now. | ||
|
||
```bash | ||
cd test | ||
``` | ||
|
||
## Step 5: Run the Preprocessor | ||
|
||
Your models need to do a number of preprocessing steps in order to prepare for training. To preprocess everything you need, run the following: | ||
|
||
```bash | ||
everyvoice fs2 preprocess -p config/feature_prediction.yaml | ||
``` | ||
|
||
## Step 6: Train your Vocoder | ||
|
||
```bash | ||
everyvoice hifigan train -p config/vocoder.yaml | ||
``` | ||
|
||
By default, we run our training with PyTorch Lightning's "auto" strategy. But, if you are on a machine where you know the hardware, you can specify it like: | ||
|
||
```bash | ||
everyvoice hifigan train -p config/vocoder.yaml -d 1 -a gpu | ||
``` | ||
|
||
Which would use the GPU accelerator and specify 1 device/chip. | ||
|
||
## Step 7: Train your Feature Prediction Network | ||
|
||
To generate audio when you train your feature prediction network, you need to add your vocoder checkpoint to the `config/feature_prediction.yaml` | ||
|
||
At the bottom of that file you'll find a key called vocoder_path. Add the absolute path to your trained vocder (here it would be `/path/to/test/logs/VocoderExperiment/base/checkpoints/last.ckpt` where `/path/to` would be the actual path to it on your computer.) | ||
|
||
Once you've replaced the vocoder_path key, you can train your feature prediction network: | ||
|
||
```bash | ||
everyvoice fs2 train -p config/feature_prediction.yaml | ||
``` | ||
|
||
## Step 8: Synthesize Speech in Your Language! | ||
|
||
You can synthesize by pointing the CLI to your trained feature prediction network and passing in the text. You can export to wav, npy, or pt files. | ||
|
||
```bash | ||
everyvoice fs2 synthesize logs/FeaturePredictionExperiment/base/checkpoints/last.ckpt -t "මෙදා සැරේ සාකච්ඡාවක් විදියට නෙවෙයි නේද පල කරල තියෙන්නෙ" -a gpu -d 1 -O wav | ||
``` | ||
|
||
<!-- % Step 10 (optional): Finetune your vocoder | ||
% ---------------------------------------- | ||
% .. code-block:: bash | ||
% everyvoice e2e train -p config/e2e.yaml | ||
% Step 11: Synthesize Speech | ||
% -------------------------- | ||
% .. code-block:: bash | ||
% everyvoice e2e synthesize -t "hello world" -c config/e2e.yaml | ||
% .. warning:: | ||
% TODO: this doesn't exist yet | ||
% TODO: e2e needs checkpoint paths --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# How to fine-tune the existing checkpoints |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Guides | ||
|
||
Here are a selection of guides to help you through the process of training and using your own text-to-speech models. | ||
|
||
1. Background to TTS | ||
|
||
2. Custom |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
% EveryVoice documentation master file, created by | ||
% sphinx-quickstart on Mon Dec 5 17:51:09 2022. | ||
% You can adapt this file completely to your liking, but it should at least | ||
% contain the root `toctree` directive. | ||
|
||
# Welcome to EveryVoice's documentation! | ||
|
||
```{toctree} | ||
:caption: 'Contents:' | ||
:maxdepth: 2 | ||
start | ||
install | ||
guides/index | ||
reference/index | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# Installation | ||
|
||
In order to train on GPUs, you must install PyTorch and Cuda. To do so, we recommend: | ||
|
||
- installing [conda](https://docs.conda.io/projects/conda/en/stable/) or [miniconda](https://docs.conda.io/en/latest/miniconda.html) | ||
- creating a new environment: `conda create --name EveryVoice python=3.9` | ||
- activating the environment: `conda activate EveryVoice` | ||
- following the [PyTorch installation instructions](https://pytorch.org/get-started/locally/) relevant to your hardware | ||
|
||
We then recommend using an interactive installation after cloning the repo from GitHub: | ||
|
||
```bash | ||
$ git clone https://github.com/roedoejet/EveryVoice.git | ||
$ cd EveryVoice | ||
$ git submodule update --init | ||
$ conda activate EveryVoice | ||
$ pip install -e . | ||
``` |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
{% if page.meta.comments %} | ||
<h2 id="__comments">{{ lang.t("meta.comments") }}</h2> | ||
<!-- Insert generated snippet here --> | ||
|
||
<!-- Synchronize Giscus theme with palette --> | ||
<script> | ||
var giscus = document.querySelector("script[src*=giscus]") | ||
|
||
/* Set palette on initial load */ | ||
var palette = __md_get("__palette") | ||
if (palette && typeof palette.color === "object") { | ||
var theme = palette.color.scheme === "slate" ? "dark" : "light" | ||
giscus.setAttribute("data-theme", theme) | ||
} | ||
|
||
/* Register event handlers after documented loaded */ | ||
document.addEventListener("DOMContentLoaded", function() { | ||
var ref = document.querySelector("[data-md-component=palette]") | ||
ref.addEventListener("change", function() { | ||
var palette = __md_get("__palette") | ||
if (palette && typeof palette.color === "object") { | ||
var theme = palette.color.scheme === "slate" ? "dark" : "light" | ||
|
||
/* Instruct Giscus to change theme */ | ||
var frame = document.querySelector(".giscus-frame") | ||
frame.contentWindow.postMessage({ | ||
giscus: { | ||
setConfig: { | ||
theme | ||
} | ||
} | ||
}, | ||
"https://giscus.app" | ||
) | ||
} | ||
}) | ||
}) | ||
</script> | ||
{% endif %} |
Oops, something went wrong.