# Multilingual Text-to-Speech

## What to do when stuck
- contact me
- check https://github.com/Tomiinek/Multilingual_Text_to_Speech/issues?q=is%3Aissue+is%3Aclosed

## Install Dependencies

In [None]:
import sys
import os
import IPython
from IPython.display import Audio

In [None]:
PROJECT_ROOT = os.getcwd()

In [None]:
! pip install -q --user soundfile
! pip install -q --user phonemizer
! pip install -q --user epitran
! pip install --user protobuf==3.20.3

In [None]:
!pip install -r requirements.txt --user
!pip install --user --upgrade librosa

## Setup Dataset

We download and extract these 2 datasets to data/comvoi and data/vctk.

### Downloads COMVOI - Dutch, German, French, Russian and Chinese audio.

In [None]:
%%capture

!rm -rf data/comvoi
!curl -O -L https://github.com/Tomiinek/Multilingual_Text_to_Speech/releases/download/v1.0/comvoi.zip
!unzip -q comvoi.zip -d data/comvoi
!rm comvoi.zip

### Downloads VCTK - English audio.

In [None]:
%%capture

!rm -rf data/vctk
!curl -O -L http://studio336.sk/css/vctk.zip
!mkdir data/vctk
!unzip vctk.zip -d data/vctk
!rm vctk.zip

### Remove audio files in meta.csv files that do not exist.

In [None]:
import shutil

base_dir = 'data/comvoi'
for lang_dir in os.listdir(base_dir):
    # Rename "meta.csv" files to "_meta.csv".
    os.rename(os.path.join(base_dir, lang_dir, 'meta.csv'), os.path.join(base_dir, lang_dir, '_meta.csv'))
    
    with open(os.path.join(base_dir, lang_dir, '_meta.csv'), mode = 'r', encoding = 'utf-8') as meta_csv:
        lines = meta_csv.readlines()
        copy_of_lines = lines.copy()
        
        # Example line: 04|common_voice_fr_18576291.wav|Que suis-je auprès de Lui.
        for line in lines:
            audio_file_subdir = line.split('|')[0]
            audio_file_name = line.split('|')[1]
            
            # Example audio_file_path: data/comvoi/fr/wavs/04/common_voice_fr_18576287.wav
            audio_file_path = os.path.join(base_dir, lang_dir, 'wavs', audio_file_subdir, audio_file_name)
            if not os.path.exists(audio_file_path):
                copy_of_lines.remove(line)

    with open(os.path.join(base_dir, lang_dir, 'meta.csv'), mode = 'w', encoding = 'utf-8') as meta_csv:
        meta_csv.writelines(copy_of_lines)

### Creates a train.txt file that stores information for each dataset.
### Also generates spectrograms for each audio file.

To train the model, we need to:
1. Create a linear spectrogram and mel spectrogram for every audio file. A spectrogram is a way to visualize an audio file.
2. Create a file called train.txt.
    - Every line in train.txt corresponds to an audio file
    - It has this syntax: id|speaker|language|audio_file_path|mel_spectrogram_path|linear_spectrogram_path|text|phonemized_text
    - Example: 002437|22-de|de|../comvoi_clean/de/wavs/22/common_voice_de_18706450.wav||../comvoi_clean/mel_spectrograms/002437.npy|../comvoi_clean/linear_spectrograms/002437.npy|Meine Sims haben immer Harndrang.|
    - Phonemized text is left empty

In [None]:
import dataset.dataset as ds
ds.TextToSpeechDataset.create_meta_file('my_common_voice', 'data/comvoi', 'train.txt', 22050, 1102, True, False)

In [None]:
import dataset.dataset as ds
ds.TextToSpeechDataset.create_meta_file('vctk', 'data/vctk/VCTK-Corpus/VCTK-Corpus', 'train.txt', 22050, 1102, True, False)

We need to combine COMVOI and VCTK, so we combine the 2 train.txt files into a new one in data/comvoi_vctk.

In [None]:
# Combine COMVOI and VCTK train.txt files

!mkdir data/comvoi_vctk
!cat data/comvoi/train.txt data/vctk/VCTK-Corpus/VCTK-Corpus/train.txt > data/comvoi_vctk/train.txt

### Ensure that file paths are relative to data/comvoi_vctk directory.

The generated train.txt file audio paths such as "de/wavs/01/common_voice_de_18362579.wav".<br>
We need to access the audio from data/comvoi_vctk.<br>
So change these audio paths to "../comvoi/de/wavs/01/common_voice_de_18362579.wav".<br>
Similar process for VCTK.

In [None]:
# Back up the train.txt file first.
!cp data/comvoi_vctk/train.txt data/comvoi_vctk/_train.txt

In [None]:
COMVOI_LANGUAGES = ['de', 'fr', 'nl', 'ru', 'zh']
VCTK_LANGUAGES = ['en-us']

with open('data/comvoi_vctk/_train.txt', mode = 'r', encoding = 'utf-8') as file:
    lines = file.readlines()
    for i in range(len(lines)):
        line = lines[i]
        line_parts = line.split('|')
        lang = line_parts[2]
        if lang in COMVOI_LANGUAGES:
            # Change e.g. "de/wavs/01/common_voice_de_18362579.wav" to "../comvoi/de/wavs/01/common_voice_de_18362579.wav"
            line_parts[3] = os.path.join('../comvoi/', line_parts[3])
            line_parts[4] = os.path.join('../comvoi/', line_parts[4])
            line_parts[5] = os.path.join('../comvoi/', line_parts[5])
            lines[i] = '|'.join(line_parts)
        
        if lang in VCTK_LANGUAGES:
            line_parts[3] = os.path.join('../vctk/VCTK-Corpus/VCTK-Corpus/', line_parts[3])
            line_parts[4] = os.path.join('../vctk/VCTK-Corpus/VCTK-Corpus/', line_parts[4])
            line_parts[5] = os.path.join('../vctk/VCTK-Corpus/VCTK-Corpus/', line_parts[5])
            lines[i] = '|'.join(line_parts)

    with open('data/comvoi_vctk/train.txt', mode = 'w', encoding = 'utf-8') as file:
        file.writelines(lines)

In [None]:
!rm data/comvoi_vctk/_train.txt

### Split data into train.txt and val.txt.

An ML model needs train data and validation data (we do not need test data here).<br>
https://towardsdatascience.com/train-validation-and-test-sets-72cb40cba9e7<br>
So we remove some lines from train.txt and move them to a new val.txt file.<br>

In [None]:
# Back up the train.txt file first.
!cp data/comvoi_vctk/train.txt data/comvoi_vctk/_train.txt

In [None]:
from tqdm import tqdm
from sklearn.model_selection import train_test_split

with open('data/comvoi_vctk/_train.txt', mode = 'r', encoding = 'utf-8') as file:
    all_lines = file.readlines()
    train, val = train_test_split(all_lines, test_size=0.15, random_state=42)
    print(len(train))
    print(len(val))
        
    with open('data/comvoi_vctk/train.txt', mode = 'w', encoding = 'utf-8') as file:
        file.writelines(train)
    with open('data/comvoi_vctk/val.txt', mode = 'w', encoding = 'utf-8') as file:
        file.writelines(val)

In [None]:
!rm data/comvoi_vctk/_train.txt

## Train

generated_switching_comvoi_vctk.json is located in params/ and it contains information on how we will train.

In [None]:
os.chdir(PROJECT_ROOT)

!PYTHONIOENCODING=utf-8 python3 train.py --hyper_parameters generated_switching_comvoi_vctk

In [None]:
%env CH=INSERT_CHECKPOINT_FILE_NAME_HERE

### Zip logs

In [None]:
%cd ~/Multilingual_Text_to_Speech
!rm logs.zip
!zip logs.zip -r logs

## Test

In [None]:
!echo "fr|Cette requête s'explique par les relations peu conventionnelles que Schrödinger entretient avec les femmes.|01-zh|fr"  | python3 synthesize.py --checkpoint checkpoints/$CH --save_spec
!echo "ru|Как считают современные археологи, на месте находились четыре различных храма.|01-zh|ru"  | python3 synthesize.py --checkpoint checkpoints/$CH --save_spec
!echo "de|Sie liegt zwischen dem Ijsselmeer, der Ijssel und den Hügeln der Veluwe.|01-zh|de"  | python3 synthesize.py --checkpoint checkpoints/$CH --save_spec
!echo "zh|wǒ zài nián qīng shí hou yě céng jīng zuò guò xǔ duō mèng.|01-zh|zh"  | python3 synthesize.py --checkpoint checkpoints/$CH --save_spec
!echo "en|Now, the way I would do it is by better analysing the data.|01-zh|en-us"  | python3 synthesize.py --checkpoint checkpoints/$CH --save_spec

import IPython.display as ipd
ipd.display(ipd.Audio('fr.wav'))
ipd.display(ipd.Audio('ru.wav'))
ipd.display(ipd.Audio('de.wav'))
ipd.display(ipd.Audio('zh.wav'))
ipd.display(ipd.Audio('en.wav'))