#Speech synthesis inference using the finetuned [VITS](https://arxiv.org/pdf/2106.06103.pdf) model

## Installation and Set up

Required packages for correct downloading of the model files:

In [1]:
!pip install git-lfs
!git lfs install

Collecting git-lfs
  Downloading git_lfs-1.6-py2.py3-none-any.whl (5.6 kB)
Installing collected packages: git-lfs
Successfully installed git-lfs-1.6
Error: Failed to call git rev-parse --git-dir --show-toplevel: "fatal: not a git repository (or any of the parent directories): .git\n"
Git LFS initialized.


Clone the [github repository](https://github.com/GerrySant/VITS_finetuned) where the model is stored:

In [2]:
!git clone https://github.com/GerrySant/VITS_finetuned.git

Cloning into 'VITS_finetuned'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 25 (delta 3), reused 21 (delta 2), pack-reused 0[K
Unpacking objects: 100% (25/25), done.
tcmalloc: large alloc 1471086592 bytes == 0x5639e0ba6000 @  0x7f71c20f52a4 0x5639a3ae0e8f 0x5639a3abdfcb 0x5639a3a72f33 0x5639a3a1722a 0x5639a3a176e6 0x5639a3a34451 0x5639a3a349e9 0x5639a3a34f13 0x5639a3ad9b82 0x5639a397b162 0x5639a3961a65 0x5639a3962725 0x5639a396172a 0x7f71c143bc87 0x5639a396177a


Clone and set up the [github repository](https://github.com/coqui-ai/TTS) of the text to speech library. 

In [3]:
# Clone
!git clone https://github.com/coqui-ai/TTS.git

# Installation of the library's required packages
!pip install -q -e TTS
!cd TTS && python setup.py develop

# It fixes the numpy version conflict. It requires restarting the runtime - done automatically by exit() -
!pip install --upgrade numpy
exit()

Cloning into 'TTS'...
remote: Enumerating objects: 27314, done.[K
remote: Counting objects: 100% (127/127), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 27314 (delta 77), reused 102 (delta 72), pack-reused 27187[K
Receiving objects: 100% (27314/27314), 128.35 MiB | 15.67 MiB/s, done.
Resolving deltas: 100% (19975/19975), done.
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 287 kB 11.9 MB/s 
[K     |████████████████████████████████| 80 kB 8.5 MB/s 
[K     |████████████████████████████████| 212 kB 48.8 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 3.4 MB 39.3 MB/s 
[K     |████████████████████████████████| 1.3 MB 40.5 MB/s 
[K   

## Exercise 2: Inference

Create the folder where the output audios will be saved:

In [4]:
import os

os.makedirs('./OUTPUT_AUDIOS', exist_ok=True)

Function that allows to keep the path to the "speakers.json" file updated.

In [5]:
import json

def update_speaker_path(config_path, speakers_path):

  confg = open(config_path, "r")
  json_object = json.load(confg)
  confg.close()

  json_object['model_args']['speakers_file'] = speakers_path
  json_object['speakers_file'] = speakers_path

  confg = open(config_path, "w")
  json.dump(json_object, confg)
  confg.close()

Determine the arguments necessary to perform inference:

In [6]:
message= "Lets go tarkov" # Enter the text you want to convert to speech.

model_path = "./VITS_finetuned/vits_BSC_Gerard_Sant/best_model.pth" # Path to best_model.pth

config_path = "./VITS_finetuned/vits_BSC_Gerard_Sant/config.json" # Path to config.json

speaker_path = "./VITS_finetuned/vits_BSC_Gerard_Sant/speakers.json" # Path to speakers.json

Perform inference using the following code cell:

In [7]:
update_speaker_path(config_path, speaker_path)

!python ./TTS/TTS/bin/synthesize.py --text "{message}" \
      --model_path {model_path} \
      --config_path {config_path} \
      --speaker_id my_speaker \
      --out_path OUTPUT_AUDIOS/output.wav

 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:True
 | > num_mels:80
 | > log_func:np.log
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:False
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:0.0
 | > pitch_fmax:640.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:23
 | > do_sound_norm:False
 | > do_amp_to_db_linear:False
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:2.718281828459045
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
 > Using Griffin-Lim as no vocoder model defined
 > Text: Lets go tarkov
 > Text splitted to sentences.
['Lets go tarkov']
 > Processing time: 2.5188755989074707
 > Real-time fa

## Exercise 3: Inference through a local website


Install the necessary packages for running the web locally

In [1]:
!pip install flask && pip install redis

Collecting redis
  Downloading redis-4.2.2-py3-none-any.whl (226 kB)
[K     |████████████████████████████████| 226 kB 12.2 MB/s 
Collecting deprecated>=1.2.3
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting async-timeout>=4.0.2
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Installing collected packages: deprecated, async-timeout, redis
Successfully installed async-timeout-4.0.2 deprecated-1.2.13 redis-4.2.2


Please, run the following cell and click on the address that appears in the output

In [None]:
!python ./VITS_finetuned/vits_web/app.py

 * Serving Flask app "app" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: on
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 264-649-855
