# End-to-End prosody transfer prototype using QuartzNet, Mellotron, Tacotron, WaveGlow, Multi-Singer and HiFiGAN.
This notebook uses resources from following repositories: 
 - https://github.com/NVIDIA/mellotron
 - https://github.com/NVIDIA/NeMo
 - https://github.com/Rongjiehuang/Multi-Singer
 - https://github.com/NVIDIA/waveglow.git

and is designed to run in google colab. If you want to run it locally, you might be required to install many additional dependencies!

This notebook requires a GPU to run properly. 
First, you must select in the top left corner => `Runtime` / `Change runtime type` => and here select a `GPU` option.

Mounting a personal google drive with pretrained models: choose the correct account and allow access...

In [None]:
# Mounting a personal google drive with pretrained models: choose the correct account and allow access...
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/
! ls

*If the repo already exists in your local files, run the cell below to get rid of it and instatantiate new copy*


In [None]:
# remove repository from previous testing instances
! rm -r mellotron
# clone the Mellotron repository
! git clone https://github.com/NVIDIA/mellotron.git

Descend into the repo and list directory contents

In [None]:
%cd /content/mellotron
! ls

# Initialize Nvidia NeMo Framework + dependencies
The following block will import the NeMo framework from github.

In [None]:
## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install unidecode

# ## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

## Install TorchAudio
!pip install torchaudio>=0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

## Grab the config we'll use in this example
!mkdir configs

In [None]:
import nemo
nemo.__version__

## Initializing Git submodule for waveglow
WaveGlow is used as a default vocoder for Mellotron.

In [None]:
# this cell makes sure the sub-repository initially used by Mellotron is intialized
! git submodule init
! git submodule update

Copy large files (pretrained models) from the mapped GDrive to the local filesystem.

In [None]:
!cp -r /content/drive/MyDrive/models/ .
! ls

Mellotron needs many dependencies as described by requirements.txt
- matplotlib==2.1.0
- tensorflow==1.15.2
- inflect==0.2.5
- librosa==0.6.0
- scipy==1.0.0
- tensorboardX==1.1
- Unidecode==1.0.22
- pillow
- nltk==3.4.5
- jamo==0.4.1
- music21

In [None]:
# We create a new requirement file, since google colab comes with many of these programs preinstalled and their succesive installation throws errors.
! touch newReqs.txt
! echo 'Unidecode==1.0.22' > newReqs.txt
! echo 'tensorflow==1.15.2' >> newReqs.txt

*Unidecode is not preinstalled by colab.*
*Tensorflow is included in colab, but versions 0.2.0 and newer removed an attribute called "contrib". Mellotron requires a older version of tensorflow to function properly.*

In [None]:
# requirement installation - can take a while.
! pip install -r newReqs.txt

## Mellotron Inference
Mellotron needs to instantiate and import many of its libraries to synthesize mel spectrograms.

In [None]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import IPython.display as ipd

import sys
sys.path.append('waveglow/')

from itertools import cycle
import numpy as np
import scipy as sp
from scipy.io.wavfile import write
import pandas as pd
import librosa
import torch

# importing custom Mellotron classes
from hparams import create_hparams
from model import Tacotron2, load_model
from waveglow.denoiser import Denoiser
from layers import TacotronSTFT
from data_utils import TextMelLoader, TextMelCollate
from text import cmudict, text_to_sequence
from mellotron_utils import get_data_from_musicxml

In [None]:
def panner(signal, angle):
    angle = np.radians(angle)
    left = np.sqrt(2)/2.0 * (np.cos(angle) - np.sin(angle)) * signal
    right = np.sqrt(2)/2.0 * (np.cos(angle) + np.sin(angle)) * signal
    return np.dstack((left, right))[0]

In [None]:
def plot_mel_f0_alignment(mel_source, mel_outputs_postnet, f0s, alignments, figsize=(16, 16)):
    fig, axes = plt.subplots(4, 1, figsize=figsize)
    axes = axes.flatten()
    axes[0].imshow(mel_source, aspect='auto', origin='bottom', interpolation='none')
    axes[1].imshow(mel_outputs_postnet, aspect='auto', origin='bottom', interpolation='none')
    axes[2].scatter(range(len(f0s)), f0s, alpha=0.5, color='red', marker='.', s=1)
    axes[2].set_xlim(0, len(f0s))
    axes[3].imshow(alignments, aspect='auto', origin='bottom', interpolation='none')
    axes[0].set_title("Source Mel")
    axes[1].set_title("Predicted Mel")
    axes[2].set_title("Source pitch contour")
    axes[3].set_title("Source rhythm")
    plt.tight_layout()

In [None]:
def load_mel(path):
    audio, sampling_rate = librosa.core.load(path, sr=hparams.sampling_rate)
    audio = torch.from_numpy(audio)
    if sampling_rate != hparams.sampling_rate:
        raise ValueError("{} SR doesn't match target {} SR".format(
            sampling_rate, stft.sampling_rate))
    audio_norm = audio.unsqueeze(0)
    audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
    melspec = stft.mel_spectrogram(audio_norm)
    melspec = melspec.cuda()
    return melspec

In [None]:
hparams = create_hparams()

Settings for the inner Tacotron model used as aligner.

In [None]:
stft = TacotronSTFT(hparams.filter_length, hparams.hop_length, hparams.win_length,
                    hparams.n_mel_channels, hparams.sampling_rate, hparams.mel_fmin,
                    hparams.mel_fmax)

## Loading pretrained models
The pretrained libriTts model for Mellotron is available from: https://drive.google.com/open?id=1ZesPPyRRKloltRIuRnGZ2LIUEuMSVjkI

In [None]:
checkpoint_path = "models/mellotron_libritts.pt"
mellotron = load_model(hparams).cuda().eval()
mellotron.load_state_dict(torch.load(checkpoint_path)['state_dict'])

The pretrained model for WaveGlow is available from: https://drive.google.com/open?id=1okuUstGoBe_qZ4qUEF8CcwEugHP7GM_b

In [None]:
waveglow_path = 'models/waveglow_256channels_universal_v4.pt'
waveglow = torch.load(waveglow_path)['model'].cuda().eval()
denoiser = Denoiser(waveglow).cuda().eval()

# Add full ASR NeMo Pipeline with Quartznet
To demonstrate teh power of NeMo, we can add a Speech recognition model that will transcribe the text from source audio clip for us.

In [None]:
import nemo.collections.asr as nemo_asr

In [None]:
# import pretrained quartznet
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

In [None]:
files = ['./data/test-sing.wav']
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
  print(f"Audio in {fname} was recognized as: {transcription}")

## Load additional data required
Arpabet is a giant text file that maps normalized text into phonemes with correct pronunciation. For example "ABDOMINAL ==> AE0 B D AA1 M AH0 N AH0 L" 

audio_paths contains audio/text pairs for inference.

In [None]:
arpabet_dict = cmudict.CMUDict('data/cmu_dictionary')
audio_paths = 'data/examples_filelist.txt'
dataloader = TextMelLoader(audio_paths, hparams)
datacollate = TextMelCollate(1)

Data loading is done here:

In [None]:
file_idx = 0
audio_path, text, sid = dataloader.audiopaths_and_text[file_idx]

# get audio path, encoded text, pitch contour and mel for gst
text_encoded = torch.LongTensor(text_to_sequence(text, hparams.text_cleaners, arpabet_dict))[None, :].cuda()    
pitch_contour = dataloader[file_idx][3][None].cuda()
mel = load_mel(audio_path)
print(audio_path, text)

# load source data to obtain rhythm using tacotron 2 as a forced aligner
x, y = mellotron.parse_batch(datacollate([dataloader[file_idx]]))

This is the selected source audioclip for Prosody extraction.

In [None]:
ipd.Audio(audio_path, rate=hparams.sampling_rate)

In [None]:
# extract speaker IDs from dataset
speaker_ids = TextMelLoader("filelists/libritts_train_clean_100_audiopath_text_sid_shorterthan10s_atleast5min_train_filelist.txt", hparams).speaker_ids
# Extract speaker information
speakers = pd.read_csv('filelists/libritts_speakerinfo.txt', engine='python',header=None, comment=';', sep=' *\| *', 
                       names=['ID', 'SEX', 'SUBSET', 'MINUTES', 'NAME'])

# Connect speaker information with ID
speakers['MELLOTRON_ID'] = speakers['ID'].apply(lambda x: speaker_ids[x] if x in speaker_ids else -1)
# Create speaker list based on SEX and length of recordings
female_speakers = cycle(
    speakers.query("SEX == 'F' and MINUTES > 20 and MELLOTRON_ID >= 0")['MELLOTRON_ID'].sample(frac=1).tolist())
male_speakers = cycle(
    speakers.query("SEX == 'M' and MINUTES > 20 and MELLOTRON_ID >= 0")['MELLOTRON_ID'].sample(frac=1).tolist())

## Prosody transfer

In [None]:
with torch.no_grad():
    # get rhythm (alignment map) using tacotron 2
    mel_outputs, mel_outputs_postnet, gate_outputs, rhythm = mellotron.forward(x)
    rhythm = rhythm.permute(1, 0, 2)

In [None]:
# choose random speaker ID and SEX for synthesis
speaker_id = next(female_speakers) if np.random.randint(2) else next(male_speakers)
speaker_id = torch.LongTensor([speaker_id]).cuda()

# Generate spectrogram
with torch.no_grad():
    mel_outputs, mel_outputs_postnet, gate_outputs, _ = mellotron.inference_noattention(
        (text_encoded, mel, speaker_id, pitch_contour, rhythm))

#Plot spectrogram and addditional info
plot_mel_f0_alignment(x[2].data.cpu().numpy()[0],
                      mel_outputs_postnet.data.cpu().numpy()[0],
                      pitch_contour.data.cpu().numpy()[0, 0],
                      rhythm.data.cpu().numpy()[:, 0].T)

Here we can see the original spectrogram with the predicted spectrogram that will be synthesized into audio. 
The third graph displays predicted pitch contour for the clip.
The last graph shows the alignment of spectrograms.

## Waveform generation with WaveGlow

In [None]:
with torch.no_grad():
    audioWaveglow = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.8), 0.01)[:, 0]
ipd.Audio(audioWaveglow[0].data.cpu().numpy(), rate=hparams.sampling_rate)

## Replacing WaveGlow with Multi-Singer
This is an attempt to replace waveglof for audiowave generation with a better-suited vocoder - Multisinger.

In [None]:
# return to the absolute path root
%cd /content/
! git clone https://github.com/Rongjiehuang/Multi-Singer.git
%cd Multi-Singer

## Infernece for Multi-Singer
The inference for multi-singer is different as it is done from command line via a comand with parameters.


`python inference.py -i data/feature -o outputs/  -c checkpoints/*.pkl -g config/config.yaml`

-i acoustic feature folder

-o directory to save generated speech

-c checkpoints file

-c config file

In [None]:
# make directory for output
! mkdir outputs

In [None]:
#create and save mel spectrogram representation from Mellotron
file = open("mel_output.pt", "w")

# it is difficult to determine, in what format is the spectrogram and how to properly save it for import into Multi-singer
torch.save(mel_outputs_postnet, 'mel_output.pt')

## Generating audio waveform with Multi-singer
So far this always fails for not accepting the input spectrogram format.

In [None]:
! python inference.py -i mel_output.pt -o outputs/  -c ../models/Basic.pkl -g config/config.yaml

## Generating audio waveform with HiFiGAN via NeMo
The following approach demonstrates the possibility of swapping a vocoder module for Mellotron and attaching a different module provided by the NeMo framework.

In [None]:
# import pretrained TTS models
import nemo.collections.tts as nemo_tts

# import pretrained HiFiGAN
from nemo.collections.tts.models import HifiGanModel
hifigan = HifiGanModel.from_pretrained("tts_hifigan").eval().cuda()

### Audio synthesis from original mel-Spectrogram
The following audio was synthesized via HiFiGAN.

In [None]:
audio = hifigan.convert_spectrogram_to_audio(spec=mel_outputs_postnet).to('cpu').detach().numpy()
ipd.display(ipd.Audio(audio, rate=hparams.sampling_rate))

FOr comparison, the previous WaveGlow synthesized clip...

In [None]:
ipd.Audio(audioWaveglow[0].data.cpu().numpy(), rate=hparams.sampling_rate)