# HuggingFace: Audio (Text to Speech)

## Imports

In [1]:
# these are the necessary packages for the text to speech model
# it is important that this section runs before the others
# the utils and data_utils package are from this repo
%pwd
!git clone https://github.com/jaywalnut310/vits.git
!python --version
%cd vits/

%cd monotonic_align/
%mkdir monotonic_align
!python3 setup.py build_ext --inplace
%cd ../
%pwd

fatal: destination path 'vits' already exists and is not an empty directory.


Python 3.8.18
/Users/mocha/DataspellProjects/CMPE258/FastAI-Keras/HuggingFace/Audio-Model/vits
/Users/mocha/DataspellProjects/CMPE258/FastAI-Keras/HuggingFace/Audio-Model/vits/monotonic_align
mkdir: monotonic_align: File exists
/Users/mocha/DataspellProjects/CMPE258/FastAI-Keras/HuggingFace/Audio-Model/vits


'/Users/mocha/DataspellProjects/CMPE258/FastAI-Keras/HuggingFace/Audio-Model/vits'

In [2]:
# used to import the pretrained torch huggingface models 
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from torch import nn
from torch.nn import functional as F
from torch.utils.data import DataLoader
# used to import checkpoint
import os
import subprocess
import locale
import re
import glob
import tempfile
import math
import utils
import argparse
import subprocess
# used for inference / eval
from IPython.display import Audio
from jiwer import wer

  torch.utils._pytree._register_pytree_node(


INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


In [3]:
# loads/modify the data
import pandas as pd
import numpy as np
import gzip
import json

In [14]:
# loads the packages from the authors' github directory
from data_utils import TextAudioLoader, TextAudioCollate, TextAudioSpeakerLoader, TextAudioSpeakerCollate
from models import SynthesizerTrn
from scipy.io.wavfile import write
import commons

In [5]:
# explore the data and select features
def parse(path):
  g = gzip.open(path, 'rb')
  for l in g:
    yield json.loads(l)

def getDF(path):
  i = 0
  df = {}
  for d in parse(path):
    df[i] = d
    i += 1
  return pd.DataFrame.from_dict(df, orient='index')

## The Data

The pretrained HuggingFace audio [model](https://huggingface.co/facebook/wav2vec2-base-960h) is able to use 16kh sampled audio snippets as input - as that is what it was originally trained on. The HuggingFace documentation also gives instructions on how to transcribe audio to text speech using this model. 

Since I want to reverse this architecture, I would need to find a "vec2wav2" model. After checking huggingface, I found that there is another [model](https://huggingface.co/facebook/mms-tts) that implements text-to-speech. This model seems to be pretrained on  a variety of languages and therefore it may be interesting applying this model on a variety of language text inputs. The facebook model text-to-speech documentation that I used for inital setup of the model can be found [here](https://colab.research.google.com/github/facebookresearch/fairseq/blob/main/examples/mms/tts/tutorial/MMS_TTS_Inference_Colab.ipynb#scrollTo=UtEeQcmwuUaG).

Unfortunately, my amazon fashion reviews text dataset seems to only have reviews in english.

##  ETL (Extract/Transform/Load)

In [6]:
fashion_reviews = getDF('../../../data/text_data/Amazon Fashion Review Data.json.gz')

In [20]:
y_sample = fashion_reviews['reviewText'].dropna().sample(50)

## Model Inference
As the model has already been pre-trained, it is necessary to download the training weights / 'checkpoint' that were found when training the model on the language used to train. Since my fashion reviewText feature is in english, I will be using the english checkpoint.

### Loading the checkpoint

In [8]:
locale.getpreferredencoding = lambda: "UTF-8"

def download(lang, tgt_dir="./"):
  lang_fn, lang_dir = os.path.join(tgt_dir, lang+'.tar.gz'), os.path.join(tgt_dir, lang)
  cmd = ";".join([
        f"wget https://dl.fbaipublicfiles.com/mms/tts/{lang}.tar.gz -O {lang_fn}",
        f"tar zxvf {lang_fn}"
  ])
  print(f"Download model for language: {lang}")
  subprocess.check_output(cmd, shell=True)
  print(f"Model checkpoints in {lang_dir}: {os.listdir(lang_dir)}")
  return lang_dir

LANG = "eng"
ckpt_dir = download(LANG)

Download model for language: eng


--2024-03-07 14:48:44--  https://dl.fbaipublicfiles.com/mms/tts/eng.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 2600:9000:201d:3400:13:6e38:acc0:93a1, 2600:9000:201d:6800:13:6e38:acc0:93a1, 2600:9000:201d:3200:13:6e38:acc0:93a1, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|2600:9000:201d:3400:13:6e38:acc0:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 134859962 (129M) [application/x-tar]
Saving to: ‘./eng.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 5.33M 24s
    50K .......... .......... .......... .......... ..........  0% 7.20M 21s
   100K .......... .......... .......... .......... ..........  0% 10.0M 18s
   150K .......... .......... .......... .......... ..........  0% 6.79M 18s
   200K .......... .......... .......... .......... ..........  0% 47.4M 15s
   250K .......... .......... .......... .......... ..........  0% 8.49M 15s
   300K .......... .......... .......... ..

Model checkpoints in ./eng: ['config.json', 'G_100000.pth', 'vocab.txt']



x eng/config.json


In [10]:
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of the model checkpoint at facebook/wav2vec2-base-960h were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.masked_spec_embed', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You sho

preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

In [12]:
def preprocess_char(text, lang=None):
    """
    Special treatement of characters in certain languages
    """
    print(lang)
    if lang == 'ron':
        text = text.replace("ț", "ţ")
    return text

class TextMapper(object):
    def __init__(self, vocab_file):
        self.symbols = [x.replace("\n", "") for x in open(vocab_file, encoding="utf-8").readlines()]
        self.SPACE_ID = self.symbols.index(" ")
        self._symbol_to_id = {s: i for i, s in enumerate(self.symbols)}
        self._id_to_symbol = {i: s for i, s in enumerate(self.symbols)}

    def text_to_sequence(self, text, cleaner_names):
        '''Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
        Args:
        text: string to convert to a sequence
        cleaner_names: names of the cleaner functions to run the text through
        Returns:
        List of integers corresponding to the symbols in the text
        '''
        sequence = []
        clean_text = text.strip()
        for symbol in clean_text:
            symbol_id = self._symbol_to_id[symbol]
            sequence += [symbol_id]
        return sequence

    def uromanize(self, text, uroman_pl):
        iso = "xxx"
        with tempfile.NamedTemporaryFile() as tf, \
             tempfile.NamedTemporaryFile() as tf2:
            with open(tf.name, "w") as f:
                f.write("\n".join([text]))
            cmd = f"perl " + uroman_pl
            cmd += f" -l {iso} "
            cmd +=  f" < {tf.name} > {tf2.name}"
            os.system(cmd)
            outtexts = []
            with open(tf2.name) as f:
                for line in f:
                    line =  re.sub(r"\s+", " ", line).strip()
                    outtexts.append(line)
            outtext = outtexts[0]
        return outtext

    def get_text(self, text, hps):
        text_norm = self.text_to_sequence(text, hps.data.text_cleaners)
        if hps.data.add_blank:
            text_norm = commons.intersperse(text_norm, 0)
        text_norm = torch.LongTensor(text_norm)
        return text_norm

    def filter_oov(self, text):
        val_chars = self._symbol_to_id
        txt_filt = "".join(list(filter(lambda x: x in val_chars, text)))
        print(f"text after filtering OOV: {txt_filt}")
        return txt_filt

def preprocess_text(txt, text_mapper, hps, uroman_dir=None, lang=None):
    txt = preprocess_char(txt, lang=lang)
    is_uroman = hps.data.training_files.split('.')[-1] == 'uroman'
    if is_uroman:
        with tempfile.TemporaryDirectory() as tmp_dir:
            if uroman_dir is None:
                cmd = f"git clone git@github.com:isi-nlp/uroman.git {tmp_dir}"
                print(cmd)
                subprocess.check_output(cmd, shell=True)
                uroman_dir = tmp_dir
            uroman_pl = os.path.join(uroman_dir, "bin", "uroman.pl")
            print(f"uromanize")
            txt = text_mapper.uromanize(txt, uroman_pl)
            print(f"uroman text: {txt}")
    txt = txt.lower()
    txt = text_mapper.filter_oov(txt)
    return txt

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

print(f"Run inference with {device}")
vocab_file = f"{ckpt_dir}/vocab.txt"
config_file = f"{ckpt_dir}/config.json"
assert os.path.isfile(config_file), f"{config_file} doesn't exist"
hps = utils.get_hparams_from_file(config_file)
text_mapper = TextMapper(vocab_file)
net_g = SynthesizerTrn(
    len(text_mapper.symbols),
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    **hps.model)
net_g.to(device)
_ = net_g.eval()

g_pth = f"{ckpt_dir}/G_100000.pth"
print(f"load {g_pth}")

_ = utils.load_checkpoint(g_pth, net_g, None)

Run inference with cpu




load ./eng/G_100000.pth


In [17]:
def translate_to_speech(txt):
    txt = preprocess_text(txt, text_mapper, hps, lang=LANG)
    stn_tst = text_mapper.get_text(txt, hps)
    with torch.no_grad():
        x_tst = stn_tst.unsqueeze(0).to(device)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(device)
        hyp = net_g.infer(
            x_tst, x_tst_lengths, noise_scale=.667,
            noise_scale_w=0.8, length_scale=1.0
        )[0][0,0].cpu().float().numpy()

    return Audio(hyp, rate=hps.data.sampling_rate)

In [21]:
audio_from_text = y_sample.apply(translate_to_speech)

eng
text after filtering OOV: love my new nike's  it's been years since i've been able to wear them sizing changed years ago and i just couldn't wear them any more but now the new styles and sizing are perfect  lightweight shoe with tons of style  the fog color is just what i was looking fornot black and not neon  i wore them to participate in relay for life 24 hour walk and they held up great
eng
text after filtering OOV: comfortable
eng
text after filtering OOV: excellent
eng
text after filtering OOV: cute shoe fit as expected and very comfortable
eng
text after filtering OOV: i like it i wasn't sure if i could pull off this kind of hat but it looks good and fits nicely
eng
text after filtering OOV: on the left shoe there were two stained yellow spots
eng
text after filtering OOV: great arch support and cushion both for the heel and ball of the foot great aerobic shoe
eng
text after filtering OOV: a nice lightweight shoe not a lot of cushion so i wouldn't run long distances in them
e

In [24]:
text_to_speech_fashionreviews = pd.DataFrame({'textReview': y_sample, 'audioReview': audio_from_text})

In [40]:
# lets check out the first 5!
n = 5
for i in range(len(text_to_speech_fashionreviews)):
    review = text_to_speech_fashionreviews.iloc[i,:]
    print('Text: ', review['textReview'])
    display('Audio: ',review['audioReview'])
    if i >= 5:
        break

Text:  Love my new Nike's.  It's been years since I've been able to wear them (sizing changed years ago and I just couldn't wear them any more) but now the new styles and sizing are perfect.  Lightweight shoe with tons of style.  The Fog color is just what I was looking for...not black and not neon!  I wore them to participate in Relay for Life (24 hour walk) and they held up great!


'Audio: '

Text:  Comfortable


'Audio: '

Text:  EXCELLENT


'Audio: '

Text:  Cute shoe, fit as expected, and very comfortable.


'Audio: '

Text:  I like it. I wasn't sure if I could pull off this kind of hat, but it looks good and fits nicely.


'Audio: '

Text:  On the left shoe, there were two stained yellow spots.


'Audio: '