<h1>Text to Speech Converter and emotion Prediction model </h1>

<hr>

<h3>The program performs two main tasks: text-to-speech synthesis and emotion prediction from text.
    


    
For the text-to-speech synthesis, the program utilizes pre-trained models (Tacotron2 and WaveGlow) from the NVIDIA/DeepLearningExamples repository. It loads these models and sets them up for inference on the GPU. The input text is provided, and the program generates mel-spectrograms using Tacotron2 and converts them into audio waveforms using WaveGlow. The resulting audio is saved as a WAV file and played back using the Audio class in the Jupyter Notebook.
    
 
For the emotion prediction from text, the program uses the VADER (Valence Aware Dictionary and sEntiment Reasoner) library. It installs the library and imports the SentimentIntensityAnalyzer class. The predict_emotion function is defined, which takes the input text, calculates sentiment scores using the VADER analyzer, and predicts the emotion as "positive," "negative," or "neutral" based on the scores.
    
 
The program demonstrates the integration of these functionalities, showcasing how to synthesize speech from text and predict the associated emotion using pre-trained models and libraries.</h3>

 
Please note that the program includes additional code segments related to installing dependencies and loading utility functions, which may be necessary for the specific environment or use case but not directly related to the core functionality of text-to-speech synthesis and emotion prediction.

<hr>

In [1]:
pip install vaderSentiment


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
[0mNote: you may need to restart the kernel to use updated packages.


<h3>pip install vaderSentiment: This command installs the vaderSentiment library, which is used for sentiment analysis and emotion prediction.</h3>



In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


<h3>This line imports the SentimentIntensityAnalyzer class from the vaderSentiment library, which is used for sentiment analysis and emotion prediction.</h3>

In [4]:
%%bash
pip install numpy scipy librosa unidecode inflect librosa
apt-get update
apt-get install -y libsndfile1





Hit:1 http://packages.cloud.google.com/apt gcsfuse-focal InRelease
Hit:2 https://packages.cloud.google.com/apt cloud-sdk InRelease
Get:3 https://packages.cloud.google.com/apt google-fast-socket InRelease [5015 B]
Hit:4 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:5 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:7 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Fetched 5015 B in 1s (4943 B/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
libsndfile1 is already the newest version (1.0.28-7ubuntu0.1).
0 upgraded, 0 newly installed, 0 to remove and 43 not upgraded.


[0m

Load the Tacotron2 model pre-trained on [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/) and prepare it for inference:

<h3>The %%bash cell magic command is used to execute shell commands in a Jupyter Notebook. In this case, it installs additional dependencies (numpy, scipy, librosa, unidecode, inflect, and libsndfile1) and updates the apt package manager.</h3>

In [5]:
import torch
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_amp/versions/19.09.0/files/nvidia_tacotron2pyt_fp16_20190427


Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0-2): 3 x Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (linear_layer): Linear(in_features=80, out_features=256, bias=False)
        )
        (1): LinearNorm(
          (linear_layer): Linear(in_features=256, out_features=256, bias=False)
        )
      )
    )
    (attention_rnn): LSTMCell(768, 1024)
    (attention_layer): Attention(
      (query_layer): LinearNorm(
        (linear_layer): Linear(in_features=1024, out_features=128, bias=False)
      )
      (memory_layer): LinearNorm(
        (linear_layer): Linear(in_fea

<h3>The  code loads pre-trained models for text-to-speech synthesis using Tacotron2 and WaveGlow from the NVIDIA/DeepLearningExamples repository. These models are downloaded and initialized on the GPU ('cuda') for inference.</h3>

Load pretrained WaveGlow model

In [6]:
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ckpt_amp/versions/19.09.0/files/nvidia_waveglowpyt_fp16_20190427


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0-3): 4 x WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0-6): 7 x Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (7

Now, let's make the model say:

In [38]:
text = "   Witnessing ,  the sufferings of the poor and helpless  ,  sinks my heart   with sadness. "



Format the input using utility methods

<h3>The utils variable loads utility functions from the NVIDIA/DeepLearningExamples repository for preparing the input sequence.
    

The prepare_input_sequence function is used to preprocess the input text (text) and convert it into sequences and lengths suitable for the Tacotron2 model.</h3>

In [39]:
utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
sequences, lengths = utils.prepare_input_sequence([text])

Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub


Run the chained models:

<h3>Inside the with torch.no_grad() block, the Tacotron2 model (tacotron2) is used to infer mel-spectrograms from the input sequences. The mel-spectrograms are then passed to the WaveGlow model (waveglow) to generate audio.</h3>

In [40]:
with torch.no_grad():
    mel, _, _ = tacotron2.infer(sequences, lengths)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

You can write it to a file and listen to it

In [41]:
from scipy.io.wavfile import write
write("audio.wav", rate, audio_numpy)

Alternatively, play it right away in a notebook with IPython widgets

In [42]:
from IPython.display import Audio
Audio(audio_numpy, rate=rate)

### Details
For detailed information on model input and output, training recipies, inference and performance visit: [github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) and/or [NGC](https://ngc.nvidia.com/catalog/resources/nvidia:tacotron_2_and_waveglow_for_pytorch)



In [43]:
analyzer = SentimentIntensityAnalyzer()


<h3>The predict_emotion function takes the input text as an argument and uses the SentimentIntensityAnalyzer to obtain sentiment scores. Based on the scores, it predicts the emotion as "positive," "negative," or "neutral."</h3>

In [15]:
def predict_emotion(text):
    sentiment_scores = analyzer.polarity_scores(text)

    # Extract the sentiment scores
    compound_score = sentiment_scores['compound']
    positive_score = sentiment_scores['pos']
    negative_score = sentiment_scores['neg']

    # Based on the scores, predict the emotion
    if compound_score >= 0.05:
        emotion = 'positive'
    elif compound_score <= -0.05:
        emotion = 'negative'
    else:
        emotion = 'neutral'

    return emotion


In [44]:

emotion = predict_emotion(text)
print("The emotion(sentiment) of the input text' ",text, "'is:", emotion)


The emotion(sentiment) of the input text'     Witnessing ,  the sufferings of the poor and helpless  ,  sinks my heart   with sadness.  'is: negative
