# 🗣️ Text-to-Audio: AI Generated Speech from Text
This project implements a complete **Text-to-Speech (TTS)** pipeline using FastSpeech2 and HiFi-GAN models. It converts input text into high-quality, natural-sounding speech, built entirely from scratch without third-party TTS tools.

### 🎯 Key Objectives:
- Convert raw text into mel spectrograms
- Use HiFi-GAN vocoder to synthesize realistic audio
- Visualize generated mel spectrogram and training loss
- Prepare and validate datasets like LJ Speech for training


## 📂 Dataset and Preprocessing
We used the [LJ Speech](https://keithito.com/LJ-Speech-Dataset/) dataset. The pipeline includes:
- Extracting and validating mel spectrograms from WAV files
- Directory setup: `wavs`, `mel_spectrograms`, validation scripts
- Shaping and normalizing spectrograms for training


## 📈 Mel Spectrogram Visualization
This section visualizes a sample generated mel spectrogram as output by the trained HiFi-GAN model.


In [None]:
from IPython.display import Image
Image(filename='generated_mel.png')

## 📉 Training Loss Curve
Visualize how the MSE Loss evolves over training epochs to evaluate convergence and model performance.


In [None]:
from IPython.display import Image
Image(filename='training_loss.png')

## 🚀 Inference Pipeline
Use HiFi-GAN for waveform generation from mel spectrograms. Inference script processes test samples and plays generated audio.


In [None]:
# Run inference from script
!python inference.py

## ✅ Conclusion
This notebook demonstrates a full TTS system, capable of producing intelligible, natural-sounding audio. With additional fine-tuning and FastSpeech2 integration, the system can achieve near real-time text-to-audio conversion.
