### SpeechT5 Model

- **What it is**:  
  SpeechT5 is a **transformer-based model** from Microsoft designed for both **text-to-speech (TTS)** and **speech-to-text (ASR)** tasks.  
  It’s trained in a *“unified encoder-decoder”* way, meaning it can handle multiple speech and text tasks with the same architecture.  

- **How it works in TTS**:  
  1. **Text input** → tokenized by the `SpeechT5Processor`.  
  2. **Encoder** processes the text tokens into hidden representations.  
  3. **Decoder** generates **acoustic features** (like mel-spectrograms).  
  4. **Vocoder (HiFi-GAN)** converts those spectrograms into actual **waveforms**.  
  5. **Speaker embeddings** can be provided to control **voice style** (pitch, accent, timbre).  

- **Key features**:  
  - High-quality, natural-sounding speech synthesis.  
  - Supports **speaker adaptation** → you can change voices with embeddings.  
  - Part of the Hugging Face `transformers` library, easy to integrate.  

- **Why you used it**:  
  It gives you **state-of-the-art TTS** with relatively simple code, and lets you stream chunks of speech with character-level alignment.  



### STEP 1: Load Models and Speaker Embeddings

- **Processor**  
  Handles text preprocessing and tokenization before passing into the model.

- **Model (SpeechT5)**  
  Converts processed text tokens into intermediate **mel-spectrograms** (representation of speech).

- **Vocoder (SpeechT5HifiGan)**  
  Converts mel-spectrograms into actual **audio waveforms** that we can play.

- **Speaker Embeddings**  
  - Pre-trained voice characteristics (from CMU Arctic dataset).  
  - Index **7306** is used here to pick a specific voice.  
  - These embeddings define **pitch, tone, accent, and timbre** of the generated speech.


In [5]:
import nest_asyncio
nest_asyncio.apply()
import asyncio
import torch
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from datasets import load_dataset
import numpy as np
import websockets
import json
import base64
import librosa

# Load models
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embeddings = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)


Fetching 1 files: 100%|████████████████████████| 1/1 [00:00<00:00, 11618.57it/s]


### STEP 2: Generate Audio Chunks with Character Timestamps

This function processes text in chunks and generates synchronized audio:

- **Chunking the text**  
  Splits the input text into **4-word chunks** 

- **Generating speech for each chunk**  
  - Passes tokens through the TTS model + vocoder to generate the audio waveform.

- **Estimating character timings**  
  - Calculates the total audio duration for the chunk.  
  - Divides it equally across all characters (simple approximation).  

- **Streaming output**  
  Yields a dictionary containing:  
  - Characters in the chunk  
  - The generated audio tensor


In [6]:

async def generate_audio_chunks_with_char_timestamps(text: str, chunk_size=4):
    words = text.strip().split()
    char_pos = 0
    for i in range(0, len(words), chunk_size):
        chunk_words = words[i:i+chunk_size]
        chunk_text = " ".join(chunk_words)
        inputs = processor(text=chunk_text, return_tensors="pt")
        speech_chunk = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
        num_chars = len(chunk_text)
        total_samples = speech_chunk.shape[0]
        duration_ms = total_samples / 16000 * 1000
        char_durations_ms = [duration_ms / num_chars] * num_chars
        char_start_times_ms = np.cumsum([0]+char_durations_ms[:-1]).tolist()
        yield {
            "chars": list(chunk_text),
            "char_start_times_ms": [round(t) for t in char_start_times_ms],
            "char_durations_ms": [round(d) for d in char_durations_ms],
            "audio": speech_chunk
        }
        char_pos += len(chunk_text) + 1
        await asyncio.sleep(0.01)

### STEP 3: Conversion

- **Model output**: The TTS model produces audio as a PyTorch tensor at 16kHz.  
- **Playback compatibility**: Browsers expect **44.1kHz WAV/PCM**, so resampling is required.  

➡️ In practice:  
- Convert PyTorch tensor → NumPy array  
- Resample from 16kHz → 44.1kHz  
- Convert to 16-bit PCM integers  
- Encode as Base64 for safe WebSocket transmission  


In [7]:
def audio_to_base64(audio_tensor):
    audio_np = audio_tensor.numpy()
    audio_44k = librosa.resample(audio_np, orig_sr=16000, target_sr=44100)
    audio_int16 = (audio_44k * 32767).astype(np.int16)
    return base64.b64encode(audio_int16.tobytes()).decode('utf-8')


### STEP 4: Handle WebSocket Client

This function manages communication with the frontend UI:

1. **Connection setup** – runs when a client connects, creates a `text_buffer`.  
2. **Receiving messages** – gets `"text"` to add to the buffer, `"flush"` to trigger processing; `" "` signals start, `""` signals end.  
3. **Processing text** – on flush, generates audio chunks with timestamps, converts to Base64, and sends JSON back to the client.  
4. **Keep alive** – clears the buffer for the next input; keeps the connection open unless closed or an error occurs.  


In [8]:
async def handle_client(websocket):
    text_buffer = ""
    print("Client connected")
    
    try:
        async for message in websocket:
            print(f"Received: {message}")
            data = json.loads(message)
            text = data["text"]
            flush = data["flush"]
            
            if text == " ":
                print("Got initial space")
                continue
            if text == "":
                print("Got final empty")
                break
                
            text_buffer += text
            print(f"Text buffer: {text_buffer}")
            
            if flush:
                print("Processing flush...")
                async for chunk_data in generate_audio_chunks_with_char_timestamps(text_buffer):
                    print("Got chunk, converting audio...")
                    audio_b64 = audio_to_base64(chunk_data["audio"])
                    response = {
                        "audio": audio_b64,
                        "alignment": {
                            "chars": chunk_data["chars"],
                            "char_start_times_ms": chunk_data["char_start_times_ms"],
                            "char_durations_ms": chunk_data["char_durations_ms"]
                        }
                    }
                    print("Sending response...")
                    await websocket.send(json.dumps(response))
                text_buffer = ""
                print("Done processing - connection stays open")
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()

start_server = websockets.serve(handle_client, "localhost", 8111)
print("Server running on ws://localhost:8111")
asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()

Server running on ws://localhost:8111


OSError: [Errno 48] error while attempting to bind on address ('127.0.0.1', 8111): address already in use