<a href="https://colab.research.google.com/github/HosseinEyvazi/Voice-AI-Booklet/blob/main/Voice_Sec1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 1) From signal → sound → voice

* **Signal**: a physical waveform (air pressure changes).
* **Sound**: any audible signal.
* **Voice**: human sound with vocal tract patterns (pitch, timbre, phonemes).
  In ML we go from the **raw waveform** → **features** (e.g., mel-spectrograms) → **models** that map audio⇄text.

---

# 2) Speech-to-Text (STT) from microphone with `speech_recognition` (Google Web Speech API)

### Install

```bash
pip install SpeechRecognition pyaudio
# if PyAudio wheels are hard to install on your OS:
# pip install sounddevice
# pip install soundfile
```

### Code (microphone → text)

```python
import speech_recognition as sr

def listen_and_transcribe(lang="en-US"):
    r = sr.Recognizer()
    with sr.Microphone() as source:
        print("Adjusting for ambient noise…")
        r.adjust_for_ambient_noise(source, duration=1)
        print("Speak now!")
        audio = r.listen(source, timeout=5, phrase_time_limit=15)

    try:
        # Uses Google’s free web API (rate-limited; needs internet; not for production)
        text = r.recognize_google(audio, language=lang)
        print("You said:", text)
        return text
    except sr.UnknownValueError:
        print("Could not understand audio")
    except sr.RequestError as e:
        print("API request error:", e)

if __name__ == "__main__":
    listen_and_transcribe("en-US")  # e.g., "fa-IR" for Persian
```

> Notes
> • The free recognizer is rate-limited and not privacy-friendly for sensitive data. For production, use local Whisper, Vosk, or paid cloud APIs.
> • Microphone level matters: avoid clipping; use `adjust_for_ambient_noise`.

---

# 3) Text-to-Speech (TTS) using Hugging Face **serverless** (Inference API)

## What does “serverless” mean here?

You **call a managed model endpoint** (Hugging Face Inference API). You don’t provision servers; HF handles scaling, GPU, and uptime. You pay per usage/plan. Pros: zero ops, quick start. Cons: network latency, request limits/cold starts, and you’re sending data off-box.

### Install

```bash
pip install requests
```

### Code (send text → get a WAV file)

Here we call a public TTS model via the Inference API. You need a **HF API token** (create one on your HF account, scope: “Read”).

```python
import requests

HF_TOKEN = "hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXX"  # <- put your token here
MODEL_ID = "suno/bark-small"  # or "facebook/mms-tts-eng", etc.

headers = {
    "Authorization": f"Bearer {HF_TOKEN}",
    "Accept": "audio/wav"  # ask for raw audio back
}

payload = {
    "inputs": "Hello! This speech was generated with a serverless Hugging Face endpoint.",
    "parameters": {
        # model-specific knobs; Bark can take e.g. speaker, tempo, etc.
    }
}

resp = requests.post(
    f"https://api-inference.huggingface.co/models/{MODEL_ID}",
    headers=headers,
    json=payload,
    timeout=120
)
resp.raise_for_status()

with open("tts_output.wav", "wb") as f:
    f.write(resp.content)

print("Saved: tts_output.wav")
```

> Swap `MODEL_ID` for languages/voices (e.g., `facebook/mms-tts-eng` for English, other MMS TTS variants for different languages). Some models may warm up on the first call (slower the very first time).

---

# 4) “Tokenizer” vs “Processor” in audio

On Hugging Face:

* **Tokenizer**: text ↔ tokens (NLP).
* **Feature extractor**: audio ↔ numeric features (e.g., log-mels).
* **Processor**: **a wrapper that bundles both** (and sometimes normalizers) so you can feed/parse **audio+text** correctly.
  Examples: `WhisperProcessor`, `Wav2Vec2Processor`, `SpeechT5Processor`.

> So your intuition is right: for speech models we typically use a **Processor** instead of just a Tokenizer.

---

# 5) Embeddings, sample rate, and key audio concepts

* **Sample rate**: Hz samples per second. Many ASR models expect **16 kHz** mono. Resample mismatches hurt accuracy.
* **Features**:

  * **Mel-spectrograms** or **MFCCs** summarize frequency content over time.
  * **Speaker embeddings** (a.k.a. d-vectors/x-vectors) represent **voice identity**—used for verification and **voice cloning**.
* **Voice activity detection (VAD)**: detects speech vs. silence; used for streaming and diarization.
* **Latency tips**: stream in short chunks (e.g., 0.5–1.0 s), do on-the-fly VAD, and prefer models with streaming decoders.

---

# 6) Speech-to-Text, Text-to-Speech, Voice Cloning (deepfake)

* **STT**: audio → text (e.g., Whisper, Wav2Vec2, cloud APIs).
* **TTS**: text → audio (e.g., Bark, SpeechT5, VITS, FastPitch, FastSpeech2).
* **Voice cloning**: generate TTS that **sounds like a specific speaker** using a short voice sample (few seconds to a minute).
  ⚠️ **Ethics & consent**: only clone voices with explicit permission; label synthetic audio; be mindful of laws and platform policies.

---

# 7) Real-time voice cloning: encoder → synthesizer → vocoder

Classic pipeline:

1. **Speaker encoder**: turns a reference voice clip into a **speaker embedding**.
2. **Text/Acoustic synthesizer**: predicts a **mel-spectrogram** from text + speaker embedding.
3. **Vocoder**: converts the spectrogram into a **waveform** (Griffin-Lim, WaveGlow, HiFi-GAN, etc.).

Modern “all-in-one” models hide these steps but the logic is the same.

### Practical zero-shot cloning example (local) with Coqui-TTS

This is a simple, production-friendly way to demo cloning on your own machine.

#### Install

```bash
pip install TTS==0.22.0 torch soundfile
# pick the right torch build for your GPU/OS if needed
```

#### Code (clone from a reference WAV, synthesize to file)

```python
from TTS.api import TTS

# A popular zero-shot cloning model (multilingual):
MODEL_NAME = "tts_models/multilingual/multi-dataset/your_tts"

# Path to a short reference clip (5–15 seconds) of the target speaker, mono 16k or 22k works well
SPEAKER_WAV = "reference_speaker.wav"

tts = TTS(MODEL_NAME)

text = "This is a quick demo of zero-shot voice cloning."
tts.tts_to_file(
    text=text,
    speaker_wav=SPEAKER_WAV,   # zero-shot cloning
    file_path="cloned_voice.wav"
)

print("Saved: cloned_voice.wav")
```

> Tips
> • Clean reference audio: no music, minimal noise, consistent mic distance.
> • Try multiple 5–15s samples for better similarity.
> • For **real-time**, you need **low-latency** models, fast vocoders (e.g., HiFi-GAN), and streaming chunks. True “live” cloning is possible but hardware-dependent.

---

# 8) Bonus: local, offline TTS with Transformers (no serverless)

If you want everything on your machine (privacy, no network) you can do:

```bash
pip install transformers datasets torchaudio accelerate soundfile
```

```python
import torch, soundfile as sf
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan, AutoTokenizer

text = "Offline speech synthesis with SpeechT5."
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

inputs = processor(text=text, return_tensors="pt")

# Use a default speaker embedding (or learn/provide one for a target voice)
# Here we just use a random embedding for demo:
speaker_embeddings = torch.randn(1, 512)

with torch.no_grad():
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)

sf.write("offline_tts.wav", speech.numpy(), 16000)
print("Saved: offline_tts.wav")
```

> For cloning with SpeechT5 you’d feed a **real speaker embedding** (extracted with a compatible speaker-ID model) rather than random vectors.

---

# 9) Putting it together (mini recipes)

**A) Live captioning**

* Mic → VAD → stream 16 kHz chunks → streaming ASR (e.g., Whisper small w/ VAD) → display partial transcripts.

**B) Talking assistant**

* User speaks → STT → LLM → TTS (serverless for simplicity) → play audio. Cache TTS responses to reduce costs.

**C) Real-time cloning**

* Precompute speaker embedding (from 10–30 s sample).
* For each user prompt: text → spectrogram (conditioned on embedding) → fast vocoder → stream audio frames.

---

# 10) Checklist & pitfalls

* Match **sample rate** (usually 16 kHz for ASR).
* Normalize loudness; avoid clipping.
* Trim silences for better transcription and cloning.
* Mind **latency** (smaller chunks, GPU if possible, lightweight models).
* Legal/ethical: consent, disclosure, watermarking if applicable.

---

If you want, say the word and I’ll package this as a printable PDF or a polished Markdown booklet with diagrams.
