<a href="https://colab.research.google.com/github/MohanK-17/Voice-Conversion-System/blob/main/stt_tts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Voice Recording and Transcription Notebook

This notebook allows you to record audio directly from your microphone in Google Colab. You press a button to start recording, speak, and press it again to stop. After recording, you can immediately listen to the audio in the notebook.

The recorded audio is captured in WebM format and automatically converted to WAV, making it easy to process in Python. The WAV file is read into a NumPy array, giving access to the audio samples and sampling rate.

This setup can be extended to save recordings, generate unique timestamped filenames, or transcribe audio using models like Whisper. It provides a simple and interactive way to capture, preview, and process voice recordings all within Colab.


**Install all required libraries in a single cell**

In [None]:

!pip install -q gradio soundfile torch torchaudio huggingface_hub ruaccent datasets transformers faster-whisper ffmpeg-python coqui-tts speechbrain accelerate

!pip install -q gradio soundfile torch torchaudio huggingface_hub ruaccent
!git clone https://huggingface.co/ESpeech/ESpeech-TTS-1_RL-V2
!pip install -q datasets
!pip install -q torch torchaudio transformers soundfile
!pip install -q faster-whisper ffmpeg-python
!pip install -q coqui-tts
!pip install -q speechbrain
!pip install -q speechbrain transformers accelerate datasets

**Clone the ESpeech-TTS repository**

In [None]:
!git clone https://huggingface.co/ESpeech/ESpeech-TTS-1_RL-V2


# 📝 End-to-End Voice Recording, Transcription, and Speech Synthesis

This Colab cell implements a complete workflow for **recording, transcribing, and synthesizing speech**:

1. **Record Audio in Browser**  
   - Uses JavaScript via `IPython.display.HTML` to capture microphone input.
   - Saves the recorded audio as a WAV file in a `recordings/` folder.

2. **Transcribe Audio**  
   - Loads the recorded audio using the **Faster-Whisper** model (`medium.en`).
   - Detects language and generates a text transcript.
   - Prints each segment with timestamps and stores the full transcription in `transcribed_text`.

3. **Text-to-Speech (SpeechSynthesis)**  
   - Uses **SpeechT5** with the HifiGan vocoder to generate speech from the transcribed text.
   - Requires `speaker_embeddings` and the SpeechT5 `model` to be preloaded.
   - Saves the synthesized speech as a WAV file and plays it directly in the notebook.

4. **Output**  
   - Displays the recorded audio player.
   - Displays the synthesized speech player (if embeddings and model are available).
   - All recordings are timestamped for easy reference.

> This workflow enables end-to-end **voice cloning / speech synthesis experiments**, combining live recording, transcription, and TTS in a single interactive cell.


In [None]:
import os, io, ffmpeg, torch
from datetime import datetime
from base64 import b64decode
from IPython.display import HTML, display, Audio
from faster_whisper import WhisperModel
from transformers import SpeechT5Processor, SpeechT5HifiGan
import soundfile as sf

os.makedirs("recordings", exist_ok=True)

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");
my_btn.appendChild(t);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
    }
  };
  recorder.start();
};

recordButton.innerText = "Recording... press to stop";
navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);

function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving..."
  }
}

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
recordButton.onclick = ()=>{
  toggleRecording()
  sleep(2000).then(() => {
    resolve(base64data.toString())
  });
}
});
</script>
"""

def get_audio():
    display(HTML(AUDIO_HTML))
    data = eval_js("data")
    binary = b64decode(data.split(',')[1])

    process = (
        ffmpeg
        .input('pipe:0')
        .output('pipe:1', format='wav')
        .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
    )
    output, _ = process.communicate(input=binary)
    return output

print("Click button, record, then stop...")
wav_bytes = get_audio()

timestamp = datetime.now().strftime("%b%d_%H-%M-%S")
filename = f"recordings/{timestamp}.wav"

with open(filename, "wb") as f:
    f.write(wav_bytes)

print(f"Saved audio → {filename}")
display(Audio(filename))

#Transcribe using Faster-Whisper
model_size = "medium.en"
device = "cuda" if torch.cuda.is_available() else "cpu"
whisper_model = WhisperModel(model_size, device=device)

print("Transcribing audio...")
segments, info = whisper_model.transcribe(filename)

print(f"Detected language: {info.language}")
transcript = ""
for segment in segments:
    print(f"[{segment.start:.2f}s → {segment.end:.2f}s] {segment.text}")
    transcript += segment.text + " "

transcribed_text = transcript.strip()
print("\nTranscription complete.")
print(f"Stored transcript:\n{transcribed_text}")

#Voice tt5
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")

if 'speaker_embeddings' in locals() and 'model' in locals():
    inputs = processor(text=transcribed_text, return_tensors="pt")
    speech = model.generate_speech(inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
    output_file = f"recordings/synthesized_{timestamp}.wav"
    sf.write(output_file, speech.numpy(), samplerate=16000)
    print(f"✅ Synthesized speech saved to {output_file}")
    display(Audio(output_file, rate=16000))
else:
    print("Speaker embeddings or SpeechT5 model not found. Please load them before generating speech.")


# 📊 Audio Waveform Visualization

This cell visualizes the amplitude over time for audio files:

1. **Recorded Audio**  
   - Plots the waveform of the audio captured from the microphone.
   - X-axis: Time in seconds  
   - Y-axis: Amplitude

2. **Synthesized Audio**  
   - If SpeechT5 synthesized speech was generated, its waveform is also plotted for comparison.
   - Helps visually compare the recorded input and synthesized output.

> This is useful for inspecting audio quality, duration, and amplitude patterns.


In [None]:
import matplotlib.pyplot as plt
import numpy as np

def plot_waveform(audio_file_path, title="Audio Waveform"):
    import soundfile as sf
    try:
        audio, samplerate = sf.read(audio_file_path)
        time = np.linspace(0., len(audio) / samplerate, len(audio))
        plt.figure(figsize=(12, 4))
        plt.plot(time, audio)
        plt.xlabel("Time [s]")
        plt.ylabel("Amplitude")
        plt.title(f"{title}: {audio_file_path}")
        plt.grid(True)
        plt.show()
    except Exception as e:
        print(f"Error plotting waveform: {e}")

plot_waveform(filename, title="Recorded Audio")

if 'output_file' in locals():
    plot_waveform(output_file, title="Synthesized Audio")


# 📊 ASR and TTS Evaluation Metrics Visualization

This cell demonstrates **hypothetical performance metrics** for ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) models, along with **voice cloning evaluation**:

1. **ASR Metrics (Word Error Rate - WER)**  
   - Shows hypothetical WER scores for different ASR models.  
   - Lower WER indicates better transcription accuracy.  
   - Visualized using a bar chart.

2. **TTS Subjective Metrics (Mean Opinion Score - MOS)**  
   - Shows hypothetical MOS evaluations for TTS models across multiple dimensions:  
     - Overall MOS  
     - Fluency MOS  
     - Prosody MOS  
     - Quality MOS  
   - Higher MOS indicates better perceived naturalness and quality.  
   - Visualized with grouped bar charts to compare multiple metrics.

3. **TTS Objective Metric (Voice Cloning: Cosine Similarity)**  
   - Hypothetical cosine similarity between original and synthesized speaker embeddings.  
   - Higher similarity (closer to 1) indicates better voice cloning fidelity.  
   - Visualized with a simple bar chart.

> These visualizations help compare **different ASR/TTS models** and evaluate **voice cloning quality**, both subjectively (MOS) and objectively (embedding similarity).  


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import display

# Hypothetical data for demonstration
# Replace with your actual evaluation results

# ASR Metrics
asr_data = {
    'Model': ['Faster-Whisper (small)', 'Another ASR Model', 'Yet Another ASR Model'],
    'WER (%)': [5.5, 8.2, 7.1] # Hypothetical Word Error Rates (lower is better)
}
df_asr = pd.DataFrame(asr_data)

# TTS Subjective Metrics (from MOS evaluation)
tts_subjective_data = {
    'Model': ['SpeechT5', 'TTS Model 2', 'TTS Model 3'],
    'Overall MOS': [4.2, 3.8, 4.5], # Hypothetical Mean Opinion Scores (higher is better)
    'Fluency MOS': [4.3, 3.9, 4.4],
    'Prosody MOS': [4.0, 3.5, 4.6],
    'Quality MOS': [4.1, 3.7, 4.5]
}
df_tts_subjective = pd.DataFrame(tts_subjective_data)

# TTS Objective Metric (for Voice Cloning)
tts_objective_data = {
    'Model': ['SpeechT5 (Cloned)', 'Another Cloning Model'],
    'Cosine Similarity': [0.92, 0.88] # Hypothetical Cosine Similarity (closer to 1 is better)
}
df_tts_objective = pd.DataFrame(tts_objective_data)

# --- Display DataFrames ---
print("Hypothetical ASR Model Performance (WER):")
display(df_asr)

print("\nHypothetical TTS Model Subjective Performance (MOS):")
display(df_tts_subjective)

print("\nHypothetical Voice Cloning Performance (Cosine Similarity):")
display(df_tts_objective)


# --- Visualization of ASR Word Error Rate ---
plt.figure(figsize=(8, 5))
plt.bar(df_asr['Model'], df_asr['WER (%)'], color='skyblue')
plt.ylabel('Word Error Rate (%)')
plt.title('Hypothetical ASR Model Performance (WER)')
plt.ylim(0, 10) # Adjust y-limit based on expected scores
plt.show()

# --- Visualization of TTS Subjective Metrics (MOS) ---
bar_width = 0.2
r1 = np.arange(len(df_tts_subjective['Model']))
r2 = [x + bar_width for x in r1]
r3 = [x + bar_width for x in r2]
r4 = [x + bar_width for x in r3]

plt.figure(figsize=(12, 6))
plt.bar(r1, df_tts_subjective['Overall MOS'], color='lightcoral', width=bar_width, edgecolor='grey', label='Overall MOS')
plt.bar(r2, df_tts_subjective['Fluency MOS'], color='salmon', width=bar_width, edgecolor='grey', label='Fluency MOS')
plt.bar(r3, df_tts_subjective['Prosody MOS'], color='tomato', width=bar_width, edgecolor='grey', label='Prosody MOS')
plt.bar(r4, df_tts_subjective['Quality MOS'], color='orangered', width=bar_width, edgecolor='grey', label='Quality MOS')

plt.ylabel('Mean Opinion Score (MOS)')
plt.title('Hypothetical TTS Model Subjective Performance (MOS)')
plt.ylim(0, 5.5) # MOS is typically on a scale of 1-5
plt.xticks([r + bar_width*1.5 for r in range(len(df_tts_subjective['Model']))], df_tts_subjective['Model'])
plt.legend()
plt.show()


# --- Visualization of TTS Objective Metric (Cosine Similarity) ---
plt.figure(figsize=(8, 5))
plt.bar(df_tts_objective['Model'], df_tts_objective['Cosine Similarity'], color='mediumseagreen')
plt.ylabel('Cosine Similarity')
plt.title('Hypothetical Voice Cloning Performance (Cosine Similarity)')
plt.ylim(0, 1.1) # Cosine similarity is between -1 and 1, often 0 to 1 for embeddings
plt.show()

In [None]:
import matplotlib.pyplot as plt
import soundfile as sf
import numpy as np

audio_file_path = "speecht5_hifigan_synthesized_speech.wav"

try:
    audio, samplerate = sf.read(audio_file_path)


    time = np.linspace(0., len(audio) / samplerate, len(audio))

    # Plot the waveform
    plt.figure(figsize=(12, 4))
    plt.plot(time, audio)
    plt.xlabel("Time [s]")
    plt.ylabel("Amplitude")
    plt.title("Synthesized Audio Waveform")
    plt.grid(True)
    plt.show()

except FileNotFoundError:
    print(f"Error: Audio file not found at {audio_file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

# 🗣️ Voice Cloning with SpeechT5 and Speaker Embedding

This cell continues the workflow from previous steps and performs **voice cloning** using:

1. **Speaker embedding extraction**:
   - Loads a short reference audio (`ref_wav`) of the user's voice.
   - Processes the audio to match the sample rate expected by the pre-trained SpeechBrain speaker recognition model (`spkrec-xvect-voxceleb`).
   - Generates a speaker embedding tensor that captures the unique characteristics of the speaker's voice.

2. **Text-to-Speech synthesis**:
   - Uses the previously transcribed text (`transcribed_text`) from the recorded audio.
   - Loads the pre-trained **SpeechT5** model, processor, and HifiGan vocoder.
   - Generates a speech waveform in the **user's voice** by combining the text and speaker embedding.
   - Saves the cloned speech to `cloned_speech.wav` and plays it inline.

**Notes:**
- The speaker embedding must have a batch dimension; if missing, it’s automatically added.
- The generated speech is at 16 kHz, suitable for playback or further processing.


In [None]:
from google.colab import files

print("Please upload a short audio file of your voice.")
uploaded = files.upload()


if len(uploaded) == 0:
  print("No file was uploaded.")
  ref_wav = None
else:
  ref_wav = list(uploaded.keys())[0]
  print(f"✅ Uploaded audio file: {ref_wav}")


In [None]:
from speechbrain.pretrained import EncoderClassifier
import torch

speaker_model = EncoderClassifier.from_hparams(source="speechbrain/spkrec-xvect-voxceleb", run_opts={"device":"cuda" if torch.cuda.is_available() else "cpu"})

print("Speaker embedding model loaded successfully.")

In [None]:
import torch
import torchaudio
from IPython.display import Audio

speaker_embedding = None

if ref_wav is not None:
    print(f"Loading audio from {ref_wav}...")
    try:
        # Load audio (compatible with torchaudio returning tuple or SimpleNamespace)
        audio_loaded = torchaudio.load(ref_wav)
        if isinstance(audio_loaded, tuple):
            audio_tensor, sample_rate = audio_loaded
        else:  # SimpleNamespace in newer torchaudio
            audio_tensor = audio_loaded.signal
            sample_rate = audio_loaded.sample_rate

        # Ensure mono
        if audio_tensor.shape[0] > 1:
            audio_tensor = audio_tensor.mean(dim=0, keepdim=True)

        # Resample if needed
        target_sr = getattr(speaker_model.hparams, "sample_rate", 16000)  # fallback to 16kHz
        if sample_rate != target_sr:
            resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sr)
            audio_tensor = resampler(audio_tensor)
            print(f"Resampled audio to {target_sr} Hz")
        else:
            print(f"Audio sample rate matches model: {sample_rate} Hz")

        # Generate speaker embedding
        print("Generating speaker embedding...")
        with torch.no_grad():
            embeddings = speaker_model.encode_batch(audio_tensor)

        speaker_embedding = embeddings.squeeze(0)
        print("✅ Speaker embedding generated.")
        print(f"Embedding shape: {speaker_embedding.shape}")

    except Exception as e:
        print(f"Error processing audio file: {e}")
        speaker_embedding = None
else:
    print("No audio file provided. Cannot generate speaker embedding.")


In [None]:

print(speaker_model.hparams)

In [None]:
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
import torch
import soundfile as sf
from IPython.display import display, Audio

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

if 'transcribed_text' in locals() and 'speaker_embedding' in locals() and speaker_embedding is not None:
    print("Using transcribed text:", transcribed_text)
    print("Using speaker embedding with shape:", speaker_embedding.shape)

    inputs = processor(text=transcribed_text, return_tensors="pt")

    if speaker_embedding.ndim == 1:
        speaker_embedding = speaker_embedding.unsqueeze(0)
        print("Added batch dimension to speaker embedding. New shape:", speaker_embedding.shape)

    speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

    output_file = "cloned_speech.wav"
    sf.write(output_file, speech.numpy(), samplerate=16000)

    print("Speech saved to:", output_file)
    display(Audio(output_file, rate=16000))
else:
    print("Transcribed text or speaker embedding not available. Cannot synthesize speech.")


## Play cloned voice output

### Subtask:
Listen to the synthesized speech to evaluate the voice cloning.


**Reasoning**:
Play the synthesized audio file named `cloned_speech.wav` to allow for subjective evaluation.



In [None]:
import IPython.display as ipd
import os

output_file = "cloned_speech.wav"

if os.path.exists(output_file):
    print(f"Playing synthesized speech from {output_file}")
    display(ipd.Audio(output_file, rate=16000))
else:
    print(f"Error: Synthesized speech file not found at {output_file}")
