Install Text-Text

## 🔧 Installing Required Text-to-Speech Library
We install the `TTS` library (by coqui.ai), which is used for text-to-speech conversion.


In [1]:
# Installing the Coqui TTS library for speech synthesis
!pip install TTS

Collecting TTS
  Downloading TTS-0.22.0-cp311-cp311-manylinux1_x86_64.whl.metadata (21 kB)
Collecting anyascii>=0.3.0 (from TTS)
  Downloading anyascii-0.3.3-py3-none-any.whl.metadata (1.6 kB)
Collecting pysbd>=0.3.4 (from TTS)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting pandas<2.0,>=1.4 (from TTS)
  Downloading pandas-1.5.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting trainer>=0.0.32 (from TTS)
  Downloading trainer-0.0.36-py3-none-any.whl.metadata (8.1 kB)
Collecting coqpit>=0.0.16 (from TTS)
  Downloading coqpit-0.0.17-py3-none-any.whl.metadata (11 kB)
Collecting pypinyin (from TTS)
  Downloading pypinyin-0.55.0-py2.py3-none-any.whl.metadata (12 kB)
Collecting hangul-romanize (from TTS)
  Downloading hangul_romanize-0.1.0-py3-none-any.whl.metadata (1.2 kB)
Collecting gruut==2.2.3 (from gruut[de,es,fr]==2.2.3->TTS)
  Downloading gruut-2.2.3.tar.gz (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

## 🔧 Installing Resemblyzer for Voice Embeddings
We install `resemblyzer`, a library used for extracting speaker embeddings from audio. This is useful for speaker verification or voice cloning.


In [1]:
# Installing Resemblyzer library to extract speaker embeddings
!pip install resemblyzer


Collecting resemblyzer
  Downloading Resemblyzer-0.1.4-py3-none-any.whl.metadata (5.8 kB)
Collecting webrtcvad>=2.0.10 (from resemblyzer)
  Downloading webrtcvad-2.0.10.tar.gz (66 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.2/66.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting typing (from resemblyzer)
  Downloading typing-3.7.4.3.tar.gz (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.6/78.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading Resemblyzer-0.1.4-py3-none-any.whl (15.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.7/15.7 MB[0m [31m43.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: webrtcvad, typing
  Building wheel for webrtcvad (setup.py) ... [?25l[?25hdone
  Created wheel for webrtcvad: filename=webrtcvad-2.0.10-cp311-cp311-linux

## 🔧 Installing Additional Audio Processing Libraries
We install key libraries for audio preprocessing and transcription, including `faster-whisper`, `pydub`, `librosa`, and `soundfile`.



In [1]:
# Installing whisper-based ASR (Automatic Speech Recognition) and audio processing tools
!pip install faster-whisper pydub librosa soundfile --quiet  # --quiet reduces console output

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.7/39.7 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.6/38.6 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.5/16.5 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## 🎵 Audio Preprocessing: Convert to Mono and Resample
We load the input audio file, convert it to mono, resample it to 22050 Hz, and save the preprocessed output.



In [2]:
import librosa
import soundfile as sf

# Define file paths
input_path = "/content/drive/MyDrive/AudioFiles/Vijay-Mallya-UGLY-TRUTH-Madan-Go.wav"
output_path = "/content/drive/MyDrive/AudioFiles/output.wav"

# Load the audio: convert to mono and resample to 22050 Hz
wav, sr = librosa.load(input_path, sr=22050, mono=True)

# Save the preprocessed audio to a new file
sf.write(output_path, wav, 22050)

print(f"✔ Preprocessed audio saved to: {output_path}")

✔ Preprocessed audio saved to: /content/drive/MyDrive/AudioFiles/output.wav


## 🧠 Extracting Speaker Embedding using Resemblyzer
We use `Resemblyzer` to extract a speaker embedding (also called a d-vector) from the preprocessed audio. This embedding represents the unique vocal features of the speaker.


In [3]:
from resemblyzer import VoiceEncoder, preprocess_wav
import numpy as np

# Initialize the voice encoder
encoder = VoiceEncoder()

# Preprocess the audio file for embedding extraction
wav_pre = preprocess_wav(output_path)

# Extract speaker embedding (d-vector)
embed = encoder.embed_utterance(wav_pre)

print("✔ Speaker embedding extracted. Shape:", embed.shape)

# Save the embedding as a .npy file for future use
np.save("/content/drive/MyDrive/AudioFiles/embeddings.npy", embed)


Loaded the voice encoder model on cuda in 0.39 seconds.
✔ Speaker embedding extracted. Shape: (256,)


## 🗣️ Voice Cloning with Pre-trained Multilingual TTS Model
We use the Coqui TTS library to generate speech from text, mimicking the voice characteristics of a reference speaker using the extracted embedding.


In [4]:
from TTS.api import TTS

# Load a pre-trained multilingual TTS model capable of voice cloning
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=False)

# Define the text to synthesize
text = "Hello Every one welcome to MG squad naa thaa unga madan"
output_audio = "cloned.wav"

# Generate cloned speech using the reference audio and speaker embedding
tts.tts_to_file(
    text=text,
    speaker_wav=output_path,           # Reference audio (used for voice cloning)
    file_path=output_audio,            # Output path for cloned audio
    language='en',                     # Language code
    speaker_embedding=embed            # Speaker embedding vector
)

print(f"✔ Cloned voice saved to: {output_audio}")




 > Downloading model to /root/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts
 > Model's license - CC BY-NC-ND 4.0
 > Check https://creativecommons.org/licenses/by-nc-nd/4.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:16000
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > Model fully restored. 

## ▶️ Play and Download the Cloned Audio
We play the generated cloned audio within the notebook and also provide a download link to save it locally.


In [6]:
from IPython.display import Audio, FileLink

# Play the cloned audio
Audio("cloned.wav")

# Provide a download link to manually download the file
FileLink("cloned.wav")


## 🧾 Transcribing Audio Using Faster-Whisper
We use the `faster-whisper` ASR model to transcribe the input audio into text. The transcription is done segment-wise and then combined into a single string.



In [2]:
from faster_whisper import WhisperModel

# Load Whisper model
model = WhisperModel("small", compute_type="auto")  # use 'medium' or 'small' if needed

# Transcribe with forced language as English
segments, _ = model.transcribe(
    "/content/drive/MyDrive/AudioFiles/Vijay-Mallya-UGLY-TRUTH-Madan-Go.wav",
    language="en"  #Force the model to interpret speech as English
)

# Combine all segments into one text
transcribed_text = " ".join([seg.text for seg in segments])
print("📝 Transcribed Text:\n", transcribed_text)


📝 Transcribed Text:
  Madan, tell me about Vijay Malia's podcast.  Vijay Malia has been in a podcast for so many years and he has been telling the truth about what really happened to him.  He is not a thief, he is not even talking about me.  This weekend, I got a DM from him.  So, you asked me what happened.  Today, we are going to watch the whole of India.  All the Indians and the world are talking about this podcast with Vijay Malia.  He has been telling me about what really happened to him.  One side is Katrina Kaif, the other side is Deepika Padukone, the other side is Virat Kohli.  Vijay Malia has been among the stars all around him.  Today, in some country, when he has been taking over the bank completely, he is trying to win.  The reason is that he has been sitting in the United Kingdom for so many years.  He has made a mistake in the court he is in.  But he has been telling me that he made a mistake.  So, what is happening? Let's watch it in detail.  Before that, if you want to