<h1 align="center">üéôÔ∏è Deepfake Audio</h1>
<h3 align="center"><i>A neural voice cloning studio powered by SV2TTS technology</i></h3>

<div align="center">

| **Author** | **Profiles** |
|:---:|:---|
| **Amey Thakur** | [![GitHub](https://img.shields.io/badge/GitHub-Amey--Thakur-181717?logo=github)](https://github.com/Amey-Thakur) [![ORCID](https://img.shields.io/badge/ORCID-0000--0001--5644--1575-A6CE39?logo=orcid)](https://orcid.org/0000-0001-5644-1575) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Amey_Thakur-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=0inooPgAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Amey_Thakur-20BEFF?logo=kaggle)](https://www.kaggle.com/ameythakur20) |
| **Mega Satish** | [![GitHub](https://img.shields.io/badge/GitHub-msatmod-181717?logo=github)](https://github.com/msatmod) [![ORCID](https://img.shields.io/badge/ORCID-0000--0002--1844--9557-A6CE39?logo=orcid)](https://orcid.org/0000-0002-1844-9557) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Mega_Satish-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=7Ajrr6EAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Mega_Satish-20BEFF?logo=kaggle)](https://www.kaggle.com/megasatish) |

---

**Attribution:** This project builds upon the foundational work of [CorentinJ/Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning).

üöÄ **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/ameythakur/Deepfake-Audio) | üé• **Video Demo:** [YouTube](https://youtu.be/i3wnBcbHDbs) | üíª **Repository:** [GitHub](https://github.com/Amey-Thakur/DEEPFAKE-AUDIO)

<a href="https://youtu.be/i3wnBcbHDbs">
  <img src="https://img.youtube.com/vi/i3wnBcbHDbs/0.jpg" alt="Video Demo" width="60%">
</a>

</div>

## üìñ Introduction

> **An audio deepfake is when a ‚Äúcloned‚Äù voice that is potentially indistinguishable from the real person‚Äôs is used to produce synthetic audio.**

This research notebook demonstrates the **SV2TTS (Speaker Verification to Text-to-Speech)** framework, a three-stage deep learning pipeline capable of cloning a voice from a mere 5 seconds of audio.

### üß† System Architecture

The following diagram illustrates how the three components interact to produce the final audio output.

```mermaid
graph LR
    A[üé§ Reference Audio] -->|Preprocess| B(Speaker Encoder)
    B -->|Generates Embedding| C{Synthesizer}
    D[üìù Input Text] --> C
    C -->|Mel Spectrogram| E(Vocoder)
    E -->|Waveform Synthesis| F[üîä Output Audio]
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#9f9,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bbf,stroke:#333,stroke-width:2px
    style E fill:#bbf,stroke:#333,stroke-width:2px
```

### The Pipeline Components
1.  **Speaker Encoder**: A Recurrent Neural Network (RNN) that condenses the *timbre* and *prosody* of the reference audio into a fixed-length vector (embedding).
2.  **Synthesizer**: A Tacotron-2 based implementation that takes text and the speaker embedding to generate a visual representation of speech (Mel Spectrogram).
3.  **Vocoder**: A WaveRNN network that iteratively generates the raw audio waveform from the Mel Spectrogram, sample by sample.

## ‚òÅÔ∏è Cloud Environment Setup
Execute the following cell **only** if you are running this notebook in a cloud environment like **Google Colab** or **Kaggle**. 

This script will:
1.  Clone the [DEEPFAKE-AUDIO repository](https://github.com/Amey-Thakur/DEEPFAKE-AUDIO).
2.  Install system-level dependencies (e.g., `libsndfile1` for audio processing).
3.  Install Python libraries required for signal processing and deep learning.

In [None]:
import os
import sys

# Detect Cloud Environment (Colab/Kaggle)
try:
    shell = get_ipython()
    if 'google.colab' in str(shell):
        print("üíª Detected Google Colab Environment. Initiating setup...")
        
        # 1. Clone the Repository
        if not os.path.exists("DEEPFAKE-AUDIO"):
            print("‚¨áÔ∏è Cloning DEEPFAKE-AUDIO repository...")
            shell.system("git clone https://github.com/Amey-Thakur/DEEPFAKE-AUDIO")
        
        # 2. Change Working Directory
        os.chdir("/content/DEEPFAKE-AUDIO")
        
        # 3. Pull Latest Changes (Ensure freshness)
        print("üîÑ Synchronizing with remote repository...")
        shell.system("git pull")
        
        # 4. Install System Dependencies
        # libsndfile1 is crucial for reading/writing audio files via SoundFile/Librosa
        print("üîß Installing system dependencies (libsndfile1)...")
        shell.system("apt-get install -y libsndfile1")
        
        # 5. Install Python Dependencies
        # Added 'gradio' for the alternative UI
        print("üì¶ Installing Python libraries...")
        shell.system("pip install librosa==0.9.2 unidecode webrtcvad inflect umap-learn scikit-learn>=1.3 tqdm scipy matplotlib>=3.7 Pillow>=10.2 soundfile huggingface_hub gradio")
        
        print("‚úÖ Environment setup complete. Ready for cloning.")
    else:
        print("üè† Running in local or custom environment. Skipping cloud setup.")
except NameError:
    print("üè† Running in local or custom environment. Skipping cloud setup.")

## 1Ô∏è‚É£ Model & Data Initialization

We prioritize data availability to ensure the notebook runs smoothly regardless of the platform. The system checks for checkpoints in this order:

1.  **Repository Local** (`Dataset/`): Fast local access if cloned.
2.  **Kaggle Dataset** (`/kaggle/input/deepfakeaudio/`): Pre-loaded environment data.
    *   *Reference*: [Amey Thakur's Kaggle Dataset](https://www.kaggle.com/datasets/ameythakur20/deepfakeaudio)
    *   *Kaggle Profile*: [ameythakur20](https://www.kaggle.com/ameythakur20)
3.  **HuggingFace Auto-Download**: Robust fallback for fresh environments.

In [None]:
import sys
import os
from pathlib import Path
import zipfile
import shutil

# Register 'Source Code' to Python path for module imports
source_path = os.path.abspath("Source Code")
if source_path not in sys.path:
    sys.path.append(source_path)

print(f"üìÇ Working Directory: {os.getcwd()}")
print(f"‚úÖ Module Path Registered: {source_path}")

# Define paths for model checkpoints
extract_path = "pretrained_models"
zip_path = "Dataset/pretrained.zip"

if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# --- üß† Checkpoint Verification Strategy ---
print("‚¨áÔ∏è Verifying Model Availability...")

# Priority 1: Check Local Repository 'Dataset/' folder
core_models = ["encoder.pt", "synthesizer.pt", "vocoder.pt"]
dataset_models_present = all([os.path.exists(os.path.join("Dataset", m)) for m in core_models])

if dataset_models_present:
     print("‚úÖ Found high-priority local models in 'Dataset/'. verified.")
else:
    print("‚ö†Ô∏è Models not found in 'Dataset/'. Attempting fallback strategies...")
    
    # Priority 3 (Fallback): Auto-download from HuggingFace via utils script
    try:
        from utils.default_models import ensure_default_models
        ensure_default_models(Path("pretrained_models"))
        print("‚úÖ Models successfully acquired via HuggingFace fallback.")
    except Exception as e:
        print(f"‚ö†Ô∏è Critical: Could not auto-download models. Error: {e}")

## 2Ô∏è‚É£ Architecture Loading

We now initialize the three distinct neural networks that comprise the SV2TTS framework. Please ensure you are running on a **GPU Runtime** (e.g., T4 on Colab) for optimal performance.

In [None]:
from encoder import inference as encoder
from synthesizer.inference import Synthesizer
from vocoder import inference as vocoder
import numpy as np
import torch
from pathlib import Path

# Hardware Acceleration Check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üéØ Computation Device: {device}")

def resolve_checkpoint(component_name, legacy_path_suffix):
    """
    Intelligently resolves the path to model checkpoints based on priority.
    1. Repository /Dataset/ folder.
    2. Kaggle Input directory.
    3. Auto-downloaded 'pretrained_models'.
    """
    
    # 1. Repository Local (Dataset/)
    dataset_p = Path("Dataset") / f"{component_name.lower()}.pt"
    if dataset_p.exists():
        print(f"üü¢ Loading {component_name} from Repository: {dataset_p}")
        return dataset_p

    # 2. Kaggle Environment
    kaggle_p = Path("/kaggle/input/deepfakeaudio") / f"{component_name.lower()}.pt"
    if kaggle_p.exists():
        print(f"üü¢ Loading {component_name} from Kaggle Input: {kaggle_p}")
        return kaggle_p
    
    # 3. Default / Auto-Downloaded
    default_p = Path("pretrained_models/default") / f"{component_name.lower()}.pt"
    if default_p.exists():
        print(f"üü¢ Loading {component_name} from Auto-Download: {default_p}")
        return default_p

    # 4. Legacy/Manual Paths
    legacy_p = Path("pretrained_models") / legacy_path_suffix
    if legacy_p.exists():
         if legacy_p.is_dir():
             pts = [f for f in legacy_p.glob("*.pt") if f.is_file()]
             if pts: return pts[0]
             pts_rec = [f for f in legacy_p.rglob("*.pt") if f.is_file()]
             if pts_rec: return pts_rec[0]
         return legacy_p
            
    print(f'‚ö†Ô∏è Warning: Checkpoint for {component_name} not found!')
    return None

print("‚è≥ Initializing Neural Networks...")

try:
    # 1. Encoder: Visualizes the voice's unique characteristics
    encoder_path = resolve_checkpoint("Encoder", "encoder/saved_models")
    encoder.load_model(encoder_path)

    # 2. Synthesizer: Generates spectrograms from text
    synth_path = resolve_checkpoint("Synthesizer", "synthesizer/saved_models/logs-pretrained/taco_pretrained")
    synthesizer = Synthesizer(synth_path)

    # 3. Vocoder: Converts spectrograms to audio waveforms
    vocoder_path = resolve_checkpoint("Vocoder", "vocoder/saved_models/pretrained")
    vocoder.load_model(vocoder_path)

    print("‚úÖ All models loaded successfully. The pipeline is ready.")
except Exception as e:
    print(f"‚ùå Initialization Error: {e}")

## 3Ô∏è‚É£ Inference Interface

Select your **Input Method** below to begin cloning.

*   **Option A: Classic Widget UI (Default)**: Simple, lightweight, and reliable.
*   **Option B: Modern Gradio UI**: Advanced interface with waveform visualization and easy downloading.

Run either cell below to start.

In [None]:
# --- OPTION A: CLASSIC WIDGET UI ---

import ipywidgets as widgets
from IPython.display import display, Javascript, Audio, HTML
from google.colab import output
from base64 import b64decode, b64encode
import io
import librosa
import librosa.display
import soundfile as sf
import matplotlib.pyplot as plt
import numpy as np
import time

# --- JAVASCRIPT: Audio Recording Logic ---
# This script injects JS into the browser to capture microphone input.
RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})"""

def record_audio(sec=10):
    """Invokes the JS recorder and saves the result to 'recording.wav'."""
    print("üî¥ Recording active for %d seconds..." % sec)
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    print("‚úÖ Recording saved.")
    binary = b64decode(s.split(',')[1])
    with open('recording.wav', 'wb') as f:
        f.write(binary)
    return 'recording.wav'

# --- VISUALIZATION FUNCTION ---
def visualize_results(original_wav, generated_wav, spec, embed, title="Cloning Analysis"):
    fig, axes = plt.subplots(3, 1, figsize=(10, 12))
    
    # 1. Waveform Comparison
    axes[0].set_title("Input Voice vs. Cloned Voice (Waveform)")
    librosa.display.waveshow(original_wav, alpha=0.5, ax=axes[0], label="Original")
    librosa.display.waveshow(generated_wav, alpha=0.5, ax=axes[0], label="Cloned", color='r')
    axes[0].legend()
    
    # 2. Spectrogram
    axes[1].set_title("Generated Mel Spectrogram")
    im = axes[1].imshow(spec, aspect="auto", origin="lower", interpolation='none')
    fig.colorbar(im, ax=axes[1])
    
    # 3. Embedding Heatmap
    axes[2].set_title("Speaker Embedding (256-D Latent Representation)")
    if len(embed) == 256:
        # Reshape to 16x16 for a square heatmap visualization
        axes[2].imshow(embed.reshape(16, 16), aspect='auto', cmap='viridis')
    else:
        axes[2].plot(embed) 
        
    plt.tight_layout()
    plt.show()

# --- WIDGETS: User Interface Construction ---

print("Select Input Method:")
tab = widgets.Tab()

# Tab 1: Presets (Celebrity Samples)
sample_roots = [
    "Source Code/samples",
    "Dataset/samples",
    "/kaggle/input/deepfakeaudio/samples"
]
samples_dir = "Source Code/samples" # Default fallback
for d in sample_roots:
    if os.path.exists(d) and len(os.listdir(d)) > 0:
        samples_dir = d
        print(f"üìÇ Loading Reference Samples from: {d}")
        break

# Filter for audio files
preset_files = [f for f in os.listdir(samples_dir) if f.endswith(".wav") or f.endswith(".mp3")]
preset_files.sort()

# Prioritize Key Samples
priority_samples = ["Steve Jobs.wav", "Donald Trump.wav"]
for sample in reversed(priority_samples):
    if sample in preset_files:
        preset_files.insert(0, preset_files.pop(preset_files.index(sample)))

dropdown = widgets.Dropdown(options=preset_files, description='Preset:')
tab1 = widgets.VBox([dropdown])

# Tab 2: File Upload
uploader = widgets.FileUpload(accept='.wav,.mp3', multiple=False)
tab2 = widgets.VBox([uploader])

# Tab 3: Microphone Recording
record_btn = widgets.Button(description="Start Recording (10s)", button_style='danger')
record_out = widgets.Output()
def on_record_click(b):
    with record_out:
        record_btn.disabled = True
        try:
            record_audio(10)
        except Exception as e:
             print(f"Error: {e}. (Note: Recording requires Colab/Browser context)")
        record_btn.disabled = False
        
record_btn.on_click(on_record_click)
tab3 = widgets.VBox([record_btn, record_out])

# Assemble Tabs
tab.children = [tab1, tab2, tab3]
tab.set_title(0, 'üéµ Presets')
tab.set_title(1, 'üìÇ Upload')
tab.set_title(2, 'üî¥ Record')
display(tab)

# Text Input Area
text_input = widgets.Textarea(
    value='Hello! This is a real-time voice cloning test. The quality is truly amazing.',
    placeholder='Enter text to synthesize...',
    description='Text:',
    disabled=False,
    layout=widgets.Layout(width='50%', height='100px')
)

# Post-Processing Options
normalize_chk = widgets.Checkbox(value=False, description="Normalize Audio üéöÔ∏è")

clone_btn = widgets.Button(description="Clone Voice! üöÄ", button_style='primary')
out = widgets.Output()

display(text_input, normalize_chk, clone_btn, out)

# --- PROCESSING PIPELINE ---

def run_cloning(b):
    with out:
        out.clear_output()
        active_tab = tab.selected_index
        input_path = None
        
        try:
            # 1. Acquire Input Audio
            if active_tab == 0: # Preset
                 input_path = os.path.join(samples_dir, dropdown.value)
                 print(f"üéôÔ∏è Source: Preset ({dropdown.value})")
            
            elif active_tab == 1: # Upload
                 if not uploader.value:
                     print("‚ùå User Error: Please upload a file first!")
                     return
                 fname = list(uploader.value.keys())[0]
                 content = uploader.value[fname]['content']
                 input_path = "uploaded_sample.wav"
                 with open(input_path, "wb") as f:
                     f.write(content)
                 print(f"üéôÔ∏è Source: Upload ({fname})")
            
            elif active_tab == 2: # Record
                 if not os.path.exists("recording.wav"):
                     print("‚ùå User Error: Please record audio first!")
                     return
                 input_path = "recording.wav"
                 print("üéôÔ∏è Source: Microphone Recording")
            
            # Start Timer
            start_time = time.time()

            # 2. Preprocess & Embed (Encoder)
            print("‚è≥ Step 1/3: Preprocessing & Encoding Speaker Characteristics...")
            original_wav, sampling_rate = librosa.load(input_path)
            preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
            embed = encoder.embed_utterance(preprocessed_wav)
            print("   ‚úÖ Speaker embedding generated.")

            # 3. Synthesize Spectrogram (Synthesizer)
            print("‚è≥ Step 2/3: Synthesizing Mel Spectrogram from Text...")
            specs = synthesizer.synthesize_spectrograms([text_input.value], [embed])
            spec = specs[0]
            print("   ‚úÖ Spectrogram generated.")

            # 4. Generate Waveform (Vocoder)
            print("‚è≥ Step 3/3: Vocoding (Spectrogram -> Audio)...")
            generated_wav = vocoder.infer_waveform(spec)
            
            # 5. Post-Processing
            if normalize_chk.value:
                print("üéöÔ∏è Normalizing Audio Output...")
                generated_wav = librosa.util.normalize(generated_wav)
            
            # Stop Timer & Calculate RTF
            end_time = time.time()
            proc_time = end_time - start_time
            duration = len(generated_wav) / synthesizer.sample_rate
            rtf = proc_time / duration
            print(f"‚ö° Performance Analysis: Generated {duration:.2f}s of audio in {proc_time:.2f}s (RTF: {rtf:.3f}x)")
            
            # 6. Output Audio & Download
            print("üéâ Synthesis Complete! Playing Audio:")
            display(Audio(generated_wav, rate=synthesizer.sample_rate))
            
            # Generate Download Link (Base64)
            buf = io.BytesIO()
            sf.write(buf, generated_wav, synthesizer.sample_rate, format='WAV')
            b64 = b64encode(buf.getvalue()).decode()
            html_str = f'<a href="data:audio/wav;base64,{b64}" download="cloned_voice.wav" style="background-color:#4CAF50; color:white; padding:8px 16px; text-decoration:none; border-radius:4px;">‚¨áÔ∏è Download Audio (WAV)</a>'
            display(HTML(html_str))
            
            # 7. Visualization
            print("\nüìä Generating Scholarly Analysis...")
            visualize_results(original_wav, generated_wav, spec, embed)
            
        except Exception as e:
            print(f"‚ùå Execution Error: {e}")

clone_btn.on_click(run_cloning)

In [None]:
# --- OPTION B: MODERN GRADIO UI ---

import gradio as gr

def gradio_cloning_pipeline(text, audio_input, normalize):
    if audio_input is None:
        raise gr.Error("Please provide an input audio file (upload or record).")
        
    # 1. Load Audio
    # Gradio passes audio as (sample_rate, numpy_array) or filepath depending on type
    # Here type="filepath" is safest for consistency with librosa
    original_wav, sampling_rate = librosa.load(audio_input)
    
    # 2. Encoder
    preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
    embed = encoder.embed_utterance(preprocessed_wav)
    
    # 3. Synthesizer
    specs = synthesizer.synthesize_spectrograms([text], [embed])
    spec = specs[0]
    
    # 4. Vocoder
    generated_wav = vocoder.infer_waveform(spec)
    if normalize:
        generated_wav = librosa.util.normalize(generated_wav)
        
    return (synthesizer.sample_rate, generated_wav)

iface = gr.Interface(
    fn=gradio_cloning_pipeline,
    inputs=[
        gr.Textbox(label="Text to Synthesize", value="This is a test of the Deepfake Audio cloning system.", lines=3),
        gr.Audio(label="Reference Voice", type="filepath"), # Supports upload & mic
        gr.Checkbox(label="Normalize Output", value=False)
    ],
    outputs=gr.Audio(label="Cloned Voice"),
    title="üéôÔ∏è Deepfake Audio Studio",
    description="Clone any voice in seconds using the SV2TTS framework. Upload a sample, type your text, and generate!",
    theme="default"
)

print("üöÄ Launching Gradio Interface...")
iface.launch(share=True, debug=True)