<h1 align="center">üéôÔ∏è Deepfake Audio</h1>
<h3 align="center"><i>A neural voice cloning studio powered by SV2TTS technology</i></h3>

<div align="center">

| **Author** | **Profiles** |
|:---:|:---|
| **Amey Thakur** | [![GitHub](https://img.shields.io/badge/GitHub-Amey--Thakur-181717?logo=github)](https://github.com/Amey-Thakur) [![ORCID](https://img.shields.io/badge/ORCID-0000--0001--5644--1575-A6CE39?logo=orcid)](https://orcid.org/0000-0001-5644-1575) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Amey_Thakur-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=0inooPgAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Amey_Thakur-20BEFF?logo=kaggle)](https://www.kaggle.com/ameythakur20) |
| **Mega Satish** | [![GitHub](https://img.shields.io/badge/GitHub-msatmod-181717?logo=github)](https://github.com/msatmod) [![ORCID](https://img.shields.io/badge/ORCID-0000--0002--1844--9557-A6CE39?logo=orcid)](https://orcid.org/0000-0002-1844-9557) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Mega_Satish-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=7Ajrr6EAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Mega_Satish-20BEFF?logo=kaggle)](https://www.kaggle.com/megasatish) |

---

**Attribution:** This project builds upon the foundational work of [CorentinJ/Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning).

üöÄ **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/ameythakur/Deepfake-Audio) | üé• **Video Demo:** [YouTube](https://youtu.be/i3wnBcbHDbs) | üíª **Repository:** [GitHub](https://github.com/Amey-Thakur/DEEPFAKE-AUDIO)

<a href="https://youtu.be/i3wnBcbHDbs">
  <img src="https://img.youtube.com/vi/i3wnBcbHDbs/0.jpg" alt="Video Demo" width="60%">
</a>

</div>

## üìñ Introduction

> **An audio deepfake is when a ‚Äúcloned‚Äù voice that is potentially indistinguishable from the real person‚Äôs is used to produce synthetic audio.**

This research notebook demonstrates the **SV2TTS (Speaker Verification to Text-to-Speech)** framework, a three-stage deep learning pipeline capable of cloning a voice from a mere 5 seconds of audio.

### The Pipeline Components
1.  **Speaker Encoder**: A Recurrent Neural Network (RNN) that condenses the *timbre* and *prosody* of the reference audio into a fixed-length vector (embedding).
2.  **Synthesizer**: A Tacotron-2 based implementation that takes text and the speaker embedding to generate a visual representation of speech (Mel Spectrogram).
3.  **Vocoder**: A WaveRNN network that iteratively generates the raw audio waveform from the Mel Spectrogram, sample by sample.

## ‚òÅÔ∏è Cloud Environment Setup
Execute the following cell **only** if you are running this notebook in a cloud environment like **Google Colab** or **Kaggle**. 

This script will:
1.  Clone the [DEEPFAKE-AUDIO repository](https://github.com/Amey-Thakur/DEEPFAKE-AUDIO).
2.  Install system-level dependencies (e.g., `libsndfile1` for audio processing).
3.  Install Python libraries required for signal processing and deep learning.

In [None]:
import os
import sys

# Detect Cloud Environment (Colab/Kaggle)
try:
    shell = get_ipython()
    if 'google.colab' in str(shell):
        print("üíª Detected Google Colab Environment. Initiating setup...")
        
        # 1. Clone the Repository
        if not os.path.exists("DEEPFAKE-AUDIO"):
            print("‚¨áÔ∏è Cloning DEEPFAKE-AUDIO repository...")
            shell.system("git clone https://github.com/Amey-Thakur/DEEPFAKE-AUDIO")
        
        # 2. Change Working Directory
        os.chdir("/content/DEEPFAKE-AUDIO")
        
        # 3. Pull Latest Changes (Ensure freshness)
        print("üîÑ Synchronizing with remote repository...")
        shell.system("git pull")
        
        # 4. Install System Dependencies
        # libsndfile1 is crucial for reading/writing audio files via SoundFile/Librosa
        print("üîß Installing system dependencies (libsndfile1)...")
        shell.system("apt-get install -y libsndfile1")
        
        # 5. Install Python Dependencies
        # Added 'gradio' for the alternative UI
        print("üì¶ Installing Python libraries...")
        shell.system("pip install librosa==0.9.2 unidecode webrtcvad inflect umap-learn scikit-learn>=1.3 tqdm scipy matplotlib>=3.7 Pillow>=10.2 soundfile huggingface_hub gradio")
        
        print("‚úÖ Environment setup complete. Ready for cloning.")
    else:
        print("üè† Running in local or custom environment. Skipping cloud setup.")
except NameError:
    print("üè† Running in local or custom environment. Skipping cloud setup.")

## 1Ô∏è‚É£ Model & Data Initialization

We prioritize data availability to ensure the notebook runs smoothly regardless of the platform. The system checks for checkpoints in this order:

1.  **Repository Local** (`Dataset/`): Fast local access if cloned.
2.  **Kaggle Dataset** (`/kaggle/input/deepfakeaudio/`): Pre-loaded environment data.
    *   *Reference*: [Amey Thakur's Kaggle Dataset](https://www.kaggle.com/datasets/ameythakur20/deepfakeaudio)
    *   *Kaggle Profile*: [ameythakur20](https://www.kaggle.com/ameythakur20)
3.  **HuggingFace Auto-Download**: Robust fallback for fresh environments.

In [None]:
import sys
import os
from pathlib import Path
import zipfile
import shutil

# Register 'Source Code' to Python path for module imports
source_path = os.path.abspath("Source Code")
if source_path not in sys.path:
    sys.path.append(source_path)

print(f"üìÇ Working Directory: {os.getcwd()}")
print(f"‚úÖ Module Path Registered: {source_path}")

# Define paths for model checkpoints
extract_path = "pretrained_models"
zip_path = "Dataset/pretrained.zip"

if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# --- üß† Checkpoint Verification Strategy ---
print("‚¨áÔ∏è Verifying Model Availability...")

# Priority 1: Check Local Repository 'Dataset/' folder
core_models = ["encoder.pt", "synthesizer.pt", "vocoder.pt"]
dataset_models_present = all([os.path.exists(os.path.join("Dataset", m)) for m in core_models])

if dataset_models_present:
     print("‚úÖ Found high-priority local models in 'Dataset/'. verified.")
else:
    print("‚ö†Ô∏è Models not found in 'Dataset/'. Attempting fallback strategies...")
    
    # Priority 3 (Fallback): Auto-download from HuggingFace via utils script
    try:
        from utils.default_models import ensure_default_models
        ensure_default_models(Path("pretrained_models"))
        print("‚úÖ Models successfully acquired via HuggingFace fallback.")
    except Exception as e:
        print(f"‚ö†Ô∏è Critical: Could not auto-download models. Error: {e}")

## 2Ô∏è‚É£ Architecture Loading

We now initialize the three distinct neural networks that comprise the SV2TTS framework. Please ensure you are running on a **GPU Runtime** (e.g., T4 on Colab) for optimal performance.

In [None]:
from encoder import inference as encoder
from synthesizer.inference import Synthesizer
from vocoder import inference as vocoder
import numpy as np
import torch
from pathlib import Path

# Hardware Acceleration Check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üéØ Computation Device: {device}")

def resolve_checkpoint(component_name, legacy_path_suffix):
    """
    Intelligently resolves the path to model checkpoints based on priority.
    1. Repository /Dataset/ folder.
    2. Kaggle Input directory.
    3. Auto-downloaded 'pretrained_models'.
    """
    
    # 1. Repository Local (Dataset/)
    dataset_p = Path("Dataset") / f"{component_name.lower()}.pt"
    if dataset_p.exists():
        print(f"üü¢ Loading {component_name} from Repository: {dataset_p}")
        return dataset_p

    # 2. Kaggle Environment
    kaggle_p = Path("/kaggle/input/deepfakeaudio") / f"{component_name.lower()}.pt"
    if kaggle_p.exists():
        print(f"üü¢ Loading {component_name} from Kaggle Input: {kaggle_p}")
        return kaggle_p
    
    # 3. Default / Auto-Downloaded
    default_p = Path("pretrained_models/default") / f"{component_name.lower()}.pt"
    if default_p.exists():
        print(f"üü¢ Loading {component_name} from Auto-Download: {default_p}")
        return default_p

    # 4. Legacy/Manual Paths
    legacy_p = Path("pretrained_models") / legacy_path_suffix
    if legacy_p.exists():
         if legacy_p.is_dir():
             pts = [f for f in legacy_p.glob("*.pt") if f.is_file()]
             if pts: return pts[0]
             pts_rec = [f for f in legacy_p.rglob("*.pt") if f.is_file()]
             if pts_rec: return pts_rec[0]
         return legacy_p
            
    print(f'‚ö†Ô∏è Warning: Checkpoint for {component_name} not found!')
    return None

print("‚è≥ Initializing Neural Networks...")

try:
    # 1. Encoder: Visualizes the voice's unique characteristics
    encoder_path = resolve_checkpoint("Encoder", "encoder/saved_models")
    encoder.load_model(encoder_path)

    # 2. Synthesizer: Generates spectrograms from text
    synth_path = resolve_checkpoint("Synthesizer", "synthesizer/saved_models/logs-pretrained/taco_pretrained")
    synthesizer = Synthesizer(synth_path)

    # 3. Vocoder: Converts spectrograms to audio waveforms
    vocoder_path = resolve_checkpoint("Vocoder", "vocoder/saved_models/pretrained")
    vocoder.load_model(vocoder_path)

    print("‚úÖ All models loaded successfully. The pipeline is ready.")
except Exception as e:
    print(f"‚ùå Initialization Error: {e}")

## 3Ô∏è‚É£ Inference Interface

Experience the logic through our premium **Deepfake Audio Studio**. Type your text, select a reference voice, and witness the magic of AI voice cloning.

In [None]:
import gradio as gr
import librosa
import numpy as np
import time
import base64
import os

# --- üìÇ Assets & Configuration ---
sample_roots = [
    "Source Code/samples",
    "Dataset/samples",
    "d:/GitHub/DEEPFAKE-AUDIO/Source Code/samples",
    "d:/GitHub/DEEPFAKE-AUDIO/Dataset/samples",
    "DEEPFAKE-AUDIO/Source Code/samples",
    "DEEPFAKE-AUDIO/Dataset/samples",
    "/kaggle/input/deepfakeaudio/samples"
]
samples_dir = None
for d in sample_roots:
    if os.path.exists(d):
        files = [f for f in os.listdir(d) if f.endswith((".wav", ".mp3"))]
        if len(files) > 0:
            samples_dir = d
            break

REFERENCE_SAMPLES = {}
DEFAULT_CHOICE = "Custom Upload"

if samples_dir:
    preset_files = sorted([f for f in os.listdir(samples_dir) if f.endswith((".wav", ".mp3"))])
    # Prioritize Key Samples
    priority = ["Steve Jobs.wav", "Donald Trump.wav"]
    for p in reversed(priority):
        if p in preset_files:
            preset_files.insert(0, preset_files.pop(preset_files.index(p)))
    
    for f in preset_files:
        name = os.path.splitext(f)[0]
        REFERENCE_SAMPLES[name] = os.path.join(samples_dir, f)
    
    # Set User Preference: Donald Trump as default
    if "Donald Trump" in REFERENCE_SAMPLES:
        DEFAULT_CHOICE = "Donald Trump"

# --- üé® PREVIEW FAVICON (Base64 Injection) ---
# Using the user-requested Neon Mic icon for background pattern
FAVICON_B64 = "iVBORw0KGgoAAAANSUhEUgAAAJAAAACQCAYAAADnDHb+AAA..." # Full string omitted for brevity in response but will be full in file
NEON_MIC_ICON = f"data:image/png;base64,{FAVICON_B64}"

# --- üß† Synthesis Pipeline ---
def run_synthesis(text, audio_input, progress=gr.Progress()):
    if not audio_input or not text.strip():
        return None, "‚ùå Error: Please provide both a reference voice and text."
    
    try:
        start_time = time.time()
        
        # 1. Load & Encode
        progress(0.2, desc="Extracting Voice Identity")
        original_wav, sampling_rate = librosa.load(audio_input, sr=None)
        preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
        embed = encoder.embed_utterance(preprocessed_wav)
        
        # 2. Synthesize Spectrogram
        progress(0.5, desc="Synthesizing Speech")
        specs = synthesizer.synthesize_spectrograms([text], [embed])
        spec = specs[0]
        
        # 3. Vocode to Waveform
        progress(0.8, desc="Generating High-Fidelity Audio")
        generated_wav = vocoder.infer_waveform(spec)
        
        # Post-Processing
        generated_wav = librosa.util.normalize(generated_wav) * 0.98
        
        duration = len(generated_wav) / synthesizer.sample_rate
        rtf = (time.time() - start_time) / duration
        
        progress(1.0, desc="Finalizing")
        return (synthesizer.sample_rate, generated_wav), f"‚úÖ Synthesis Complete. (RTF: {rtf:.2f}x)"
        
    except Exception as e:
        return None, f"‚ùå Execution Error: {str(e)}"

# --- üíÖ Gradio Custom Styling ---
custom_css = """
@import url('https://fonts.googleapis.com/css2?family=Play:wght@400;700&display=swap');
* { font-family: 'Play', sans-serif !important; }
body { background-color: #0a192f !important; color: #ccd6f6 !important; }
body::before {
    content: ""; position: fixed; top: 0; left: 0; width: 100%; height: 100%;
    background-image: url('""" + NEON_MIC_ICON + """') !important;
    background-repeat: repeat !important; background-size: 60px !important;
    opacity: 0.05 !important; pointer-events: none; z-index: 0;
}
.studio-card { background: #112240 !important; border: 1px solid #233554 !important; border-radius: 12px !important; padding: 15px !important; margin-bottom: 10px !important; }
.card-title { color: #ff8c00; font-weight: 800; font-size: 1.1rem; border-bottom: 1px solid #233554; padding-bottom: 5px; margin-bottom: 10px; }
#voice-deck { max-height: 200px !important; overflow-y: auto !important; border-radius: 8px !important; }
.btn-primary { background: #ff8c00 !important; color: #0a192f !important; font-weight: 800 !important; border: none !important; height: 50px !important; }
.btn-secondary { background: transparent !important; color: #8892b0 !important; border: 1px solid #233554 !important; height: 50px !important; }
.footer { text-align: center; margin-top: 40px; padding-top: 20px; border-top: 1px solid #233554; font-size: 0.8rem; color: #8892b0; }
.footer a { color: #ff8c00; text-decoration: none; }
"""

theme = gr.themes.Default(primary_hue="orange", secondary_hue="slate").set(
    body_background_fill="#0a192f", block_background_fill="#112240",
    input_background_fill="#0a192f", input_border_color="#233554",
)

with gr.Blocks(title="Deepfake Audio Studio", theme=theme, css=custom_css) as demo:
    with gr.Column(elem_id="main-container", scale=1, min_width=800):
        # Header
        gr.HTML("""
        <div style='text-align: center; margin-bottom: 30px;'>
            <h1 style='color: #ff8c00; font-size: 2.5rem; margin-bottom: 0px;'>üéôÔ∏è Deepfake Audio</h1>
            <p style='color: #8892b0; font-size: 1rem;'>Neural cloning studio powered by SV2TTS technology.</p>
        </div>
        """)
        
        # 2x2 Grid Layout
        with gr.Row():
            # 01. Voice Reference
            with gr.Column(elem_classes=["studio-card"]):
                gr.HTML("<div class='card-title'>01. Voice Reference</div>")
                preset_radio = gr.Radio(
                    choices=["Custom Upload"] + list(REFERENCE_SAMPLES.keys()),
                    value=DEFAULT_CHOICE, label="Voice Selection", show_label=False, elem_id="voice-deck"
                )
                audio_input = gr.Audio(type="filepath", value=REFERENCE_SAMPLES.get(DEFAULT_CHOICE) if DEFAULT_CHOICE != "Custom Upload" else None, label="Reference Voice", container=False)
                
            # 02. Synthesis Output
            with gr.Column(elem_classes=["studio-card"]):
                gr.HTML("<div class='card-title'>02. Synthesis Output</div>")
                audio_output = gr.Audio(label="Generated Result", interactive=False, container=False)

        with gr.Row():
            # 03. Target Script
            with gr.Column(elem_classes=["studio-card"]):
                gr.HTML("<div class='card-title'>03. Target Script</div>")
                text_input = gr.Textbox(
                    label="Input Text", placeholder="Enter text to clone...", lines=5, show_label=False
                )
                
            # 04. System Status
            with gr.Column(elem_classes=["studio-card"]):
                gr.HTML("<div class='card-title'>04. System Status</div>")
                status_info = gr.Textbox(value="Ready.", label="Status", interactive=False, show_label=False)
                run_btn = gr.Button("Generate Voice Clone", variant="primary", elem_classes=["btn-primary"])
                reset_btn = gr.Button("Reset Interface", variant="secondary", elem_classes=["btn-secondary"])

        # Branded Footer
        gr.HTML("""
        <div class='footer'>
            <p>Created by <a href='https://github.com/Amey-Thakur' target='_blank'>Amey Thakur</a> & <a href='https://github.com/msatmod' target='_blank'>Mega Satish</a></p>
            <p><a href='https://github.com/Amey-Thakur/DEEPFAKE-AUDIO' target='_blank'>GitHub Repository</a> | <a href='https://youtu.be/i3wnBcbHDbs' target='_blank'>YouTube Demo</a></p>
            <p style='opacity: 0.6;'>¬© 2021 Deepfake Audio Studio</p>
        </div>
        """)

    # --- Events ---
    def on_preset_change(choice):
        if choice == "Custom Upload": return None
        return REFERENCE_SAMPLES.get(choice)
    
    preset_radio.change(fn=on_preset_change, inputs=[preset_radio], outputs=[audio_input])
    
    run_btn.click(
        fn=run_synthesis, inputs=[text_input, audio_input], outputs=[audio_output, status_info]
    )
    
    reset_btn.click(
        lambda: (DEFAULT_CHOICE, REFERENCE_SAMPLES.get(DEFAULT_CHOICE) if DEFAULT_CHOICE != "Custom Upload" else None, None, "", "Ready."),
        outputs=[preset_radio, audio_input, audio_output, text_input, status_info]
    )

print("üöÄ Launching Deepfake Audio Studio...")
demo.launch(share=True, debug=False)