<h1 align="center">üéôÔ∏è Deepfake Audio</h1>
<h3 align="center"><i>A neural voice cloning studio powered by SV2TTS technology</i></h3>

<div align="center">

| **Author** | **Profiles** |
|:---:|:---|
| **Amey Thakur** | [![GitHub](https://img.shields.io/badge/GitHub-Amey--Thakur-181717?logo=github)](https://github.com/Amey-Thakur) [![ORCID](https://img.shields.io/badge/ORCID-0000--0001--5644--1575-A6CE39?logo=orcid)](https://orcid.org/0000-0001-5644-1575) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Amey_Thakur-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=0inooPgAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Amey_Thakur-20BEFF?logo=kaggle)](https://www.kaggle.com/ameythakur20) |
| **Mega Satish** | [![GitHub](https://img.shields.io/badge/GitHub-msatmod-181717?logo=github)](https://github.com/msatmod) [![ORCID](https://img.shields.io/badge/ORCID-0000--0002--1844--9557-A6CE39?logo=orcid)](https://orcid.org/0000-0002-1844-9557) [![Google Scholar](https://img.shields.io/badge/Google_Scholar-Mega_Satish-4285F4?logo=google-scholar&logoColor=white)](https://scholar.google.ca/citations?user=7Ajrr6EAAAAJ&hl=en) [![Kaggle](https://img.shields.io/badge/Kaggle-Mega_Satish-20BEFF?logo=kaggle)](https://www.kaggle.com/megasatish) |

---

**Attribution:** This project builds upon the foundational work of [CorentinJ/Real-Time-Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning).

üöÄ **Live Demo:** [Hugging Face Space](https://huggingface.co/spaces/ameythakur/Deepfake-Audio) | üé• **Video Demo:** [YouTube](https://youtu.be/i3wnBcbHDbs) | üíª **Repository:** [GitHub](https://github.com/Amey-Thakur/DEEPFAKE-AUDIO)

<a href="https://youtu.be/i3wnBcbHDbs">
  <img src="https://img.youtube.com/vi/i3wnBcbHDbs/0.jpg" alt="Video Demo" width="60%">
</a>

</div>

## üìñ Introduction

> **An audio deepfake is when a ‚Äúcloned‚Äù voice that is potentially indistinguishable from the real person‚Äôs is used to produce synthetic audio.**

This research notebook demonstrates the **SV2TTS (Speaker Verification to Text-to-Speech)** framework, a three-stage deep learning pipeline capable of cloning a voice from a mere 5 seconds of audio.

### The Pipeline Components
1.  **Speaker Encoder**: A Recurrent Neural Network (RNN) that condenses the *timbre* and *prosody* of the reference audio into a fixed-length vector (embedding).
2.  **Synthesizer**: A Tacotron-2 based implementation that takes text and the speaker embedding to generate a visual representation of speech (Mel Spectrogram).
3.  **Vocoder**: A WaveRNN network that iteratively generates the raw audio waveform from the Mel Spectrogram, sample by sample.

## ‚òÅÔ∏è Cloud Environment Setup
Execute the following cell **only** if you are running this notebook in a cloud environment like **Google Colab** or **Kaggle**. 

This script will:
1.  Clone the [DEEPFAKE-AUDIO repository](https://github.com/Amey-Thakur/DEEPFAKE-AUDIO).
2.  Install system-level dependencies (e.g., `libsndfile1` for audio processing).
3.  Install Python libraries required for signal processing and deep learning.

In [None]:
import os
import sys

# Detect Cloud Environment (Colab/Kaggle)
try:
    shell = get_ipython()
    if 'google.colab' in str(shell):
        print("üíª Detected Google Colab Environment. Initiating setup...")
        
        # 1. Clone the Repository
        if not os.path.exists("DEEPFAKE-AUDIO"):
            print("‚¨áÔ∏è Cloning DEEPFAKE-AUDIO repository...")
            shell.system("git clone https://github.com/Amey-Thakur/DEEPFAKE-AUDIO")
        
        # 2. Change Working Directory
        os.chdir("/content/DEEPFAKE-AUDIO")
        
        # 3. Pull Latest Changes (Ensure freshness)
        print("üîÑ Synchronizing with remote repository...")
        shell.system("git pull")
        
        # 4. Install System Dependencies
        # libsndfile1 is crucial for reading/writing audio files via SoundFile/Librosa
        print("üîß Installing system dependencies (libsndfile1)...")
        shell.system("apt-get install -y libsndfile1")
        
        # 5. Install Python Dependencies
        print("üì¶ Installing Python libraries...")
        shell.system("pip install librosa==0.9.2 unidecode webrtcvad inflect umap-learn scikit-learn>=1.3 tqdm scipy matplotlib>=3.7 Pillow>=10.2 soundfile huggingface_hub")
        
        print("‚úÖ Environment setup complete. Ready for cloning.")
    else:
        print("üè† Running in local or custom environment. Skipping cloud setup.")
except NameError:
    print("üè† Running in local or custom environment. Skipping cloud setup.")

## 1Ô∏è‚É£ Model & Data Initialization

We prioritize data availability to ensure the notebook runs smoothly regardless of the platform. The system checks for checkpoints in this order:

1.  **Repository Local** (`Dataset/`): Fast local access if cloned.
2.  **Kaggle Dataset** (`/kaggle/input/deepfakeaudio/`): Pre-loaded environment data.
    *   *Reference*: [Amey Thakur's Kaggle Dataset](https://www.kaggle.com/datasets/ameythakur20/deepfakeaudio)
    *   *Kaggle Profile*: [ameythakur20](https://www.kaggle.com/ameythakur20)
3.  **HuggingFace Auto-Download**: Robust fallback for fresh environments.

In [None]:
import sys
import os
from pathlib import Path
import zipfile
import shutil

# Register 'Source Code' to Python path for module imports
source_path = os.path.abspath("Source Code")
if source_path not in sys.path:
    sys.path.append(source_path)

print(f"üìÇ Working Directory: {os.getcwd()}")
print(f"‚úÖ Module Path Registered: {source_path}")

# Define paths for model checkpoints
extract_path = "pretrained_models"
zip_path = "Dataset/pretrained.zip"

if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# --- üß† Checkpoint Verification Strategy ---
print("‚¨áÔ∏è Verifying Model Availability...")

# Priority 1: Check Local Repository 'Dataset/' folder
core_models = ["encoder.pt", "synthesizer.pt", "vocoder.pt"]
dataset_models_present = all([os.path.exists(os.path.join("Dataset", m)) for m in core_models])

if dataset_models_present:
     print("‚úÖ Found high-priority local models in 'Dataset/'. verified.")
else:
    print("‚ö†Ô∏è Models not found in 'Dataset/'. Attempting fallback strategies...")
    
    # Priority 3 (Fallback): Auto-download from HuggingFace via utils script
    try:
        from utils.default_models import ensure_default_models
        ensure_default_models(Path("pretrained_models"))
        print("‚úÖ Models successfully acquired via HuggingFace fallback.")
    except Exception as e:
        print(f"‚ö†Ô∏è Critical: Could not auto-download models. Error: {e}")

## 2Ô∏è‚É£ Architecture Loading

We now initialize the three distinct neural networks that comprise the SV2TTS framework. Please ensure you are running on a **GPU Runtime** (e.g., T4 on Colab) for optimal performance.

In [None]:
from encoder import inference as encoder
from synthesizer.inference import Synthesizer
from vocoder import inference as vocoder
import numpy as np
import torch
from pathlib import Path

# Hardware Acceleration Check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üéØ Computation Device: {device}")

def resolve_checkpoint(component_name, legacy_path_suffix):
    """
    Intelligently resolves the path to model checkpoints based on priority.
    1. Repository /Dataset/ folder.
    2. Kaggle Input directory.
    3. Auto-downloaded 'pretrained_models'.
    """
    
    # 1. Priority: Repository Local (Dataset/)
    dataset_p = Path("Dataset") / f"{component_name.lower()}.pt"
    if dataset_p.exists():
        print(f"üü¢ Loading {component_name} from Repository: {dataset_p}")
        return dataset_p

    # 2. Priority: Kaggle Environment
    kaggle_p = Path("/kaggle/input/deepfakeaudio") / f"{component_name.lower()}.pt"
    if kaggle_p.exists():
        print(f"üü¢ Loading {component_name} from Kaggle Input: {kaggle_p}")
        return kaggle_p
    
    # 3. Priority: Auto-Downloaded Fallback
    default_p = Path("pretrained_models/default") / f"{component_name.lower()}.pt"
    if default_p.exists():
        print(f"üü¢ Loading {component_name} from Auto-Download: {default_p}")
        return default_p

    # 4. Legacy/Manual Paths
    legacy_p = Path("pretrained_models") / legacy_path_suffix
    if legacy_p.exists():
         if legacy_p.is_dir():
             pts = [f for f in legacy_p.glob("*.pt") if f.is_file()]
             if pts: return pts[0]
             pts_rec = [f for f in legacy_p.rglob("*.pt") if f.is_file()]
             if pts_rec: return pts_rec[0]
         return legacy_p
            
    print(f'‚ö†Ô∏è Warning: Checkpoint for {component_name} not found!')
    return None

print("‚è≥ Initializing Neural Networks...")

try:
    # 1. Encoder: Visualizes the voice's unique characteristics
    encoder_path = resolve_checkpoint("Encoder", "encoder/saved_models")
    encoder.load_model(encoder_path)

    # 2. Synthesizer: Generates spectrograms from text
    synth_path = resolve_checkpoint("Synthesizer", "synthesizer/saved_models/logs-pretrained/taco_pretrained")
    synthesizer = Synthesizer(synth_path)

    # 3. Vocoder: Converts spectrograms to audio waveforms
    vocoder_path = resolve_checkpoint("Vocoder", "vocoder/saved_models/pretrained")
    vocoder.load_model(vocoder_path)

    print("‚úÖ All models loaded successfully. The pipeline is ready.")
except Exception as e:
    print(f"‚ùå Initialization Error: {e}")

## 3Ô∏è‚É£ Inference Interface

Experience the logic through our simple **Neural Voice Cloning Studio**. Type your text, select a reference voice, and witness the magic of AI.

In [None]:
import ipywidgets as widgets
from IPython.display import display, Audio, clear_output
import librosa
import numpy as np
import time
import os

# --- üìÇ Configuration ---
sample_roots = [
    "Source Code/samples",
    "Dataset/samples",
    "d:/GitHub/DEEPFAKE-AUDIO/Source Code/samples",
    "d:/GitHub/DEEPFAKE-AUDIO/Dataset/samples",
    "/kaggle/input/deepfakeaudio/samples"
]

samples_dir = None
for d in sample_roots:
    if os.path.exists(d):
        files = [f for f in os.listdir(d) if f.endswith((".wav", ".mp3"))]
        if len(files) > 0:
            samples_dir = d
            break

presets = {}
if samples_dir:
    files = sorted([f for f in os.listdir(samples_dir) if f.endswith((".wav", ".mp3"))])
    for f in files:
        presets[os.path.splitext(f)[0]] = os.path.join(samples_dir, f)

# --- üíÖ Interface Components ---
style = {'description_width': 'initial'}

preset_dropdown = widgets.Dropdown(
    options=[('Custom Upload', None)] + sorted(list(presets.items())),
    value=presets.get("Donald Trump", None) if "Donald Trump" in presets else None,
    description='Reference Voice:',
    style=style
)

upload_widget = widgets.FileUpload(
    accept='.wav,.mp3',
    multiple=False,
    description='Upload Voice:',
    style=style
)

text_area = widgets.Textarea(
    value='Welcome to Deepfake Audio. This is a simple neural voice cloning demonstration.',
    placeholder='Enter text to synthesize...',
    description='Target Script:',
    layout={'width': '100%', 'height': '100px'},
    style=style
)

generate_btn = widgets.Button(
    description='üöÄ Generate Voice Clone',
    button_style='primary',
    layout={'width': '100%', 'height': '50px'}
)

output_area = widgets.Output()

# --- üß† Logic ---
def run_synthesis(_):
    with output_area:
        clear_output()
        print("‚è≥ Processing...")
        
        try:
            # Get Reference Audio
            if upload_widget.value:
                uploaded_file = list(upload_widget.value.values())[0]
                # Save temp
                with open("temp_ref.wav", "wb") as f:
                    f.write(uploaded_file['content'])
                ref_path = "temp_ref.wav"
            else:
                ref_path = preset_dropdown.value
            
            if not ref_path:
                print("‚ùå Error: Please select a preset or upload a custom voice.")
                return

            script = text_area.value.strip()
            if not script:
                print("‚ùå Error: Please enter a target script.")
                return

            # 1. Encode Voice
            wav, sr = librosa.load(ref_path, sr=None)
            preprocessed_wav = encoder.preprocess_wav(wav, sr)
            embed = encoder.embed_utterance(preprocessed_wav)
            
            # 2. Synthesize Spectrogram
            specs = synthesizer.synthesize_spectrograms([script], [embed])
            
            # 3. Vocode Waveform
            generated_wav = vocoder.infer_waveform(specs[0])
            
            # Post-Process
            generated_wav = librosa.util.normalize(generated_wav) * 0.98
            
            clear_output()
            print("‚úÖ Synthesis Complete!")
            display(Audio(generated_wav, rate=synthesizer.sample_rate, autoplay=True))
            
        except Exception as e:
            print(f"‚ùå Error: {e}")

generate_btn.on_click(run_synthesis)

# --- Layout ---
display(widgets.VBox([
    widgets.HTML("<h2>üéôÔ∏è Neural Voice Cloning Studio</h2>"),
    widgets.HBox([preset_dropdown, upload_widget]),
    text_area,
    generate_btn,
    output_area
]))