# Project 6 - Voice Cloning and Fake Audio Detection (VCFAD)

**Introduction:**

A technology company working in the Cyber Security industry which focuses on building systems that help individuals and organizations to have safe and secure digital presence requires an algorithm that can synthesize spoken audio by converting a speaker’s voice to another speaker’s voice with the end goal to detect if any spoken audio is pristine or fake.


**Data Description:**

Two datasets will be used in this project:
- **TIMIT Dataset:** The TIMIT corpus of read speech is designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. TIMIT contains a total of 6300 sentences, 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the United States.

- **CommonVoice Dataset:** Common Voice is part of Mozilla's initiative to help teach machines how real people speak. Common Voice is a corpus of speech data read by users on the Common Voice website (https://commonvoice.mozilla.org/), and based upon text from a number of public domain sources like user submitted blog posts, old books, movies, and other public speech corpora. Its primary purpose is to enable the training and testing of automatic speech recognition (ASR) systems.


**Goal(s):**

- Build a machine learning system to detect if a spoken audio is synthetically generated or not.

    - First, build a voice cloning system (VC) given a speaker’s spoken audio that clones the source speaker’s voice to the target speaker’s voice. Utilize the TIMIT dataset as it consists of aligned text-audio data with various speakers.
    
    - Next, build a machine learning system which detects if any spoken audio is a natural speech or synthetically generated by machine. Utilize the CommonVoice dataset as it consists of thousands of naturally spoken audio which could be used as golden spoken audio by humans as positive examples and creating negative examples using the voice cloning system as automatic data/label generator. Since the CommonVoice English dataset is large, you can use a subset of it by sampling the dataset.


**Success Metrics:**


- Voice cloning (VC):
    - use Word Error Rate (WER)
    - report speaker classification accuracy
<br><br>
- Fake audio detection (FAD):
    - Use F-score via positive labels coming from the groundtruth dataset and negative labels generated by the VC.

## Voice cloning system (VC)

In [1]:
import numpy as np
import os
import sys
from pathlib import Path
import glob

%load_ext autoreload
%autoreload 2

sys.path.append("Real_Time_Voice_Cloning")

from encoder import inference as encoder
from synthesizer.inference import Synthesizer
from Real_Time_Voice_Cloning.utils.default_models import ensure_default_models
from vocoder import inference as vocoder

import functions

  warn("Unable to import 'webrtcvad'. This package enables noise removal and is recommended.")


For the voice cloning system we are going to use the TIMIT dataset which contains a total of 6300 sentences, 10 sentences spoken by each of 630
speakers from 8 major dialect regions of the United States.

### Data preprocessing

In [14]:
!xxd -b ./data/TIMIT/TRAIN/DR1/FCJF0/SA1.WAV | head

00000000: 01001110 01001001 01010011 01010100 01011111 00110001  NIST_1
00000006: 01000001 00001010 00100000 00100000 00100000 00110001  A.   1
0000000c: 00110000 00110010 00110100 00001010 01100100 01100001  024.da
00000012: 01110100 01100001 01100010 01100001 01110011 01100101  tabase
00000018: 01011111 01101001 01100100 00100000 00101101 01110011  _id -s
0000001e: 00110101 00100000 01010100 01001001 01001101 01001001  5 TIMI
00000024: 01010100 00001010 01100100 01100001 01110100 01100001  T.data
0000002a: 01100010 01100001 01110011 01100101 01011111 01110110  base_v
00000030: 01100101 01110010 01110011 01101001 01101111 01101110  ersion
00000036: 00100000 00101101 01110011 00110011 00100000 00110001   -s3 1
xxd: Broken pipe


Since files start with a NIST_1 header instead of a RIFF header, we have to convert them to a proper .wav format to be able to use de sound_recognition functions.

In [45]:
ORIGINAL_AUDIOFILES_PATH = "data/TIMIT/TRAIN"
functions.convert_audiofiles(ORIGINAL_AUDIOFILES_PATH)

Extract .wav files from the original folder and organize them in a new one.

In [46]:
if not os.path.exists(ORIGINAL_WAV_PATH): os.makedirs(ORIGINAL_WAV_PATH)
if not os.path.exists(FAKE_WAV_PATH): os.makedirs(FAKE_WAV_PATH)

functions.organize_audio_files(root_path='data/TIMIT/TRAIN', new_folder=ORIGINAL_WAV_PATH)

### Voice cloning process

I will be using the functions from the following github:
`https://github.com/CorentinJ/Real-Time-Voice-Cloning.git`

In [21]:
functions.check_cuda()

Found 1 GPUs available. Using GPU 0 (NVIDIA GeForce RTX 3060 Laptop GPU) of compute capability 8.6 with 6.2Gb total memory.



In [32]:
## Load the models one by one.
ensure_default_models(Path("saved_models"));
encoder.load_model(Path("saved_models/default/encoder.pt"));
synthesizer = Synthesizer("saved_models/default/synthesizer.pt");
vocoder.load_model(Path("saved_models/default/vocoder.pt"));

Loaded encoder "encoder.pt" trained to step 1564501
Synthesizer using device: cuda
Building Wave-RNN
Trainable Parameters: 4.481M
Loading model weights at saved_models/default/vocoder.pt


In [None]:
original_voice_files = glob.glob(os.path.join(ORIGINAL_WAV_PATH, "*"))
for i, path in enumerate(original_voice_files):
    try:
        src_audio = path
        if i < len(original_voice_files)-1: sample_audio = original_voice_files[i+1]
        else: sample_audio = original_voice_files[0] 
        dst_audio = os.path.join(FAKE_WAV_PATH, "fake_" + src_audio.split('/')[-1])
        functions.voice_to_voice(src_audio, sample_audio, dst_audio)
    
    except Exception as e:
        print("Caught exception: %s" % repr(e))

MISSING:
- use Word Error Rate (WER)
- report speaker classification accuracy

## Fake Audio Detection (FAD)

MISSING:
- Use F-score via positive labels coming from the groundtruth dataset and negative labels generated by the VC.