# Part 3 - Custom Trainset Walkthrough
This notebook contains blocks to guide you through how to create your own train set to produce new model with AASIST.

In [None]:
# Setup

# Import the git repo and install required libraries
!git clone --branch Showcase https://github.com/Hapemo/Deepfake-Audio-Detection/
%cd Deepfake-Audio-Detection/
%pip install -r requirements.txt

# Prepare data folder and download speech samples
!mkdir data
%cd data
!gdown 1WevCjrJJ7pv9XzCwjbyJ1iNr2uoOYnZI
!gdown 1TE88aXpA5YvTP5KwjM_1SYs3_HinXjDj
!gdown 1JvitJbYdjojw5ORYv42XTA9iairs8_40
!gdown 1Vin3K95s8IVlFH8evu75ZYuOud07sLvU
!unzip showcase_samples.zip
!unzip SileroVAD_samples.zip
!unzip SPEAKER0001.zip
!unzip showcase_dev_samples.zip
%cd ..


## Preparing your trainset
Preparing train set is pretty much the same as preparing testset. If you have not viewed Part 2 yet, highly recommend to review how test set is prepared there. Just one minor mention, train and dev set must contain at least one spoof and one bonafide too.

## Prepare your config file for dataset
The config file dictates which pretrained model, what model parameters, what hyperparameters, and what dataset to use. If you have not viewed Part 2's config file instructions yet, highly recommend to review it as the details of data preparation is discussed there. This portion will only mention some of the finetuning mechanisms you can play around with

Here are a few notable parameters you can change in the config file
1. batch_size
   Batch size determines how much speech sample gets loaded in a single step during training and evaluation. Take note, it will consume GPU ram as more speeches are sent to the GPU at once, go too crazy and you'll be met with cuda out of memory error.
2. num_epochs
   This idicates the number of epochs to run the training for. Larger it is, longer the training will be. The codebase has inbuilt best epoch saves that will save instances where the training and development EER is at it's minimum, so you don't have to worry too much about overtraining when your num_epoch is too big.
3. model_config
   This customization requires in-depth knowledge of the AASIST model, as it's motifying the parameters of the layers. If you want to learn more about the AASIST model, you can visit [this paper](https://arxiv.org/abs/2110.01200). 
   
   There is a parameter to take note, that is "nb_samp". This dictates the length of speech the model can access. The default sample rate of audios used is 16000. So if most of your speeches are 16000, nb_samp should be 4*16000 = 64000. If the audio is too short, the backend system will fill up the empty portion with repeats from the beginning of the audio. If the audio is too long, it will simply cut it short. Don't go too crazy on this too, as more data will be sent to the GPU at once, too much and cuda will throw you out of memory error.

4. optim_config
   This customizes the optimizer parameters for training. By default, the training's learning rate decreases in attempt to achieve mimimum EER. Some easy customization you can do is changing "base_lr", this dictates the starting learning rate of the model. Another one is "lr_min", this dictates the minimum value which learning rate can fall to during training.

Play around with these values and see which one works best for you!

Once you play around with a few value, you might find it troublesome to manually change the parameters for each test. You can explore how to change the config file values with python and conduct retest, or even change the code base, have fun!






## Running training
This sample training uses "showcase_samples" dataset to train and "showcase_dev_samples" dataset to develop, finally performing evaluation on "eval_folders" dataset. You can explore the parameters in the config file and play around with it, find out which parameter suits the model best.

showcase_dev_samples came from the same domain as showcase_samples, meaning they are different subsets but from the same pool of data, the spoof audio were generated with the same Voice Conversion model.

With showcase_samples containing 1,432 speeches, the training takes about 1 minute per epoch with T4 GPU, so do expect some waiting time if your num_of_epoch is large. The num_of_epoch for below's experiment will be reduced to 5 to save time. 

In [None]:
import os
%run "pyfiles/main.py" --eval --config "./config/ShowcaseTrain.conf" 
os.remove("data/segment_info.txt")
