## Contents

##### 1. Problem statement
##### 2. Detailed description of data
##### 3. Description of attacks and models used for detection
##### 4. Baseline and SOTA architecture
##### 5. Metrics
##### 6. References

## Link to project proposal
https://docs.google.com/document/d/1K46pcbYzvXcz3di70oCOtl7iO4J0LP3QNHHRxRG-8Cg/edit

## 1. Problem statement

Deepfakes are artificial media in which audio, image, or video is fabricated and made to look like original/authentic content that represents humans. The problem at hand is **Detection of fake audio. Particularly, spoofing attacks on automatic speaker verification systems**. 

We aim to analyze & attempt to improve current approaches that combine deep learning and traditional machine learning architectures in an ensemble to build robust detection systems.

## 2. Data Description

[Research paper on ASVspoof 2019 Data analysis](https://arxiv.org/pdf/1911.01601.pdf)

### Overview:

Challenge deals with two types of attacks Logical access(LA) and Physical access(PA)

Sampling rate was 96 kHz which was downsampled to 16kHz
107 speakers (46 male, 61 female)

**LA**: Attack in feature manipulation. 7 Text to Speech and 6 Voice conversion spoofing systems

Number of utterances
Train: bonafide 2,580 | spoofed 22,800
Development: bonafide 2,548 | spoofed 22,296
Evaluation : 80,000

**PA** : Attack at the sensor or physical environment. 9 attack types and 27 environments

Number of utterances
Train: bonafide 5,400 | spoofed 48,600
Development: bonafide 5,400 | spoofed 24,300
Evaluation : 135,000

Types of features used in dataset creation :  
60-dimensional Mel-cepstrum, 1-dimensional F0, 5-dimensional aperiodic component,  25 dimensional bandaperiodicity coefficients (BAPs) etc.

Types of features that can be extracted from dataset :
512-dimensional x-vector using Kaldi, Baseline model using CQCC features , Baseline model using LFCC features. Human evaluation was carried to obtain 27,000 bonafide scores of target speakers, 12,000 bona-fide scores for non-target
speakers, and around 1,200 scores for spoofed utterances.

### Database structure:

Some details about the database:

1. Training and development data for the LA scenario are included in 'ASVspoof2019_LA_train'  ' ASVspoof2019_LA_dev'. Training dataset contains audio files with known ground-truth, which can be used to train systems to distinguish genuine and spoofed speech. The development dataset contains audio files with known ground-truth which can be used for the development of spoofing detection algorithms. Likewise, training and development data for the PA scenario are included in 'ASVspoof2019_PA_train'  ' ASVspoof2019_PA_dev'. 

2. Evaluation data for LA and PA are available in 'ASVspoof2019_LA_eval'  and 'ASVspoof2019_PA_eval', respectively.

3. Dev and eval enrollment data for ASV are available in 'ASVspoof2019_{LA,PA}_dev' and 'ASVspoof2019_{LA,PA}_eval', respectively.

4. Protocol and keys are available in 'ASVspoof2019_LA_{cm,asv}_protocols'  and ASVspoof2019_PA_{cm,asv}_protocols, respectively.

5. Additional README.LA.txt files and README.pA.txt are included in packages. They are the extended version of ASVspoof2019_{LA,PA}_instructions_v1.txt originally used for the challenge to explain the database.

6. The baseline results based on LFCC and CQCC can be reproduced using publicly released Matlab-based implementation of a replay attack spoofing detector http://www.asvspoof.org/asvspoof2019/ASVspoof_2019_baseline_CM_v1.zip

## 3. Description of attacks and models used for detection

The ASVspoof challenge aims to encourage further progress through

*   the collection and distribution of a standard dataset with varying spoofing attacks implemented with multiple, diverse algorithms
*   a series of competitive evaluations for automatic speaker verification.

#### Background:

The ASVspoof 2019 challenge follows on from three special sessions on spoofing and countermeasures for automatic speaker verification held during INTERSPEECH 2013 [1], 2015 [2], and 2017 [3]. While the first edition in 2013 was targeted mainly at increasing awareness of the spoofing problem, the 2015 edition included the first challenge on the topic, accompanied by commonly defined evaluation data, metrics and protocols. The task in ASVspoof 2015 was to design countermeasure solutions capable of discriminating between bona fide (genuine) speech and spoofed speech produced using either text-to-speech (TTS) or voice conversion (VC) systems. The ASVspoof 2017 challenge focused on the design of countermeasures aimed at detecting replay spoofing attacks that could, in principle, be implemented by anyone using common consumer-grade devices.

The ASVspoof 2019 challenge extends the previous challenge in several directions. The 2019 edition is the first to focus on countermeasures for all three major attack types, namely those stemming from TTS, VC and replay spoofing attacks. Advances with regards to the 2015 edition include the addition of up-to-date TTS and VC systems that draw upon substantial progress made in both fields during the last four years. ASVspoof 2019 thus aims to determine whether the advances in TTS and VC technology post a greater threat to automatic speaker verification and the reliability of spoofing countermeasures.

Advances with regards to the 2017 edition concern the use of a far more controlled evaluation setup for the assessment of replay spoofing countermeasures. Whereas the 2017 challenge was created from the recordings of real replayed spoofing attacks, the use of an uncontrolled setup made results somewhat difficult to analyse. A controlled setup, in the form of replay attacks simulated using a range of real replay devices and carefully controlled acoustic conditions is adopted in ASVspoof 2019 with the aim of bringing new insights into the replay spoofing problem.

#### **ASVspoof 2019**

The 2019 edition aligns ASVspoof more closely with the field of automatic speaker verification. Whereas the 2015 and 2017 editions focused on the development and assessment of stand-alone countermeasures, ASVspoof 2019 adopts for the first time a new ASV-centric metric in the form of the tandem decision cost function (t-DCF) [4].

The ASVspoof 2019 database encompasses two partitions for the assessment of logical access (LA) and physical access (PA) scenarios. Both are derived from the VCTK base corpus [5] which includes speech data captured from 107 speakers (46 males, 61 females). Both LA and PA databases are themselves partitioned into three datasets, namely training, development and evaluation which comprise the speech from 20 (8 male, 12 female), 10 (4 male, 6 female) and 48 (21 male, 27 female) speakers respectively. The three partitions are disjoint in terms of speakers, and the recording conditions for all source data are identical. While the training and development sets contain spoofing attacks generated with the same algorithms/conditions (designated as known attacks), the evaluation set also contains attacks generated with different algorithms/conditions (designated as unknown attacks). Reliable spoofing detection performance therefore calls for systems that generalise well to previously-unseen spoofing attacks.

## 4. SOTA and Baseline models

### Criteria for judgement:-

ASVspoof 2019 focuses on assessment of tandem systems consisting of both a spoofing countermeasure (CM) (designed by the participant) and an ASV system (provided by the organisers). The performance of the two combined systems is evaluated via the minimum normalized tandem detection cost function (t-DCF)
The EER serves as a secondary metric. The EER corresponds to a CM operating point with equal miss and false alarm rates and was the primary metric for previous editions of ASVspoof.


### Challenge Results:-

The top-performing system for the LA scenario, T05, achieved a t-DCF of 0.0069 and EER of 0.22%. The top-performing system for the PA scenario, T28, achieved a t-DCF of 0.0096 and EER of 0.39%. Monotonic increases in the tDCF that are not always mirrored by monotonic increases in the EER show the importance of considering the performance of the ASV and CM systems in tandem.


### Approaches used by different teams-

#### Ensemble technique
https://arxiv.org/pdf/1904.04589.pdf

They train five deep models using raw audio or time frequency representations as input to minimise a binary crossentropy (CE) loss with an Adam optimiser and early stopping with a patience of P epochs. As the dataset has more spoofed examples, they replicate the bonafide examples to ensure each batch contains an equal number of bonafide and spoofed examples, which helps stabilise the training. At inference time, they use the output layer sigmoid activation as a score. They have provided model-specific training details below-

#### CNN
They use an utterance-level mean-variance normalized log spectrogram3 , computed using a 1024-point FFT with a hop size of 160 samples, as the input. For each task, we train two such CNN models, model A and B, on the first and last 4 seconds of each audio sample

#### CRNN
As input, they use a mean-variance (computed on train tr set) normalized log-Mel spectrogram of 40 Mel bands, computed on the first 5 seconds of truncated or looped audio samples, using a 1024-point FFT with a hop size of 256 samples. 

#### IDCNN
In total, the model consists of 9 ReSE-2 blocks [22]. These blocks are a combination of ResNets [23] and SENets [24]. They use the multi-level feature aggregation, where the outputs of the last three blocks are concatenated and followed by a fully connected layer of 1024 units, batch normalization and ReLU layers, a 50% dropout layer and a fully connected layer of 1 unit with sigmoid activation. Each convolutional layer has filters of size 3, L2 weight regularizer of 0.0005 and all strides are of unit value. The raw audio input is 3.7 seconds in duration and randomly sampled segments of this size are selected from the recordings. 

#### Wave U-Net
They use a modified version of the Wave-U-Net [25], with five layers of stride four, and without upsampling blocks (model E). The outputs of the last convolution are max-pooled across time, reducing the parameter count and incorporating the intuition that the important features in the tasks are temporally local. Finally, they apply a fully connected layer with a single output to yield a classification probability. They train the model using a batch size of 64, a learning rate of 10−5 and early stopping patience of P = 10 for both the LA and PA tasks, where an epoch is defined as 500 update steps

#### Gaussian Mixture models
They train three GMM models using 60-dimensional static, delta and acceleration (SDA) mel frequency cepstral coefficients (MFCCs) (model F), inverted mel frequency cepstral coefficients (IMFCCs)(model G), and sub-band centroid magnitude coefficients (SCMC) (model H), due to their performance on the ASVspoof 2015 and 2017 spoofing datasets

#### Support Vector Machines
They train two SVMs using i-vectors (model I) and the longterm-average-spectrum (LTAS) feature (model J) since they have shown good performance on prior spoofing datasets.


#### RESULTS-

This team achieve good performance on the PA and 3rd ranking on the LA tasks of the challenge. The PA task seems generally more difficult and should thus be the primary focus of future work.



### Using VGG net and SincNet
https://omilia.com/wp-content/uploads/2019/09/BUT_Omilia_anti_spoof_2019_system_description-2.pdf

For this challenge, two different topologies were used for Physical access. The first one is a modified version of a VGG network  which has shown good performance in Audio Tagging and Audio Scene Classification. The second network is a modified version of a Light CNN (LCCN) which had the best performance for ASVSpoof2017 challenge. We have used a modified version of both networks for acoustic scene classification challenge 2019. In the following two sections, both networks will be explained in more detail-

The VGG network comprises several convolutional and pooling layers followed by a statistics pooling and several dense layers which perform classification.  There are 6 convolutional blocks in the model, each containing 2 convolutional layers and one max-pooling. Each max-pooling layer reduce the size of frequency axis to half while only one of them reduces the temporal resolution.

The LCNN network is a combination of convolutional and max-pooling layers and uses Max-Feature-Map (MFM) as non-linearity. MFM is a layer which simply reduce the number of output channels to the half by taking the maximum of two consecutive channels (or any other combination of two channels).


#### RESULTS-

For PA, they follwed the VGG architecture and obtained very competitive results in both development and evaluation sets, by fusing only two networks. For LA, they fused a VGG architecture with the recently proposed SincNet. The rationale for employing the latter was its ability to jointly optimize the networks and the feature extractor, which was shown to be very effective for speech and speaker recognition. Despite their efforts to prevent overfitting (mainly via attack-level cross validation in training and development), the results on LA showed the difficulty of the SincNet in generalizing to certain attacks which were significantly different to those in the training. They concluded that more research is required in order to make full use of end-to-end anti-spoofing architectures such as SincNet in cases of large mismatch between training and evaluation attacks.


---



## 5. ASVspoof metrics

ASVspoof challenge uses two metrics: t-DCF (tandem detection cost function) and EER (equal error rate). Here we try to explain what these metrics are measuring briefly (reference: https://www.isca-speech.org/archive/Odyssey_2018/pdfs/68.pdf)

### Sub-systems in ASVspoof

In ASVspoof challenge, we consider two different systems: <br>

1) ASV (automatic speaker verification): This system is concerned with verifying whether the speakers in the given audios are same or different

2) CM (counter measure): This is concerned with identifying whether a audio is spoofed or not

Because the two sub-systems are inter-connected and affect each other's performance, we want a metric that measures the joint performance of the two sub-systems. This is where t-DCF comes into play.



### Dataset types
Consider that for a given speaker we have three separate datasets: <br>

a) target: audio of the speaker <br>
b) non-target: audio of a different speaker <br>
c) spoofed: spoofed audio of speaker <br>



### Detection error rates
Some detection error rates that we will use later: <br>

* ASV
     * Miss error rate: $P_{miss}^{asv}(t) = \frac{1}{N_{tar}} \sum_{i \in S_{tar}} \mathbb{I} \{r_i \leq t\}$ <br>
       This tries to estimate the probability of ASV system misclassifying a speaker audio as non-speaker audio/missing speaker audio (if the score of the model $r_i$ is $\leq$ to a threshold $t$, the ASV system classifies the audio as non-speaker). $\mathbb{I}$ function calculates the number of occurrences for which the condition holds true.
     * False acceptance rate: $P_{fa}^{asv}(t) = \frac{1}{N_{nontar}} \sum_{i \in S_{nontar}} \mathbb{I} \{r_i > t\}$ <br>
      This estimates the probability of a sample not belonging to speaker model but being falsely accepted as such by the ASV system
      
     * Because we can consider spoofs to be very close to speaker model, we can also define a quantity "spoofs that were not missed by ASV": $1 - P_{miss,spoof}^{asv}(t) = 1 - \frac{1}{N_{spoof}} \sum_{i \in S_{spoof}} \mathbb{I} \{r_i \leq t\}$ which is 1 - probability that spoofs were *correctly* missed by the ASV system.

* CM <br>
    Because CM systems are not concerned with speaker verification, we pool the target and non-target audios into one common "human" audio dataset for measuring CM related metrics
    * Miss error rate: $P_{miss}^{cm}(s) = \frac{1}{N_{human}} \sum_{j \in S_{human}} \mathbb{I} \{q_j \leq s\}$.<br> This calculates the probability of a bonafide audio being incorrectly missed/classified as spoof by the CM system. Here, if $q_j$ is $\leq$ some threshold $s$, an audio is classified as spoof.
    * False acceptance rate: $P_{fa}^{cm}(s) = \frac{1}{N_{spoof}} \sum_{j \in S_{spoof}} \mathbb{I} \{q_j > s\}$.<br> This estimates the probability of a spoofed audio being falsely accepted as bonafide by the CM system



### Equal error rate
EER (equal error rate) is achieved when $P_{miss}^{system} = P_{fa}^{system}$

### Types of action outputs of sub-systems
Now, consider ASV and CM systems working together. Each of the systems can take one action from $\mathcal{A} = \{ACCEPT, REJECT, SLEEP\}$ ACCEPT and REJECT are what one might expect. ACCEPT indicates that the system passed the audio and REJECT indicates that the system rejected the audio. SLEEP comes into picture when the two systems run sequentially. If the first system itself rejects the audio, we consider the output of the second system to be SLEEP

Considering this, the possible set of outcomes for the combined system will be: <br>
$\alpha_{1} = (ACCEPT, REJECT)$ <br>
$\alpha_{2} = (ACCEPT, ACCEPT)$ <br>
$\alpha_{3} = (REJECT, REJECT)$ <br>
$\alpha_{4} = (REJECT, ACCEPT)$ <br>
$\alpha_{5} = (REJECT, SLEEP)$ <br>
$\alpha_{6} = (SLEEP, REJECT)$ <br>

### Detection cost function
For a particular action, we define DCF (detection cost function) as: <br>
$DCF(\alpha_j) = \sum_{i=1}^{M}\pi_{i}*C(\alpha_j|\theta_i)*P_{err}(\alpha_{j}|\theta_{i})$ <br>
Here, <br>
1) $\alpha_j$ corresponds to a particular action sequence <br>
2) $\sum_{i=1}^{M}$ corresponds to summing over individual actions in the correct sequence of actions <br>
3) $\pi_{i}$ corresponds to probability of a true action occurring <br>
4) $C(\alpha_{j}|\theta_{i})$ is the cost associated with taking action $\alpha_{j}$ when the true action is $\theta_{i}$ <br>
5) $P_{err}(\alpha_{j}|\theta_{i})$ is the probability that we will take action $\alpha_{j}$ given that the true action is $\theta_{i}$. For our case, these are the probabilities that we defined above.

Then, $DCF = \sum_{j=1}^{L}DCF(\alpha_{j})$ where we are summing over all the actions we took

### Tandem Detection Cost Function
Now, for t-DCF.

t-DCF$(s, t) = C_{miss}^{asv}*\pi_{tar}*P_{a}(s,t) + C_{fa}^{asv}*\pi_{nontar}*P_{b}(s,t) + C_{fa}^{cm}*\pi_{spoof}*P_{c}(s,t) + C_{miss}^{cm}*\pi_{tar}*P_{d}(s)$

Here, <br>

1) s is the threshold for CM system and t is the threshold for ASV system (as used in the error rates above) <br>
2) $C_{miss}^{asv}=$ cost associated with ASV missing speaker audio; $C_{fa}^{asv}=$ cost associated with ASV falsely accepting non-target audio; $C_{fa}^{cm}=$ cost associated with CM system falsely accepting spoofed audio; and $C_{miss}^{cm}=$ cost associated with CM system falsely rejecting non-spoof audio <br>
3) $\pi_{tar}$ is probability of audio being from target dataset; $\pi_{nontar}$ is probability of audio being from nontarget dataset; and $\pi_{spoof}$ is probability of audio being from spoofed dataset

For the last probability terms, in the equation: <br>

a) $P_{a}(s,t)=(1-P_{miss}^{cm}(s))*P_{miss}^{asv}(t)$ that is probability of CM not missing human speech but ASV misclassifying.<br>
b) $P_{b}(s,t)=(1-P_{miss}^{cm}(s))*P_{fa}^{asv}(t)$ that is the probability that CM does not miss human speech (non-target one in this case) but ASV falsely accepts non-target. <br>
c) $P_{c}(s,t)=P_{fa}^{cm}(s)*(1-P_{miss,spoof}^{asv}(t))$ is the probability that CM falsely accepts spoofed audio and ASV does not miss spoofed audio <br>
d) $P_{d}(s)=P_{miss}^{cm}(s)$ is the probability that CM falsely rejects target utterance thus not giving ASV a chance to classify/making ASV's prediction irrelevant.

## ASVSpoof evaluation plan: 
http://www.asvspoof.org/asvspoof2019/asvspoof2019_evaluation_plan.pdf

## Paper mentioning about ASVspoof 2019 results
https://arxiv.org/pdf/1904.05441.pdf

## 6. References



[1] Nicholas Evans, Tomi Kinnunen and Junichi Yamagishi, "Spoofing and countermeasures for automatic speaker verification", Interspeech 2013,  925-929, August 2013

[2] Zhizheng Wu, Tomi Kinnunen, Nicholas Evans, Junichi Yamagishi, Cemal Hanilc, Md Sahidullah Aleksandr Sizov, "ASVspoof 2015: the First Automatic Speaker Verification Spoofing and Countermeasures Challenge", Proc. Interspeech 2015  2037-2041 September 2015

[3] Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, Kong Aik Lee, "The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection", Proc. Interspeech 2017, August 2017

[4] T. Kinnunen, K. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D. A. Reynolds, “t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification,” in Proc. Odyssey, June 2018.

[5] VCTK corpus: http://dx.doi.org/10.7488/ds/1994