# LinguAS: Linguistically-augmented Audio Speech data

LinguAS is a dataset of genuine and spoofed audio speech samples annotated by expert linguists. The dataset contains over 800 samples, each annotated with an expert linguist's judgment of five linguistic cues associated with deepfaked speech. This dataset is an exemplar of how *deep data*, rather than big data, can powerfully improve deep learning modeling.

**Linguistic features over acoustic sampling**
We created LinguAS to address a need for audio deepfake detection data that is explainable, generalizeable, and representative of actual linguistic phenomena– how humans actually speak. Most audio deepfake detection models use raw audio or acoustic sampling only to predict whether speech is deepfaked or real. 

The key aspect of the LinguAS dataset is the inclusion of *expert defined linguistic features*, or EDLFs. These were selected by the linguists on our research team as four different linguistic indices that can reveal whether speech is spoofed or real. For every audio sample, a linguist annotated their perception of whether the sample contained any of these cues. All audio samples were assigned a binary code on each feature by a linguist. If a sample sounded anomalous for a particular feature, it received a "1". Audio samples that sounded typical received a "0" for that feature (not anomalous).
    
Our linguists selected the following five linguistic features for annotation and exploration:

- *breath anomaly*: whether a speaker produced typical breath sounds and breathing rhythm during the recording
- *pitch anomaly*: whether a speaker's pitch varied in an unusual way during the recording
- *pause anomaly*: whether a speaker paused unnaturally during the recording
- *bursts anomaly*: whether a speaker's production of the English consonant class [/p/, /t/, /k/, /b/, /d/, /g/] was typical of human speech
- *audio quality anomaly*: whether the audio sample featured any signs of alteration or falsification; including but not limited to clicks, compression, or oversmoothing

To learn more about the theoretical motivation behind these choices and how they improve model performance, read our paper HERE.
<br>

## Data Description
LinguAS provides 845 audio samples annotated by expert linguists. The data is roughly balanced by proportion of spoofed vs genuine audio, attack type, and speaker gender. The audio samples in the data are a combination of samples from publicly-available deepfake detection challenge datasets and samples generated by the LinguAS research team using voice conversion and text-to-speech tools. We included a wide variety of sources, speakers, and attack types so that although the dataset is small, it contains a wide array of human– and human*like*– speech variation. 

**The dataset includes samples from the following sources:**

- _Publicly-available datasets_
  - ASVspoof 2021; deepfakes
  - ASVspoof 2017; voice spoofing replay attacks
  - Cotatron
  - FakeorReal
  - LJSpeech
    
- _Generator tools used by LinguAS team_
  - ASSEM-VC
  - Google/WaveNet TTS
  - Lyrebird (Descript)
  - MelGan
  - Mellotron
  - Phonetic PosterioGram (PPG)
  - Resemble.ai
  - Youtube videos of public figures and imitators

## LinguAS metadata configuration
- **clip_id**
    - The unique id of every audio sample in the data.
- **type**
    - Classifies the audio file into one of five categories:
      - genuine: real human speech, not spoofed
      - replay_attack: a real human speaker whose voice was spoofed by replaying their previous voice production
      - mimicry: a speaker's purposeful imitation of another person's voice, intended to impersonate them
      - vc: voice conversion; a type of audio deepfake where the victim's voice is overlaid on another speaker's speech
      - tts: text-to-speech; naturalistic computer-generated speech, generated from text input
 - **generator_or_source**
   - Identifies where a sample originated from, whether a publicly-available dataset or created by our research team.
 - **duration**
   - Length of audio clip in seconds
- **gender**
  - Apparent speaker gender. Although it is not possible to know someone's gender only by listening to their voice, we provide gender labels either from the public dataset where we gathered them, or based on the "male" or "female" selection options for TTS and voice conversion generators.
- **expert-defined linguistic features (EDLFs)**
  - columns 6 to 10 each represent one EDLF (defined above).
    - *breath_anomaly*
    - *pitch_anomaly*
    - *pause_anomaly*
    - *bursts_anomaly*
    - *audio_quality_anomaly*


**Configuration of linguAS_metadata.csv**
| clip_id          | type     | generator_or_source | duration | gender | pitch_anomaly | pause_anomaly | bursts_anomaly | breath_anomaly | audio_quality_anomaly | y_true |
|-------------------|----------|----------------------|-----------|---------|----------------|----------------|------------------|------------------|------------------------|----------|
| Assem1           | vc       | ASSEM                | 4         | F       | 0              | 0              | 0                | 0                | 0                      | 1        |
| DF_E_2051342.wav | vc       | asvspoof2021_A19     | 4         | M       | 0              | 0              | 0                | 1                | 1                      | 1        |
| file204.wav      | tts      | FoR_Dataset_tts      | 3         | M       | 1              | 0              | 0                | 0                | 1                      | 1        |
| KamalaF9.wav     | mimicry | mimicry              | 5         | F       | 0              | 1              | 0                | 0                | 1                      | 1        |
| KamalaR1.wav     | genuine | genuine              | 8         | F       | 1              | 0              | 0                | 0                | 0                      | 0        |



## Acknowledgements 
Thank you to the creators of the publicly-available datasets *ASVspoof*, *FakeorReal*, and *LJspeech*. 

## Citations
Park, S. W., Kim, D. Y., & Joe, M. C. (2020). Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data. arXiv preprint arXiv:2005.03295.

Valle, R., Li, J., Prenger, R., & Catanzaro, B. (2020, May). Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6189-6193). IEEE.

Van Den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499

ITO, K., AND JOHNSON, L. The lj speech dataset, 2017. 

KIM, K.-W., PARK, S.-W., LEE, J., AND JOE, M.-C. Assem-vc: Realistic voice conversion by assembling modern speech synthesis techniques. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), pp. 69977001.

KUMAR, K., KUMAR, R., DE BOISSIERE, T., GESTIN, L., TEOH, W. Z., SOTELO, J., DE BRE ́ BISSON, A., BENGIO, Y., AND COURVILLE, A. C. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems 32 (2019).

REIMAO, R., AND TZERPOS, V. FoR: A dataset for synthetic speech detection. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) (2019), pp. 1–10.

WANG, X., AND YAMAGISHI, J. A comparative study on recent neural spoofing countermeasures for synthetic speech detection. arXiv preprint arXiv:2103.11326 (2021).

WU, Z., YAMAGISHI, J., KINNUNEN, T., HANIL  ̧CI, C., SAHIDULLAH, M., SIZOV, A., EVANS, N., TODISCO, M., AND DELGADO, H. Asvspoof: the automatic speaker verification spoofing and countermeasures challenge. IEEE Journal of Selected Topics in Signal Processing 11, 4 (2017), 588–604.

T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K. A. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” in Interspeech 2017. ISCA, 2017, pp. 26. [Online]. Available: http://www.isca speech.org/archive/Interspeech 2017/abstracts/1111.html

YAMAGISHI, J., WANG, X., TODISCO, M., SAHIDULLAH, M., PATINO, J., NAUTSCH, A., LIU, X., LEE, K. A., KINNUNEN, T., EVANS, N., ET AL. Asvspoof 2021: accelerating progress in spoofed and deepfake speech detection. 