Skip to content

The Abuse Project Audio Dataset (TAPAD). Think MNIST for audio profanity.

License

Notifications You must be signed in to change notification settings

profanitas/TAPAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The Abuse Project Audio Dataset (TAPAD)

World's largest profanity audio dataset

PICTURE logo
Dataset consists of ‭26,365 audio files
Click here for documentation

See The Abuse Project

TAPAD (∿) is an open dataset, meaning it will grow over time as more data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using git tags.

Current Status & ID3

Category Const
Total files 26,365
Dataset updated July 30, 2019
Language classes 75
File Type MP3
Mime Type audio/mpeg
Mpeg Audio Version 2
Audio Layer 3
Audio Bitrate 32 kbps
Sample Rate 24000
Channel Mode Single Channel
Ms Stereo Off
Intensity Stereo Off
Codec Type audio
Codec Time Base 1/24000
Codec Tag 0x0000
Sample Fmt fltp
Sample Rate 24000
Channels 1
Channel Layout mono
Bits Per Sample 0
R Frame Rate 0/0
Avg Frame Rate 0/0
Time Base 1/14112000

Languages are required to be 2 letters, normally their 2 letter ISO code, see: ISO_639-1

Scripts & Utilities

Filename Location Description Type
record.py acquire\custom Records audio in WAV format (default: 3 sec) Helper script
wingen.py acquire\generate TTS conversion using SAPI.SpVoice Helper script
gTTSgen.py acquire\generate TTS conversion using gTTS & abuse 0.1.1 Helper script
gspectogram.py utils Generates spectrogram of a wav file Utility tool

Structure

.
├───af
├───ar
├───bn
├───bs
├───ca
├───cs
├───cy
├───da
├───de
├───el
├───en
│   ├───1 (340 wav files)
│   └───2
├───en-au
├───en-ca
├───en-gb
├───en-gh
├───en-ie
├───en-in
├───en-ng
├───en-nz
├───en-ph
├───en-tz
├───en-uk
├───en-us
├───en-za
├───eo
├───es
├───es-es
├───es-us
├───et
├───fi
├───fr
├───fr-ca
├───fr-fr
├───hi
├───hr
├───hu
├───hy
├───id
├───is
├───it
├───ja
├───jw
├───km
├───ko
├───la
├───lv
├───mk
├───ml
├───mr
├───my
├───ne
├───nl
├───no
├───pl
├───pt
├───pt-br
├───pt-pt
├───ro
├───ru
├───si
├───sk
├───sq
├───sr
├───su
├───sv
├───sw
├───ta
├───te
├───th
├───tl
├───tr
├───uk
├───vi
├───zh-cn
└───zh-tw

Most of these audio classes have 347 MP3 files of ~5.783 minutes each. MP3 had a lot of patent issues but according to Wikipedia, "If the longest-running patent mentioned in the aforementioned references is taken as a measure, then the MP3 technology became patent-free in the United States on 16 April 2017 when U.S. Patent 6,009,399, held by and administered by Technicolor, expired".

Checking files

find audio/ -type f | wc -l

Made with TAPAD

Did you use or saw TAPAD in a paper, project or app? Add it here!

Maintainers

The dataset is regularly updated and maintained by,

Useful Resources

The textual data was collected was from different places which all have been listed below,

LICENSE

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

To view a copy of this license, visit NC-SA 4.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.