The Abuse Project Audio Dataset (TAPAD)

World's largest profanity audio dataset

Dataset consists of ‭26,365 audio files
Click here for documentation

See The Abuse Project

TAPAD (∿) is an open dataset, meaning it will grow over time as more data is contributed. In order to enable reproducibility and accurate citation the dataset is versioned using git tags.

Current Status & ID3

Category	Const
Total files	`26,365`
Dataset updated	`July 30, 2019`
Language classes	`75`
File Type	MP3
Mime Type	audio/mpeg
Mpeg Audio Version	2
Audio Layer	3
Audio Bitrate	32 kbps
Sample Rate	24000
Channel Mode	Single Channel
Ms Stereo	Off
Intensity Stereo	Off
Codec Type	audio
Codec Time Base	1/24000
Codec Tag	0x0000
Sample Fmt	fltp
Sample Rate	24000
Channels	1
Channel Layout	mono
Bits Per Sample	0
R Frame Rate	0/0
Avg Frame Rate	0/0
Time Base	1/14112000

Languages are required to be 2 letters, normally their 2 letter ISO code, see: ISO_639-1

Scripts & Utilities

Filename	Location	Description	Type
`record.py`	`acquire\custom`	Records audio in WAV format (default: 3 sec)	Helper script
`wingen.py`	`acquire\generate`	TTS conversion using `SAPI.SpVoice`	Helper script
`gTTSgen.py`	`acquire\generate`	TTS conversion using gTTS & `abuse 0.1.1`	Helper script
`gspectogram.py`	`utils`	Generates spectrogram of a wav file	Utility tool

Structure

.
├───af
├───ar
├───bn
├───bs
├───ca
├───cs
├───cy
├───da
├───de
├───el
├───en
│   ├───1 (340 wav files)
│   └───2
├───en-au
├───en-ca
├───en-gb
├───en-gh
├───en-ie
├───en-in
├───en-ng
├───en-nz
├───en-ph
├───en-tz
├───en-uk
├───en-us
├───en-za
├───eo
├───es
├───es-es
├───es-us
├───et
├───fi
├───fr
├───fr-ca
├───fr-fr
├───hi
├───hr
├───hu
├───hy
├───id
├───is
├───it
├───ja
├───jw
├───km
├───ko
├───la
├───lv
├───mk
├───ml
├───mr
├───my
├───ne
├───nl
├───no
├───pl
├───pt
├───pt-br
├───pt-pt
├───ro
├───ru
├───si
├───sk
├───sq
├───sr
├───su
├───sv
├───sw
├───ta
├───te
├───th
├───tl
├───tr
├───uk
├───vi
├───zh-cn
└───zh-tw

Most of these audio classes have 347 MP3 files of ~5.783 minutes each. MP3 had a lot of patent issues but according to Wikipedia, "If the longest-running patent mentioned in the aforementioned references is taken as a measure, then the MP3 technology became patent-free in the United States on 16 April 2017 when U.S. Patent 6,009,399, held by and administered by Technicolor, expired".

Checking files

find audio/ -type f | wc -l

Made with TAPAD

Did you use or saw TAPAD in a paper, project or app? Add it here!

The Abuse Project
(...)

Maintainers

The dataset is regularly updated and maintained by,

Piyush Raj (@0x48piraj)

Useful Resources

The textual data was collected was from different places which all have been listed below,

LICENSE

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

To view a copy of this license, visit NC-SA 4.0 or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.

Name		Name	Last commit message	Last commit date
Latest commit History 2,459 Commits
acquire		acquire
audio		audio
utils		utils
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Abuse Project Audio Dataset (TAPAD)

Current Status & ID3

Scripts & Utilities

Structure

Checking files

Made with TAPAD

Maintainers

Useful Resources

LICENSE

About

Releases

Packages

Languages

License

profanitas/TAPAD

Folders and files

Latest commit

History

Repository files navigation

The Abuse Project Audio Dataset (TAPAD)

Current Status & ID3

Scripts & Utilities

Structure

Checking files

Made with TAPAD

Maintainers

Useful Resources

LICENSE

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages