Skip to content

Stolarski-Lukasz/PCA

Repository files navigation

Phonetic Corpus of Audiobooks

The Phonetic Corpus of Audiobooks (PCA) is a linguistic corpus that can be used for phonetic or acoustic research on speech and articulation. The corpus includes audio recordings and corresponding text versions of audiobooks, which were downloaded from the websites librivox.org and gutenberg.org, and segmented and synchronized using Python scripts and the Aeneas library.

This GitHub repository contains the code for the corpus and a small part of the database. In order to explore the full database of over 100 audiobooks, visit pca.clarin-pl.eu.

Users can search for specific words and phrases or speech sound combinations in the corpus, and narrow their search by author, reader, and text criteria. For more information on the purpose and materials used in the creation of the corpus, as well as background data and corpus statistics, see the full documentation available at the project's website.