This is the data and code behind the Data Stories Podcast Archive.
It is based on automatic audio transcriptions of the 170 episodes we recorded since February 2012.
We used https://www.assemblyai.com/ to transcribe the audio files.
The data folder contains both the unprocessed output of the automatic transcription, as well as the revised versions: The automatic transcription works very well for known English words, but has its limitations for proper names or domain specific terms. So we used custom word replacement lists to fix some of the most apparent misspellings.
The web application is a custom frontend for this dataset.
It is accessible online at https://archive.datastori.es/
It allows browsing and search of the transcripts, and also the playback of the corresponding audio files.
You can also explore and download the data (along with some additional download statistics) in an Observable Notebook.
- Miska Knapek (Data processing)
- Moritz Stefaner (Concept, design and web code)