Podcast Transcription Data Pipeline using Apache Airflow

Project Overview

In this project, we'll create a data pipeline using Apache Airflow to download podcast episodes and automatically transcribe them using speech recognition. The results will be stored in a SQLite database, making it easy to query and analyze the transcribed podcast content.

While this project doesn't strictly require the use of Apache Airflow, it offers several advantages:

We can schedule the project to run on a daily basis.
Each task can run independently, and we receive error logs for troubleshooting.
Tasks can be easily parallelized, and the project can run in the cloud if needed.
It provides extensibility for future enhancements, such as adding more advanced speech recognition or summarization.

By the end of this project, you'll have a solid understanding of how to utilize Apache Airflow and a practical project that can serve as a foundation for further development.

Project Steps

Download Podcast Metadata XML and Parse
- Obtain the metadata for podcast episodes by downloading and parsing an XML file.
Create a SQLite Database for Podcast Metadata
- Set up a SQLite database to store podcast metadata efficiently.
Download Podcast Audio Files Using Requests
- Download the podcast audio files from their sources using the Python requests library.
Transcribe Audio Files Using Vosk
- Implement audio transcription using the Vosk speech recognition library.

Getting Started

Local Setup

Before you begin, ensure that you have the following prerequisites installed locally:

Apache Airflow 2.3+
Python 3.8+
Python packages:
- pandas
- sqlite3
- xmltodict
- requests

Please follow the Airflow installation guide to install Apache Airflow successfully.

Data

During the project, we'll download the required data, including a language model for Vosk and podcast episodes. If you wish to explore the podcast metadata, you can find it here.

Code

You can access the project code in the code directory.

Project Screenshots

airflow database sqlite connection
Dag
Get Episodes Task output

Project Usage

To run the data pipeline, follow the steps provided in the steps.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
episodes		episodes
outputs		outputs
README.md		README.md
episodes.db		episodes.db
podcust_summary.py		podcust_summary.py
t.py		t.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Podcast Transcription Data Pipeline using Apache Airflow

Project Overview

Project Steps

Getting Started

Local Setup

Data

Code

Project Screenshots

Project Usage

About

Releases

Packages

Languages

3amory99/Podcust-Summary-Data-Pipeline-Using-Airflow

Folders and files

Latest commit

History

Repository files navigation

Podcast Transcription Data Pipeline using Apache Airflow

Project Overview

Project Steps

Getting Started

Local Setup

Data

Code

Project Screenshots

Project Usage

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages