Speech-to-text data collection with Kafka, Airflow, and Spark

Introduction

Large and quality datasets are critical to ensure the performance, fairness, robustness, reliability, and scalability of ML systems. Data scientists often lack diverse and large datasets to train and test the machine learning models they design. This project focuses on developing a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model. The general objective of the project is to develop a data engineering pipeline using Apache Kafka, Apache Spark and Airflow to allow collection of millions of Amharic and Swahili audio recordings from speakers reading digital text in app and web platforms. These recordings can be used to produce a large and diverse dataset for training and testing speech-to-text processing models.

The proposed data pipeline was built on Apache Kafka, an open-source distributed event streaming platform. By combining messaging, storage, and stream processing, the data pipeline allow collection, storage and analysis of real-time audio datasets. The data pipeline consists of the following key components:

Data producers
Data consumers
Apache Kafka cluster
Amazon S3 bucket Connectors
Apache Spark Stream preprocessors

Project Structure

images:

images/ the folder where all snapshot for the project are stored.

logs:

logs/ the folder where script logs are stored.

data:

data/ the folder where the dataset files are stored.

.github:

.github/: the folder where github actions and unit-tests are integrated.
cml.yaml: the file where the cml configuration is stored.

.vscode:

.vscode/: the folder where local path are stored.

notebooks:

notebooks/: a jupyter notebook for preprocessing the data.

scripts:

scripts/: folder where modules are stored.

tests:

tests/: the folder containing unit tests for the scripts.

root folder

requirements.txt: a text file listing the projet's dependancies.
.travis.yml: a configuration file Travis CI for unit test.
setup.py: a configuration file for installing the scripts as a package.
results.txt: a text file containing the results of the cml report.
README.md: Markdown text with a brief explanation of the project and the repository structure.

Installation guide

Conda Enviroment

conda create --name stt python==3.8
conda activate stt

Setting up docker container for the project

Setting up kafka and zookeeper

docker-compose -f docker-compose.yml up -d

Setting up spark and airflow

docker-compose -f docker-compose1.yml up -d

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech-to-text data collection with Kafka, Airflow, and Spark

Table of Contents

Introduction

Project Structure

images:

logs:

data:

.github:

.vscode:

notebooks:

scripts:

tests:

root folder

Installation guide

Conda Enviroment

Next

Setting up docker container for the project

Setting up kafka and zookeeper

Setting up spark and airflow

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
.dvc		.dvc
.github/workflows		.github/workflows
client		client
data		data
images		images
logs		logs
mlruns		mlruns
models		models
notebooks		notebooks
scripts		scripts
spark		spark
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
docker-compose1.yml		docker-compose1.yml
requirements.txt		requirements.txt
results.txt		results.txt
setup.py		setup.py

License

abel-blue/StoTkas

Folders and files

Latest commit

History

Repository files navigation

Speech-to-text data collection with Kafka, Airflow, and Spark

Table of Contents

Introduction

Project Structure

images:

logs:

data:

.github:

.vscode:

notebooks:

scripts:

tests:

root folder

Installation guide

Conda Enviroment

Next

Setting up docker container for the project

Setting up kafka and zookeeper

Setting up spark and airflow

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages