Large and quality datasets are critical to ensure the performance, fairness, robustness, reliability, and scalability of ML systems. Data scientists often lack diverse and large datasets to train and test the machine learning models they design. This project focuses on developing a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model. The general objective of the project is to develop a data engineering pipeline using Apache Kafka, Apache Spark and Airflow to allow collection of millions of Amharic and Swahili audio recordings from speakers reading digital text in app and web platforms. These recordings can be used to produce a large and diverse dataset for training and testing speech-to-text processing models.
The proposed data pipeline was built on Apache Kafka, an open-source distributed event streaming platform. By combining messaging, storage, and stream processing, the data pipeline allow collection, storage and analysis of real-time audio datasets. The data pipeline consists of the following key components:
- Data producers
- Data consumers
- Apache Kafka cluster
- Amazon S3 bucket Connectors
- Apache Spark Stream preprocessors
images/
the folder where all snapshot for the project are stored.
logs:
logs/
the folder where script logs are stored.
data:
data/
the folder where the dataset files are stored.
.github/
: the folder where github actions and unit-tests are integrated.cml.yaml
: the file where the cml configuration is stored.
.vscode/
: the folder where local path are stored.
notebooks/
: a jupyter notebook for preprocessing the data.
scripts/
: folder where modules are stored.
tests/
: the folder containing unit tests for the scripts.
requirements.txt
: a text file listing the projet's dependancies..travis.yml
: a configuration file Travis CI for unit test.setup.py
: a configuration file for installing the scripts as a package.results.txt
: a text file containing the results of the cml report.README.md
: Markdown text with a brief explanation of the project and the repository structure.
conda create --name stt python==3.8
conda activate stt
git clone https://github.com/Speech-to-text-Kafka-Airflow-Spark/StoTkas.git
cd StoTkas
sudo python3 setup.py install
docker-compose -f docker-compose.yml up -d
docker-compose -f docker-compose1.yml up -d