Table of content*
This week, 10 Academy is your client. Recognizing the value of large data sets for speech-t0-text data sets, and seeing the opportunity that there are many text corpuses for both languages, and understanding that complex data engineering skills is valuable to your profile for employers, this week’s task is simple: design and build a robust, large scale, fault tolerant, highly available Kafka cluster that can be used to post a sentence and receive an audio file. By the end of this project, you should produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-t0-text model.
Data source For this Project came from Amharic news text classification dataset with baseline performance [dataset] (https://github.com/IsraelAbebe/An-Amharic-News-Text-classification-Dataset).
Install Kafka and run
Install Airflow
Install Spark
To install and run this project
git clone https://github.com/STT-data-collection/STT-Data-Collection.git
cd STT-Data-Collection
pip install -r requirements.txt
This folder holds python script files for airflow dags
This folder holds the data of the project (data is store on google drive using DVC)
This folder holds backend flask api
This folder holds front end of the project using Reactjs
This folder holds python script files that define producer , consumer , topic and manage kafka cluster
This folder holds log data of the projects
This folder holds prediction models
This folder holds demonstrations of the project
This folder holds images of parts of the projects
This folder holds prediction model scripts
This folder holds test files