This was my project for the Machine Learning course, during my Master, and it consisted of using deep learning for speech recognition. More specifically, recognizing which word is being played on an audio track.
I tried the experiment using the two main audio features: spectrograms and MFCCs (Mel Frequency Cepstral Coefficients). To run the implementation, first download the dataset (more instructions in the dataset folder) and run one of the prepare_dataset.py
files, depending on which feature you want to use. This python
script will create a file called data.json
in which there are the features that will be used to train the model. Then run the corresponding train.py
file to train the model. When it has finished, the model will be saved (I provide two models already trained, model_spectograms.h5
and model_mfccs.h5
). Finally, put in the test folder the tracks you want to make predictions about and run the corresponding predictions.py
file and change the path to the track in the main
function.
Here I show the loss and accuracy curves I got when I did the project.
Curves using spectrograms
Curves using MFCCs