# Presentation Outline

***COMP47650 - Deep Learning - 2024/25 Spring***

> - **Name:** Yu Du
> - **Date:** May 5, 2025

In this final deep learning individual project, I chose to implement the classification prediction of audio style types of the GTZAN audio dataset based on the modern Transformer architecture. The reason is as I explained in the report, the Transformer architecture has become the mainstream technical framework in the field of modern AI and has been tested many times.

**Deep Learning**, **Transformer**, **Audio Genre Classification**

**Outline**
- [Walkthrough](#1-walkthrough)
- [Model Explanation](#2-model-explanation)
- [Training Revelation](#3-training-revelation)
- [Test and Evaluation](#4-test-and-evaluation)
- [Conclusion](#5-conclusion)
- [Reference](#reference)

***

## 1 Walkthrough

```bash
dl_individual_project/
        ├─ src/             # Source code storage directory
        │  ├─ data/         # Data processing source code storage directory
        │  │  ├─ __init__.py
        │  │  ├─ wav_dataset_loader.py
        │  │  └─ audio_preprocessor.py
        │  ├─ models/       # Model source code storage directory
        │  │  ├─ __init__.py
        │  │  ├─ pre_encoder.py
        │  │  ├─ encoder.py
        │  │  ├─ audio_classifier.py
        │  │  └─ multi_layer_audio_classifier.py
        │  ├─ utils/        # Encapsulation class source code storage directory
        │  │  ├─ __init__.py
        │  │  ├─ data_processor_model.py
        │  │  └─ gtzan_audio_classifier_model.py
        │  └─ main.py       # Program execution entry
       ...
```

During the code development process, I followed the general data mining analysis process to divide my code structure, that is, I put the data processing part (including data loading and data preprocessing) in the data part. Then, according to the steps of building the model (refer to the Transformer architecture diagram I use below), I put the three sub-layer models of pre-encoder, encoder and classifier in the model part. Finally, based on the program functions of these two parts, I encapsulated two tool classes (placed in the tool part), one for the data class and the other for the entire model class, which can make the internal details transparent to external users.

## 2 Model Explanation

<div align=center><img src="./figs/architecture.png" style="width:300px" /></div>

<center><b>Figure.</b> Overall model architecture.</center>

As for the model part, as I show in the architecture diagram here, because this model first uses the Transformer architecture, and secondly, it has achieved good application results in the speech transcription program code-named Whisper developed by OpenAI, directly choosing this proven model can save a lot of experimental costs for trying model structures and adjusting parameters.

## 3 Training Revelation

During the training process, I think the biggest challenge I encountered was the low accuracy, some models had overfitting problems, and the model running time was long due to insufficient computing power. For these problems, I consulted a lot of Internet information and learned from their methods of using multiple optimizers (such as data enhancement, automatic learning rate adjustment, and timely stopping) to improve my model, and achieved a certain degree of improvement.

## 4 Test and Evaluation

**Table.** The highest accuracy.

|  | Training Set(%) | Validation Set(%) |
| :---: | :---: | :---: |
| (a) | 77.6 | 54.0 |
| (b) | 76.5 | 60.5 |
| (c) | 77.4 | 51.0 |
| (d) | 79.9 | 60.0 |

Judging from the final prediction accuracy results, the four groups of models with different configurations can achieve relatively stable and good accuracy on the training set, but there are relatively large differences on the validation set. One possible reason is the complexity of the model, that is, the more complex the model is, it does not necessarily lead to better prediction results. After all, considering that the data set is not large, there are only about 1,000 audio sample data. From the perspective of the loss curve, the model generally has the problem of weak generalization ability, that is, in the final convergence stage of the four loss curves, the training loss and the validation loss have more or less loss gaps, and cannot be well consistent.

<div align=center><img src="./figs/loss_curve.png" style="width:600px" /></div>

<center><b>Figure.</b> Loss curves for four experiments.</center>

## 5 Conclusion

In general, my biggest discovery from this personal project is that with limited data sets, relatively lightweight model structures often achieve better performance. And when the model reaches its best state (60% accuracy), its performance is roughly equivalent to the human ability to distinguish audio genres. And due to time constraints and other factors, this project still has many model hyperparameters to be verified, and there are still many problems in the model that need to be fixed. Therefore, in future work, I plan to continue to further test and tune the hyperparameters, and continue to explore potential improvements to the model architecture. The goal is to achieve a more stable, more accurate, and higher classification performance model.

## Reference

- Mogonediwa, K. (2024). Music genre classification: Training an AI model. *arXiv*. https://arxiv.org/abs/2405.15096.
- Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. *IEEE Transactions on Speech and Audio Processing*, *10*(5), 293–302. https://doi.org/10.1109/TSA.2002.800560
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In *Advances in Neural Information Processing Systems* (Vol. 30, pp. 5998–6008). Curran Associates, Inc. https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
- Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. In *Proceedings of the 40th International Conference on Machine Learning* (Vol. 202, pp. 28492–28518). PMLR. https://proceedings.mlr.press/v202/radford23a.html

**End of the Notebook**