##### Project ID: 028 

##### Project Title: Environmental Sound Classification 

##### Area of Research: Computer Vision

##### Team name: NNDL

##### Team members:

Sheng Zhang(z5446399)

Zelong Huang(z5489331) 

Shanyu Zhou(z5466581) 

Lingyun Yan(z5467937)

### 1. Introduction, Motivation and Problem Statement
#### 1.1 Introduction
<img src="./picture/esc50.gif" alt="final" width="600">

In today's information age, the field of machine learning has seen rapid technological advancement and significant achievements in various fields, although research in speech and music processing has progressed, the analysis of ambient sound is still relatively lagging behind. Environmental Sound Classification (ESC), as one of the important application areas, is attracting more and more researchers' attention. 

Environmental sound refers to a variety of non-speech sounds produced by nature and human activities, such as animal calls, traffic noise, natural environmental sounds (e.g., rain, wind), and home environmental sounds such as washing machine sound, door bell. These sounds contain a wealth of information and provide important clues about the state of the environment, the type of activity, and so on.

#### 1.2 Motivation
With the rapid development of deep learning technology, which has made breakthroughs in the fields of image, speech and natural language processing, ambient sound classification has a wide range of application prospects and important significance in the real world. Ambient sound classification technology is widely used in many fields, such as audio surveillance systems to detect and respond to security threats in a timely manner, hearing aids using sound classification technology to enhance the user's understanding of ambient sound, smart homes where sound classification is used for home security monitoring and user experience enhancement, video content generation where background sound effects can be added automatically, and automated driving to help vehicles recognize ambient sound signals to improve safety. 

In ecological monitoring, natural environmental sounds are analyzed to protect wildlife and identify environmental changes, and in medical diagnostics, sounds such as breathing and coughing are analyzed to assist doctors in early diagnosis and monitoring. For the research of this problem, this not only helps to promote the development of academic research, but also provides technical support and solutions for practical applications.

#### 1.3 Problem Statement
This study will address the problem of environmental sound classification, and the ESC-50 used provides a standardized benchmark dataset. In the task of environmental sound classification, the following problems are faced: the diversity and complexity of sound data, the influence of noise and sound quality of the environment, model selection and feature extraction, and generalization capabilities. 

Based on the above problems, the objectives of this study are: exploring and applying different deep learning models, optimizing feature extraction methods, and improving the robustness and generalization ability of the models. We hope that the study of the above problems can find the optimal strategy for environmental sound classification and provide valuable help for future research and applications.

### 2. Data Sources

#### 2.1 Dataset Description
The ESC-50 dataset is designed for the classification of environmental sounds. It contains a diverse range of audio clips representing various everyday sounds, it contains 2000 audio clips divided into 50 categories of 40 audio clips each. All audio clips are 5-second WAV files with a sampling rate of 44.1 kHz in mono.

#### 2.2 Specific Categories
This table lists the five major categories of the ESC-50 dataset and their corresponding subcategories.
<img src="./picture/截屏2024-07-21 10.51.50.png" alt="final" width="800">

#### 2.3 Download
The dataset can be downloaded from the following link：
https://github.com/karoldvl/ESC-50/archive/master.zip


### 3. Exploratory Analysis of Data

#### 3.1 Data properties
**Data Type:** The data for this project consists of direct audio signals and images using converted audio. to facilitate processing using image recognition models such as Convolutional Neural Networks (CNN).

#### 3.2 Number of categories
**Number of categories:** The dataset contains 50 categories, each with 40 audio files.
The project divides these audio files into dataset and test set, which are roughly categorized into 9:1, 8:2, 7:3 by looking at some common audio categorization papers.

#### 3.3 Preprocessing
##### 3.3.1 Audio signal preprocessing
+ **Normalization:** Normalize the audio signal to improve the stability and performance of subsequent processing and analysis, and reduce the effect of noise.
+ **Label Coding:** Use one-hot coding to convert labels to a categorical format to avoid sequential relationships between categories, improve model performance, and good compatibility.
##### 3.3.2 Image Preprocessing
+ **Gray Scale Processing:** Convert audio to gray scale image, it reduces computational complexity and storage requirements, reduces data redundancy, makes processing more concise and increases processing speed.
+ **Size Adjustment:** At first, our group learned from the paper that 1000*400 size image has good effect, so we initially used that size image, but the performance is not good in the actual test, adjusted to 224*224, in the subsequent test to get better results.

#### 3.4 Feature Extraction
We have learned that feature extraction is roughly done in the following three ways by reviewing related papers.

+ **STFT** 

STFT, full name Short-Time Fourier Transform, which is based on the principle of dividing the audio signal into small time segments, and applying Fourier Transform to the small time segments.STFT can capture the frequency characteristics of the sound in time, for example, in the ESC-50 dataset, STFT can capture the beginning and end of a dog barking, or the persistence pattern of the sound of rain.

<img src="./picture/stft.png" alt="final" width="600">

+ **MFCC** 

MFCC, full name Mel Frequency Cepstral Coefficients, which maps the linear spectrum onto the Mel scale after performing a Fourier transform on the audio. In the ESC-50 dataset, MFCC can effectively capture the timbral characteristics of sounds, such as distinguishing different animal calls, or different types of machine sounds.

<img src="./picture/mfcc.png" alt="final" width="600">

+ **Mel** 

The Mel spectrogram is the result of converting a regular spectrogram to the Mel frequency scale. It retains the time-frequency representation of STFT, but the frequency axes use the Meier scale, which is more compatible with the perception of the human ear. In the ESC-50 dataset, the Mel Spectrogram is effective in capturing time-frequency patterns of ambient sounds, such as the sustained high-frequency component of rain, or the transient high-energy signature of a car horn.

<img src="./picture/mel.png" alt="final" width="600">

#### 3.5 Data Augmentation
+ **Audio Data Enhancement:** Including time shifting, adding noise and other methods to improve the robustness and generalization ability of the model.
+ **Image Data Enhancement:** Including rotation, horizontal flipping, adding noise and other methods to improve the robustness of the model and generalization ability.

#### 3.6 Challenging aspects
+ **Category diversity:** the 50 categories cover a wide range of sound types, which increases the difficulty of categorization. 
+ **Fewer samples per category:** there are only 40 samples per category, which can lead to overfitting or underfitting problems. 
+ **Background noise:** real-world recordings may contain background noise, which increases the complexity of classification. 
+ **Intra-class variation:** there may be significant variation in sounds from the same category, e.g. different kinds of dog barks. 
+ **Inter-class similarity:** there may be similarities between certain classes, e.g. "rain" and "waves".


### 4. Basic Model
We chose different models based on traditional machine learning direction, CNN direction, RNN direction, for different models respectively change their feature extraction method, image size, training set to test set ratio, optimizer selection, optimizer learning rate, training rounds and run out multiple sets of results for comparison.

##### Evaluation of model results
And we also output loss vs. rounds plots, confusion matrices, accuracy, f1-score, and runtime for each training to evaluate the model's results.

For solving this problem, based on previous learning experiences with neural networks as well as machine learning, the first thing that comes to mind is to convert audio into pictures, and transform the problem of categorizing sounds into a problem of categorizing pictures.

So we convert all the audio into images using different feature extraction methods, here is an example, we convert the audio into a Mel Spectrogram with a size of 224*224.

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display

def wav_to_melspectrogram_image(wav_file, output_folder):
    y, sr = librosa.load(wav_file, sr=None)  # Load the audio file
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)  # Compute the Mel spectrogram
    S_DB = librosa.power_to_db(S, ref=np.max)  # Convert to dB scale
    
    os.makedirs(output_folder, exist_ok=True)  # Ensure output folder exists
    output_file = os.path.join(output_folder, f"{os.path.splitext(os.path.basename(wav_file))[0]}.png")
    
    # Plot and save the Mel spectrogram
    fig, ax = plt.subplots(figsize=(2.24, 2.24), dpi=100)
    ax.set_axis_off()
    librosa.display.specshow(S_DB, sr=sr, x_axis='time', y_axis='mel', cmap='viridis', ax=ax)
    plt.subplots_adjust(left=0, right=1, top=1, bottom=0)
    fig.savefig(output_file, bbox_inches='tight', pad_inches=0)
    plt.close(fig)

def process_all_wav_files(input_folder, output_folder):
    for wav_file in os.listdir(input_folder):
        if wav_file.endswith('.wav'):
            wav_to_melspectrogram_image(os.path.join(input_folder, wav_file), output_folder)

#### 4.1 Traditional machine learning
For the problem of classifying audio images, we first considered classification using traditional machine learning methods. We used three traditional machine learning methods (KNN, SVM, Random Forest) for testing and analyzing the final results.
Here is the core code for all three models.

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from tensorflow.keras.preprocessing.image import load_img, img_to_array

In [None]:
# Read CSV file
file_path = ''
df = pd.read_csv(file_path)

# Assume spectrogram file path and format
spectrogram_path = ''

# Function to load spectrogram images
def load_spectrogram(file_name):
    img_path = os.path.join(spectrogram_path, file_name.replace('.wav', '.png'))
    img = load_img(img_path, color_mode='grayscale') 
    img_array = img_to_array(img)
    return img_array

# Prepare dataset
X = []
y = []

for index, row in df.iterrows():
    spectrogram = load_spectrogram(row['filename'])
    X.append(spectrogram)
    y.append(row['target'])

X = np.array(X)
y = np.array(y)

# Flatten the data into one-dimensional vectors
X_flatten = X.reshape(X.shape[0], -1)

# Split data into training and testing sets, using stratified splitting to ensure uniform distribution of each class
X_train, X_test, y_train, y_test = train_test_split(X_flatten, y, test_size=0.1, random_state=42, stratify=y)

# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Train KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

In [None]:
# Train SVM model
svm = SVC(kernel='linear', C=1, random_state=42)
svm.fit(X_train_scaled, y_train)

In [None]:
# Train Random Forest model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

#### 4.1.2 Running Result
Below is a table of results for all test data

<img src="./picture/KNN.png" alt="final" width="600">

In traditional machine learning approaches, we experimented with three models: KNN, SVM, and Random Forest. We also tried various data preprocessing and testing methods, such as different image sizes (1000x400, 224x224) and different train-test splits (90:10, 80:20, 70:30). We found that the best performance was achieved with an image size of 1000x400 and a train-test split of 90:10. Among these three models, the highest accuracy rates were 27% for KNN, 38.75% for SVM, and 40% for Random Forest. Despite these efforts, the overall performance was unsatisfactory.


After analyzing the test results of the above three traditional machine learning, we found that they are not very effective in classifying audio images. After our analysis, we feel that it is mainly because of their limitations in feature extraction, handling high dimensional data, exploiting temporal dependencies, and large-scale data processing capabilities. 

In contrast, deep learning methods have significant advantages in these areas and can perform much better in audio image classification tasks. So in our next work, we consider comparative tests using different deep learning models in the hope of finding the one that performs better.

#### 4.2 CNN
We analyze and compare their test performance by reviewing relevant audio image classification papers, we consider the following deep learning models, mainly considering convolutional neural network architectures.

##### 4.2.1 AlexNet
AlexNet is characterized by its innovative network architecture and combination of techniques, including ReLU activation function, Dropout regularization, overlap pooling, data augmentation, and parallel computing, which can effectively improve the accuracy of the test.

*Ismail Fawaz, H., Lucas, B., Forestier, G., Pelletier, C., Schmidt, D. F., Weber, J., ... & Petitjean, F. (2020). Inceptiontime: Finding alexnet for time series classification. Data Mining and Knowledge Discovery, 34(6), 1936-1962.*

For the AlexNet model we conducted several tests using control variables in terms of feature extraction, image size, optimizer, number of training rounds, etc. and plotted the final test results as the table below.

<img src="./picture/截屏2024-07-23 21.38.10.png" alt="final" width="1000">

By analyzing the results of several tests, we found that the accuracy can reach up to 59.5%. 

The AlexNet model reads and preprocesses the spectrogram image, resizes it to 224x224 pixels, and normalizes it. Data enhancement was performed using ImageDataGenerator, which includes operations such as rotation, translation and horizontal flipping. The defined AlexNet model consists of five convolutional layers and three fully connected layers using ReLU activation function and L2 regularization. The model is compiled using the SGD optimizer and trained by the fit method, during which the Dropout technique is applied to prevent overfitting. 

Finally, the performance of the model is evaluated by the test data, and the classification report and confusion matrix are generated for visualization and analysis.

Below is the relevant code.

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD

# Read CSV file
file_path = ''
df = pd.read_csv(file_path)

# Assuming the spectrogram file path and file format
spectrogram_path = ''

# Function: Load spectrogram image
def load_spectrogram(file_name):
    img_path = os.path.join(spectrogram_path, file_name.replace('.wav', '.png'))
    img = load_img(img_path, target_size=(224, 224))  # AlexNet's default input size is 224x224
    img_array = img_to_array(img)
    img_array /= 255.0  # Normalize image
    return img_array

# Prepare the dataset
X = []
y = []

for index, row in df.iterrows():
    spectrogram = load_spectrogram(row['filename'])
    X.append(spectrogram)
    y.append(row['target'])

X = np.array(X)
y = np.array(y)

# Convert labels to categorical format
y_categorical = to_categorical(y, num_classes=len(df['target'].unique()))

# Split data into training and testing sets using stratified split to ensure even distribution of each class
X_train, X_test, y_train, y_test = train_test_split(X, y_categorical, test_size=0.1, random_state=42, stratify=y)

# Define the AlexNet model
def create_alexnet(input_shape, num_classes):
    model = Sequential()

    # First convolutional and pooling layer
    model.add(Conv2D(96, (11, 11), strides=(4, 4), activation='relu', input_shape=input_shape, kernel_regularizer=l2(0.001)))
    model.add(MaxPooling2D((3, 3), strides=(2, 2)))

    # Second convolutional and pooling layer
    model.add(Conv2D(256, (5, 5), padding='same', activation='relu', kernel_regularizer=l2(0.001)))
    model.add(MaxPooling2D((3, 3), strides=(2, 2)))

    # Third convolutional layer
    model.add(Conv2D(384, (3, 3), padding='same', activation='relu', kernel_regularizer=l2(0.001)))

    # Fourth convolutional layer
    model.add(Conv2D(384, (3, 3), padding='same', activation='relu', kernel_regularizer=l2(0.001)))

    # Fifth convolutional and pooling layer
    model.add(Conv2D(256, (3, 3), padding='same', activation='relu', kernel_regularizer=l2(0.001)))
    model.add(MaxPooling2D((3, 3), strides=(2, 2)))

    # Flatten and fully connected layers
    model.add(Flatten())
    model.add(Dense(4096, activation='relu', kernel_regularizer=l2(0.001)))
    model.add(Dropout(0.5))
    model.add(Dense(4096, activation='relu', kernel_regularizer=l2(0.001)))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))

    return model

# Create the model
input_shape = (224, 224, 3)
num_classes = len(df['target'].unique())
model = create_alexnet(input_shape, num_classes)

# Adjust learning rate and momentum
optimizer = SGD(learning_rate=0.01, momentum=0.9)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, batch_size=32, epochs=80, validation_data=(X_test, y_test))

# Evaluate the model
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.argmax(y_test, axis=1)

accuracy = accuracy_score(y_true, y_pred_classes)
f1 = f1_score(y_true, y_pred_classes, average='weighted')

##### 4.2.2 VGG16
VGG is a network structure with sixteen layers deep. Due to its use of the same convolutional kernel and pooling kernel, as well as its unified design, the architecture of VGG16 is relatively simple and easy to understand and implement, and it also has an excellent performance in dealing with the problem of image classification, which is why we have chosen it as one of our basic models.

*Qassim, H., Verma, A., & Feinzimer, D. (2018, January). Compressed residual-VGG16 CNN model for big data places image recognition. In 2018 IEEE 8th annual computing and communication workshop and conference (CCWC) (pp. 169-175). IEEE.*

For the VGG16 model we conducted several tests using control variables in terms of feature extraction, image size, optimizer, number of training rounds, etc. and plotted the final test results as the table below.

<img src="./picture/截屏2024-07-23 10.28.31.png" alt="final" width="1000">

By analyzing the results of several tests, we found that the model can achieve up to 60% accuracy. The model reads and preprocesses the spectrogram image, resizing it to 224x224 pixels and normalizing it. The defined VGG16 model consists of a pre-trained VGG16 convolutional layer (without the top fully connected layer) and a custom fully connected layer using the ReLU activation function and L2 regularization. The model is compiled using the Adam optimizer and trained by the fit method, during which the Dropout technique is applied to prevent overfitting. Finally, the model performance is evaluated by test data.

The core code is shown below.

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Flatten, Dense, Dropout, GlobalAveragePooling2D
from tensorflow.keras.applications import VGG16
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2  # Import l2

In [None]:
# Freeze the convolutional layer weights of VGG16
for layer in base_model.layers:
    layer.trainable = False

# Build the new model
model = Sequential()
model.add(base_model)
model.add(GlobalAveragePooling2D())
model.add(Dense(4096, activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu', kernel_regularizer=l2(0.001)))
model.add(Dropout(0.5))
model.add(Dense(len(df['target'].unique()), activation='softmax'))

# Adjust the learning rate
optimizer = Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, batch_size=32, epochs=50, validation_data=(X_test, y_test))

##### 4.2.3 ResNet
ResNet50 is a network sturcture with 50 deep convolutional neural network. This network was introduced residual blocks to train the deeper netural network. And it is widely used in image classification tasks.

*He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).*

In this project, we alse used Global average pooling layer, fully connected layer, Dropout layer and output layer based on ResNet50 model to further process the features, prevent overfitting and output results.

This code is used for classification task of spectrogram. Firstly, the spectrogram file name and corresponding labels are read from the CSV file and the spectrogram image is loaded and preprocessed and the dataset is divided into training set and test set according to the classification. Next, the code uses the ResNet50 model as a feature extractor to construct a new classification model and train it on the training set while evaluating it on the test set. 

After the training of this code is completed, the code calculates and prints the accuracy and F1 scores of the model, plots the curves of training and validation losses, and displays the confusion matrix and the accuracy of each category, and calculates and prints the running time of the whole process.

<img src="./picture/Resnet.png" alt=final width="1000">

For the ResNet50 model, we tried to use three ratios, 9:1, 8:2, and 7:3, to partition the dataset into a training set and a test set, and all other things being equal, the 9:1 training-test ratio was able to have the best accuracy (77%) and F1-score (0.767). For the 9:1 training-to-test ratio, we used three optimizers, Adam, Adagrad, and SGD, with different learning rates, and after experimentation, the Adam optimizer worked better for the model.

The following is the core code of ResNet

In [None]:
import os
import time
import pandas as pd
from tensorflow.keras.preprocessing.image import load_img, img_to_array
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, Input, GlobalAveragePooling2D
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

In [None]:
# Build the ResNet base model
resnet_base = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the convolutional layers of ResNet
for layer in resnet_base.layers:
    layer.trainable = False

# Build a new model on top of ResNet
input_tensor = Input(shape=(224, 224, 3))
x = resnet_base(input_tensor, training=False)
x = GlobalAveragePooling2D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
output_tensor = Dense(len(df['target'].unique()), activation='softmax')(x)

model = Model(inputs=input_tensor, outputs=output_tensor)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test))

##### 4.2.4 DenseNet
DenseNet121 connects each layerto every other layer in a feed-forward fashion.Foreach layer, the feature-maps of all preceding layers areused as inputs, and its own feature-maps are used as inputsinto all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

*Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700-4708).*

In this project, we alse used Global average pooling layer, fully connected layer, Dropout layer and output layer based on DenseNet121 model to further process the features, prevent overfitting and output results.

This code is used for classification task of spectrogram. Firstly, the spectrogram file name and corresponding labels are read from the CSV file and the spectrogram image is loaded and preprocessed and the dataset is divided into training set and test set according to the classification. Next, the code uses the DenseNet121 model as a feature extractor to construct a new classification model and train it on the training set while evaluating it on the test set. After the training of this code is completed, the code calculates and prints the accuracy and F1 score of the model, plots the curves of training and validation losses, and displays the confusion matrix and the accuracy of each category, and calculates and prints the running time of the whole process.

<img src="./picture/Densnet.png" alt="final" width="1000">

For the DenseNet121 model, we tried to use three ratios, 9:1, 8:2, and 7:3, to partition the dataset into a training set and a test set, and all other things being equal, the 9:1 training-test ratio was able to have the best accuracy (80.5%) and F1-score (0.799). For the 9:1 training-to-test ratio, we used three optimizers, Adam, Adagrad, and SGD, with different learning rates, and after experimentation, the Adam optimizer worked better for the model.

The following is the core code of DenseNet

In [None]:
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.applications.densenet import preprocess_input

In [None]:
# Build DenseNet pre-processing model
densenet_base = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the convolutional layer weights of DenseNet
for layer in densenet_base.layers:
    layer.trainable = False

# Build a new model
input_tensor = Input(shape=(224, 224, 3))
x = densenet_base(input_tensor, training=False)
x = GlobalAveragePooling2D()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
output_tensor = Dense(len(df['target'].unique()), activation='softmax')(x)

model = Model(inputs=input_tensor, outputs=output_tensor)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=50, validation_data=(X_test, y_test))


#### 4.3 RNN Model

Recurrent Neural Networks (RNNs) are a class of neural networks that are specialized for processing sequential data. Unlike Convolutional Neural Networks (CNNs), which are good at processing grid-like data such as images, RNNs are particularly well suited for tasks involving time series or sequence data.

RNNs have several unique advantages in audio classification tasks:

1. **Direct processing of sequential data**: RNNs can directly process the temporal characteristics of audio signals without first converting them to spectrograms or other image-like representations like CNNs.

2. **Capture long-term dependencies**: RNNs are able to maintain internal states, allowing them to capture long-term dependencies in audio data. This is particularly useful for understanding context in sounds that evolve over time.

3. **Variable input length**: RNNs can process input sequences of varying lengths, which is useful when processing audio clips of varying lengths.

4. **Memory efficiency**: For long sequences, RNNs are more memory-efficient than CNNs because they process inputs sequentially rather than as a whole.

##### 4.3.1 LSTM (Long Short-Term Memory)

Long Short-Term Memory *(Hochreiter & Schmidhuber, 1997)* is a special type of RNN designed to solve the vanishing gradient problem that standard RNNs may face when processing long sequences. LSTM introduces a memory cell and various gates (input, forget, and output gates) that enable the network to selectively remember or forget information in long sequences.

*Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.*

Our implementation of the LSTM model includes several key components:

In [None]:
def create_improved_rnn_model(input_shape, num_classes):
    model = Sequential([
        LSTM(64, return_sequences=True, kernel_regularizer=l2(0.01), input_shape=input_shape),
        BatchNormalization(),
        LSTM(32),
        BatchNormalization(),
        Dense(64, activation='relu', kernel_regularizer=l2(0.01)),
        Dropout(0.5),
        Dense(32, activation='relu', kernel_regularizer=l2(0.01)),
        Dropout(0.3),
        Dense(num_classes, activation='softmax')
    ])
    return model

Key features of our model:

1. **Stacked LSTM layers**: We use two LSTM layers, allowing the model to learn hierarchical representations of the audio data.

2. **Regularization**: We apply L2 regularization to the LSTM and Dense layers to prevent overfitting.

3. **Batch Normalization**: After each LSTM layer, we include a BatchNormalization layer to stabilize the learning process and potentially allow for higher learning rates.

4. **Dropout**: We incorporate Dropout layers after the Dense layers to further combat overfitting.

5. **Dense layers**: The model includes two Dense layers with ReLU activation before the final classification layer, allowing for non-linear combinations of the learned features.

##### 4.3.2 Bi-LSTM

To further improve performance, we implemented a bidirectional LSTM (Bi-LSTM) model *(Schuster & Paliwal, 1997)*. Bi-LSTM processes the input sequence in both forward and backward directions, allowing the network to capture information that may come from future context as well as past context.

*Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11), 2673-2681.*

Advantages of Bi-LSTM over standard LSTM include:

1. **Improved contextual understanding**: By processing the sequence bidirectionally, Bi-LSTM can capture more comprehensive context for each time step in the audio.

2. **Better performance on tasks that require full sequence context**: For the audio datasets we use, Bi-LSTM generally outperforms unidirectional LSTM because they are correlated throughout the sequence.

3. **Increased model capacity**: Bi-LSTM effectively doubles the number of parameters in the recurrent layer, potentially allowing for more complex feature learning.

In [None]:
def create_improved_rnn_model(input_shape, num_classes):
    model = Sequential([
        Bidirectional(LSTM(64, return_sequences=True, kernel_regularizer=l2(0.01)), input_shape=input_shape),
        BatchNormalization(),
        Bidirectional(LSTM(32)),
        BatchNormalization(),
        Dense(64, activation='relu', kernel_regularizer=l2(0.01)),
        Dropout(0.5),
        Dense(32, activation='relu', kernel_regularizer=l2(0.01)),
        Dropout(0.3),
        Dense(num_classes, activation='softmax')
    ])
    return model

##### 4.3.3 Augmentation

To further enhance our model's performance and generalization capabilities, we implemented data augmentation techniques specifically tailored for audio data:

In [None]:
def augment_audio(audio, sr):
    # Time shift
    shift_max = int(sr * 0.1)
    shift = np.random.randint(-shift_max, shift_max)
    augmented = np.roll(audio, shift)

    # Add noise
    noise_factor = 0.05
    noise = np.random.randn(len(audio))
    augmented += noise_factor * noise

    # Normalize
    augmented = np.clip(augmented, -1, 1)
    return augmented

Our augmentation strategy consists of two main techniques:

1. **Time Shift**: We randomly shift the audio time by up to 10% of its length. This helps the model remain invariant to the exact timing of the sounds in the clip.

2. **Noise Injection**: We add a small amount of random noise to the audio. This simulates real-world conditions where background noise is often present and helps the model be more robust to such changes.

3. **Normalization**: After applying these augmentations, we clip the values ​​to ensure they stay within the valid range of the audio data.

These augmentation techniques help artificially increase the diversity of our training data, potentially improving the model's ability to generalize to new, unseen audio samples.

We summarize the results of several of the above runs with respect to the LSTM model in the following table:

<img src="./picture/LSTM结果.jpg" alt="final" width="1000">


##### 4.3.4 GRU
GRU stands for Gated Recurrent Unit, which is a type of recurrent neural network (RNN) architecture and a variant of the Long Short-Term Memory (LSTM) network. GRUs are good at handle sequence data and capture temporal dependencies without suffering from the vanishing gradient problem that can affect traditional RNNs.

This model uses the 'Sequential' model from TensorFlow's Keras API and includes 'Input' and 'GRU' layers from Keras. In the GRU layer, we define 64 units. The 'Dense' layer uses 'relu' (Rectified Linear Unit) as the activation function. The 'Dropout' layer sets 50% of the neurons to drop out during training. The final 'Dense' layer uses the 'softmax' activation function, which is commonly used for multi-class classification problems. In the 'compile' method, we choose the 'adam' optimizer, known for achieving better accuracy.

*Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.*

In [None]:
# Define simple GRU model
def create_simple_gru_model(input_shape, num_classes):
    model = Sequential([
        Input(shape=input_shape),
        GRU(64),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

##### 4.3.5 Bi-GRU
Bi-GRU processes sequences in both forward and backward directions, allowing it to achieve better performance in audio classification where context from both directions is important. Additionally, it is more robust to data variations such as noise and differing sequence lengths. By capturing dependencies missed by unidirectional GRUs, Bi-GRU leads to higher accuracy and richer feature extraction.

For the Bidirectional GRU model, we use the 'Bidirectional' wrapper along with the GRU layer from Keras. The 'Dense' layers use the same 'relu' and 'softmax' activation functions. We retain the 'adam' optimizer and the same loss function in the 'compile' method from Keras.

*Li, X., Ma, X., Xiao, F., Xiao, C., Wang, F., & Zhang, S. (2022). Time-series production forecasting method based on the integration of Bidirectional Gated Recurrent Unit (Bi-GRU) network and Sparrow Search Algorithm (SSA). Journal of Petroleum Science and Engineering, 208, 109309.*

In [None]:
# Define Bidirectional GRU model
def create_simple_gru_model(input_shape, num_classes):
    model = Sequential([
        Input(shape=input_shape),
        Bidirectional(GRU(64)),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

In this section, we used the same data augmentation methods as in the LSTM section: time shift, noise addition, and normalization. Additionally, we experimented with different feature extraction methods, including MFCC and Mel spectrograms. The results are as follows.

<img src="./picture/GRU.png" alt="final" width="1000">

As can be seen from the results, with the introduction of GRU-related modeling and data augmentation, we get the highest accuracy of 78.5% in this part.

### 5 Basic Model Enhancement and Result

#### 5.1  Basic model results

These are all the basic models we have chosen, and we have selected the best performing cases of each basic model to be plotted in the table below.

<img src="./picture/final.png" alt="final" width="1000">

In different base models, we experimented with various image sizes (1000x400, 224x224), different training/testing splits (9:1, 8:2, 7:3), different optimizers (Adam, SGD, Adagrad), and different learning rates (0.01, 0.001, 0.0001). Based on these analyses, we compiled a table showing the highest accuracy achieved for each model along with the corresponding image size, training/testing split, optimizer, and learning rate. The standout model among machine learning, CNN, and RNN approaches was the DenseNet model from CNN, achieving an accuracy of 80.5%. The next best was the Bi-GRU model from RNN, which achieved an accuracy of 78.5% after data augmentation.

#### 5.1  Analysis of baseline results

Since we have chosen DenseNet as our base model, the next step is to consider how to enhance on this base model. We output the confusion matrix of the DenseNet model results during training as shown below.

<img src="./picture/混淆矩阵.png" alt="final" width="500">

In the confusion matrix, we have labeled some categories with relatively low accuracy, we looked at the spectral images of these categories in the hope of finding the reason for the low accuracy, some of the categories are shown in the figure below.

<img src="./picture/表现不好的类别.png" alt="final" width="500">

By analyzing these images, we found that some of the poorly performing categories shared the common feature of different repetitions of the sound in that category, of varying lengths. For example, for one of the categories that had sound clips about breathing, some of the audios had multiple breaths performed, while others had only one breath performed. So we feel that this could be the cause of the lower accuracy.

#### 5.2  Base model improvement enhancements

To address the above problems, we believe that we can try to use data enhancement approaches, for example, we can enhance the training images by panning, adding noise, rotating, etc. to help the model better recognize different audio features and thus improve the classification performance.

However, the data enhancement approach to the pictures is not effective in improving the DenseNet121 model in this project. After analysis, we believe that the spectrogram is a conversion of the audio signal into a visual representation too much visual enhancement of the spectrogram may destroy the inherent structure of the audio signal, resulting in the model not being able to understand the features of the audio correctly.

Considering that in the base model we have better data enhancement results of the audio directly to the RNN structure, we decided to adopt the RNN structure mentioned above to use the data enhancement technique of the Bi-GRU model combined with the DenseNet121 model. Eventually there is some improvement for the testing accuracy of the model.

In [None]:
# Data augmentation function
def augment_audio(audio, sr):
    # Time shift
    shift_max = int(sr * 0.1)
    shift = np.random.randint(-shift_max, shift_max)
    augmented = np.roll(audio, shift)

    # Add noise
    noise_factor = 0.05
    noise = np.random.randn(len(audio))
    augmented += noise_factor * noise

    # Normalize
    augmented = np.clip(augmented, -1, 1)
    return augmented

In [None]:
# Load and preprocess data
def load_and_preprocess_data(audio_dir, csv_file, max_length=220500, augment=False):
    df = pd.read_csv(csv_file)
    X = []
    y = []
    for _, row in df.iterrows():
        file_path = os.path.join(audio_dir, row['filename'])
        audio, sr = librosa.load(file_path, sr=44100, duration=5.0)
        if augment:
            audio = augment_audio(audio, sr)
        if len(audio) < max_length:
            audio = np.pad(audio, (0, max_length - len(audio)))
        else:
            audio = audio[:max_length]
        mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=128)
        mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
        X.append(mel_spectrogram_db.T)
        y.append(row['target'])
    return np.array(X), np.array(y)

In [None]:
# Define Bidirectional GRU model with DenseNet121 and Attention
def create_birnn_gru_model_with_densenet_attention(input_shape, num_classes):
    input_layer = Input(shape=input_shape)
    reshaped_input = tf.keras.layers.Reshape((input_shape[0], input_shape[1], 1))(input_layer)
    
    # DenseNet121 model
    densenet = DenseNet121(include_top=False, input_shape=(input_shape[0], input_shape[1], 1), weights=None)
    densenet_out = densenet(reshaped_input)
    densenet_out = tf.keras.layers.GlobalAveragePooling2D()(densenet_out)
    
    reshaped_densenet_out = tf.keras.layers.Reshape((1, -1))(densenet_out)
    birnn_out = Bidirectional(GRU(64, return_sequences=True))(reshaped_densenet_out)
    attention_out = Attention()(birnn_out)
    
    dense1 = Dense(64, activation='relu')(attention_out)
    dropout = Dropout(0.5)(dense1)
    output_layer = Dense(num_classes, activation='softmax')(dropout)
    
    model = Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return mode

For the above model, after reviewing information and papers, we try to introduce the attention mechanism to optimize the model. We used the additive attention mechanism, which is used on the bidirectional GRU in this model, to focus on the output of the time step of the GRU, enabling the model to focus on more important features. By adding the attention mechanism, the model is able to focus on important time steps in the input sequence to improve the model's audio classification ability.

For several scenarios mentioned above for the basic model enhancement, we have tested them, and the relevant tests belong to the following figure.

<img src="./picture/improve结果表格.png" alt="final" width="1000">

By analyzing the data results, we can find that the best test results can be obtained by using the Bi-GRU model with data enhancement techniques in combination with the DenseNet121 model followed by the introduction of the additive attention mechanism, with a test result of 87%.

We use this model as our final result after lifting on the base model, and the complete code is as follows.

In [29]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Bidirectional, GRU, Dense, Dropout, Layer
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping, ModelCheckpoint
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import librosa
import os

# Data augmentation function
def augment_audio(audio, sr):
    # Time shift
    shift_max = int(sr * 0.1)
    shift = np.random.randint(-shift_max, shift_max)
    augmented = np.roll(audio, shift)

    # Add noise
    noise_factor = 0.05
    noise = np.random.randn(len(audio))
    augmented += noise_factor * noise

    # Normalize
    augmented = np.clip(augmented, -1, 1)
    return augmented

# Load and preprocess data
def load_and_preprocess_data(audio_dir, csv_file, max_length=220500, augment=False):
    df = pd.read_csv(csv_file)
    X = []
    y = []
    for _, row in df.iterrows():
        file_path = os.path.join(audio_dir, row['filename'])
        audio, sr = librosa.load(file_path, sr=44100, duration=5.0)
        if augment:
            audio = augment_audio(audio, sr)
        if len(audio) < max_length:
            audio = np.pad(audio, (0, max_length - len(audio)))
        else:
            audio = audio[:max_length]
        mel_spectrogram = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=128)
        mel_spectrogram_db = librosa.power_to_db(mel_spectrogram, ref=np.max)
        X.append(mel_spectrogram_db.T)
        y.append(row['target'])
    return np.array(X), np.array(y)

# Define Attention Layer
class Attention(Layer):
    def __init__(self, **kwargs):
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        self.W = self.add_weight(name='attention_weight', shape=(input_shape[-1], input_shape[-1]), initializer='random_normal', trainable=True)
        self.b = self.add_weight(name='attention_bias', shape=(input_shape[-1],), initializer='zeros', trainable=True)
        super(Attention, self).build(input_shape)

    def call(self, x):
        score = tf.nn.tanh(tf.tensordot(x, self.W, axes=[-1, 0]) + self.b)
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * x
        context_vector = tf.reduce_sum(context_vector, axis=1)
        return context_vector

# Define Bidirectional GRU model with DenseNet121 and Attention
def create_birnn_gru_model_with_densenet_attention(input_shape, num_classes):
    input_layer = Input(shape=input_shape)
    reshaped_input = tf.keras.layers.Reshape((input_shape[0], input_shape[1], 1))(input_layer)
    
    # DenseNet121 model
    densenet = DenseNet121(include_top=False, input_shape=(input_shape[0], input_shape[1], 1), weights=None)
    densenet_out = densenet(reshaped_input)
    densenet_out = tf.keras.layers.GlobalAveragePooling2D()(densenet_out)
    
    reshaped_densenet_out = tf.keras.layers.Reshape((1, -1))(densenet_out)
    birnn_out = Bidirectional(GRU(64, return_sequences=True))(reshaped_densenet_out)
    attention_out = Attention()(birnn_out)
    
    dense1 = Dense(64, activation='relu')(attention_out)
    dropout = Dropout(0.5)(dense1)
    output_layer = Dense(num_classes, activation='softmax')(dropout)
    
    model = Model(inputs=input_layer, outputs=output_layer)
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# File paths
audio_dir = 'audio'
csv_file = 'meta/esc50.csv'

# Load and preprocess data
X, y = load_and_preprocess_data(audio_dir, csv_file)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, stratify=y, random_state=42)

# Data augmentation for training set
X_train_aug, y_train_aug = load_and_preprocess_data(audio_dir, csv_file, augment=True)
X_train = np.concatenate([X_train, X_train_aug])
y_train = np.concatenate([y_train, y_train_aug])

# Create and train Bidirectional GRU model with DenseNet121 and Attention
model_birnn_gru_densenet_attention = create_birnn_gru_model_with_densenet_attention(X_train.shape[1:], len(np.unique(y)))

# Callbacks
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=1e-5)
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
model_checkpoint = ModelCheckpoint('best_birnn_gru_densenet_attention_model.keras', save_best_only=True, monitor='val_accuracy')

history_birnn_gru_densenet_attention = model_birnn_gru_densenet_attention.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.1,
    callbacks=[reduce_lr, early_stopping, model_checkpoint],
    verbose=1
)

# Evaluate Bidirectional GRU model with DenseNet121 and Attention
y_pred_birnn_gru_densenet_attention = model_birnn_gru_densenet_attention.predict(X_test)
y_pred_classes_birnn_gru_densenet_attention = np.argmax(y_pred_birnn_gru_densenet_attention, axis=1)
accuracy_birnn_gru_densenet_attention = accuracy_score(y_test, y_pred_classes_birnn_gru_densenet_attention)
f1_birnn_gru_densenet_attention = f1_score(y_test, y_pred_classes_birnn_gru_densenet_attention, average='weighted')
print(f'Overall Accuracy: {accuracy_birnn_gru_densenet_attention:.4f}')
print(f'Overall F1 Score: {f1_birnn_gru_densenet_attention:.4f}')

# Calculate accuracy for each class
class_accuracies_birnn_gru_densenet_attention = {}
for class_id in np.unique(y):
    mask = y_test == class_id
    class_accuracy = accuracy_score(y_test[mask], y_pred_classes_birnn_gru_densenet_attention[mask])
    class_accuracies_birnn_gru_densenet_attention[class_id] = class_accuracy
    print(f'Class {class_id} Accuracy: {class_accuracy:.4f}')

# Plot learning curves
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(history_birnn_gru_densenet_attention.history['accuracy'], label='Training Accuracy')
plt.plot(history_birnn_gru_densenet_attention.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(history_birnn_gru_densenet_attention.history['loss'], label='Training Loss')
plt.plot(history_birnn_gru_densenet_attention.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()

plt.tight_layout()
plt.savefig('learning_curves_birnn_gru_densenet_attention.png')
plt.close()

# Plot confusion matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix
cm_birnn_gru_densenet_attention = confusion_matrix(y_test, y_pred_classes_birnn_gru_densenet_attention)
plt.figure(figsize=(12, 10))
sns.heatmap(cm_birnn_gru_densenet_attention, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.savefig('confusion_matrix_birnn_gru_densenet_attention.png')
plt.close()

# Plot class accuracies
plt.figure(figsize=(12, 6))
plt.bar(class_accuracies_birnn_gru_densenet_attention.keys(), class_accuracies_birnn_gru_densenet_attention.values())
plt.title('Accuracy by Class')
plt.xlabel('Class')
plt.ylabel('Accuracy')
plt.savefig('class_accuracies_birnn_gru_densenet_attention.png')
plt.close()


Epoch 1/100
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m505s[0m 5s/step - accuracy: 0.0283 - loss: 3.9323 - val_accuracy: 0.0237 - val_loss: 3.9265 - learning_rate: 0.0010
Epoch 2/100
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m471s[0m 4s/step - accuracy: 0.0479 - loss: 3.8229 - val_accuracy: 0.0211 - val_loss: 4.3189 - learning_rate: 0.0010
Epoch 3/100
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m485s[0m 5s/step - accuracy: 0.0522 - loss: 3.6988 - val_accuracy: 0.0211 - val_loss: 4.0221 - learning_rate: 0.0010
Epoch 4/100
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m485s[0m 5s/step - accuracy: 0.0989 - loss: 3.4716 - val_accuracy: 0.0316 - val_loss: 4.3279 - learning_rate: 0.0010
Epoch 5/100
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m486s[0m 5s/step - accuracy: 0.1028 - loss: 3.2969 - val_accuracy: 0.0605 - val_loss: 3.8046 - learning_rate: 0.0010
Epoch 6/100
[1m107/107[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

Plot confusion matrix

<img src="./picture/wemeet image_20240804120902074.png" alt="final" width="700">

##### 6.Discution

#### 6.1 Strengths
+ **High-efficiency feature extraction:** High-efficiency feature extraction through DenseNet121, combined with Mel spectrogram, makes the feature extraction process more efficient and meaningful.

+ **Timing Modeling Capability:** Enhanced timing modeling capability for audio data through bi-directional GRU modeling, which better captures the temporal dependencies in audio signals.

+ **Enhanced model robustness:** Data enhancement strategies (time offset and additive noise) improve the robustness of the model to the diversity of audio data in real-world applications.

+ **Optimizing the training process:** The use of callback functions optimizes the training process and improves the convergence speed and final performance of the model.

+ **high accuracy：** The final model achieved a high level of accuracy (87%), with many categories improving in the final model compared to the base model.

#### 6.2 Weaknesses
In the 10 categories where the baseline DenseNet model achieved accuracies below 50%, our final model (Bi-GRU + DenseNet + data augmentation + attention mechanism) improved 7 of them. Among the 3 categories that did not see improvement, the most notable were categories 23 and 28. These categories, representing breathing and snoring sounds respectively, were confused in the final results, leading to lower accuracies.

<img src="./picture/最终结果缺点.jpg" alt="final" width="800">

We analyzed the possible reasons for the model's failure to distinguish these two categories:

+ **Visual Similarities**: The spectrograms for these sounds show similar frequency patterns and intensity distributions.
+ **Acoustic Similarities**: Both sounds have similar timbre, pitch, and rhythmic patterns, making them difficult to differentiate.
+ **Technical Limitations**: The DenseNet may struggle to make fine distinctions between these similar sounds. The Bi-GRU might also have difficulty capturing dependencies in both directions effectively. Even with an attention layer, if the key features of both classes are similar, the mechanism may not differentiate them well.

#### 6.3 Future Work
To improve the model, we suggest:

+ **Increasing the Dataset Size**: Pretraining with a larger dataset could help the model learn more subtle differences between the categories.
+ **Using More Advanced Models**: Employing more sophisticated, computationally intensive models and advanced data augmentation techniques might enhance the model's ability to distinguish between these similar sounds.