# A Voice Story （Fake or Real） Discrimination Model Based on ResNet

# 1 Author

**Student Name**:  MingHao Hong

**Student ID**:  221171051

### Tip:

In Model.ipynb:

    it includes the subsequent processes of dataset construction, network construction, training function construction, and model training using ipynb.

In Submission_Report.ipynb:

    there is the report for the entire mini-project, which includes content from problem analysis, problem thinking, model selection, dataset construction, to conclusion analysis.

Your can download all in my github:

    https://github.com/0tiemWBmine0/mini-project

# 2 Problem formulation

In this mimi-project, the machine learning problem we aim to address is: **Predicting the authenticity of recorded stories based on audio**.

Specifically, the goal of this project is to construct a machine learning model that can accept an audio recording of 3 to 5 minutes as input, extract feature information from the audio data across different dimensions, and based on these information features, build a model to output results: predicting whether the narrated story is true or fictional.

This is an interesting and challenging problem, which I believe can be explained from the following aspects:

1. **Cross-modal Learning**: This issue may involve the intersection of sound signal processing and natural language processing. The model needs to be able to understand and process data from different modalities, which is a challenge.

2. **Sentiment Analysis and Speech Recognition**: Audio data contains not only linguistic information but also the speaker's emotions and intonation. The model needs to be able to recognize and analyze these non-verbal cues to help determine the authenticity of the story.

3. **Time Series Analysis**: Unlike static data, audio data is sequential data that changes over time. The model needs to be able to handle the characteristics of this time series, capturing the dynamic changes in the storytelling process.

4. **Data Scarcity and Imbalance**: In the real world, there may be an imbalance between true and fictional stories, and the collected data may be biased towards one category, which requires the model to handle class imbalance issues during training.

5. **Ethics and Privacy Issues**: When dealing with audio data, privacy and ethical issues of the data must be considered to ensure that the collection and use of data comply with legal regulations and ethical standards.

6. **Wide Application**: This technology can be applied to various fields, such as the authenticity analysis of court testimonies, verification of news reports, and the authenticity determination of information on social media, and has extensive practical application value.

7. **Challenging Traditional Perceptions**: People generally believe that they can judge whether a person is lying by observing and listening, but this project challenges this perception, attempting to automate the process using machine learning technology.

8. **Cutting-edge Technology**: This project may involve the latest deep learning technologies, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), as well as possible attention mechanisms and Long Short-Term Memory Networks (LSTM), all of which are hot technologies in the field of machine learning today.

Through the basic analysis of the characteristics of this problem, it is easy to see that this is a machine learning project that is both challenging and interesting.


# 3 Methodology

## Analysis of the Problem:

Firstly, I considered two approaches to handling the problem:

One approach is to extract textual information for identification, but this would be similar to the problem of identifying whether textual information is true or not; it's akin to the current AI applications for detecting fake news. However, this approach actually increases the difficulty. With my current technology, I cannot process text with RNN, nor can I verify the authenticity of information online. Moreover, these stories are mostly personal events, not historical events, which further increases the difficulty of identification.

The second approach is to start directly with the characteristics of the audio information. Since it's human voice, it inevitably contains information about emotions, pauses, pitch, and timbre when people speak. I believe that when people convey true or false information, there must be different ways of processing these characteristic information. If we analyze these, we might be able to distinguish the authenticity of the story.

Therefore, I have chosen the second approach for building the machine learning model.

As for the discrimination of voice signals, the most important aspect is the variation of different characteristics over time. I am thinking if there is a way to represent the changes of each characteristic over time, that is, a diagram of the temporal relationships of various parameters.

This kind of multi-dimensional parameter information cannot be achieved with a single linear model; therefore, I plan to output the features in the form of images and use convolutional neural networks to analyze and process the images.

In terms of features, I have referred to the following parameters and their main applications:

## Feature Study And Choose

##### 1. DWT (Discrete Wavelet Transform)
- **Basic Characteristics**: DWT is a time-frequency analysis tool that decomposes a signal into time-localized representations of different frequency components. DWT can capture the instantaneous characteristics of a signal and is suitable for the analysis of non-stationary signals.
- **Applications**: DWT is widely used in voice signal processing, image compression (such as JPEG2000), and signal denoising. It can analyze signals at multiple resolutions, making it important for feature extraction in speech recognition and audio processing.

##### 2. STFT (Short-Time Fourier Transform)
- **Basic Characteristics**: STFT achieves dual analysis of time and frequency by sliding a window in time and performing a Fourier transform on the signal. The choice of window significantly affects the resolution of the results.
- **Applications**: STFT is commonly used for spectral analysis of audio and speech signals, music information retrieval, and audio effect processing. It is suitable for analyzing signals with brief change characteristics.

##### 3. MFCCs (Mel-Frequency Cepstral Coefficients)
- **Basic Characteristics**: MFCC is a speech feature extraction method based on the characteristics of human ear perception. It transforms the audio signal to the Mel frequency scale and then extracts features through the Discrete Cosine Transform (DCT). MFCC can effectively represent the timbre and pitch of speech.
- **Applications**: MFCC is very important in speech recognition, speaker recognition, emotion analysis, and other speech processing tasks. It is widely used as input features in machine learning and deep learning.

##### 4. Spectrogram
- **Basic Characteristics**: A spectrogram is a two-dimensional image formed by decomposing the audio signal into its frequency components and plotting their changes along the time axis. The horizontal axis of the spectrogram is time, the vertical axis is frequency, and the color or brightness represents amplitude.
- **Applications**: Spectrograms are widely used in audio signal analysis, speech recognition, and music information retrieval. They can intuitively reflect the time-frequency characteristics of audio signals and are the basis for many speech processing algorithms.

##### 5. LPC (Linear Predictive Coding)
- **Basic Characteristics**: LPC models a signal by predicting the current sample value as a linear combination, thereby extracting signal features. It estimates the parameters of the signal by minimizing the prediction error.
- **Applications**: LPC is widely used in speech synthesis, speaker recognition, and phoneme recognition. It can effectively capture the resonance characteristics of speech signals and is a classic tool for speech analysis.

##### 6. LPCC (Linear Predictive Cepstral Coefficients)
- **Basic Characteristics**: LPCC is a feature extracted from the resonance characteristics in LPC coefficients, usually obtained by performing a Discrete Cosine Transform on the LPC coefficients. It reflects the resonance characteristics of the signal and is suitable for representation in the cepstral domain.
- **Applications**: LPCC is used in speech recognition and synthesis because it has better discriminability than LPC and is more consistent with human ear perception of sound.

##### 7. LSF (Line Spectral Frequencies)
- **Basic Characteristics**: LSF represents the spectral line characteristics of speech signals, obtained from the cepstral analysis of LPC, and can reflect the spectral information of the signal. LSF has higher numerical stability than the original LPC coefficients while maintaining the same amount of information.
- **Applications**: LSF is widely used in speech synthesis, coding, and analysis, and is commonly used in vocoders and speaker recognition.

##### 8. PLP (Perceptual Linear Prediction)
- **Basic Characteristics**: PLP is based on psychoacoustic models and performs linear predictive analysis of signals while considering various factors of human ear auditory characteristics. By modeling the perceptual model of audio signals, it can more effectively extract features close to human ear perception.
- **Applications**: PLP is widely used in speech recognition, speaker verification, and other related fields, often used to enhance the performance of speech recognition systems under human auditory characteristics.

##### 9. Fundamental Frequency Period
- **Basic Characteristics**: The fundamental frequency period refers to the repetition interval of the fundamental frequency in sound waves, which is a direct reflection of the periodic characteristics of sound. It determines the pitch of the sound and is crucial for the analysis of music and speech signals. The measurement of the fundamental frequency period can be achieved by detecting the peaks or zero-crossings of the sound wave.
- **Applications**: In music production, speech synthesis, and speech recognition, the detection and analysis of the fundamental frequency period are essential for pitch tracking, vocoder design, and music information retrieval tasks.

##### 10. Spectral Centroid
- **Basic Characteristics**: The spectral centroid is a statistical measure that describes the central position of the distribution of a sound's spectrum, obtained by calculating the weighted average frequency of the spectrum. The spectral centroid can reflect the "brightness" or "fullness" of a sound, that is, the brightness or richness of the sound.
- **Applications**: The spectral centroid has a wide range of applications in music information retrieval, sound classification, and emotion analysis. It can help distinguish the sounds of different musical instruments and recognize the emotional tendencies of sounds.

##### 11. Spectral Contrast
- **Basic Characteristics**: Spectral contrast is an indicator that measures the relative intensity of different frequency components in a sound spectrum, calculated by comparing the energy distribution in different frequency bands. Spectral contrast can reveal the texture and complexity of sound, which is very important for analyzing the clarity and layering of sound.
- **Applications**: Spectral contrast plays an important role in speech recognition, music classification, and sound effect design. It can be used to distinguish the voices of different speakers and assess the style and emotion of musical works.

I have constructed five datasets for testing and learning, comparing them to achieve the best results:
   They are: 
   MFCC+Spectrogram; LSF+LPC; Spectrogram+Fundamental Frequency Period+Spectral Contrast+Spectral Centroid; Fundamental Frequency Period+Spectral Contrast+Spectral Centroid.

## Building and Basic Selection of Network Architecture

In terms of networking, I chose ResNet as the basic network architecture, and the network structure I built is as follows (after many modifications):

### Summary of Network Hierarchy

The ResNet network I built includes the following main parts:

1. **Basic Layer (b1)**:
   - It uses a 7x7 convolutional layer, batch normalization, ReLU activation function, and max pooling layer, aiming to extract preliminary features from the input image and reduce its dimensions.

2. **Residual Blocks (b2, b3, b4, b5)**:
   - Each residual block is composed of multiple `Residual` modules, supporting cross-layer learning. By using 1x1 convolutions and stride adjustments, the number of channels in the input and output is ensured to be consistent, avoiding information loss.
   - The design of residual blocks allows the network to learn more complex features and alleviates the problem of gradient disappearance in deep networks.

3. **Global Average Pooling and Fully Connected Layer**:
   - At the end of the network, a global average pooling layer is used to adjust the size of the feature map to 1x1, which is then flattened and passed through a fully connected layer to output the classification results.

### How to Define Model Performance

Model performance is usually defined by the following metrics:

- **Accuracy**: Measures the model's correct classification rate on the test set. Accuracy is the most commonly used performance metric, especially in classification tasks. It is obtained by calculating the ratio of correctly classified samples to the total number of samples.

- **Loss**: The output value of the loss function, used to evaluate the quality of the model's predictions. The lower the loss value, the closer the model's predictions are to the true values.

- **Training and Testing Accuracy**: By comparing the accuracy of the training set and the test set, the generalization ability of the model can be assessed. Overfitting models perform well on the training set but poorly on the test set.

### Selected Accuracy Analysis

In your code, the calculation of accuracy uses the `d2l.accuracy()` function, which is typically implemented in the following way:

- **Number of Correctly Classified Samples**: Calculates the number of samples where the model's predicted results match the true labels.
- **Total Number of Samples**: Calculates the total number of samples in the current batch.

The formula for calculating accuracy is:
\[
\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Samples}}
\]

By regularly recording training accuracy and testing accuracy during the training process, the learning progress and performance changes of the model can be observed. This monitoring method helps to adjust the learning rate and other hyperparameters in a timely manner to optimize model performance.

### Choice of Loss Function

In your network, the loss function chosen is **CrossEntropyLoss**, which is one of the most commonly used loss functions for classification problems. Its features include:

- **Applicability**: Suitable for multi-class classification problems, effectively handling the probability distribution of class labels.
- **Numerical Stability**: Usually used in conjunction with the softmax layer, maintaining numerical stability.
- **Gradient Information**: Provides good gradient information, which is helpful for parameter updates during optimization.

The formula for cross-entropy loss is:
\[
L = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})
\]
where \(N\) is the number of samples, \(C\) is the number of classes, \(y\) is the true label, and \(\hat{y}\) is the probability predicted by the model.

### Training Process Selection

The selection of the training process is reflected in your code as the following key steps:

1. **Weight Initialization**: Initialize the weights of the convolutional and fully connected layers using Xavier uniform distribution to ensure a good initial state of the model.

   ```python
   def init_weights(m):
       if type(m) == nn.Linear or type(m) == nn.Conv2d:
           nn.init.xavier_uniform_(m.weight)
   ```

2. **Optimizer**: Use SGD (Stochastic Gradient Descent) as the optimizer with a learning rate of 0.05. SGD is a commonly used optimization algorithm in deep learning, suitable for large-scale datasets.

   ```python
   optimizer = torch.optim.SGD(net.parameters(), lr=lr)
   ```

3. **Training Loop**: In each epoch, perform forward propagation, calculate loss, backward propagation, and parameter updates. Accumulate training loss and accuracy using `d2l.Accumulator` to facilitate the calculation of averages.

4. **Visualization**: Use `d2l.Animator` to visualize the changes in loss and accuracy during the training process, helping to monitor the model's learning progress.

5. **Testing Accuracy**: At the end of each epoch, calculate the accuracy of the test set using the `evaluate_accuracy_gpu` function to assess the model's generalization ability.







# 4 Implemented ML prediction pipelines

The overall pipeline is as follows:

1. Input Data: Processed audio feature images

2. Transformation Stage:
   - Feature Extraction: Utilize the first two layers of ResNet to extract low-level features from the images
   - Residual Block Construction: Use multiple residual blocks to further extract high-level semantic features from the images

3. Model Stage:
   - Network Construction: On the basis of feature extraction, add a global average pooling layer, a Flatten layer, and several fully connected layers to build the complete ResNet network model
   - Model Training: Train the network using the cross-entropy loss function and adopt the SGD optimizer
   
4. Integration Stage:
   - No ensemble methods are included; direct predictions are made using a single ResNet model.


## 4.1 Transformation stage

In the transformation stage, we mainly carried out feature extraction operations. First, we constructed the first two layers of ResNet using convolutional layers, BatchNorm layers, and ReLU activation functions to extract low-level features from the images.

Next, we defined the construction function for ResNet residual blocks, stacking multiple residual blocks to further extract high-level semantic features from the images. This feature extraction method based on residual connections can effectively enhance the model's performance and is one of the core innovations of the ResNet network.


## 4.2 Model stage

In the model stage, we integrated the aforementioned feature extraction part into a complete ResNet network model. Specifically, we sequentially stacked the first two layers of ResNet and four residual block modules, and then added a global average pooling layer, a Flatten layer, and several fully connected layers at the end, forming the complete ResNet network structure.

This network structure can effectively extract multi-scale feature representations from images, thereby accomplishing the task of image classification. During the model training process, we use the cross-entropy loss function and the SGD optimizer to perform end-to-end training of the network.


## 4.3 Ensemble stage

In the current pipeline, we have not included any ensemble methods. We directly use a single ResNet model for prediction. Ensemble methods can typically further enhance model performance, such as considering the use of Bagging, Boosting, or Stacking ensemble strategies. However, in this case, we have only used a single ResNet model without any ensemble.

Overall, this ML prediction pipeline leverages the feature extraction capabilities of the ResNet network to build a complete classification model that can effectively accomplish image classification tasks. In the future, it is possible to consider adding ensemble methods to further enhance the performance of the model.


# 5 Dataset

### 1. Dataset Construction

#### 1.1 Dataset Sources

Assuming the `MLEnd Deception` dataset contains audio files for audio classification, these audio files encompass various stories, and we have obtained a CSV file to indicate whether a story is true or false.

In this task, we will utilize the audio files from this dataset and construct feature vectors by extracting different audio features such as MFCC, LPC, LSF, and CQCC.

1. The first step in this task is to preprocess the dataset provided on QM, transforming the data into the format and shape required by our model.
    a. Initially, we opt for ResNet to process the feature images extracted from the audio, hence our first task is to select different audio features and convert them into images. These images will serve as our required dataset.

2. The second step of the task primarily involves establishing a Dataset, which is essential for using PyTorch's Dataloader tool.
    a. First, process the CSV file to mark the filenames and classifications.
    b. Construct the dataset using PyTorch.
    c. Format the dataset into the standard Dataset format, enabling us to train it in ResNet using PyTorch tools.

3. For dataset division, in this experiment, we adopt a training set to test set ratio of 7:3 for the entire dataset.

For dataset processing: See the specific code in: data.py

#### 1.2 Dataset Preprocessing

Preprocess the dataset provided on QM, transforming the data into the format and shape required by our model. After analyzing the various dimensions of sound features and actual experimental results, I have chosen LSF and LPC coefficients as the sources for the dataset.

1. **Pre-emphasis Filter**:
   - Apply pre-emphasis processing to the audio data through the `pre_emphasis` function. This enhances the high-frequency components and improves the robustness of the features.

2. **Feature Extraction**:
   - **Line Spectral Frequencies (LSF)**: Calculate the line spectral frequencies of the audio through the `compute_lsf` function. LSF is derived from Linear Predictive Coding (LPC) coefficients and reflects the spectral characteristics of the sound.

   - **LPC (Linear Predictive Coding) Coefficients**: Obtain LPC coefficients through the `compute_lpc` function. LPC describes the spectral envelope of the audio signal.

3. **Save and Visualize Features**:
   - Save the extracted features as images for subsequent analysis and visualization. These images can help understand the distribution of characteristics in the audio signal.

#### 1.3 Dataset Construction

We plan to use the **MLEnd Deception dataset** to construct the model. This dataset is used for image classification tasks, specifically classifying images as real or fake (i.e., distinguishing between genuine and counterfeit images). According to the provided code, the dataset includes images and labels for each image.

The label for each image is indicated by an entry in a CSV file, with label values of 0 or 1, representing "fake" and "real" images, respectively.

CSV file processing:

![image-2.png](attachment:image-2.png)

##### 1.3.1 Data Loading

Each image in the dataset is stored in a specified directory, and its corresponding labels are recorded in two CSV files:

- `index1.csv` records the image filenames and labels for the training set.
- `index2.csv` records the image filenames and labels for the test set.

The relationship between image files and labels is mapped through CSV files. In the code, the `pd.read_csv()` function is used to read the CSV files, and then images are loaded from the specified path based on the filenames.

```python
self.img_labels = pd.read_csv(img_label_dir)  # Read the CSV file containing image filenames and labels
```

Each time an image is loaded, the code constructs the image path through the file path and then loads it as a tensor using the `read_image()` method:

```python
img_path = os.path.join(self.img_dir + self.img_labels.iloc[index, 0])
image = read_image(img_path)
```

##### 1.3.2 Data Preprocessing and Transformation

After loading the images, to ensure the training effectiveness of the model, it is usually necessary to perform some preprocessing operations on the images. The code uses `transforms.Compose()` to perform multiple transformation steps on the images:

```python
transform = transforms.Compose([  
    transforms.Resize((224, 224)),  # Resize the image to 224x224
    transforms.Grayscale(num_output_channels=1)  # Convert to grayscale image
])  
```

These transformations include:

- **Resize**: Adjust the image to a uniform size (224x224) for input into deep learning models (such as ResNet, VGG, etc.).
- **Grayscale**: Convert the image to a single-channel (grayscale image), which helps reduce the complexity of the input for certain tasks (such as image classification).

* Visualization of the dataset:


![image.png](attachment:image.png)

##### 1.3.3 Dataset Splitting

The code utilizes two distinct datasets:

- `dataset_train`: The image dataset for training, stored in the path specified by `img_root_path`.
- `dataset_test`: The image dataset for evaluation, stored in the path specified by `img_root_path1`.

```python
dataset_train = myImageDataset(img_root_path, label_path, transform)  # Training set
dataset_test = myImageDataset(img_root_path1, label_path1, transform)  # Test set
```

#### 1.4 Dataset Splitting

When training machine learning models, it is common to split the dataset into a training set and a validation set (as well as a possible test set). This is to prevent the model from "memorizing" the data and losing its generalization ability. When splitting the dataset, we need to ensure the following points:

1. **Independent and Identically Distributed (IID)**:
   - Ensure that the samples in the training and validation sets come from the same distribution, meaning they should have similar feature distributions.
   - Avoid data leakage and ensure that the samples in the training and validation sets do not overlap.

2. **Proportional Splitting**:
   - Common splitting ratios are 80% for training and 20% for validation, or 70% for training and 30% for validation.
   - In some cases, the test set can be independent of the training and validation sets, but more data is usually required to create a reasonable test set.

In this experiment, we adopt a training set to test set ratio of 7:3 for the entire dataset.

### Conclusion

By employing appropriate feature extraction and dataset construction methods, we can create an effective and representative dataset for audio classification tasks. Ensuring the diversity, independence and identical distribution of the dataset, as well as addressing potential limiting factors, is crucial for enhancing the model's generalization ability.




# 6 Experiments and results

**Note: The complete, executable code experiment files can be found in ：2. Dataset Creation and Model Training**; in the file that named of Model.ipynb

Below is the analysis of the code and the overall network, with some code excerpts for better understanding.

### Network Structure Construction

1. **Base Layer (b1)**:
   - Consists of a convolutional layer, batch normalization, ReLU activation function, and max pooling. The code is as follows:

   ```python
   b1 = nn.Sequential(
       nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
       nn.BatchNorm2d(64),
       nn.ReLU(),
       nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
   )
   ```

   The role of this base layer is to extract preliminary features and reduce the size of the input image.

2. **Residual Block (resnet_block)**:
   - Defines how to create residual blocks and adjust the number of channels and stride to match the input network:

   ```python
   def resnet_block(input_channels, num_channels, num_residuals, first_block=False):
       blk = []
       for i in range(num_residuals):
           if i == 0 and not first_block:
               blk.append(Residual(input_channels, num_channels, use_1x1conv=True, strides=2))
           else:
               blk.append(Residual(num_channels, num_channels))
       return blk
   ```

   The blocks created by this function will enable the network to effectively learn residual mappings and prevent gradient vanishing.

3. **Building Multiple Layers (b2, b3, b4, b5)**:
   - Each subsequent layer is composed of multiple residual blocks, with the number of channels gradually increasing between layers. Example code is as follows:

   ```python
   b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
   b3 = nn.Sequential(*resnet_block(64, 128, 2))
   b4 = nn.Sequential(*resnet_block(128, 256, 2))
   b5 = nn.Sequential(*resnet_block(256, 512, 2))
   ```

4. **Global Average Pooling and Fully Connected Layer**:
   - Finally, the model is classified through a global average pooling layer and a fully connected layer:

   ```python
   net = nn.Sequential(
       b1, b2, b3, b4, b5,
       nn.AdaptiveAvgPool2d((1, 1)),
       nn.Flatten(),
       nn.Linear(512, 2)
   )
   ```

### Training Process

1. **Weight Initialization**:
   - Initialize the weights of convolutional and fully connected layers using Xavier uniform distribution as shown below:

   ```python
   def init_weights(m):
       if type(m) == nn.Linear or type(m) == nn.Conv2d:
           nn.init.xavier_uniform_(m.weight)
   net.apply(init_weights)
   ```

2. **Training Setup**:
   - Use SGD optimizer and cross-entropy loss, and set the learning rate and number of training epochs with the following code:

   ```python
   optimizer = torch.optim.SGD(net.parameters(), lr=lr)
   loss = nn.CrossEntropyLoss()
   ```

3. **Training Loop**:
   - In each epoch, the process of forward propagation, loss calculation, backpropagation, and gradient update for each batch is exemplified as follows:

   ```python
   for X, y in train_iter:
       ...
       y_hat = net(X)
       ...
       l = loss(y_hat, y)
       l.backward()
       optimizer.step()
   ```

4. **Testing and Visualization**:
   - At the end of each epoch, calculate and visualize the accuracy of the test set with the following example code:

   ```python
   test_acc = evaluate_accuracy_gpu(net, test_iter)
   animator.add(epoch + 1, (None, None, test_acc))
   ```

5. **Performance Evaluation**:
   - At the end of training, print the training loss, training accuracy, and test accuracy:

   ```python
   print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, test acc {test_acc:.3f}')
   ```

### Results Analysis:

As mentioned above, I constructed 5 datasets for testing and learning:
   They are: MFCC+Spectrogram; LSF+LPC; Spectrogram+F0+Spectral Contrast+Spectral Centroid; F0+Spectral Contrast+Spectral Centroid

Their styles are as follows (in order):


![00001.jpg](attachment:00001.jpg)

![00001-2.jpg](attachment:00001-2.jpg)

![00001-3.jpg](attachment:00001-3.jpg)

![00001-4.jpg](attachment:00001-4.jpg)

![00001-5.jpg](attachment:00001-5.jpg)

Translate to English:

### Training I

However, during training, the results are not very satisfactory;

For the MFCC+spectrogram; LSF+LPC, using these two datasets, the model is prone to non-convergence.

![image-2.png](attachment:image-2.png)

Translate to English:

Analyzing the reasons:

1. Inadequate data preprocessing: Data has not been normalized or standardized, leading to difficulties in model optimization across data with different scales.

2. Mismatch between data volume and model complexity: Having too much data with a small model, or too little data with an overly complex model, can both result in the model's inability to learn effectively.

3. Gradient issues: Gradient explosion or disappearance are common problems in deep learning, which may lead to abnormal model parameter updates and thus failure to converge.

### Training II

For the training involving fundamental frequency, spectral contrast, and spectral centroid, the data is very prone to overfitting (this could be due to a small training set, or it could be due to excessive model complexity, which increases the degree of overfitting).

![image.png](attachment:image.png)

Translate to English:

Analyzing the causes of overfitting:

1. Excessive model complexity: If the number of model parameters is too high relative to the amount of training data, the model may learn the noise and details in the training data, rather than just the underlying data distribution.

2. Insufficient training data: A lack of training samples or insufficient sample diversity can prevent the model from generalizing to new data.

3. Inadequate regularization: Regularization techniques (such as L1 and L2 regularization) can help reduce overfitting. If these techniques are not applied properly or are not strong enough, overfitting may occur.

4. Lack of cross-validation: Not using cross-validation to assess the model's generalization ability can lead to an overestimation of the model's performance.

### Improvement Directions

#### For the non-convergence of MFCC+spectrogram and LSF+LPC datasets:

1. **Enhance data preprocessing**:
   - Ensure that data is normalized or standardized to allow the model to optimize more effectively across data of different scales.
   - Consider pre-emphasizing MFCC and LSF features to enhance high-frequency components, improve signal-to-noise ratio, and reduce signal distortion.

2. **Adjust the match between data volume and model complexity**:
   - If the data volume is large and the model is small, consider increasing the model's complexity; conversely, if the data volume is insufficient and the model is overly complex, simplify the model structure.

3. **Address gradient issues**:
   - Gradient explosion or disappearance can be resolved by using gradient clipping, choosing the right optimizer (such as Adam), or adjusting the learning rate.

#### For the overfitting of the fundamental frequency+spectral contrast+spectral centroid dataset:

1. **Reduce model complexity**:
   - If the model has too many parameters, try reducing the number of parameters to decrease overfitting to the training data.

2. **Increase the volume of training data**:
   - Increase the number of samples or sample diversity to enhance the model's generalization ability.

3. **Enhance regularization**:
   - Apply L1, L2 regularization techniques, or add Dropout layers in neural networks to reduce overfitting.

4. **Implement cross-validation**:
   - Use cross-validation to assess the model's generalization ability and avoid overestimating the model's performance.

5. **Adjust training strategies**:
   - Consider using early stopping, stopping training when the loss on the validation set no longer decreases, to prevent overfitting.

6. **Decrease Batch Size**:
   - A smaller batch size can provide stronger regularization effects as it makes the batches more "noisy," akin to adding noise.

Alternatively, one could abandon the ResNet structure and train using RNNs or other models.




# 7 Conclusions

**Conclusion:**

The experiments revealed that the model struggled with convergence when using MFCC+Spectrogram and LSF+LPC datasets, likely due to inadequate data preprocessing, mismatched data volume to model complexity, and gradient issues.

Additionally, the model showed a tendency to overfit when trained on datasets involving fundamental frequency, spectral contrast, and spectral centroid, possibly due to excessive model complexity, insufficient training data, inadequate regularization, and lack of cross-validation.

**Improvement Suggestions:**

1. **Enhance Data Preprocessing:**
   - Normalize or standardize data to facilitate model optimization across different scales.
   - Pre-emphasize MFCC and LSF features to enhance high-frequency components and reduce signal distortion.

2. **Balance Data Volume and Model Complexity:**
   - Adjust model complexity based on the volume of data available to prevent underfitting or overfitting.

3. **Manage Gradient Issues:**
   - Implement gradient clipping, select appropriate optimizers, or adjust learning rates to address gradient explosion or disappearance.

4. **Reduce Model Complexity:**
   - Decrease the number of parameters if the model is overfitting to the training data.

5. **Expand Training Data:**
   - Increase sample size or diversity to improve model generalization.

6. **Strengthen Regularization:**
   - Apply L1, L2 regularization, or Dropout layers to mitigate overfitting.

7. **Implement Cross-Validation:**
   - Use cross-validation to assess and avoid overestimating model performance.

8. **Adjust Training Strategies:**
   - Employ early stopping to prevent overfitting by halting training when validation loss stops improving.

9. **Optimize Batch Size:**
   - Consider smaller batch sizes for stronger regularization effects.

10. **Explore Alternative Models:**
   - Consider abandoning the ResNet structure in favor of RNNs or other models that may better handle the specific challenges of the datasets.


# 8 References

In the process of building and training deep learning models, many literatures, tools, and libraries have provided valuable support for our research and practice.

Here are some resources that deserve gratitude in this field:

### Books
1. **"Deep Learning"** (Ian Goodfellow, Yoshua Bengio, Aaron Courville):
   - This book is a classic textbook in the field of deep learning, systematically introducing the basic theories and applications of deep learning.

### Papers
1. **"ResNet: Deep Residual Learning for Image Recognition"** (Kaiming He et al.):
   - This paper proposed the architecture of residual networks (ResNet), significantly improving the training effects of deep networks and becoming an important milestone in the field of computer vision.

2. **"Densely Connected Convolutional Networks"** (Gao Huang et al.):
   - Introduced the DenseNet architecture, emphasizing the idea of dense connections, further promoting the design of deep learning models.

### Open Source Libraries and Tools
1. **PyTorch**:
   - A popular deep learning framework that provides flexible dynamic computation graphs, easy to debug and develop, and widely used in research and industry.

2. **Matplotlib** and **Seaborn**:
   - Python libraries for data visualization, helping us visualize changes in loss and accuracy during training, facilitating the analysis of model performance.

### Code Repositories
1. **d2l.ai** (Dive into Deep Learning):
   - This open-source project provides practical tutorials and example code for deep learning, helping learners better understand the concepts and implementations of deep learning.

2. **Fastai**:
   - A high-level library based on PyTorch, aiming to simplify the deep learning training process, providing many pre-trained models and convenient APIs.

### Other Tools
1. **Jupyter Notebook**:
   - An interactive computing environment, suitable for data analysis and deep learning experiments, supporting visualization and document writing.

Thanks to the contributions of the above resources, they have provided important support for the development and application of deep learning, and have inspired countless researchers and developers to explore and innovate in this field.

