# Predicting the Veracity of Narrated Stories Using Audio Feature Extraction and Ensemble Machine Learning Techniques

# 1 Author

**Student Name**: Yiyang Zhou <br>
**Student ID**: 221168154



# 2 Problem Formulation

The goal of this project is to build a machine learning model that predicts whether a narrated story is real or fictional based on a 2-5 minute audio recording. The formalized description of this project is shown as follows:

**2.1 Audio Classification Challenge**:
   - Processing and analyzing complex audio features
   - Handling variations in speaking styles and emotional expressions
   - Extracting meaningful patterns from high-dimensional acoustic data

**2.2 Technical Components**:
   - Audio signal processing and feature extraction
   - Data augmentation for robust model training
   - Ensemble learning with multiple classifiers:
     * Support Vector Machines (SVM)
     * Random Forest (RF)
     * K-Nearest Neighbors (KNN)

**2.3 Research Objectives**:
   - Develop an accurate and robust classification system
   - Compare and combine different machine learning approaches
   - Achieve improved performance through ensemble methods

**2.4 What's Interesting About the Project**:

- **Data Set Expansion**: Despite the initial small scale of available datasets, innovative strategies are required to expand the data pool, enhancing the model's learning capabilities and generalization.
- **Cross-Disciplinary Impact**: The project intersects multiple disciplines, from machine learning and audio processing to psychology and security, showcasing its broad applicability and potential for real-world impact.
- **Limitation of deep learning**: Small dataset size also implies that deep learning algorithms, which generally perform well but require larger datasets, may not perform as well as classical machine learning algorithms in this project.

The innovative use of ensemble learning methods helps overcome individual model limitations and provides a more reliable solution for deception detection in audio narratives.

By addressing these challenges, the project aims to help people better identify the authenticity of verbal messages, which has important implications in the fields of fraud detection and prevention, psychological analysis and behavioral research, security and forensics applications, and human-computer interaction systems.

This project has been uploaded to GitHub. The GitHub link for this project is [https://github.com/Seca5702/CBU5201-2024-Miniproject-YiyangZhou-221168154](https://github.com/Seca5702/CBU5201-2024-Miniproject-YiyangZhou-221168154).

# 3 Methodology

**3.1 Training Task**

The training task involves constructing a supervised learning pipeline to classify audio samples into "Deceptive Story" (0) or "True Story" (1). Training data, augmented through pitch shifting, noise addition, and speed alteration, is processed into standardized acoustic features such as MFCC, chroma features, and Mel spectrograms. These features are creatively used to train three base classifiers (SVM, Random Forest, and KNN) independently, with hyperparameters optimized using grid search and cross-validation. The trained classifiers are then integrated into an ensemble model using a soft voting strategy.

**3.2 Validation Task**

The validation task evaluates model generalizability using 15% of the dataset held out during the training phase. Stratified sampling ensures class balance in validation data. Evaluation metrics such as accuracy, precision, recall, and F1-score are computed on the validation set. Additionally, 5-fold cross-validation is conducted during model training to mitigate overfitting and estimate performance stability.

**3.3 Performance Definition**

Model performance is defined by multiple metrics:
- Accuracy: Measures the proportion of correctly classified samples in the test set.
- Precision and Recall: Evaluate the model's ability to avoid false positives and negatives, particularly important in imbalanced datasets.
- F1-score: Provides a harmonic mean of precision and recall, useful for overall evaluation.
- Confusion Matrix: Offers a detailed view of classification outcomes per class, enabling the identification of specific error patterns.
- For ensemble models, comparative analysis against individual models is conducted to highlight performance gains.

**3.4 Data Preprocessing**

- Extract multi-dimensional acoustic features from the raw audio, including MFCC, chroma features, and Mel spectrograms
- Creatively perform data augmentation through methods such as adding sine wave noise, altering pitch, and changing speed
- Apply feature standardization to ensure consistent scale across all features

**3.5 Feature Engineering**

- Extract comprehensive acoustic features:
  * MFCC (40 coefficients): Capture vocal tract configuration
  * Chroma features: Represent pitch and harmonic content
  * Mel spectrograms: Encode frequency characteristics
- Calculate statistical measures (mean and standard deviation) for each feature type
- Standardize features using StandardScaler to normalize the feature distribution

**3.6 Model Selection and Ensemble Learning**

- Base Models Selection:
  * SVM: Primary classifier optimized for high-dimensional feature spaces
  * Random Forest: For capturing complex feature interactions
  * KNN: To identify local patterns in the feature space
- Ensemble Strategy:
  * Implement soft voting mechanism
  * Assign weighted importance to different classifiers
  * Optimize model parameters through grid search and cross-validation

**3.7 Model Evaluation**

- Performance Metrics:
  * Accuracy, precision, recall, and F1-score
  * Cross-validation scores to assess model stability
  * Individual and ensemble model performance comparison
- Evaluation Strategy:
  * 85/15 train-test split with stratification
  * 5-fold cross-validation for robust performance estimation
  * Comparative analysis of individual and ensemble models


# 4 Implemented ML Prediction Pipelines

The implemented ML prediction pipelines are designed to process raw audio data and output predictions on whether a given audio file represents a "Deceptive Story" or a "True Story". Each pipeline comprises multiple stages, as outlined below:  
- **Input**:  
  * Raw audio files from the MLEnd Deception Dataset, pre-labeled as deceptive (0) or true (1).  

- **Pipeline Stages**:  
  * **Data Loading and Preprocessing**: Audio files are loaded, and key acoustic features (e.g., MFCC, chroma features, Mel spectrograms) are extracted. These features are standardized to ensure consistent scaling. Data augmentation techniques are applied to improve robustness.  
  * **Feature Engineering**: Extracted features are summarized into statistical measures (e.g., mean, standard deviation) and transformed into fixed-length vectors for downstream processing.  
  * **Model Training**: Three base classifiers (SVM, Random Forest, and KNN) are trained on the processed data using optimized hyperparameters identified through grid search.  
  * **Ensemble Learning**: Predictions from the base classifiers are combined using a soft voting mechanism to produce a final output.  
  * **Evaluation**: Validation and test sets are used to assess the model's performance through metrics such as accuracy, precision, recall, and F1-score.  

- **Intermediate Data Structures**:  
   * **Feature Matrices**: Generated during preprocessing, each row corresponds to an audio sample, and columns represent standardized features.  
  * **Model Outputs**: Each classifier outputs probability distributions over the target classes, which are aggregated in the ensemble stage.  

- **Output**:  
  * Final predictions indicating whether an audio sample is deceptive or true, along with performance metrics to evaluate the pipeline's effectiveness.  

<font color="blue">
Before running the following code, please place the dataset folder "CBU0521DD_stories" and the dataset attributes file "CBU0521DD_stories_attributes.csv" in the same directory as this .ipynb file.
</font>

**4.1 Transformation Stage**

4.1.1 Data Loading and Reading：

Import necessary libraries, set data paths, and read label data.

In [1]:
import os
import numpy as np
import pandas as pd
import librosa
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set data paths
audio_path = "CBU0521DD_stories"
csv_path = "CBU0521DD_stories_attributes.csv"

# Read label data
labels_df = pd.read_csv(csv_path)




4.1.2 Feature Extraction：

Define functions to extract features from audio files, including MFCC, Chroma, and Mel Spectrogram features.

In [2]:
def extract_features(file_name):
    y, sr = librosa.load(file_name, duration=300)
    return extract_features_from_array(y, sr)

def extract_features_from_array(y, sr):
    # Extract MFCC mean and standard deviation
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
    mfccs_mean = np.mean(mfccs, axis=1)
    mfccs_std = np.std(mfccs, axis=1)
    
    # Extract Chroma features
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_mean = np.mean(chroma, axis=1)
    chroma_std = np.std(chroma, axis=1)
    
    # Extract Mel Spectrogram features
    mel = librosa.feature.melspectrogram(y=y, sr=sr)
    mel_mean = np.mean(mel, axis=1)
    mel_std = np.std(mel, axis=1)
    
    # Combine all features
    features = np.hstack([mfccs_mean, mfccs_std, chroma_mean, chroma_std, mel_mean, mel_std])
    return features


4.1.3 Data Augmentation Function:

Enhance audio data by adding noise, changing pitch, and altering speed to increase data diversity and model robustness.

In [3]:
def augment_audio(y, sr):
    # Add fixed sine wave noise
    t = np.arange(len(y))
    noise = 0.005 * np.sin(2 * np.pi * t / 100)  # Fixed sine wave noise
    y_noise = y + noise
    
    # Change pitch
    y_pitch = librosa.effects.pitch_shift(y, n_steps=2, sr=sr)
    
    # Change speed
    y_speed = librosa.effects.time_stretch(y, rate=0.9)
    
    return [y_noise, y_pitch, y_speed]


4.1.4 Data Preprocessing:

Perform feature extraction, data augmentation, feature scaling, and split the dataset into training and testing sets.

In [4]:
# Prepare dataset
print("Starting to process original data...")
features = []
labels = []

# Process original data
for index, row in labels_df.iterrows():
    file_path = os.path.join(audio_path, row['filename'])
    features.append(extract_features(file_path))
    labels.append(row['Story_type'])
    print(f'Processing original file: {row["filename"]}')

# Data augmentation
print("\nStarting data augmentation...")
for index, row in labels_df.iterrows():
    file_path = os.path.join(audio_path, row['filename'])
    y, sr = librosa.load(file_path, duration=300)
    augmented_audios = augment_audio(y, sr)
    for y_aug in augmented_audios:
        features.append(extract_features_from_array(y_aug, sr))
        labels.append(row['Story_type'])
    print(f'Processing augmented data: {row["filename"]}')

# Convert to arrays
X = np.array(features)
y = np.array(labels)

# Feature scaling
print("\nScaling features...")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split dataset into training and testing sets
print("Splitting dataset...")
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.15, random_state=42, stratify=y
)


Starting to process original data...
Processing original file: 00001.wav
Processing original file: 00002.wav
Processing original file: 00003.wav
Processing original file: 00004.wav
Processing original file: 00005.wav
Processing original file: 00006.wav
Processing original file: 00007.wav
Processing original file: 00008.wav
Processing original file: 00009.wav
Processing original file: 00010.wav
Processing original file: 00011.wav
Processing original file: 00012.wav
Processing original file: 00013.wav
Processing original file: 00014.wav
Processing original file: 00015.wav
Processing original file: 00016.wav
Processing original file: 00017.wav
Processing original file: 00018.wav
Processing original file: 00019.wav
Processing original file: 00020.wav
Processing original file: 00021.wav
Processing original file: 00022.wav
Processing original file: 00023.wav
Processing original file: 00024.wav
Processing original file: 00025.wav
Processing original file: 00026.wav
Processing original file: 0

**4.2 Model Stage**

In this stage, we focus on developing and optimizing individual models before ensemble integration:

1. **Support Vector Machine (SVM)**:
   - Optimize parameters through grid search
   - Key parameters explored: C, gamma, and kernel type
   - Best configuration found through cross-validation

2. **Model Parameter Optimization**:

   - The process of conducting grid search and finding the best model parameters is shown as follows.


In [5]:
# Define parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': ['scale', 'auto'],
    'kernel': ['rbf', 'linear']
}

# Grid search
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5)
grid.fit(X_train, y_train)

print(f"Best parameters: {grid.best_params_}")

Best parameters: {'C': 100, 'gamma': 'scale', 'kernel': 'rbf'}


After finding the optimized parameters, we can use the optimized parameters to train the model. 

Due to differences in Python's and libraries' versions, the parameters obtained from grid search may be different, which can lead to differences in the accuracy of the final model. Hence, I will directly use the parameters that I have searched here.

In [6]:
# Train SVM model with optimal parameters
# Set optimal parameters
print("Training SVM model...")
svm = SVC(
    C=10,
    gamma='scale',
    kernel='rbf',
    probability=True,
    random_state=42,
    max_iter=10000,
    cache_size=2000,
    class_weight='balanced'
)

# Train the model
svm.fit(X_train, y_train)


Training SVM model...


**4.3 Ensemble Stage**

In this project, I employed a VotingClassifier to combine three base classifiers: Support Vector Machine (SVM), Random Forest (RF), and K-Nearest Neighbors (KNN). The rationale behind this ensemble approach is as follows:

1. **Model Complementarity**:
   - SVM: Excels in handling high-dimensional feature spaces and performs well in non-linear audio feature classification
   - RF: Effectively manages complex feature interactions and provides feature importance analysis
   - KNN: Captures local patterns in audio features based on similarity metrics

2. **Ensemble Strategy**:
   - Implements soft voting to incorporate probability predictions from all classifiers
   - Assigns weights [2, 2, 1] to SVM, RF, and KNN respectively
   - Weight distribution reflects each model's relative importance in the task

3. **Model Configuration**:
   - The following code configure and initialize each base classifier and set up the parameters for the voting classifier to implement ensemble learning.


In [7]:
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# Initialize base classifiers
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=8,
    min_samples_split=4,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=42,
    class_weight='balanced',
    n_jobs=-1
)

knn = KNeighborsClassifier(
    n_neighbors=5,
    weights='distance',
    algorithm='auto',
    leaf_size=30,
    n_jobs=-1
)

# Create Voting Classifier
voting_clf = VotingClassifier(
    estimators=[('svm', svm), ('rf', rf), ('knn', knn)],
    voting='soft',
    weights=[2, 2, 1]
)

# Train Voting Classifier
print("Training Voting Classifier...")
voting_clf.fit(X_train, y_train)

# Predict and evaluate
# SVM
print("\n1. train SVM model...")
svm.fit(X_train, y_train)
svm_pred = svm.predict(X_test)
print("SVM model accuracy:", accuracy_score(y_test, svm_pred))
print("\nSVM classification report:")
print(classification_report(y_test, svm_pred))

# RF
print("\n2. train RF model...")
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
print("RF model accuracy:", accuracy_score(y_test, rf_pred))
print("\nRF classification report:")
print(classification_report(y_test, rf_pred))

# KNN
print("\n3. train KNN model...")
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
print("KNN model accuracy:", accuracy_score(y_test, knn_pred))
print("\nKNN classification report:")
print(classification_report(y_test, knn_pred))

print("\nPredicting & Evaluating the ensemble mmodel...")
y_pred = voting_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy * 100:.2f}%')
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


Training Voting Classifier...

1. train SVM model...
SVM model accuracy: 0.9333333333333333

SVM classification report:
                 precision    recall  f1-score   support

Deceptive Story       0.91      0.97      0.94        30
     True Story       0.96      0.90      0.93        30

       accuracy                           0.93        60
      macro avg       0.94      0.93      0.93        60
   weighted avg       0.94      0.93      0.93        60


2. train RF model...
RF model accuracy: 0.9333333333333333

RF classification report:
                 precision    recall  f1-score   support

Deceptive Story       0.88      1.00      0.94        30
     True Story       1.00      0.87      0.93        30

       accuracy                           0.93        60
      macro avg       0.94      0.93      0.93        60
   weighted avg       0.94      0.93      0.93        60


3. train KNN model...
KNN model accuracy: 0.8166666666666667

KNN classification report:
             

# 5 Dataset

Dataset Construction Process:

1.  Raw Data:

- Use audio files from `CBU0521DD_stories_attributes.csv`;
- Total number of samples: 100 audio files;
- Labels: "Deceptive Story" (0) and "True Story" (1);

2.  Data Partitioning:

- Training set: 85% (85 samples);
- Test set: 15% (15 samples);

3.  Data Augmentation:

- Use fixed sine wave noise to enhance model stability;
- Pitch shifting;
- Speed variation;

This increases the number of training samples to four times the original, totaling 360 audio files.

4.  Feature Extraction:

- MFCC features (mean and standard deviation);
- Chroma features (mean and standard deviation);
- Mel-spectrogram features (mean and standard deviation).

# 6 Experiments and Results

**6.1 Model Training Results**

SVM Optimal Parameter Configuration:

```
{
    'C': 10,
    'gamma': 'scale',
    'kernel': 'rbf'
}
```

**6.2 Model Evaluation Results and Analysis**

1. **Overall Performance**:
- **SVM**: Accuracy: 93.33%
- **Random Forest**: Accuracy: 93.33%
- **KNN**: Accuracy: 81.67%
- **Ensemble**: Accuracy: 96.67%

2. **Classification Details (Ensemble)**:
- Deceptive Story (0): Precision 94%, Recall 100%
- True Story (1): Precision 100%, Recall 93%

3. **Cross-validation**:
- SVM: Average accuracy 84.12% (+/- 8.19%)
- Random Forest: Average accuracy 81.76% (+/- 10.78%)
- KNN: Average accuracy 76.47% (+/- 8.72%)
- Ensemble: Average accuracy 84.12% (+/- 5.06%)


In the final ensemble model, the F1 scores for both classes are 0.97 and 0.97 respectively, indicating balanced performance.

Further parameter adjustments can be made to optimize the model's classification performance.

```
Model Accuracy: 96.67%

Classification Report:
              precision    recall  f1-score   support

           0       0.94      1.00      0.97        30
           1       1.00      0.93      0.97        30

    accuracy                           0.97        60
   macro avg       0.97      0.97      0.97        60
weighted avg       0.97      0.97      0.97        60
 ```

# 7 Conclusions

This project successfully demonstrated the feasibility of using audio feature extraction and machine learning models, specifically Support Vector Machines (SVM), to predict the veracity of narrated stories.

The ensemble approach combining SVM, RandomForest, and KNN further improved accuracy from 93.33% to 96.67%, demonstrating the complementary nature of these classifiers.
Additionally, cross-validation results showed that the ensemble model effectively balances individual classifier weaknesses, underscoring the importance of model synergy.

Key findings and achievements include:

1. **Audio Feature Extraction**: The use of MFCC, chroma features, and Mel spectrograms proved effective in capturing the nuanced characteristics of audio signals relevant to classification tasks.
2. **Model Performance**: The optimized SVM model achieved commendable performance, as evidenced by evaluation metrics like precision and recall. The ensemble approach further improved accuracy, demonstrating its complementary strengths.
3. **Practical Implications**: This approach offers potential applications in fields such as fraud detection and psychological analysis.

### Suggestions for Future Work:
- **Dataset Expansion**: Enhancing the dataset with more diverse samples could improve model robustness and generalization.
- **Real-time Implementation**: Developing a real-time system for audio analysis could make this solution more practical for real-world applications.
- **Deep Learning Integration**: Explore the integration of deep learning models such as CNNs or RNNs after collecting larger datasets to capture temporal and spatial patterns in audio data.

In summary, this project lays a strong foundation for further exploration and refinement in the field of audio-based truth prediction, offering both theoretical insights and practical potential. 


# 8 References

[1] Oravec, Jo Ann. "The emergence of “truth machines”?: Artificial intelligence approaches to lie detection." Ethics and Information Technology 24.1 (2022): 6.

[2] Mohan, Karnati, and Ayan Seal. "Deception detection on “Bag-of-Lies”: integration of multi-modal data using machine learning algorithms." Proceedings of International Conference on Machine Intelligence and Data Science Applications: MIDAS 2020. Springer Singapore, 2021.

[3] Abdelwahab, Abdelrahman, et al. "Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual Features." arXiv preprint arXiv:2411.08885 (2024).