# Urban Sound Classification using Deep Learning

## 1. Project Overview

**Type of Learning**: Supervised Deep Learning

**Algorithms**: Deep Neural Networks (MLP, CNN, LSTM) with Hyperparameter Tuning

**Task**: Multi-class Classification of Urban Sounds

This project focuses on classifying urban sounds into 10 different categories using deep learning approaches. The UrbanSound8K dataset contains 8732 labeled sound excerpts from urban environments, which we use to train and evaluate various neural network architectures.

## 2. Motivation and Goal

**Motivation**: Urban sound classification has important applications in:
- Smart city monitoring and noise pollution analysis
- Audio surveillance systems
- Environmental sound recognition for IoT devices
- Audio-based context awareness in mobile applications

**Goal**: Develop an accurate and robust deep learning model that can classify urban sounds into 10 distinct categories with high accuracy, using advanced hyperparameter tuning and proper evaluation methodologies.

## 3. Dataset Source and Citation

**Dataset**: UrbanSound8K Dataset

**Source**: https://urbansounddataset.weebly.com/urbansound8k.html

**Citation**:
J. Salamon, C. Jacoby and J. P. Bello, "A Dataset and Taxonomy for Urban Sound Research",
22nd ACM International Conference on Multimedia, Orlando SA, Nov. 2014.

Dataset compiled by Justin Salamon, Christopher Jacoby and Juan Pablo Bello. All files are excerpts of recordings
uploaded to www.freesound.org. Please see FREESOUNDCREDITS.txt for an attribution list.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
import warnings
warnings.filterwarnings('ignore')

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder, StandardScaler
import keras_tuner as kt

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

In [None]:
# Dataset path
DATASET_PATH = "UrbanSound8K"
METADATA_FILE = os.path.join(DATASET_PATH, "metadata", "UrbanSound8K.csv")

# Load metadata
metadata = pd.read_csv(METADATA_FILE)
print("Dataset loaded successfully!")
print("Dataset shape:", metadata.shape)

## 4. Dataset Description
**Data Size**:
- Total sound files: 8,732 excerpts
- Classes: 10 urban sound categories
- Typical length: <= 4 seconds duration
- Sampling: Original files vary in length, excerpts are up to 4 seconds
- Format: WAV files organized in 10 folds for cross-validation

The sampling rate, bit depth, and number of channels are the same as those of the original file uploaded to Freesound (and hence may vary from file to file).

In [None]:
# Display basic dataset information
print("Total number of audio files:", len(metadata))
print("Number of unique classes:", metadata['class'].nunique())
print("Classes:", sorted(metadata['class'].unique()))

# Display dataset structure
display(metadata.head())

# Display dataset columns
print("Dataset columns:", metadata.columns.tolist())

## 5. EXPLORATORY DATA ANALYSIS (EDA)

In [None]:
# 5.1 Label Distribution Analysis
plt.figure(figsize=(15, 10))

# Class distribution
plt.subplot(2, 2, 1)
class_counts = metadata['class'].value_counts()
sns.barplot(x=class_counts.values, y=class_counts.index, palette='viridis')
plt.title('Distribution of Sound Classes')
plt.xlabel('Number of Samples')
plt.ylabel('Class')

# Fold distribution
plt.subplot(2, 2, 2)
fold_counts = metadata['fold'].value_counts().sort_index()
sns.barplot(x=fold_counts.index, y=fold_counts.values, palette='coolwarm')
plt.title('Distribution of Samples Across Folds')
plt.xlabel('Fold')
plt.ylabel('Number of Samples')

# Duration statistics
plt.subplot(2, 2, 3)
durations_by_class = []
class_names = []
for class_name in sorted(metadata['class'].unique()):
    class_durations = metadata[metadata['class'] == class_name]['end'] - metadata[metadata['class'] == class_name]['start']
    durations_by_class.append(class_durations)
    class_names.append(class_name)

plt.boxplot(durations_by_class, labels=class_names)
plt.xticks(rotation=45, ha='right')
plt.title('Duration Distribution by Class (Box Plot)')
plt.ylabel('Duration (seconds)')
plt.grid(True, alpha=.3)

# Sample audio file analysis
plt.subplot(2, 2, 4)
# Analyze a few sample files to show waveform diversity
sample_files = metadata.sample(1, random_state=42)  # Just one sample for this small subplot
for _, row in sample_files.iterrows():
    file_path = os.path.join(DATASET_PATH, "audio", f"fold{row['fold']}", row['slice_file_name'])
    audio, sr = librosa.load(file_path, sr=22050)
    time_axis = np.linspace(0, len(audio)/sr, len(audio))
    plt.plot(time_axis, audio, color='green', alpha=.7)
    plt.title('Sample Waveform: %s' % row["class"])
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.grid(True, alpha=.3)
    break  # Just plot one

plt.show()

### 5.1 Observations

Most classes seem to have around 1000 samples a few with a significantly lower number of samples. Even then, this does not warrant resampling as the overall distribution is mostly balanced.

Another important observation is that the folds do not contain the exact same number of samples. In fact, even if each contained the same number of samples, the *amount of data* would not be the same as the duration varies between samples. An important implication is thatfor a fair comparison, 10-fold cross-validation is necessary.

It is also interesting to note that except for car horns and gun shots, all clips have a median of around 4.0. Gun shots have a median way below. While it is not used here, I suspect that including the duration as a feature/parameter while training would improve the scores of these two classes.

In [None]:
# 5.2 Audio Characteristics Analysis

# Analyze a few sample files to demonstrate audio properties
sample_analysis = []
for idx, row in metadata.sample(20, random_state=42).iterrows():
    file_path = os.path.join(DATASET_PATH, "audio", f"fold{row['fold']}", row['slice_file_name'])
    audio, sr = librosa.load(file_path, sr=None)
    duration = len(audio) / sr
    sample_analysis.append({
        'class': row['class'],
        'duration': duration,
        'sample_rate': sr,
        'samples': len(audio),
        'max_amplitude': np.max(np.abs(audio))
    })

pd.DataFrame(sample_analysis).style.set_caption("Audio characteristics")

In [None]:
# 5.3 Visualize Sample Audio Files
def plot_sample_audios(metadata, n_samples=3):
    """Plot waveform and spectrogram for sample audio files"""
    fig, axes = plt.subplots(n_samples, 3, figsize=(15, 4*n_samples))

    sample_data = metadata.sample(n_samples, random_state=25)

    for idx, (_, row) in enumerate(sample_data.iterrows()):
        file_path = os.path.join(DATASET_PATH, "audio", f"fold{row['fold']}", row['slice_file_name'])

        audio, sr = librosa.load(file_path, sr=22050)

        # Waveform
        axes[idx, 0].plot(np.linspace(0, len(audio)/sr, len(audio)), audio)
        axes[idx, 0].set_title("Waveform: %s" % row['class'])
        axes[idx, 0].set_xlabel('Time (s)')
        axes[idx, 0].set_ylabel('Amplitude')

        # Spectrogram
        D = librosa.amplitude_to_db(np.abs(librosa.stft(audio)), ref=np.max)
        img = librosa.display.specshow(D, y_axis='log', x_axis='time', sr=sr, ax=axes[idx, 1])
        axes[idx, 1].set_title("Spectrogram: %s" % row['class'])
        plt.colorbar(img, ax=axes[idx, 1])

        # MFCCs
        mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
        librosa.display.specshow(mfccs, x_axis='time', ax=axes[idx, 2])
        axes[idx, 2].set_title("MFCCs: %s" % row['class'])
        plt.colorbar(img, ax=axes[idx, 2])

    plt.tight_layout()
    plt.show()

# Plot sample audio analysis
plot_sample_audios(metadata, n_samples=3)

In [None]:
# 5.4 Duration Analysis
durations = metadata['end'] - metadata['start']
plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(durations, bins=30, alpha=.7, color='lightblue', edgecolor='black')
plt.title('Distribution of Audio Durations')
plt.xlabel('Duration (seconds)')
plt.ylabel('Frequency')

plt.subplot(1, 2, 2)
sns.boxplot(x=durations)
plt.title('Box Plot of Audio Durations')
plt.xlabel('Duration (seconds)')

plt.tight_layout()
plt.show()

print("Duration Statistics:")
print("Mean: %.2fs" % durations.mean())
print("Std: %.2fs" % durations.std())
print("Min: %.2fs" % durations.min())
print("Max: %.2fs" % durations.max())

## 6. Data Preprocessing
**Preprocessing Steps**:
1. **Feature Extraction**: MFCC (Mel-Frequency Cepstral Coefficients) with delta and delta-delta features
2. **Feature Aggregation**: Mean, standard deviation, and median across time frames
3. **Feature Scaling**: StandardScaler for normalization
4. **Label Encoding**: Convert class labels to numerical format
5. **Data Splitting**: Predefined folds for training, validation, and testing

**Benefits of the preprocessing steps**:
- MFCCs are well-established for audio classification as they capture perceptual frequency characteristics
- Delta features capture temporal dynamics of the audio signal
- Statistical aggregation reduces variable-length audio to fixed-length feature vectors
- Standard scaling ensures stable training of neural networks
- Fixed fold assignment ensures reproducible evaluation

In [None]:
# Define fixed folds for model selection
VAL_FOLD = 9
TEST_FOLD = 10
TRAIN_FOLDS = [1, 2, 3, 4, 5, 6, 7, 8]

print("Fold Assignment:")
print("Training folds:", TRAIN_FOLDS)
print("Validation fold:", VAL_FOLD)
print("Test fold:", TEST_FOLD)

In [None]:
# Precompute features for all data once
def precompute_all_features(metadata):
    """Precompute features for all audio files efficiently"""
    all_features = []
    all_labels = []
    all_folds = []

    print("Precomputing features for all audio files...")
    for idx, row in metadata.iterrows():
        file_path = os.path.join(DATASET_PATH, "audio", f"fold{row['fold']}", row['slice_file_name'])

        audio, sr = librosa.load(file_path, sr=22050)
        mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=40, n_fft=2048, hop_length=512)

        # Must set mode='nearest' in order to handle shorter clips
        mfccs_delta = librosa.feature.delta(mfccs, mode='nearest')
        mfccs_delta2 = librosa.feature.delta(mfccs, order=2, mode='nearest')

        features_combined = np.vstack([mfccs, mfccs_delta, mfccs_delta2])
        features_aggregated = np.concatenate([
            np.mean(features_combined, axis=1),
            np.std(features_combined, axis=1),
            np.median(features_combined, axis=1)
        ])

        all_features.append(features_aggregated)
        all_labels.append(row['class'])
        all_folds.append(row['fold'])

    return np.array(all_features), np.array(all_labels), np.array(all_folds)

# Precompute features for entire dataset
all_features, all_labels, all_folds = precompute_all_features(metadata)
print("Precomputed features shape:", all_features.shape)

In [None]:
# Split data using precomputed features
train_mask = np.isin(all_folds, TRAIN_FOLDS)
val_mask = all_folds == VAL_FOLD
test_mask = all_folds == TEST_FOLD

X_train, y_train = all_features[train_mask], all_labels[train_mask]
X_val, y_val = all_features[val_mask], all_labels[val_mask]
X_test, y_test = all_features[test_mask], all_labels[test_mask]

print("Training set:", X_train.shape[0], "samples")
print("Validation set:", X_val.shape[0], "samples")
print("Test set:", X_test.shape[0], "samples")

# Verify class distribution
print("\nClass distribution across splits:")
print("\nTraining:", pd.Series(y_train).value_counts().sort_index(), sep='\n')
print("\nValidation:", pd.Series(y_val).value_counts().sort_index(), sep='\n')
print("\nTest:", pd.Series(y_test).value_counts().sort_index(), sep='\n')

In [None]:
# Encode labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)
y_test_encoded = label_encoder.transform(y_test)

print("Class mapping:")
for i, class_name in enumerate(label_encoder.classes_):
    print("%s: %d" % (class_name, i))

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Reshape for CNN/LSTM models
X_train_reshaped = X_train_scaled.reshape(X_train_scaled.shape[0], X_train_scaled.shape[1], 1)
X_val_reshaped = X_val_scaled.reshape(X_val_scaled.shape[0], X_val_scaled.shape[1], 1)
X_test_reshaped = X_test_scaled.reshape(X_test_scaled.shape[0], X_test_scaled.shape[1], 1)

## 7. Model Selection and Architecture
**Choice of Models**:
1. **MLP (Multi-Layer Perceptron)**: Baseline model for comparison, good for structured feature data
2. **CNN (Convolutional Neural Network)**: Effective for capturing local patterns in feature sequences
3. **LSTM (Long Short-Term Memory)**: Suitable for temporal sequence modeling in audio data

**Rationale**:
- **MLP**: Simple baseline to establish performance benchmark
- **CNN**: Can learn hierarchical features from MFCC sequences
- **LSTM**: Can model temporal dependencies in audio signals
- **Hyperparameter Tuning**: Bayesian optimization to find optimal architecture and parameters

In [None]:
# ADVANCED HYPERPARAMETER TUNING WITH KERAS TUNER
def build_model(hp):
    """Build model with tunable hyperparameters using Keras Tuner"""

    model_type = hp.Choice('model_type', ['mlp', 'cnn', 'lstm'])
    input_dim = X_train_scaled.shape[1]
    num_classes = len(label_encoder.classes_)

    # Tunable learning rate
    learning_rate = hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log')

    if model_type == 'mlp':
        model = keras.Sequential()
        model.add(layers.Input(shape=(input_dim,)))

        # Tunable number of layers
        for i in range(hp.Int('num_layers', 2, 4)):
            units = hp.Int(f'units_{i}', min_value=128, max_value=512, step=128)

            model.add(layers.Dense(units, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(hp.Float(f'dropout_{i}', .2, .6)))

        model.add(layers.Dense(num_classes, activation='softmax'))

    elif model_type == 'cnn':
        model = keras.Sequential()
        model.add(layers.Reshape((input_dim, 1), input_shape=(input_dim,)))

        # Tunable CNN architecture
        for i in range(hp.Int('conv_layers', 2, 4)):
            filters = hp.Int(f'filters_{i}', min_value=32, max_value=256, step=32)
            kernel_size = hp.Int(f'kernel_{i}', min_value=3, max_value=7, step=2)

            model.add(layers.Conv1D(filters, kernel_size, activation='relu', padding='same'))
            model.add(layers.BatchNormalization())
            model.add(layers.MaxPooling1D(2))

        model.add(layers.GlobalAveragePooling1D())

        # Tunable dense layers
        for i in range(hp.Int('dense_layers', 1, 3)):
            units = hp.Int(f'dense_units_{i}', min_value=64, max_value=256, step=64)
            model.add(layers.Dense(units, activation='relu'))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(hp.Float('dense_dropout', .3, .6)))

        model.add(layers.Dense(num_classes, activation='softmax'))

    elif model_type == 'lstm':
        model = keras.Sequential()
        model.add(layers.Reshape((input_dim, 1), input_shape=(input_dim,)))

        # Bidirectional LSTM layers
        for i in range(hp.Int('lstm_layers', 1, 3)):
            units = hp.Int(f'lstm_units_{i}', min_value=32, max_value=128, step=32)
            return_sequences = i < hp.Int('lstm_layers', 1, 3) - 1  # Last layer doesn't return sequences

            model.add(layers.Bidirectional(layers.LSTM(units, return_sequences=return_sequences)))
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(hp.Float(f'lstm_dropout_{i}', .2, .5)))

        model.add(layers.Flatten())
        model.add(layers.Dense(num_classes, activation='softmax'))

    # Tunable optimizer
    optimizer_name = hp.Choice('optimizer', ['adam', 'rmsprop'])
    if optimizer_name == 'adam':
        optimizer = keras.optimizers.Adam(learning_rate=learning_rate)
    else:
        optimizer = keras.optimizers.RMSprop(learning_rate=learning_rate)

    model.compile(
        optimizer=optimizer,
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

    return model

In [None]:
def run_advanced_hyperparameter_tuning(X_train, y_train, X_val, y_val, max_trials=25):
    """Run advanced hyperparameter tuning with Bayesian optimization"""

    tuner = kt.BayesianOptimization(
        build_model,
        objective='val_accuracy',
        max_trials=max_trials,
        executions_per_trial=1,  # Run each trial once for speed (can increase to 2 for stability)
        directory='advanced_tuning',
        project_name='urban_sound_advanced',
        overwrite=True
    )

    print("Starting advanced hyperparameter tuning with Bayesian Optimization...")
    print("Max trials:", max_trials)
    print("Training samples:", X_train.shape[0])
    print("Validation samples:", X_val.shape[0])

    # Callbacks for training
    callbacks = [
        keras.callbacks.EarlyStopping(patience=8, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(patience=5, factor=.5, min_lr=1e-7)
    ]

    # Perform the search
    tuner.search(
        X_train, y_train,
        epochs=50,
        validation_data=(X_val, y_val),
        batch_size=32,
        callbacks=callbacks,
        verbose=1
    )

    # Get best hyperparameters and model
    best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]
    best_model = tuner.get_best_models(num_models=1)[0]

    print("\n" + "="*60)
    print("BEST HYPERPARAMETERS FOUND:")
    print("="*60)
    for param, value in best_hps.values.items():
        print("%s: %s" % (param, value))

    # Evaluate best model on validation set
    val_accuracy = best_model.evaluate(X_val, y_val, verbose=0)[1]
    print("Best model validation accuracy: %.4f" % val_accuracy)

    return best_model, best_hps, tuner

In [None]:
# Run advanced hyperparameter tuning
print("=== ADVANCED HYPERPARAMETER TUNING WITH KERAS TUNER ===")

best_model_advanced, best_hps, tuner = run_advanced_hyperparameter_tuning(
    X_train_reshaped, y_train_encoded, X_val_reshaped, y_val_encoded, max_trials=200
)

# Convert best hyperparameters to config format
best_config = {
    'model_type': best_hps.get('model_type'),
    'learning_rate': best_hps.get('learning_rate'),
    'optimizer': best_hps.get('optimizer'),
    'num_layers': best_hps.get('num_layers'),
    'batch_size': 32  # Fixed during tuning
}

print("\nBest configuration:", best_config)

In [None]:
# Retrain the best model with more epochs for final evaluation
print("=== RETRAINING BEST MODEL WITH MORE EPOCHS ===")

# Rebuild the best model with the found hyperparameters
final_model = build_model(best_hps)

# Train with more epochs and callbacks
final_history = final_model.fit(
    X_train_reshaped, y_train_encoded,
    validation_data=(X_val_reshaped, y_val_encoded),
    epochs=100,
    batch_size=32,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=12, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(patience=6, factor=.5, min_lr=1e-7)
    ],
    verbose=1
)

## 8. Results and Analysis
**Evaluation Metrics**:
- Accuracy: Overall classification performance
- Precision, Recall, F1-score: Per-class performance metrics
- Confusion Matrix: Visual representation of classification patterns
- Cross-validation: Robust performance estimation across different data splits

In [None]:
# Evaluate the final model on test set
print("=== FINAL EVALUATION ON TEST SET (FOLD 10) ===")

test_loss, test_accuracy = final_model.evaluate(X_test_reshaped, y_test_encoded, verbose=0)
y_pred = final_model.predict(X_test_reshaped)
y_pred_classes = np.argmax(y_pred, axis=1)

print("Final Model Test Accuracy (Fold 10): %.4f" % test_accuracy)
print("Final Model Test Loss: %.4f" % test_loss)

# Detailed classification report
print("\nClassification Report:")
class_report = classification_report(y_test_encoded, y_pred_classes,
                          target_names=label_encoder.classes_, output_dict=True)
print(classification_report(y_test_encoded, y_pred_classes,
                          target_names=label_encoder.classes_))

# Plot confusion matrix
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_test_encoded, y_pred_classes)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.title('Confusion Matrix - Final Model')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Efficient 10-Fold Cross-Validation
print("=== 10-FOLD CROSS-VALIDATION WITH BEST MODEL ===")

def run_efficient_cross_validation(all_features, all_labels, all_folds, best_config, n_folds=10):
    """Efficient cross-validation using precomputed features"""

    fold_accuracies = []
    label_encoder_cv = LabelEncoder()
    labels_encoded = label_encoder_cv.fit_transform(all_labels)

    for test_fold in range(1, n_folds + 1):
        print("\n--- Fold %d/%d ---" % (test_fold, n_folds))

        # Use current fold for testing, previous fold for validation
        val_fold = test_fold - 1 if test_fold > 1 else n_folds
        train_folds = [f for f in range(1, n_folds + 1) if f not in [test_fold, val_fold]]

        # Get indices using precomputed features
        train_mask = np.isin(all_folds, train_folds)
        val_mask = all_folds == val_fold
        test_mask = all_folds == test_fold

        X_train_fold, y_train_fold = all_features[train_mask], labels_encoded[train_mask]
        X_val_fold, y_val_fold = all_features[val_mask], labels_encoded[val_mask]
        X_test_fold, y_test_fold = all_features[test_mask], labels_encoded[test_mask]

        # Scale features per fold
        scaler_fold = StandardScaler()
        X_train_scaled_fold = scaler_fold.fit_transform(X_train_fold)
        X_val_scaled_fold = scaler_fold.transform(X_val_fold)
        X_test_scaled_fold = scaler_fold.transform(X_test_fold)

        # Reshape for CNN/LSTM
        X_train_reshaped_fold = X_train_scaled_fold.reshape(X_train_scaled_fold.shape[0], X_train_scaled_fold.shape[1], 1)
        X_val_reshaped_fold = X_val_scaled_fold.reshape(X_val_scaled_fold.shape[0], X_val_scaled_fold.shape[1], 1)
        X_test_reshaped_fold = X_test_scaled_fold.reshape(X_test_scaled_fold.shape[0], X_test_scaled_fold.shape[1], 1)

        # Rebuild and train model for this fold
        model_fold = build_model(best_hps)  # Use the same architecture but retrain

        model_fold.compile(
            optimizer=keras.optimizers.Adam(learning_rate=best_config['learning_rate']),
            loss='sparse_categorical_crossentropy',
            metrics=['accuracy']
        )

        history_fold = model_fold.fit(
            X_train_reshaped_fold, y_train_fold,
            validation_data=(X_val_reshaped_fold, y_val_fold),
            epochs=100,
            batch_size=32,
            verbose=0,
            callbacks=[
                keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)
            ]
        )

        # Evaluate on test fold
        test_accuracy_fold = model_fold.evaluate(X_test_reshaped_fold, y_test_fold, verbose=0)[1]
        fold_accuracies.append(test_accuracy_fold)
        print("Fold %d Test Accuracy: %.4f" % (test_fold, test_accuracy_fold))

        # Clean up
        del model_fold
        tf.keras.backend.clear_session()

    return fold_accuracies

# Run efficient cross-validation
cv_accuracies = run_efficient_cross_validation(all_features, all_labels, all_folds, best_config)

In [None]:
# Cross-Validation Results Analysis
print("\n=== 10-FOLD CROSS-VALIDATION RESULTS ===")
print("Individual Fold Accuracies:", ['%.4f' % acc for acc in cv_accuracies])
print("Mean Accuracy: %.4f ± %.4f" % (np.mean(cv_accuracies), np.std(cv_accuracies)))
print("Min Accuracy: %.4f" % np.min(cv_accuracies))
print("Max Accuracy: %.4f" % np.max(cv_accuracies))

# Enhanced visualization
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(range(1, 11), cv_accuracies, 'o-', linewidth=2, markersize=8, label='Fold Accuracy')
plt.axhline(y=np.mean(cv_accuracies), color='r', linestyle='--', label='Mean: %.4f' % np.mean(cv_accuracies))
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.title('10-Fold Cross-Validation Results')
plt.legend()
plt.grid(True, alpha=.3)

plt.subplot(1, 3, 2)
plt.boxplot(cv_accuracies)
plt.ylabel('Accuracy')
plt.title('Accuracy Distribution Across Folds')
plt.grid(True, alpha=.3)

plt.subplot(1, 3, 3)
# Training history for the final model
plt.plot(final_history.history['accuracy'], label='Training Accuracy')
plt.plot(final_history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Final Model Training History')
plt.legend()
plt.grid(True, alpha=.3)

plt.tight_layout()
plt.show()

In [None]:
# Final Comparison with detailed analysis
print("=== COMPREHENSIVE RESULTS ANALYSIS ===")
print("Model Selection Phase (Fold 10 Test): %.4f" % test_accuracy)
print("10-Fold Cross-Validation Mean: %.4f" % np.mean(cv_accuracies))
print("Performance Difference: %.4f" % (test_accuracy - np.mean(cv_accuracies)))

# Per-class performance analysis
print("\n=== PER-CLASS PERFORMANCE ANALYSIS ===")
class_performance = pd.DataFrame(class_report).transpose()
print(class_performance)

In [None]:
# Hyperparameter Tuning Analysis
print("Best Model Type:", best_config['model_type'])
print("Best Learning Rate:", best_config['learning_rate'])
print("Best Optimizer:", best_config['optimizer'])
print("Best Architecture Configuration:", best_config)

# Visualize hyperparameter search results
tuner_results = tuner.oracle.get_best_trials(num_trials=10)
tuner_data = []
for trial in tuner_results:
    tuner_data.append({**trial.hyperparameters.values, 'score': trial.score})

tuner_df = pd.DataFrame(tuner_data)
print("\nTop 10 Hyperparameter Configurations:")
print(tuner_df.sort_values('score', ascending=False).head(10))

In [None]:
# Final Summary Table
pd.DataFrame({
    'Evaluation Method': ['Single Test Fold (Fold 10)', '10-Fold Cross-Validation'],
    'Accuracy': ['%.4f ± %.4f' % (test_accuracy, 0.), '%.4f ± %.4f' % (np.mean(cv_accuracies), np.std(cv_accuracies))],
    'Best Model': [best_config['model_type'], best_config['model_type']],
    'Learning Rate': [best_config['learning_rate'], best_config['learning_rate']]
}).style.set_caption("Final performance summary")

## 9. Discussion and Conclusion

### Troubleshooting

- Preprocessing failed on some of the much shorter clips. The issue was fixed by setting `mode='nearest'` to the two `librosa.feature.delta()` calls.
- Feature extraction took too much time especially during the final training, where the designation of training, validation, and testing data was constantly changing. This was fixed by extracting features from each fold separately at the very beginning.