<a href="https://colab.research.google.com/github/AceCentre/SoundSwitch/blob/main/SoundDetect.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Positive Clips**
The positive_clips should contain audio samples of the sound you are interested in detecting. For example, if you are building a system to detect the sound of a specific word being spoken, your positive clips would contain various instances of that word being spoken by different people, in different tones, and possibly with background noise.

**Negative Clips**
The negative_clips should contain audio samples that are representative of the types of sounds that the system will encounter but should not react to. This could include background noise, other words being spoken, por any other sounds that are not the target sound. These clips are used to test the system's ability to correctly identify non-target sounds as negative.

**Template Clips**


The templates are pre-recorded audio clips that are used as a basis for comparison with incoming audio data. These could be the clearest examples of the sound you are trying to detect. In your code, these are loaded from files named Heather1.wav and Heather2.wav. The Mel spectrograms of these templates are computed and stored in S1 and S2.



In [1]:
!pip install fastdtw

import numpy as np
import librosa
from fastdtw import fastdtw
from scipy.spatial.distance import euclidean
import time
import os
from sklearn.metrics.pairwise import euclidean_distances



In [2]:
#Load clips positive and negative

def load_clips(folder):
    clips = []
    for filename in os.listdir(folder):
        if filename.endswith(".wav"):
            filepath = os.path.join(folder, filename)
            audio, _ = librosa.load(filepath, sr=44100)
            clips.append(audio)
    return clips

from google.colab import drive
drive.mount('/content/drive')
negative_clips = load_clips("/content/drive/My Drive/SoundDetectSamples/Background Clips")
positive_clips = load_clips("/content/drive/My Drive/SoundDetectSamples/Loud Clips")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
def test_method(method, positive_clips, background_noises, *args):
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    true_negatives = 0

    start_time = time.time()

    # Test on positive clips
    for clip in positive_clips:
        if method(clip, *args):
            true_positives += 1
        else:
            false_negatives += 1

    # Test on background noise clips
    for noise in background_noises:
        if method(noise, *args):
            false_positives += 1
        else:
            true_negatives += 1

    elapsed_time = time.time() - start_time

    print(f"Method: {method.__name__}")
    print(f"True Positives: {true_positives}")
    print(f"False Negatives: {false_negatives}")
    print(f"True Negatives: {true_negatives}")
    print(f"False Positives: {false_positives}")
    print(f"Time taken: {elapsed_time} seconds")


First function uses Mel spectrograms for feature extraction and Dynamic Time Warping (DTW) for comparing the features. The primary steps in both methods are:

Compute the Mel spectrogram of the audio clip.
Use DTW to find the minimum distance between the Mel spectrogram of the audio clip and the Mel spectrograms of the templates.
Compare the minimum distance to a threshold to make a classification decision.

In [4]:
def cross_correlation_detect(positive_clips, negative_clips, templates, threshold=0.8):
    results = {
        'true_positives': 0,
        'false_negatives': 0,
        'true_negatives': 0,
        'false_positives': 0
    }

    def detect_sound(audio_signal, templates, threshold):
        #print(f"Audio Signal Shape: {audio_signal.shape}, Type: {type(audio_signal)}")
        max_correlations = []
        for template in templates:
            #print(f"Template Shape: {template.shape}, Type: {type(template)}")
            c = np.correlate(audio_signal, template, mode='valid')
            max_correlations.append(np.max(c))
        max_correlation = np.max(max_correlations)
        return max_correlation > threshold


    # Test on positive clips
    for clip in positive_clips:
        if detect_sound(clip, templates, threshold):
            results['true_positives'] += 1
        else:
            results['false_negatives'] += 1

    # Test on negative clips
    for clip in negative_clips:
        if detect_sound(clip, templates, threshold):
            results['false_positives'] += 1
        else:
            results['true_negatives'] += 1

    return results

In [5]:
def melspectrogram_detect(positive_clips, background_noises, templates, sr=44100, threshold=200):
    results = {
        'true_positives': 0,
        'false_negatives': 0,
        'true_negatives': 0,
        'false_positives': 0
    }

    # Function to detect sound using Mel spectrogram and DTW
    def detect_sound(audio_signal, templates, sr, threshold):
        S = librosa.feature.melspectrogram(y=audio_signal, sr=sr, n_mels=128)
        min_distance = float('inf')

        for template in templates:
            distance, _ = fastdtw(S.T, template.T, dist=euclidean)
            min_distance = min(min_distance, distance)
            #print(f"Min distance: {min_distance}")


        return min_distance < threshold

    # Test on positive clips
    for clip in positive_clips:
        if detect_sound(clip, templates, sr, threshold):
            results['true_positives'] += 1
        else:
            results['false_negatives'] += 1

    for noise in negative_clips:
        if detect_sound(noise, templates, sr, threshold):
            results['false_positives'] += 1
        else:
            results['true_negatives'] += 1

    return results

 MFCCs (Mel-Frequency Cepstral Coefficients). MFCCs are often used in speech and audio processing to capture the timbral texture of the audio.  we use MFCCs instead of Mel spectrograms for feature extraction. We also use Euclidean distance for comparison instead of DTW. This should provide a different perspective on the performance of sound detection techniques.

In [6]:
def mfcc_detect(positive_clips, background_noises, templates, sr=44100, threshold=100):
    results = {
        'true_positives': 0,
        'false_negatives': 0,
        'true_negatives': 0,
        'false_positives': 0
    }

    # Function to detect sound using MFCC and Euclidean distance
    def detect_sound(audio_signal, templates, sr, threshold):
        mfccs = librosa.feature.mfcc(y=audio_signal, sr=sr, n_mfcc=13)
        min_distance = float('inf')

        for template in templates:
            if mfccs.shape[1] != template.shape[1]:
                continue  # Skip this template if dimensions don't match
            distance = np.sum(euclidean_distances(mfccs.T, template.T))
            min_distance = min(min_distance, distance)

        return min_distance < threshold

    # Test on positive clips
    for clip in positive_clips:
        if detect_sound(clip, templates, sr, threshold):
            results['true_positives'] += 1
        else:
            results['false_negatives'] += 1

    # Test on background noise clips
    for noise in background_noises:
        if detect_sound(noise, templates, sr, threshold):
            results['false_positives'] += 1
        else:
            results['true_negatives'] += 1

    return results


In this version, we use MFCCs as features for the SVM classifier. We train the classifier using the mean MFCCs across time for each clip, labeling positive clips as 1 and negative clips as 0. After training, we test the classifier on both positive and negative clips and update the results dictionary accordingly.

In [7]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

def svm_detect(positive_clips, background_noises, sr=44100):
    results = {
        'true_positives': 0,
        'false_negatives': 0,
        'true_negatives': 0,
        'false_positives': 0
    }

    # Extract MFCC features for training
    X_train = []
    y_train = []

    for clip in positive_clips:
        mfccs = librosa.feature.mfcc(y=clip, sr=sr, n_mfcc=13)
        X_train.append(mfccs.mean(axis=1))
        y_train.append(1)  # Label for positive clips

    for noise in background_noises:
        mfccs = librosa.feature.mfcc(y=noise, sr=sr, n_mfcc=13)
        X_train.append(mfccs.mean(axis=1))
        y_train.append(0)  # Label for negative clips

    # Train the SVM classifier
    clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
    clf.fit(X_train, y_train)

    # Test on positive clips
    for clip in positive_clips:
        mfccs = librosa.feature.mfcc(y=clip, sr=sr, n_mfcc=13)
        prediction = clf.predict([mfccs.mean(axis=1)])
        if prediction == 1:
            results['true_positives'] += 1
        else:
            results['false_negatives'] += 1

    # Test on background noise clips
    for noise in background_noises:
        mfccs = librosa.feature.mfcc(y=noise, sr=sr, n_mfcc=13)
        prediction = clf.predict([mfccs.mean(axis=1)])
        if prediction == 0:
            results['true_negatives'] += 1
        else:
            results['false_positives'] += 1

    return results

In [8]:
# combined technique

def extract_combined_features(audio, sr):
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
    chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
    spectral_contrast = librosa.feature.spectral_contrast(y=audio, sr=sr)

    # Take the mean of each feature to use as the feature vector
    mfccs_mean = np.mean(mfccs, axis=1)
    chroma_mean = np.mean(chroma, axis=1)
    spectral_contrast_mean = np.mean(spectral_contrast, axis=1)

    # Combine the features into a single array
    combined_features = np.concatenate((mfccs_mean, chroma_mean, spectral_contrast_mean))

    return combined_features

def svm_detect_combined(positive_clips, background_noises, sr=44100):
    results = {
        'true_positives': 0,
        'false_negatives': 0,
        'true_negatives': 0,
        'false_positives': 0
    }

    # Check if positive_clips and background_noises have data
    if len(positive_clips) == 0 or len(background_noises) == 0:
        print("Either positive_clips or background_noises is empty. Please check your data.")
        return results  # You can choose to return an empty results dict or handle it differently

    # Extract features for training
    X_train = []
    y_train = []

    for clip in positive_clips:
        features = extract_combined_features(clip, sr)
        X_train.append(features)
        y_train.append(1)  # Label for positive clips

    for noise in background_noises:
        features = extract_combined_features(noise, sr)
        X_train.append(features)
        y_train.append(0)  # Label for negative clips

    # Train the SVM classifier
    clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
    clf.fit(X_train, y_train)

    # Test on positive clips
    for clip in positive_clips:
        features = extract_combined_features(clip, sr)
        prediction = clf.predict([features])
        if prediction == 1:
            results['true_positives'] += 1
        else:
            results['false_negatives'] += 1

    # Test on background noise clips
    for noise in background_noises:
        features = extract_combined_features(noise, sr)
        prediction = clf.predict([features])
        if prediction == 0:
            results['true_negatives'] += 1
        else:
            results['false_positives'] += 1

    return results


In [10]:
# Load pre-recorded templates and compute their Mel spectrograms
audio1, sr = librosa.load("/content/drive/My Drive/SoundDetectSamples/Heather1.wav", sr=44100)
audio2, _ = librosa.load("/content/drive/My Drive/SoundDetectSamples/Heather2.wav", sr=44100)

S1 = librosa.feature.melspectrogram(y=audio1, sr=sr, n_mels=128)
S2 = librosa.feature.melspectrogram(y=audio2, sr=sr, n_mels=128)

templates = [S1, S2]

def evaluate_results(results):
    tp = results['true_positives']
    fp = results['false_positives']
    fn = results['false_negatives']
    tn = results['true_negatives']

    # Calculate metrics
    try:
        precision = tp / (tp + fp)
    except ZeroDivisionError:
        precision = 0.0

    try:
        recall = tp / (tp + fn)
    except ZeroDivisionError:
        recall = 0.0

    try:
        f1_score = 2 * (precision * recall) / (precision + recall)
    except ZeroDivisionError:
        f1_score = 0.0

    try:
        accuracy = (tp + tn) / (tp + tn + fp + fn)
    except ZeroDivisionError:
        accuracy = 0.0

    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1_score:.4f}")
    print(f"Accuracy: {accuracy:.4f}")

templates_raw = [audio1, audio2]
cross_corr_results = cross_correlation_detect(positive_clips, negative_clips, templates_raw, threshold=0.8)

# Evaluate and print metrics
print("Metrics for Cross-Correlation Detection:")
evaluate_results(cross_corr_results)

# Assuming you've run your tests and have results for each
mel_results = melspectrogram_detect(positive_clips, negative_clips, templates)

audio1, sr = librosa.load("/content/drive/My Drive/SoundDetectSamples/Heather1.wav", sr=44100)
audio2, _ = librosa.load("/content/drive/My Drive/SoundDetectSamples/Heather2.wav", sr=44100)
M1 = librosa.feature.mfcc(y=audio1, sr=sr, n_mfcc=13)
M2 = librosa.feature.mfcc(y=audio2, sr=sr, n_mfcc=13)
mfcc_templates = [M1, M2]

mcfc_results = mfcc_detect(positive_clips, negative_clips, templates)
svm_results = svm_detect(positive_clips, negative_clips)

# Evaluate and print metrics for each method
print("\nMetrics for Mel Spectrogram Detection:")
evaluate_results(mel_results)

print("\nMetrics for MCFC Detection:")
evaluate_results(mcfc_results)

print("\nMetrics for SVM Detection:")
evaluate_results(svm_results)

# Assuming you've run your tests and have results for each
svm_combined_results = svm_detect_combined(positive_clips, negative_clips)

# Evaluate and print metrics for the combined SVM method
print("\nMetrics for Combined SVM Detection:")
evaluate_results(svm_combined_results)


Metrics for Cross-Correlation Detection:
Precision: 0.5000
Recall: 1.0000
F1 Score: 0.6667
Accuracy: 0.5000

Metrics for Mel Spectrogram Detection:
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000
Accuracy: 0.5000

Metrics for MCFC Detection:
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000
Accuracy: 0.5000

Metrics for SVM Detection:
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000
Accuracy: 1.0000

Metrics for Combined SVM Detection:
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000
Accuracy: 1.0000


What do the scores mean?

Precision: 1.0000
Precision is the ratio of true positive predictions to the total number of positive predictions made (true positives + false positives). A precision score of 1.0000 means that every time the model predicted that a clip was positive, it was correct. In other words, there were no false positives.

Recall: 0.9000
Recall is the ratio of true positive predictions to the total number of actual positive instances (true positives + false negatives). A recall score of 0.9000 means that the model correctly identified 90% of all actual positive clips. The remaining 10% were false negatives, meaning the model incorrectly classified them as negative.

F1 Score: 0.9474
The F1 Score is the harmonic mean of Precision and Recall and provides a single score that balances the trade-off between Precision and Recall. The F1 Score ranges between 0 and 1, where 1 indicates perfect precision and recall, and 0 is the worst score. An F1 Score of 0.9474 is quite high, indicating that the model has both good precision and good recall.

Accuracy: 0.9500
Accuracy is the ratio of correct predictions (both true positives and true negatives) to the total number of instances (true positives + true negatives + false positives + false negatives). An accuracy of 0.9500 means that the model correctly classified 95% of all clips, whether they were positive or negative.

In summary, these metrics suggest that your SVM-based model is performing very well. It's correctly identifying almost all of the positive and negative clips, and when it predicts a clip is positive, it's always correct. However, it's missing 10% of the actual positive clips, as indicated by the recall of 0.9000.