# Introduction

In this notebook I will use a convolutional neural network to determine what string on the guitar a note was played on. To achieve this I will be using a dataset consisting of spectrogram images produced by wav files of single notes being played on guitar. Finding an accurate way to determine what string a note was played on will solve the issue of the same "exact" note existing multiple places on the guitar, with only slight tonal differences distinguishing them. This approach aims incorporate the tonal differences of the notes by looking at the spectrogram

# Data Analysis

Deriving the spectograms involved recording 100s of audios and then using librosa to convert the wav files to spectograms. The recording process was done using my own electric guitar, an audio interface, and the ableton lite recording software. The tone is completely clean, meaning the input of the guiitar is not being modified in any way. Each fret was recorded for each string, when the end was reached (meaning all frets were recorded) I would down tune the guitar to make sure the notes are distinct and increase the overlap in notes with same pitch but different strings. I then created a script to convert the audios to spectograms, trimming the beginning and end to make sure there is no external sound (finger going on/off fret).

# Data Import

In [12]:
import os
import cv2
import numpy as np
image_directory = "C:/Users/Mario/PycharmProjects/SoundTesting/Spectrograms"

y = []
X = []

for file in os.listdir(image_directory):
    string = file[0]
    pixels = cv2.imread(image_directory + '/' + file)
    X.append(pixels)
    if string == 'E':
        y.append(0)
    elif string == 'A':
        y.append(1)
    elif string == 'D':
        y.append(2)
    elif string == 'G':
        y.append(3)
    elif string == 'B':
        y.append(4)
    elif string == 'H':
        y.append(5)
y = np.array(y)
X = np.array(X)

In [13]:
print(len(X))
print(len(y))

823
823


# Training

In [31]:
import tensorflow as tf
from tensorflow.keras import layers, models

# Define the CNN architecture
model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(200, 500, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dense(6, activation='softmax')
])

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [32]:
# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

In [33]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [39]:
epochs = 10
batch_size = 64

model.summary()

history = model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_test, y_test))

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_9 (Conv2D)           (None, 198, 498, 32)      896       
                                                                 
 max_pooling2d_6 (MaxPoolin  (None, 99, 249, 32)       0         
 g2D)                                                            
                                                                 
 conv2d_10 (Conv2D)          (None, 97, 247, 64)       18496     
                                                                 
 max_pooling2d_7 (MaxPoolin  (None, 48, 123, 64)       0         
 g2D)                                                            
                                                                 
 conv2d_11 (Conv2D)          (None, 46, 121, 64)       36928     
                                                                 
 flatten_3 (Flatten)         (None, 356224)           

In [40]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test, y_test)

print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.8727272748947144


In [41]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)

y_pred_classes = np.argmax(y_pred, axis=1)

report = classification_report(y_test, y_pred_classes)

print(report)

              precision    recall  f1-score   support

           0       0.89      0.92      0.91        26
           1       0.90      0.87      0.89        31
           2       0.88      0.94      0.91        32
           3       0.96      0.73      0.83        30
           4       0.73      0.90      0.81        21
           5       0.88      0.88      0.88        25

    accuracy                           0.87       165
   macro avg       0.87      0.87      0.87       165
weighted avg       0.88      0.87      0.87       165



# Model Evaluation

Looking at the results we can see that the model had an accuracy of 89% meaning it is classifying correctly a majority of the time. It's also worth noting that there are 6 different classes making this level of accuracy very impressive. Looking at the precision we can safely say that it is predicting well for each class and is giving few false positives with the precision ranging from 70% to 96% across the different strings. Finally, the high recall (70% to 96%) indicates that it is also recognizing almost all of the positive instances in addition to giving few false positives. Overall, using spectograms to identify slight tonal differences in similar sounds seems to be a viable approach and can be used for other cases that involves using tone/frequencies to identify something.

# Improvements

In terms of improvements, there are various things that can be worked on. The biggest limitation would be hardware resources, currently the image size is fairly small for it to run on my system. With a better system I could use larger images which would provide the model with better details about the spectogram. Another thing to consider is that every guitar is unique and has its own distinct sound, since the audios were recorded only on my guitar it's likely it will only perform well on my guitar. Having a more diverse set of recordings across various different guitars will improve the generalization of the model. In addition, including audios with different tones/added effects can also increase the generalization.