## Speech recognition

__You should be able to do this exercise after Lecture 9.__

In this exercise, we will work with the <a href="https://arxiv.org/pdf/1804.03209.pdf">Google Speech Command Dataset</a>, which can be downloaded from <a href="http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz">here</a> (note: you do not need to download the full dataset, but it will allow you to play around with the raw audiofiles). This dataset contains 105,829 one-second long audio files with utterances of 35 common words.

We will use a subset of this dataset as indicated in the table below.

| Word | How many? | Class # |
| :-: | :-: | :-: |
| Yes | 4,044 | 3 |
| No | 3,941 | 1 |
| Stop | 3,872 | 2 |
| Go | 3,880 | 0 |

The data is given in the files `XSound.npy` and `YSound.npy`, both of which can be imported using `numpy.load`. `XSound.npy` contains spectrograms (_e.g._, matrices with a time-axis and a frequency-axis of size 62 (time) x 65 (frequency)). `YSound.npy` contains the class number, as indicated in the table above.

__(a)__ Train a convolutional neural network on the data. Find a good set of hyperparameters for the model. Do you think a convolutional neural network is suitable for this kind of problem? Why/why not?

__(b)__ Classify instances of the test set using your models. Draw a confusion matrix and comment on the results.

__(c)__ Choose one other algorithm from the course, and redo (a) and (b) using this algorithm. Supply a brief discussion of why we would expect this algorithm to do better/worse than the CNN.

In [None]:
from tensorflow import keras
from sklearn.model_selection import train_test_split
from tensorflow.keras import layers


### Part A - training CNN

In [None]:
# Load data
X = np.load("XSound.npy")
Y = np.load("YSound.npy")

display(X.shape)
display(Y.shape)

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=50)

# Define and compile the CNN model
model = keras.Sequential()

# Convolutional layers
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(62, 65, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))

# Flatten the output for the fully connected layers
model.add(layers.Flatten())

# Dense layers
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))  # 10 is an example, adjust for your task

# Compile the model
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=32)

# I would expect CNN to perform fairly well as it is translation invariant meaning that the same frequency pattern occurring at different times of the 1 second recording
# should yield same result. And the shorter the word is, the more relevant it should be, as it has more temporal positions within the 1 second time frame where it could be located

### Part B - evaluating CNN

In [None]:
# Confusion matrix funcion

def PlotConfusionMatrix(y_pred, y_test):
    # Calculate the confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    # Display the confusion matrix using ConfusionMatrixDisplay
    labels = ['yes', 'no', 'stop', 'go']  # Replace with your class labels
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)

    # Plot the confusion matrix
    disp.plot(cmap=plt.cm.Blues)
    plt.title("Confusion Matrix")
    plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(x_test, y_test)

# Print the test accuracy
print("Test accuracy:", test_accuracy)

# You can also print the test loss if needed
print("Test loss:", test_loss)

y_pred = model.predict(x_test)
y_pred = np.argmax(y_pred, axis=1)

PlotConfusionMatrix(y_pred, y_test)


#Overall performance is beyond satisfactory. The model accurately identifies the commands. Interestingly enough, some of the commands are identified incorrectly and it is not the ones I would expect
#The one that is misclassified most often is "no" and "yes" (89 times in the given test set). Even though they are phonetically completely different, semantically they are used in the same context.
#So my guess would be that underlying intention is transferred in terms of frequencies and hence the resemblance from the model's point of view.

### Part C - different algorithm - Random Forrest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Reshape the data to have a 2D structure
x_train = x_train.reshape(x_train.shape[0], -1)  # Reshape to (15737, 62*65)
x_test = x_test.reshape(x_test.shape[0], -1)
# Create and train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=20, random_state=42)
clf.fit(x_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)


PlotConfusionMatrix(y_pred, y_test)

# CNN significantly outperforms RFC (by approximately 15 % in accuracy). Most commonly misclassified words in RFC are "yes" and "no". The cause could be translation variance - RFC is not built to recognize
# same pattern in different parts of spectrogram.