<a href="https://colab.research.google.com/github/SilasRu/Oeko3/blob/master/CNN_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Speaker Recognition CNN #

The Following sections show the code used for the project.

**Cloning the repository**


The repository is cloned to gain access to the preprocessed autio files, spectrograms and functions.



*   Functions used are found in src/data_utils
*   Spectrograms and audio files are found in data/train and test



In [1]:
!git clone https://github.com/SilasRu/Oeko3.git

Cloning into 'Oeko3'...
remote: Enumerating objects: 378, done.[K
remote: Counting objects: 100% (378/378), done.[K
remote: Compressing objects: 100% (357/357), done.[K
remote: Total 22912 (delta 43), reused 343 (delta 19), pack-reused 22534[K
Receiving objects: 100% (22912/22912), 1.26 GiB | 36.43 MiB/s, done.
Resolving deltas: 100% (730/730), done.
Checking out files: 100% (20270/20270), done.


**Importing packages and path handling**

In [1]:
import os
import numpy as np
from keras.preprocessing.image import  ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.models import Sequential
import keras.models as km
import keras.layers as kl
import matplotlib.pyplot as plt
import keras
import random
import sys
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, MaxPooling2D, Activation, Dropout
from keras.utils import to_categorical
from sklearn.metrics import confusion_matrix, accuracy_score

# Append training path
sys.path.append("[Oeko3\\data\\train\\spectrograms]")
path = os.path.join('Oeko3', 'data', 'train', 'spectrograms')
persons = sorted(os.listdir(path))

# Append test path
sys.path.append("[Oeko3\\data\\test\\spectrograms]")
path_test = os.path.join('Oeko3', 'data', 'test', 'spectrograms')
persons_test = os.listdir(path_test)

# Get speaker list
sys.path.append("[Oeko3\\data\\test]")
speaker_list = os.path.join('Oeko3', 'data', 'test', 'speaker_list.csv')
sys.path.append('Oeko3/src')

# Import data_util function to convert speaker list
import data_utils

Using TensorFlow backend.


The persons available in the folder are the following:


In [3]:
persons

['berset', 'goess', 'projer', 'roesti', 'rytz']

---

**Converting train and test spectrograms to matrices**

The following two functions convert a given amount of images to a matrix, that is then used in the Neural Network.
To completely randomize the input in the training set, a sample folder of the available speakers is chosen, and from this folder, 10 random images are collected for each loop.

In [0]:
def convert_train(size):
  x_train = []
  y_train = []
  errors = 0
  for i in range(size):
      sample = random.randint(0,4)
      tempdir = os.path.join(path, persons[sample])
      tempfiles = os.listdir(tempdir)
      amount = 0
      while amount < 10:  
          sample_img = random.sample(tempfiles, 1)
          temp_img = load_img(os.path.join(tempdir,sample_img[0]))
          temp_x = img_to_array(temp_img)/255.
          if temp_x.shape== (480,640,3):
            temp_x = temp_x.reshape((1,)+temp_x.shape)
            if i ==0 and amount == 0:  
                x_train = temp_x
            else:
              x_train = np.concatenate((x_train,temp_x),  axis = 0)
            y_train.append(sample)
            amount +=1
          else:
            errors += 1
  return x_train, y_train          

In the test set, the preprocessed spectrograms are loaded and converted to an array. This together with the corresponding speaker, loaded from the speaker_list csv.

In [0]:
def convert_test(speaker_list, range_start, range_stop):
  x_test = []
  y_test = []
  converter = data_utils.Utils()
  y_test_list = converter.create_y_test(speaker_list)[range_start: range_stop]
  
  first_obs = True
  tempdir = path_test
  tempfiles = sorted(persons_test)[range_start: range_stop]
  
  for index, img in enumerate(tempfiles):
      temp_img = load_img(os.path.join(tempdir,img))
      temp_x = img_to_array(temp_img)/255.
      temp_x = temp_x.reshape((1,)+temp_x.shape)
      if y_test_list[index] != 5:
        if first_obs:
            x_test = temp_x
            first_obs = False
        else:
          x_test = np.concatenate((x_test,temp_x),  axis = 0)     
        y_test.append(y_test_list[index])
  
  return x_test, y_test

---

**CNN build**


Here we create the core CNN that is used to train the model and perform predictions.

In [6]:
#create model
model = Sequential()
#add model layers
model.add(Conv2D(128,strides=3, kernel_size=3, activation="relu",
                 input_shape= (480, 640, 3)))
model.add(MaxPooling2D())
model.add(Conv2D(64,strides=3 ,kernel_size=3, activation="sigmoid"))
model.add(MaxPooling2D())
model.add(Flatten())
model.add(Dense(len(persons), activation="softmax"))

#compile model using accuracy to measure model performance
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

Instructions for updating:
Colocations handled automatically by placer.


---

**Fitting the model on the training set**

The training is done by training 20 epochs on 200 random images in the training folder, this 40 times. We chose not to train on all available files at one time due to the high memory cost of image to matrix conversions.


In [0]:
episodes = 40
for i in range(episodes):
  x_train, y_train = convert_train(20)
  y_cat = to_categorical(y_train, num_classes=5)
  model.fit(x_train, y_cat, validation_split = 0.2, epochs=20, shuffle=True)

---

**Prediction on the test set and confusion matrix**

In the last step, the predictions are made on the SRF Arena test spectrograms.

In [0]:
predictions = []
y_test_list = []
x_test, y_test = convert_test(speaker_list, 0,4500)
prediction = model.predict_classes(x_test)

In [69]:
len(predictions)

3477

In [67]:
prediction = model.predict_classes(x_test)
print(confusion_matrix(y_test_list, predictions))
print(accuracy_score(y_test_list, predictions))

[[1314   64   52   88   13]
 [  62  298    5   32   17]
 [ 101   33  328  102   14]
 [ 140   18   83  291   17]
 [  25   70   14   16  280]]
0.722174288179465


We achieve an accuracy of 72%, with room for additional improvements.