<a href="https://colab.research.google.com/github/BenUCL/Reef-acoustics-and-AI/blob/main/Code/CNN_minibatch_creation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Create minibatches for CNN training**

This scripts splits files into train, validation and training sets. This ensures that recordings from the same 1hr block from the Indo data are put into the same minibatch set. This prevents recordings from the same 1hr block being placed in training and test/val sets which may boost accuracy due to their close proximity in time.

Audio files are then read and log-mel spectrogram matrices from them of size [n, 96, 64] are extracted, so that audio is split into 0.96sec chunks with 64 freq bins, where n = how many 0.96sec chunks can fit in the file (e.g 62 for 1min).

These matrices are then put into pickle files and saved on your GDrive so that the CNN can access these. This is because: i) larger datasets would require too much memory if they were read in one go, ii) features do not need recalculating every time you return to the CNN training script.



In [None]:
# Connect your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Download the pretrained CNN (VGGish) and required packages. The smoke test code block should return 'Looks good to me!' right at the bottom.

In [None]:
!pip install numpy==1.21.5 resampy==0.2.2 tensorflow==1.15 tf_slim==1.1.0 six==1.15.0 soundfile==0.10.3.post1

""" As package versions began updating this threw errors on the smoke test. 
For a faster download versions could be removed but this may throw errors. 
As of 17/10/22 it gives the below output, but, the smoketest codeblock passes:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-probability 0.16.0 requires gast>=0.3.2, but you have gast 0.2.2 which is incompatible.
kapre 0.3.7 requires tensorflow>=2.0.0, but you have tensorflow 1.15.0 which is incompatible.
Successfully installed gast-0.2.2 keras-applications-1.0.8 llvmlite-0.32.1 numba-0.49.1 numpy-1.21.5 resampy-0.2.2 soundfile-0.10.3.post1 tensorboard-1.15.0 tensorflow-1.15.0 tensorflow-estimator-1.15.1 tf-slim-1.1.0
WARNING: The following packages were previously imported in this runtime:
  [numpy]
You must restart the runtime in order to use newly installed versions. """

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting numpy==1.21.5
  Downloading numpy-1.21.5-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 6.5 MB/s 
[?25hCollecting resampy==0.2.2
  Downloading resampy-0.2.2.tar.gz (323 kB)
[K     |████████████████████████████████| 323 kB 46.7 MB/s 
[?25hCollecting tensorflow==1.15
  Downloading tensorflow-1.15.0-cp37-cp37m-manylinux2010_x86_64.whl (412.3 MB)
[K     |████████████████████████████████| 412.3 MB 21 kB/s 
[?25hCollecting tf_slim==1.1.0
  Downloading tf_slim-1.1.0-py2.py3-none-any.whl (352 kB)
[K     |████████████████████████████████| 352 kB 43.7 MB/s 
Collecting soundfile==0.10.3.post1
  Downloading SoundFile-0.10.3.post1-py2.py3-none-any.whl (21 kB)
Collecting tensorflow-estimator==1.15.1
  Downloading tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503 kB)
[K     |████████████████████████████████| 50



In [None]:
# Should output 'Looks good to me at the bottom!'
%cd /content/drive/MyDrive/Reef soundscapes with AI/Audioset
!python vggish_smoke_test.py

/content/drive/MyDrive/Reef soundscapes with AI/Audioset
Instructions for updating:
non-resource variables are not supported in the long term

Testing your install of VGGish

Log Mel Spectrogram example:  [[-4.47297436 -4.29457354 -4.14940631 ... -3.9747003  -3.94774997
  -3.78687669]
 [-4.48589533 -4.28825497 -4.139964   ... -3.98368686 -3.94976505
  -3.7951698 ]
 [-4.46158065 -4.29329706 -4.14905953 ... -3.96442484 -3.94895483
  -3.78619839]
 ...
 [-4.46152626 -4.29365061 -4.14848608 ... -3.96638113 -3.95057575
  -3.78538167]
 [-4.46152595 -4.2936572  -4.14848104 ... -3.96640507 -3.95059567
  -3.78537143]
 [-4.46152565 -4.29366386 -4.14847603 ... -3.96642906 -3.95061564
  -3.78536116]]
2022-10-18 09:49:27.307407: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-10-18 09:49:27.382342: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


**Imports**

In [None]:
# May be some redundancy here
from __future__ import print_function
import random
from random import shuffle

import numpy as np
import tensorflow.compat.v1 as tf
import tf_slim as slim

import vggish_input
import vggish_params
import vggish_slim

#Modules added by Ben
import os #for handling directories
import glob #for dealing with files in dir
import pandas as pd #for saving output at end in dataframe
import sklearn #added for train/test split
from sklearn.model_selection import train_test_split #added for train/test split
from numpy import loadtxt #addded so predictions can be output to CSV file
from datetime import datetime #added to append time to csv output file name to prevent overwriting
import pickle #for storing minibatches in pickle files

**Set paths to access modules and where to store pickle files.**

Also set the number of classes and their names, note codeblocks further down
will need changing where highlighted with ##### if more classes are added

In [None]:
#which repeat of the cross-val is this? (1-8):
repeat = 1 # Used to set seed for train/val/test split

### Change paths if you re-structure folders

# Path to the location where your audio file are stored:
audio_dir = r'/content/drive/MyDrive/Reef soundscapes with AI/audio_dir' 

# Path to folder containing vggish setup files and 'AudiosetAnalysis' downloaded from sarebs supplementary
vggish_files = r'/content/drive/MyDrive/Reef soundscapes with AI/Audioset' 

# Output folder for results:
results_dir = r'/content/drive/MyDrive/Reef soundscapes with AI/Results/' 

#Set the directories where logmel-spectrograms will be stored for train, test and validation sets:
pickle_trainfiles_dir = r'/content/drive/MyDrive/Reef soundscapes with AI/Results/minibatches_train'
pickle_valfiles_dir = r'/content/drive/MyDrive/Reef soundscapes with AI/Results/minibatches_val'
pickle_testfiles_dir = r'/content/drive/MyDrive/Reef soundscapes with AI/Results/minibatches_test'

#how many classes?:
_NUM_CLASSES = 2

#name a column for each class e.g 'class1', 'class2', or 'healthy', 'degraded'
col_names = 'Healthy','Degraded'

#Batch size:
batch_size = 16 # larger batches can cause a memory error on the NN script on colab depending on which GPU you are  allocated 

**Check how many files are in the directory (should be 152)**

In [None]:
os.chdir(audio_dir)
print(len([name for name in os.listdir('.') if os.path.isfile(name)]))

152


**Select files for training, test and validation sets**

In [None]:
'''This block finds the unique identifiers of each deployment (i.e what
hour of the day at what site) and splits these ID's into training, val and
test sets. These are used in the next codeblock to select the actual 
recordings using these ID's which are present within each minute from the
same deployment'''

#This function takes the parts of a filename that make it unique
 #This uses Tims naming convention, specific to the 2018 Indonesia data
def get_identifier(filename):
    #find part of the name that corresponds to the deployment
    t0 = filename.split(".")[0]
    t1 = filename.split(".")[1][0:5]
    t = t0+'.'+t1
    return t

 
#Function to get unique values within an array
def unique(list1):
    x = np.array(list1)
    return np.unique(x)

    
#Get deployment ID from each file
 #For indo, deployments were made in 1hr blocks so will include 60*1min files for each deployment
os.chdir(audio_dir)
all_files = []
all_IDs = []
for file in glob.glob("*.wav"):
  all_files.append(file)
  all_IDs.append(get_identifier(file))

#Use the above function to get a list of unique deployment ID's (approx 30 for healthy, and again for degraded)
unshuffled_unique_deployments = unique(all_IDs) #so for the real data this will give a big long order list

#shuffle this list
np.random.seed(123) #this ensures the same random shuffle is made each time, so the order is conserved when resubmitted
shuffled_unique_deployments = np.random.permutation(unshuffled_unique_deployments)
np.random.seed() #now lift the seed so that randomisation can be used again in the rest of the script

print('Number of unique deployments:')
print(len(shuffled_unique_deployments))
print('The shuffled list of IDs corresponding to these:')
print(shuffled_unique_deployments)

#Bin deployments to the train, val and test sets
a = (repeat*6)-6
b = repeat*6
c = (repeat*6)+6
d = len(shuffled_unique_deployments)
val_deployments =   shuffled_unique_deployments[a:b]
test_deployments = shuffled_unique_deployments[b:c]
train_deployments = np.concatenate((shuffled_unique_deployments[c:d],shuffled_unique_deployments[0:a]))
print('IDs designated to the validation set:')
print(val_deployments)
print('IDs designated to the test set:')
print(test_deployments)
print('IDs designated to the training set:')
print(train_deployments)

Number of unique deployments:
57
The shuffled list of IDs corresponding to these:
['BoN12.1220D' 'BoN11.1200H' 'BaN12.0529H' 'SaF4.0533D' 'BoF5.0532D'
 'BoF4.2333H' 'SaF5.1202D' 'BoN11.1200D' 'BoF2.0930H' 'BoF4.1733D'
 'BoF5.0940H' 'BoF5.0940D' 'BaN11.1731H' 'BaF1.1055H' 'BaF4.1040H'
 'BaN11.1315H' 'SaF3.1733D' 'BaF5.2332H' 'BaN11.2330H' 'SaF3.1355D'
 'SaF4.0902D' 'BoF4.1300D' 'BaF2.1733H' 'BaF5.1330H' 'BoN10.1731D'
 'BoN11.0529D' 'SaN11.0940D' 'BaF5.1732H' 'BoN11.1103D' 'BoF4.1733H'
 'BoF1.1315D' 'SaF2.1203D' 'SaF5.1125D' 'BoF3.1205H' 'SaN10.1345D'
 'BaF3.0533H' 'BaF3.0915H' 'BoF1.1328H' 'SaF1.1733D' 'BoN11.0529H'
 'BaN12.0915H' 'SaF3.2333D' 'BoF4.2333D' 'SaF1.2333D' 'SaF2.1112D'
 'BaN10.0927H' 'BoN10.1731H' 'BoN10.2359D' 'BoF4.1300H' 'BoN12.1220H'
 'BoF3.1205D' 'BoF2.0930D' 'BoN11.1103H' 'BoN10.2359H' 'BoF5.0532H'
 'BaF2.2333H' 'SaF2.0534D']
IDs designated to the validation set:
['BoN12.1220D' 'BoN11.1200H' 'BaN12.0529H' 'SaF4.0533D' 'BoF5.0532D'
 'BoF4.2333H']
IDs designated to the 

In [None]:
'''This block uses the ID's of the train, val and test sets generated above
to select the actual recordings. This generates:
train_files
val_files
test_files
Which are the arrays of recordings corresponding to each of these '''


# Define empty lists
val_files = []
test_files = []
train_files = []

# Select all files in dir that have these ID's in their name
for f in all_files: #I set this above 
  #print(f)
  namePt1 = f.split(".")[0]
  namePt2 = f.split(".")[1]
  ID = namePt1 + '.' + namePt2
  #print(ID)
  if ID in val_deployments:
     val_files.append(f)
  if ID in test_deployments:
     test_files.append(f)
  if ID in train_deployments:
     train_files.append(f)

print('Number and list of validation files:')
print(len(val_files))
print(val_files)
print('Number and list of test files:')
print(len(test_files))
print(test_files)
print('Number and list of training files:')
print(len(train_files))
print(train_files)



Number and list of validation files:
14
['BaN12.0529H.805322778.180907.1.24.wav', 'BaN12.0529H.805322778.180907.3.48.wav', 'BoF4.2333H.1677983769.180830.4.42.wav', 'BoF4.2333H.1677983769.180830.2.38.wav', 'BoF4.2333H.1677983769.180830.5.54.wav', 'BoN11.1200H.805322778.180906.1.2.wav', 'BoN11.1200H.805322778.180906.3.12.wav', 'BoF5.0532D.671907872.180831.3.14.wav', 'BoN12.1220D.1678278701.180907.4.39.wav', 'BoN12.1220D.1678278701.180907.1.9.wav', 'SaF4.0533D.805322778.180830.2.27.wav', 'SaF4.0533D.805322778.180830.4.41.wav', 'SaF4.0533D.805322778.180830.3.34.wav', 'BaN12.0529H.805322778.180907.4.50.wav']
Number and list of test files:
15
['BoF2.0930H.805322778.180828.3.28.wav', 'BoF2.0930H.805322778.180828.5.53.wav', 'BoF2.0930H.805322778.180828.1.4.wav', 'BoF5.0940H.671907872.180831.2.30.wav', 'BoF5.0940H.671907872.180831.3.36.wav', 'BoF5.0940H.671907872.180831.4.47.wav', 'BoF5.0940D.1677983769.180831.1.1.wav', 'BoF5.0940D.1677983769.180831.2.30.wav', 'SaF5.1202D.1677983769.180831.5.30

In [None]:
"""This block gets mel-specs of each 1s from a file and stores these in arrays 
for each class that are compatible with the NN execution below. """


def get_features(file_):
    #Loop to calculate log-mel spectrogram features for each file
    features_from_files_list = np.empty([0,96,64])
        
    #calculate features
    print('Calculating features for:')
    print(file_)
    features_from_files_list = np.vstack([features_from_files_list, vggish_input.wavfile_to_examples(file_)])
    
    #save results in array   
    features_from_files = np.array(features_from_files_list)
    return features_from_files

def prepare_audio(file_, num_of_classes, class_num):                              ################ add more classes as appropriate
  """
  This function cuts larger audio files up into 0.96s chunks and calculates the 
  logmel spectrograms for each chunk. It then SHUFFLES the output.

  Returns:
    Two lists, the first contains logmel specs as a numpy array of shape (
    [len, num_frames, num_bands] where len is the total number of 0.96s 
    calculated from all audio in the directory
    
    the batch_size is variable and
    each row is a log mel spectrogram patch of shape [num_frames, num_bands]
    suitable for feeding VGGish, while labels is a NumPy array of shape
    [batch_size, num_classes] where each row is a multi-hot label vector that
    provides the labels for corresponding rows in features.
  """

  # Generate logmel-spec array for each 0.96s of file, this is shape (96,64)*length of file e.g (62, 96, 64) = 1min
  logmel_spectrogram = get_features(file_) 

  # Create one hot coding
  class_num = class_num -1 # function takes class going starting at 1, 2, 3 etc, not starting at 0
  label = np.zeros(num_of_classes)#, dtype =  np.int8) # Make array of 0's the length of the number of classes
  label[class_num] = 1 # One hot code the element in the array correspoding to the class
  labels = np.array([label]*len(logmel_spectrogram))#.shape[0]) # Multiple this so it is of the same length as logmel_spectrogram
  #labels = labels.astype(int) #wanted to convert y_ to int not float, but can't get it to work


  return logmel_spectrogram, labels

In [None]:
""" This block uses the above functions to define more which get the feats and 
labels for each minibatch and stores these as pickle files to be used by the NN """

def get_class(filename):
    #find part of the name that corresponds to the deployment
     #adapted the get_identifier function above to only get class (e.g healthy)
    deployment_ID = filename.split(".")[1][4:5]
    return deployment_ID


def pickle_minibatches(minibatch, testvaltrain, minibatch_number):#, test_val_or_train):            #### GOT TO HERE
  '''This function extracts logmel-spec usings the functions prepare_audio
    defined above. It saves these and the labels in arrays and pickles these up.
    This function should be run on a minibatch 1by1
    
    file_list = a list of .wav files
    testvaltrain = 'test', 'val', or 'train' '''

  #for i in range(len(minibatch)):
  print('Creating minibatch '+str(i))
  # Define arrays for each class, add additional classes as appropriate
  all_features_class1 = np.empty([0,96,64])
  all_labels_class1 = np.empty([0,2])
  all_features_class2 = np.empty([0,96,64])
  all_labels_class2 = np.empty([0,2])
  # for each file in the minibatch, get feats and labels
  for k in minibatch:#[i]:
    os.chdir(audio_dir)
    #print(k)
    if get_class(k) == 'H':
      features_class1, labels_class1 = prepare_audio(k, _NUM_CLASSES, 1)
      all_features_class1 = np.vstack([all_features_class1, features_class1])
      all_labels_class1 = np.vstack([all_labels_class1, labels_class1])
    if get_class(k) == 'D':
      features_class2, labels_class2 = prepare_audio(k, _NUM_CLASSES, 2)
      all_features_class2 = np.vstack([all_features_class2, features_class2])
      all_labels_class2 = np.vstack([all_labels_class2, labels_class2])
  
  # Now combine the feats/labels from all classes
  minibatch_features = np.vstack([all_features_class1, all_features_class2]) #will need to repeat these lines if using >2 classes
  minibatch_labels = np.vstack([all_labels_class1, all_labels_class2])
  
  # Pickle into the correct folder for train, val or test batches
  if testvaltrain == 'train':
    #shuffle training data
    feats_labels = list(zip(minibatch_features, minibatch_labels)) #zip up
    random.shuffle(feats_labels) #shuffle

    #save pickle file 
    os.chdir(pickle_trainfiles_dir)
    pickle_filename = testvaltrain + '_' + 'minibatch_' + str(minibatch_number) 
    with open(pickle_filename, "wb") as fp:   #Pickling
      pickle.dump(feats_labels, fp)
    print('Pickled ' + pickle_filename)

  if testvaltrain == 'test':
    feats_labels = list(zip(minibatch_features, minibatch_labels))

    #save pickle file 
    os.chdir(pickle_testfiles_dir)
    pickle_filename = testvaltrain + '_' + 'minibatch_' + str(minibatch_number) 
    with open(pickle_filename, "wb") as fp:   #Pickling
      pickle.dump(feats_labels, fp)
    print('Pickled ' + pickle_filename)

  if testvaltrain == 'val':
    feats_labels = list(zip(minibatch_features, minibatch_labels))

    #save pickle file 
    os.chdir(pickle_valfiles_dir)
    pickle_filename = testvaltrain + '_' + 'minibatch_' + str(minibatch_number) 
    with open(pickle_filename, "wb") as fp:   #Pickling
      pickle.dump(feats_labels, fp)
    print('Pickled ' + pickle_filename)

Execute all the above functions on the minibatches and corresponding files

In [None]:
""" Split train, val and test files into minibatches
 These are give arrays of shape (n, batch_size)
 e.g 12 minibatches of size 32 would be shape (12, 32)"""
 
split_minibatches = lambda test_list, x: [test_list[i:i+x] for i in range(0, len(test_list), x)]
train_minibatches = split_minibatches(train_files, batch_size)
val_minibatches = split_minibatches(val_files, batch_size)
test_minibatches = split_minibatches(test_files, batch_size)

# Set dir to the directory where all the audio files are stored
os.chdir(audio_dir)

# Run pickle_minibatches for the train, val and test files, save as pickle files
for i in range(len(train_minibatches)):
  pickle_minibatches(train_minibatches[i], testvaltrain = 'train', minibatch_number = i)

for i in range(len(val_minibatches)):
  pickle_minibatches(val_minibatches[i], testvaltrain = 'val', minibatch_number = i)

for i in range(len(test_minibatches)):
  pickle_minibatches(test_minibatches[i], testvaltrain = 'test', minibatch_number = i)

Creating minibatch 0
Calculating features for:
BaF1.1055H.1678278701.180827.3.35.wav
Calculating features for:
BaF3.0915H.1678278701.180829.4.52.wav
Calculating features for:
BaF3.0533H.1677983769.180829.3.22.wav
Calculating features for:
BaF2.2333H.1677983769.180828.1.15.wav
Calculating features for:
BaF3.0533H.1677983769.180829.5.59.wav
Calculating features for:
BaF2.1733H.1677983769.180828.3.26.wav
Calculating features for:
BaF3.0533H.1677983769.180829.4.53.wav
Calculating features for:
BaF1.1055H.1678278701.180827.1.1.wav
Calculating features for:
BaF3.0915H.1678278701.180829.1.20.wav
Calculating features for:
BaF1.1055H.1678278701.180827.5.45.wav
Calculating features for:
BaF5.2332H.671907872.180831.3.35.wav
Calculating features for:
BaF5.1330H.671907872.180831.3.18.wav
Calculating features for:
BaF5.1330H.671907872.180831.5.49.wav
Calculating features for:
BaF5.2332H.671907872.180831.4.39.wav
Calculating features for:
BaF5.1732H.671907872.180831.4.31.wav
Calculating features for: