Import wav audio samples into a pandas dataframe

Create base dataset csv file with columns:
language, speaker, audio_raw_data

In [1]:
#import libraries
import numpy as np
from scipy.io import wavfile
import os
import pandas as pd

sample_length = 60 * 16000 # 16kHz sampling of 60 seconds of audio

# print size of list of audio files
folder = os.getcwd() + "/../rec/"
audio_files_list = os.listdir(folder)
print(len(audio_files_list))

# memorize audio recordings and thei associated language and speaker
audio_recs = np.zeros( (len(audio_files_list), sample_length), dtype="int16")
languages = []
speakers = []
print(audio_recs.shape)

# create starting point: dataset with all the audio recordings
for i, audio in enumerate(audio_files_list):
    sample_rate, audio_rec = wavfile.read(folder + audio)
    audio_recs[i] = audio_rec
    languages.append(audio[4:7])
    speakers.append(audio[8 : len(audio) - 7])

# print unique speakers and languages
print(np.unique(speakers))
print(np.unique(languages))


# Create a dictionary with the structured data
data_dictionary = {
    'language': languages,
    'speaker': speakers,
    'audio_raw_data': audio_recs.tolist()
}

'''
# Specify the data types for each column
data_types = {
    'audio_raw_data': 'object',
    'language': 'str'
}
'''
dataset = pd.DataFrame(data_dictionary)
# Save the DataFrame to a CSV file
#df.to_csv('dataset0.csv', index=False)

print(dataset.head())


53
(53, 960000)
['alessandro' 'donia' 'elena' 'francesco' 'gabriele' 'lorenzo' 'omar'
 'thiago']
['ara' 'eng' 'esp' 'ita' 'por']
  language     speaker                                     audio_raw_data
0      ita  alessandro  [21, -4, -25, -74, -23, -43, -177, -195, -236,...
1      por      thiago  [-561, -550, -536, -539, -545, -573, -573, -58...
2      ita       elena  [43, 41, 39, 39, 39, 38, 39, 31, 34, 38, 34, 3...
3      eng     lorenzo  [-653, -649, -667, -681, -701, -673, -694, -70...
4      esp       elena  [-415, -409, -412, -411, -407, -413, -421, -43...


## Training - Validation - Test splitting criteria

Create training and validation dataset by splitting the data based on the speaker and language:

- all the speakers having 1 or 2 languages associated will have their data split 75-25 between training and validation. Each audio sample is split in a piece of 45 seconds (used for training) and another separate piece of 15 seconds.
- speakers having more than 2 languages associated will have 75% of languages in training and 25% of languages in validation. Each audio sample here is taken entirely without splitting, and placed in the corresponding dataset.

The data collected are audio samples of 60 seconds. The split between training and validation is done in such a way to have completely separated data frames between the sets. 
This way the audio frames computed will never overlap between training and validation. This is important in order to have validation data that is unseen in training set. T
his is due to the frames of the audio samples computed in sequences of frame_size with some overlap (hop_size). 
In order to have zero overlap between frames in the 2 datasets, we divide an audio samples in 2 separate non-overlapping pieces, where the frames are computed.

This way the validation set can be used to estimate:
- known speakers speaking in languages that have been already heard from them
- known speakers speaking in languages that were never heard from them -> useful to understand quality of model's knowledge

The test set will be created ad-hoc when a lot of data is collected. A few separate test sets will be created, needed to evaluate the performance of the model in different scenarios:

1. known speakers in heard languages -> evaluate model performance in tested scenarios
2. known speakers in un-heard languages -> evaluate performance for recognizing language instead of the speaker vocal characteristics
3. unknown speakers -> evaluate performance for recognizing language from an unseen speaker

The forecast for the test task is having the performance in the case 3 being lower than the case 2. Having separate test sets is useful in order to have an unbiased estimate of the model's performance on different tasks.

In [2]:
window = 5.6 # seconds of audio in input
hop = 1.875 # overlapping window time in seconds
frequency = 16000


This function takes as input a series of audio samples, and extracts frames from it without any 75%-25% splitting

In [3]:

def split_100(audio_list: pd.DataFrame) -> (pd.DataFrame):
    audio_splits = []
    languages = []
    speakers = []
    
    samples = audio_list["audio_raw_data"].to_numpy()
    languages = audio_list["language"].to_numpy()
    speakers = audio_list["speaker"].to_numpy()
    
    for i, sample in enumerate(samples):
        
        for time in range(0, int((60-window) * frequency), int(hop * frequency)):
            audio_splits.append( sample[time : int(time + window * frequency) ] )
            languages.append(languages[i])
            speakers.append(speakers[i])

    data_dictionary = {
        'language': languages,
        'speaker': speakers,
        'audio_raw_data': audio_splits
    }

    return pd.DataFrame(data_dictionary)


Code for splitting audio samples in 2 sequences of 45 - 15 seconds each. The frames extracted from the audio samples are well separated and there are no overlapping frames between the 2 splits

In [4]:

def split_75(audio_list: pd.DataFrame) -> (pd.DataFrame, pd.DataFrame):
    audio_splits_train = []
    audio_splits_val = []
    languages_train = []
    languages_val = []
    speakers_train = []
    speakers_val = []

    samples = audio_list["audio_raw_data"].to_numpy()
    languages = audio_list["language"].to_numpy()
    speakers = audio_list["speaker"].to_numpy()

    for i, sample in enumerate(samples):

        for time in range(0, int((45-window) * frequency), int(hop * frequency)):
            audio_splits_train.append(sample[time : int(time + window * frequency) ] )
            languages_train.append(languages[i])
            speakers_train.append(speakers[i])
        
        for time in range(45 * frequency, int((60-window) * frequency), int(hop * frequency)):
            audio_splits_val.append(sample[time : int(time + window * frequency) ] )
            languages_val.append(languages[i])
            speakers_val.append(speakers[i])

    data_dictionary_train = {
        'language': languages_train,
        'speaker': speakers_train,
        'audio_raw_data': audio_splits_train
    }

    data_dictionary_val = {
        'language': languages_val,
        'speaker': speakers_val,
        'audio_raw_data': audio_splits_val
    }

    return pd.DataFrame(data_dictionary_train), pd.DataFrame(data_dictionary_val)



In [5]:

split_ratio = 0.75 # 75% of training, 25% of validation

classes_list = ["ita", "eng"] # substitute with np.unique(languages) to obtain whole set of languages
speakers_list = np.unique(speakers)
print(speakers_list)


['alessandro' 'donia' 'elena' 'francesco' 'gabriele' 'lorenzo' 'omar'
 'thiago']


In [6]:
# select the samples that are in dataset_windowed and whose language is in classes_list
valid_samples = []
for sample in dataset.iterrows():
    if sample[1]['language'] in classes_list:
        valid_samples.append(sample[1])

valid_samples = pd.DataFrame(valid_samples)

print("samples considered: ", len(valid_samples))
print("entire dataset: ", dataset.shape)

samples considered:  35
entire dataset:  (53, 3)


In [7]:
dataset_train = []
dataset_validation = []

for speaker in speakers_list:
	# take slice of the dataset containing the samples associated to one speaker
	data_speaker = valid_samples[ valid_samples["speaker"] == speaker]
	print("speaker: ", speaker, " speaks: ", np.unique(data_speaker["language"]) )
	print("samples quantity: ", data_speaker.shape)

	# compute number of languages spoken by the speaker
	langs_spoken = np.unique(data_speaker["language"])
	print("langs_spoken: ", langs_spoken)
	langs_spoken_count = langs_spoken.shape[0]
	
	samples_number = len(data_speaker)

	if (langs_spoken_count == 1 or langs_spoken_count == 2):
		# split an audio sequence of 60 seconds into one sequence of 45s and one sequence of 15s
		# such that none of the frames in the sequences are shared between the two splits
		# the 45s sequence goes in training, the 15s sequence goes in validation

		data_speaker_train, data_speaker_val = split_75(data_speaker)
		
		count_train = data_speaker_train.shape[0]
		count_valid = data_speaker_val.shape[0]

		print("added ", count_train, " samples to training")
		print("added ", count_valid, " samples to validation")

		dataset_train.extend(np.array(data_speaker_train))
		dataset_validation.extend(np.array(data_speaker_val))

	else: # more than 2 languages spoken by the speaker
		# choose 25% of the languages at random spoken by the speaker and place them entirely in validation
		# the remaining 75% of the languages are split 75-25 between training and validation

		if (langs_spoken_count == 3 or langs_spoken_count == 4):
			# one language in validation, two in training / validation
			
			random_split = np.random.uniform(0, 1, size=1)
			# choose the language that goes in validation
			lang_choice = langs_spoken[int(random_split * langs_spoken_count)]
			print("chosen language: ", lang_choice, " to be placed in validation entirely")

			for lang in langs_spoken:
				if (lang == lang_choice):
					data_speaker_val = split_100(data_speaker[ data_speaker["language"] == lang ])
					data_speaker_val = np.array(data_speaker_val)
					print("adding ", data_speaker_val.shape[0], " samples of language ", lang, " to validation")
					dataset_validation.extend(data_speaker_val)
				else:
					data_speaker_train, data_speaker_val = split_75(data_speaker[ data_speaker["language"] == lang ])
					
					count_train = len(data_speaker_train)
					count_valid = len(data_speaker_val)
					print("adding ", count_train, " samples of language ", lang, " to training")
					print("adding ", count_valid, " samples of language ", lang, " to validation")

					dataset_train.extend(np.array(data_speaker_train))
					dataset_validation.extend(np.array(data_speaker_val))

		else: 
			# if the speaker speaks 5 languages, 2 of them go in validation and 3 in training / validation
			pass
		
	print("")

len_train = len(dataset_train)
len_valid = len(dataset_validation)
total = len_train + len_valid
print("dataset_train: ", len_train)
print("dataset_validation: ", len_valid)
print("training ratio: ", len_train / total * 100.0, "%")
print("validation ratio: ", len_valid / total * 100.0, "%")
print("total number of samples: ", total)

speaker:  alessandro  speaks:  ['eng' 'ita']
samples quantity:  (7, 3)
langs_spoken:  ['eng' 'ita']


added  154  samples to training
added  42  samples to validation

speaker:  donia  speaks:  []
samples quantity:  (0, 3)
langs_spoken:  []

speaker:  elena  speaks:  ['eng' 'ita']
samples quantity:  (10, 3)
langs_spoken:  ['eng' 'ita']
added  220  samples to training
added  60  samples to validation

speaker:  francesco  speaks:  ['eng' 'ita']
samples quantity:  (4, 3)
langs_spoken:  ['eng' 'ita']
added  88  samples to training
added  24  samples to validation

speaker:  gabriele  speaks:  ['eng' 'ita']
samples quantity:  (4, 3)
langs_spoken:  ['eng' 'ita']
added  88  samples to training
added  24  samples to validation

speaker:  lorenzo  speaks:  ['eng' 'ita']
samples quantity:  (6, 3)
langs_spoken:  ['eng' 'ita']
added  132  samples to training
added  36  samples to validation

speaker:  omar  speaks:  ['eng']
samples quantity:  (2, 3)
langs_spoken:  ['eng']
added  44  samples to training
added  12  samples to validation

speaker:  thiago  speaks:  ['eng']
samples quantity:  (2, 3)


Saves the training and validation dataset in memory

In [8]:
# convert dataset_train to dataframe and save it into a csv file
# convert dataset_validation to dataframe and save it into a csv file
folder = os.path.dirname(os.getcwd()) + "/datasets/"

train_df = pd.DataFrame(dataset_train)
train_df.to_csv(folder + 'dataset_train.csv', index=False)

valid_df = pd.DataFrame(dataset_validation)
valid_df.to_csv(folder + 'dataset_validation.csv', index=False)