# Data preparation

From the `Portale della Didattica`, `labs/lab3-training-deployment`, download `msc-train.zip`, `msc-val.zip`, and `msc-test.zip`. Then, go to the `FILES` tab of your `Deepnote` project and click on the `+` icon, `Upload file`, and upload the three zip archives. Extract them with the following commands:

In [None]:
# !unzip -q msc-train.zip -d msc-train
# !unzip -q msc-val.zip -d msc-val
# !unzip -q msc-test.zip -d msc-test

# Create a TF Dataset

The `tf.data` API enables you to build complex input pipelines. The main class of this API is the `tf.data.Dataset`, that is an abstraction representing a sequence of elements. For example, each element might be a pair of tensors representing an audio and its label.

The `tf.data.Dataset` class enables you to create TF datasets from different sources like Python variables, text data, or sets of files. For more information, refers to the [API Documentation](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) and the [Official Tutorial](https://www.tensorflow.org/guide/data).

For example, we can create a TF dataset from the files stored in our working directory with the following code:



In [None]:
train_files_ds = tf.data.Dataset.list_files('msc-train/*')

In [None]:
num_files = len(train_files_ds)

print(f'The dataset contains {num_files} files.')

for x in train_files_ds.take(5):
    print(x)

The dataset contains 6400 files.
tf.Tensor(b'msc-train/up_fbe51750_nohash_0.wav', shape=(), dtype=string)
tf.Tensor(b'msc-train/right_ab81c9c8_nohash_0.wav', shape=(), dtype=string)
tf.Tensor(b'msc-train/no_abbfc3b4_nohash_1.wav', shape=(), dtype=string)
tf.Tensor(b'msc-train/yes_5ab63b0a_nohash_0.wav', shape=(), dtype=string)
tf.Tensor(b'msc-train/yes_92037d73_nohash_1.wav', shape=(), dtype=string)


# Apply pre-processing

In [None]:
from preprocessing import AudioReader

audio_reader = AudioReader(tf.int16, 16000)

train_audio_and_label_ds = train_files_ds.map(audio_reader.get_audio_and_label)

for x, label in train_audio_and_label_ds.take(5):
    print(f'Data shape: {x.shape} - Label: {label.numpy().decode()}')

2023-11-07 12:17:50.923132: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX AVX2 FMA
2023-11-07 12:17:50.926011: W tensorflow_io/core/kernels/audio_video_mp3_kernels.cc:271] libmp3lame.so.0 or lame functions are not available
Data shape: (16000,) - Label: down
Data shape: (16000,) - Label: stop
Data shape: (16000,) - Label: yes
Data shape: (16000,) - Label: stop
Data shape: (16000,) - Label: left


## Log-Mel Spectrogram

In [None]:
from preprocessing import MelSpectrogram


MEL_SPECTROGRAM_ARGS = {
    'sampling_rate': 16000,
    'frame_length_in_s': 0.04,
    'frame_step_in_s': 0.02,
    'num_mel_bins': 128,
    'lower_frequency': 0,
    'upper_frequency': 8000,
}

mel_spec_processor = MelSpectrogram(**MEL_SPECTROGRAM_ARGS)

train_mel_spec_ds = train_audio_and_label_ds.map(mel_spec_processor.get_mel_spec_and_label)

In [None]:
for x, y in train_mel_spec_ds.take(5):
    print(f'Data shape: {x.shape} -  Label: {y.numpy()}')

Data shape: (49, 128) -  Label: b'up'
Data shape: (49, 128) -  Label: b'up'
Data shape: (49, 128) -  Label: b'up'
Data shape: (49, 128) -  Label: b'up'
Data shape: (49, 128) -  Label: b'up'


## MFCC

In [None]:
from preprocessing import MFCC

MFCC_ARGS = {
    **MEL_SPECTROGRAM_ARGS,
    'num_coefficients': 10,
}

mfcc_processor = MFCC(**MFCC_ARGS)
train_mfcc_ds = train_audio_and_label_ds.map(mfcc_processor.get_mfccs_and_label)

In [None]:
for x, y in train_mfcc_ds.take(5):
    print(f'Data shape: {x.shape} -  Label: {y.numpy()}')

Data shape: (49, 10) -  Label: b'up'
Data shape: (49, 10) -  Label: b'no'
Data shape: (49, 10) -  Label: b'up'
Data shape: (49, 10) -  Label: b'left'
Data shape: (49, 10) -  Label: b'up'


# Batching

In [None]:
train_mfcc_batched_ds = train_mfcc_ds.batch(8)

In [None]:
for x, y in train_mfcc_batched_ds.take(5):
    print(f'Data shape: {x.shape} -  Labels: {y.numpy()}')

Data shape: (8, 49, 10) -  Labels: [b'left' b'left' b'go' b'stop' b'right' b'no' b'right' b'no']
Data shape: (8, 49, 10) -  Labels: [b'up' b'stop' b'stop' b'up' b'yes' b'no' b'stop' b'right']
Data shape: (8, 49, 10) -  Labels: [b'left' b'no' b'go' b'no' b'no' b'go' b'left' b'left']
Data shape: (8, 49, 10) -  Labels: [b'stop' b'no' b'up' b'go' b'stop' b'go' b'stop' b'up']
Data shape: (8, 49, 10) -  Labels: [b'go' b'no' b'no' b'down' b'go' b'go' b'yes' b'yes']


# Prepare for Training

For training a classifier with TF Keras, we need to:
* Add the channel axis to the data
    * the number of channels must be explicit!
* Transform the label string to an integer (from 0 to 9)
    * this is needed by the [SparseCategoricalCrossentropy](https://www.tensorflow.org/api_docs/python/tf/keras/losses/SparseCategoricalCrossentropy) loss

In [None]:
from preprocessing import LABELS

def prepare_for_training(feature, label):
    feature = tf.expand_dims(feature, -1)
    label_id = tf.argmax(label == LABELS)

    return feature, label_id

train_ds = train_mfcc_ds.map(prepare_for_training).batch(8)

In [None]:
for x, y in train_ds.take(5):
    print(f'Data shape: {x.shape} -  Label IDs: {y.numpy()}')

Data shape: (8, 49, 10, 1) -  Label IDs: [7 2 5 3 4 0 3 4]
Data shape: (8, 49, 10, 1) -  Label IDs: [2 3 4 1 3 7 4 2]
Data shape: (8, 49, 10, 1) -  Label IDs: [2 7 4 0 5 3 5 3]
Data shape: (8, 49, 10, 1) -  Label IDs: [1 6 7 4 2 6 2 7]
Data shape: (8, 49, 10, 1) -  Label IDs: [2 3 7 6 2 0 7 7]


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=6f1fd91f-a434-4542-983d-3ce5ae14ac33' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>