<a href="https://colab.research.google.com/github/PriyankaTUI/AudioClassificationWithDeepLearningAnalysis/blob/master/dataset/data_processing_comparison.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
!git clone https://github.com/PriyankaTUI/AudioClassificationWithDeepLearningAnalysis.git
%cd AudioClassificationWithDeepLearningAnalysis/dataset
!pwd

Cloning into 'AudioClassificationWithDeepLearningAnalysis'...
remote: Enumerating objects: 215, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 215 (delta 52), reused 29 (delta 23), pack-reused 145[K
Receiving objects: 100% (215/215), 44.37 MiB | 29.64 MiB/s, done.
Resolving deltas: 100% (108/108), done.
/content/AudioClassificationWithDeepLearningAnalysis/dataset
/content/AudioClassificationWithDeepLearningAnalysis/dataset


# **For pre-data processing, we use two different strategies. We will evaluate both strategies below based on time management and memory management.**


Using the memory profiler module, we will compute memory allocation at each stage of data processing to determine the amount of memory that is required. Additionally, we'll utilize the %%time command to track the overall processing time for data.

Please check the results from cells 7 and 10. It provides us with thorough details on memory profiling for both methods. Based on those results, we may decide which technique is more efficient.



In [4]:
!pip install memory_profiler

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
%load_ext memory_profiler


# Fist data processing approach


**With this strategy, all data will be processed beforehand and stored locally for use during training.**


In [6]:
%%file mprun_data_processing.py


from torchaudio.datasets import SPEECHCOMMANDS
import os
import torchaudio
import torch
from torch.utils.data import Dataset, DataLoader, random_split
import torch.nn as nn
import numpy as np


def label_to_index(labels, label):
    # Return the position of the word in labels
    return torch.tensor(labels.index(label))

def index_to_label(labels, index):
    # Return the word corresponding to the index in labels
    # This is the inverse of label_to_index
    return labels[index]

    
def load_and_preprocess_speech_command_dataset(random_targets, digits):
    random_targets = ['right', 'down', 'yes', 'sheila', 'marvin', 'backward', 'follow', 'bed', 'bird', 'cat', 'dog', 'happy', 'left', 'stop']
    digits = ['zero','one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'] 
    tensors = []
    targets = []

    #old classes storage
    old_class_tensors = []
    old_class_targets = []
    
    labels = digits + random_targets

    ### saving local file for some data to save time for creating and processing new data
    ### while continuously working on project 
    ### we can delete local files and always create new data
    if (os.path.exists(path='data/novel_class_tensors.pt') and 
        os.path.exists(path='data/novel_class_targets.pt') and 
        os.path.exists(path='data/old_class_tensors.pt') and 
        os.path.exists(path='data/old_class_targets.pt')):
        
        tensors = torch.load('data/novel_class_tensors.pt')
        targets = torch.load('data/novel_class_targets.pt')
        old_class_tensors = torch.load('data/old_class_tensors.pt')
        old_class_targets = torch.load('data/old_class_targets.pt')

    else:


        #  Loading dataset and custom dataloader
        dataset = torchaudio.datasets.SPEECHCOMMANDS('./data/' , url = 'speech_commands_v0.02', folder_in_archive= 'SpeechCommands',  download = True)
        #parameters for MFCC transformation
        n_fft = 2048
        win_length = None
        hop_length = 512
        n_mels = 256
        n_mfcc = 256

        for waveform, sample_rate, label, *_ in dataset:
            if label in random_targets:
                if sample_rate == 16000:
                    if waveform.shape == (1, 16000):
                        tensors += [torchaudio.transforms.MFCC(sample_rate=sample_rate, n_mfcc=32, 
                                                                melkwargs={
                                                                            'n_fft': n_fft,
                                                                            'n_mels': n_mels,
                                                                            'hop_length': hop_length,
                                                                            'mel_scale': 'htk',
                                                                            }
                                                                            )(waveform)]
                        targets += [label_to_index(labels, label)]

            if label in digits:
                if sample_rate == 16000:
                    if waveform.shape == (1, 16000):
                        old_class_tensors += [torchaudio.transforms.MFCC(sample_rate=sample_rate, n_mfcc=32, 
                                                                melkwargs={
                                                                            'n_fft': n_fft,
                                                                            'n_mels': n_mels,
                                                                            'hop_length': hop_length,
                                                                            'mel_scale': 'htk',
                                                                            }
                                                                            )(waveform)]
                        old_class_targets += [label_to_index(labels, label)]

    torch.save(tensors, 'data/novel_class_tensors.pt')
    torch.save(targets, 'data/novel_class_targets.pt')
    torch.save(old_class_tensors, 'data/old_class_tensors.pt')
    torch.save(old_class_targets, 'data/old_class_targets.pt')
        
    return {"tensors": tensors, "targets": targets,
            "old_class_tensors": old_class_tensors, 
            "old_class_targets": old_class_targets}
    


class SpeechCommandSubDataset(Dataset):
    
    def __init__(self,data,labels):
        self.data = data
        self.labels = labels
            
    def __len__(self):
        return len(self.data)    
    
    def __getitem__(self,idx):
        # print(f"getting data {idx}")
        return self.data[idx], self.labels[idx]


######starting processing##########
def data_preprocessing():
  random_targets = ['right', 'down', 'yes', 'sheila', 'marvin', 'backward', 'follow', 'bed', 'bird', 'cat', 'dog', 'happy', 'left', 'stop']
  digits = ['zero','one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'] 
  data = load_and_preprocess_speech_command_dataset(random_targets=random_targets, digits=digits)
  tensors = data["tensors"]
  targets = data["targets"]
  old_class_tensors = data["old_class_tensors"]
  old_class_targets = data["old_class_targets"]
  
  #taking small amount of sample data for training and testing
  random_index = np.random.randint(len(tensors), size=500)
  # print(f"Random index: {random_index}")
  valid_dataset = SpeechCommandSubDataset(data = [tensors[index] for index in random_index], 
                                          labels = [targets[index] for index in random_index])

  traindata, testdata = random_split(valid_dataset, [round(len(valid_dataset)*.6), round(len(valid_dataset)*.4)])
  trainloader = DataLoader(traindata, batch_size=10, shuffle=True)
  testloader = DataLoader(testdata, batch_size=10, shuffle=True)

  # #creating old data loder for measures
  random_index_olddata = np.random.randint(len(old_class_tensors), size=200)
  # print(f"Random index: {random_index}")
  old_class_dataset = SpeechCommandSubDataset(data = [old_class_tensors[index] for index in random_index_olddata], 
                                          labels = [old_class_targets[index] for index in random_index_olddata])

  old_class_testloader = DataLoader(old_class_dataset, batch_size=10, shuffle=True)

  #to check time required for iterating training data
  for i, (input, lables) in enumerate(traindata):
    print("", end= "")
  
  #to check time required for iterating testing data
  for i, (input, lables) in enumerate(testloader):
    print("", end= "")


Writing mprun_data_processing.py


In [7]:
%%time
from mprun_data_processing import data_preprocessing
%mprun -f data_preprocessing data_preprocessing()


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/local/lib/python3.7/dist-packages/memory_profiler.py", line 845, in enable
    sys.settrace(self.trace_memory_usage)



  0%|          | 0.00/2.26G [00:00<?, ?B/s]


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/local/lib/python3.7/dist-packages/memory_profiler.py", line 848, in disable
    sys.settrace(self._original_trace_function)




CPU times: user 14min 12s, sys: 1min 4s, total: 15min 16s
Wall time: 15min 23s




```
Filename: /content/AudioClassificationWithDeepLearningAnalysis/dataset/mprun_data_processing.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   112    268.4 MiB    268.4 MiB           1   def data_preprocessing():
   113    268.4 MiB      0.0 MiB           1     random_targets = ['right', 'down', 'yes', 'sheila', 'marvin', 'backward', 'follow', 'bed', 'bird', 'cat', 'dog', 'happy', 'left', 'stop']
   114    268.4 MiB      0.0 MiB           1     digits = ['zero','one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'] 
   115    748.2 MiB    479.9 MiB           1     data = load_and_preprocess_speech_command_dataset(random_targets=random_targets, digits=digits)
   116    748.2 MiB      0.0 MiB           1     tensors = data["tensors"]
   117    748.2 MiB      0.0 MiB           1     targets = data["targets"]
   118    748.2 MiB      0.0 MiB           1     old_class_tensors = data["old_class_tensors"]
   119    748.2 MiB      0.0 MiB           1     old_class_targets = data["old_class_targets"]
   120                                           
   121                                           #taking small amount of sample data for training and testing
   122    748.2 MiB      0.0 MiB           1     random_index = np.random.randint(len(tensors), size=500)
   123                                           # print(f"Random index: {random_index}")
   124    748.2 MiB      0.0 MiB         503     valid_dataset = SpeechCommandSubDataset(data = [tensors[index] for index in random_index], 
   125    748.2 MiB      0.0 MiB         503                                             labels = [targets[index] for index in random_index])
   126                                         
   127    748.2 MiB      0.0 MiB           1     traindata, testdata = random_split(valid_dataset, [round(len(valid_dataset)*.6), round(len(valid_dataset)*.4)])
   128    748.2 MiB      0.0 MiB           1     trainloader = DataLoader(traindata, batch_size=10, shuffle=True)
   129    748.2 MiB      0.0 MiB           1     testloader = DataLoader(testdata, batch_size=10, shuffle=True)
   130                                         
   131                                           # #creating old data loder for measures
   132    748.2 MiB      0.0 MiB           1     random_index_olddata = np.random.randint(len(old_class_tensors), size=200)
   133                                           # print(f"Random index: {random_index}")
   134    748.2 MiB      0.0 MiB         203     old_class_dataset = SpeechCommandSubDataset(data = [old_class_tensors[index] for index in random_index_olddata], 
   135    748.2 MiB      0.0 MiB         203                                             labels = [old_class_targets[index] for index in random_index_olddata])
   136                                         
   137    748.2 MiB      0.0 MiB           1     old_class_testloader = DataLoader(old_class_dataset, batch_size=10, shuffle=True)
   138                                         
   139                                           #to check time required for iterating training data
   140    748.2 MiB      0.0 MiB         301     for i, (input, lables) in enumerate(traindata):
   141    748.2 MiB      0.0 MiB         300       print("", end= "")
   142                                           
   143                                           #to check time required for iterating testing data
   144    748.6 MiB      0.4 MiB          21     for i, (input, lables) in enumerate(testloader):
   145    748.6 MiB      0.0 MiB          20       print("", end= "")
```



In [8]:
!rm -rf data/SpeechCommands

# Second data processing approach

**With this strategy, all data will be processed in form of list and processed only during training.**

In [9]:
%%file mprun_meta_data_processing.py

from collections import defaultdict
import random
from torchaudio.datasets import SPEECHCOMMANDS
import os
import torchaudio
import torch
from torch.utils.data import Dataset, DataLoader, random_split
import torch.nn as nn
import numpy as np


def label_to_index(labels, label):
    # Return the position of the word in labels
    return torch.tensor(labels.index(label))

def index_to_label(labels, index):
    # Return the word corresponding to the index in labels
    # This is the inverse of label_to_index
    return labels[index]

class SubsetSC(SPEECHCOMMANDS):
    def __init__(self, subset: str = None, subset_type : str = None):
        digits = ['zero','one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'] 
        super().__init__("./", download=True)
        n_fft = 2048
        win_length = None
        hop_length = 512
        n_mels = 256
        n_mfcc = 256
        sampling_rate = 16000
        self.transform = torchaudio.transforms.MFCC(sample_rate=sampling_rate, n_mfcc=32, 
                                                                        melkwargs={
                                                                                    'n_fft': n_fft,
                                                                                    'n_mels': n_mels,
                                                                                    'hop_length': hop_length,
                                                                                    'mel_scale': 'htk',
                                                                                    }
                                                                                    )

        def load_list(filename):
            filepath = os.path.join(self._path, filename)
            with open(filepath) as fileobj:
                if subset_type == 'old':
                    return [os.path.normpath(os.path.join(self._path, line.strip())) for line in fileobj if line.startswith(tuple(digits))]
                elif subset_type == 'novel':
                    return [os.path.normpath(os.path.join(self._path, line.strip())) for line in fileobj if not line.startswith(tuple(digits))]
                else:
                    return [os.path.normpath(os.path.join(self._path, line.strip())) for line in fileobj]

        if subset == "training":
            self._walker = load_list("validation_list.txt")
        elif subset == "testing":
            self._walker = load_list("testing_list.txt")

            
    def __getitem__(self,idx):
        waveform, sampling_rate, label, *_ = super().__getitem__(idx)
        #returning waveform and it's label
        if self.transform is not None:
            waveform = self.transform(waveform)
        return waveform, label

class FewShotBatchSampler(object):

    def __init__(self, dataset_targets, N_way, K_shot, include_query=False, shuffle=True, shuffle_once=False):
        """
        Inputs:
            dataset_targets - PyTorch tensor of the labels of the data elements.
            N_way - Number of classes to sample per batch.
            K_shot - Number of examples to sample per class in the batch.
            include_query - If True, returns batch of size N_way*K_shot*2, which
                            can be split into support and query set. Simplifies
                            the implementation of sampling the same classes but
                            distinct examples for support and query set.
            shuffle - If True, examples and classes are newly shuffled in each
                      iteration (for training)
            shuffle_once - If True, examples and classes are shuffled once in
                           the beginning, but kept constant across iterations
                           (for validation)
        """
        super().__init__()
        self.dataset_targets = dataset_targets
        self.N_way = N_way
        self.K_shot = K_shot
        self.shuffle = shuffle
        self.include_query = include_query
        if self.include_query:
            self.K_shot *= 2
        self.batch_size = self.N_way * self.K_shot  # Number of overall images per batch

        # Organize examples by class
        self.classes = torch.unique(self.dataset_targets).tolist()
        self.num_classes = len(self.classes)
        self.indices_per_class = {}
        self.batches_per_class = {}  # Number of K-shot batches that each class can provide
        for c in self.classes:
            self.indices_per_class[c] = torch.where(self.dataset_targets == c)[0]
            self.batches_per_class[c] = self.indices_per_class[c].shape[0] // self.K_shot

        # Create a list of classes from which we select the N classes per batch
        self.iterations = sum(self.batches_per_class.values()) // self.N_way
        self.class_list = [c for c in self.classes for _ in range(self.batches_per_class[c])]
        if shuffle_once or self.shuffle:
            self.shuffle_data()
        else:
            # For testing, we iterate over classes instead of shuffling them
            sort_idxs = [i+p*self.num_classes for i,
                         c in enumerate(self.classes) for p in range(self.batches_per_class[c])]
            self.class_list = np.array(self.class_list)[np.argsort(sort_idxs)].tolist()

    def shuffle_data(self):
        # Shuffle the examples per class
        for c in self.classes:
            perm = torch.randperm(self.indices_per_class[c].shape[0])
            self.indices_per_class[c] = self.indices_per_class[c][perm]
        # Shuffle the class list from which we sample. Note that this way of shuffling
        # does not prevent to choose the same class twice in a batch. However, for
        # training and validation, this is not a problem.
        random.shuffle(self.class_list)

    def __iter__(self):
        # Shuffle data
        if self.shuffle:
            self.shuffle_data()

        # Sample few-shot batches
        start_index = defaultdict(int)
        for it in range(self.iterations):
            class_batch = self.class_list[it*self.N_way:(it+1)*self.N_way]  # Select N classes for the batch
            index_batch = []
            for c in class_batch:  # For each class, select the next K examples and add them to the batch
                index_batch.extend(self.indices_per_class[c][start_index[c]:start_index[c]+self.K_shot])
                start_index[c] += self.K_shot
            if self.include_query:  # If we return support+query set, sort them so that they are easy to split
                index_batch = index_batch[::2] + index_batch[1::2]
            yield index_batch

    def __len__(self):
        return self.iterations


def data_processing():
  # from torch.utils.data.dataset import Subset
  old_train_set = SubsetSC("training", "old")
  old_test_set = SubsetSC("testing", "old")
  novel_train_set = SubsetSC("training", "novel")
  novel_test_set = SubsetSC("testing", "novel")

  training_data = next(iter(novel_train_set))

  #labels would be combination of novel classes and old classes(digits)
  targets_list = [os.path.basename(os.path.dirname(novel_train_set._walker[i])) for i in range(len(novel_train_set))]
  targets = list(set(targets_list))
  digits = ['zero','one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'] 
  labels = digits + targets 
  targets_idx = [label_to_index(targets, i) for i in targets_list]
  # targets_idx = [list(targets).index(i) for i in targets_list]
  N_WAY = 3
  K_SHOT = 3
  sampler = FewShotBatchSampler(torch.as_tensor(targets_idx),N_WAY, K_SHOT, include_query= False, shuffle=True, shuffle_once=True)

  def pad_sequence(batch):
    # Make all tensor in a batch the same length by padding with zeros
    batch = [item.t() for item in batch]
    batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
    return batch.permute(0, 2, 1)


  def collate_fn(batch):
          tensors, targets = [], []
          for waveform, label in batch:
                  tensors += [torch.squeeze(waveform)]
                  targets += [label_to_index(labels, label)]
                  
          tensors = torch.unsqueeze(pad_sequence(tensors), 1)
          targets = torch.stack(targets)
          return tensors, targets

  train_data_loader = DataLoader(novel_train_set, batch_sampler=sampler, collate_fn=collate_fn)

  #memory required to iterate through training data
  training_data = next(iter(train_data_loader))


Writing mprun_meta_data_processing.py


In [10]:
%%time
from mprun_meta_data_processing import data_processing
%mprun -f data_processing data_processing()

  0%|          | 0.00/2.26G [00:00<?, ?B/s]


CPU times: user 3min 35s, sys: 36.7 s, total: 4min 12s
Wall time: 4min 16s




```
Filename: /content/AudioClassificationWithDeepLearningAnalysis/dataset/mprun_meta_data_processing.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   143    747.1 MiB    747.1 MiB           1   def data_processing():
   144                                           # from torch.utils.data.dataset import Subset
   145    351.4 MiB   -395.6 MiB           1     old_train_set = SubsetSC("training", "old")
   146    352.4 MiB      1.0 MiB           1     old_test_set = SubsetSC("testing", "old")
   147    353.4 MiB      1.0 MiB           1     novel_train_set = SubsetSC("training", "novel")
   148    354.4 MiB      1.0 MiB           1     novel_test_set = SubsetSC("testing", "novel")
   149                                         
   150    355.3 MiB      0.9 MiB           1     training_data = next(iter(novel_train_set))
   151                                         
   152                                           #labels would be combination of novel classes and old classes(digits)
   153    355.3 MiB      0.0 MiB        6341     targets_list = [os.path.basename(os.path.dirname(novel_train_set._walker[i])) for i in range(len(novel_train_set))]
   154    355.3 MiB      0.0 MiB           1     targets = list(set(targets_list))
   155    355.3 MiB      0.0 MiB           1     digits = ['zero','one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'] 
   156    355.3 MiB      0.0 MiB           1     labels = digits + targets 
   157    355.6 MiB      0.3 MiB        6341     targets_idx = [label_to_index(targets, i) for i in targets_list]
   158                                           # targets_idx = [list(targets).index(i) for i in targets_list]
   159    355.6 MiB      0.0 MiB           1     N_WAY = 3
   160    355.6 MiB      0.0 MiB           1     K_SHOT = 3
   161    356.3 MiB      0.7 MiB           1     sampler = FewShotBatchSampler(torch.as_tensor(targets_idx),N_WAY, K_SHOT, include_query= False, shuffle=True, shuffle_once=True)
   162                                         
   163    358.0 MiB      0.0 MiB           2     def pad_sequence(batch):
   164                                             # Make all tensor in a batch the same length by padding with zeros
   165    358.2 MiB      0.2 MiB          12       batch = [item.t() for item in batch]
   166    358.2 MiB      0.0 MiB           1       batch = torch.nn.utils.rnn.pad_sequence(batch, batch_first=True, padding_value=0.)
   167    358.2 MiB      0.0 MiB           1       return batch.permute(0, 2, 1)
   168                                         
   169                                         
   170    358.0 MiB      1.7 MiB           2     def collate_fn(batch):
   171    358.0 MiB      0.0 MiB           1             tensors, targets = [], []
   172    358.0 MiB      0.0 MiB          10             for waveform, label in batch:
   173    358.0 MiB      0.0 MiB           9                     tensors += [torch.squeeze(waveform)]
   174    358.0 MiB      0.0 MiB           9                     targets += [label_to_index(labels, label)]
   175                                                           
   176    358.2 MiB      0.0 MiB           1             tensors = torch.unsqueeze(pad_sequence(tensors), 1)
   177    358.2 MiB      0.0 MiB           1             targets = torch.stack(targets)
   178    358.2 MiB      0.0 MiB           1             return tensors, targets
   179                                         
   180    356.3 MiB      0.0 MiB           1     train_data_loader = DataLoader(novel_train_set, batch_sampler=sampler, collate_fn=collate_fn)
   181                                         
   182                                           #memory required to iterate through training data
   183    358.2 MiB      0.0 MiB           1     training_data = next(iter(train_data_loader))
```



# Conclusion

The processing of all the data using the first strategy takes about 15 minutes, while processing the list using the second way only takes about 3 minutes. Furthermore, the first strategy uses 748.6 MiB of memory, which is significantly higher than the second approach's 358.2 MiB of memory. Therefore, using the second strategy of processing data as needed rather than the first approach, which processes all data processing in advance, is more efficient.