# **A. Introduction  and Dataset Overview:**


1.   Dataset: MNIST Audio Dataset
2.   Goal: Classify spoken digits recordings (0 – 9) based on audio features
3.   Problem: Multi-class classification
4.   Target labels: Numbers from 0-9
5.   Link for the datase: https://huggingface.co/datasets/flexthink/audiomnist



**Importing necessary packages:**

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.3.0-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.3.0-py3-none-any.whl (484 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading x

In [3]:
!pip install torchaudio librosa numpy matplotlib

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.5.1->torchaudio)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.wh

In [4]:
import torch
import torchaudio
import torchaudio.transforms as T
from torch.utils.data import Dataset, DataLoader

# **B. Data Loading and Cleaning:**

In [2]:
# Loading the Dataset: (Dataset is already split into Training, Testing and Validation)
from datasets import load_dataset

ds = load_dataset("flexthink/audiomnist")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/162 [00:00<?, ?B/s]

audiomnist.py:   0%|          | 0.00/5.11k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/399M [00:00<?, ?B/s]

0001.parquet:   0%|          | 0.00/416M [00:00<?, ?B/s]

0002.parquet:   0%|          | 0.00/389M [00:00<?, ?B/s]

0003.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/37.0M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/36.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/28520 [00:00<?, ? examples/s]

Generating valid split:   0%|          | 0/750 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/750 [00:00<?, ? examples/s]

In [5]:
# Printing a single dataset:
print(ds["train"][0])

{'file_name': '/storage/hf-datasets-cache/all/datasets/31152253818556-config-parquet-and-info-flexthink-audiomnist-315086ef/downloads/extracted/a5ef8955819b16af0fa318318c415faec8cf9fe6d8554d5835ca9333d2a0aed0/dataset/26/5_26_27.wav', 'audio': {'path': '5_26_27.wav', 'array': array([3.66210938e-04, 3.05175781e-04, 3.05175781e-04, ...,
       1.83105469e-04, 1.22070312e-04, 9.15527344e-05]), 'sampling_rate': 48000}, 'speaker_id': '26', 'age': 22, 'gender': 0, 'accent': 2, 'native_speaker': False, 'origin': 'Asia, China, Beijing', 'digit': 5}


# **C. Convert Data into Tensor Format:**

**Preprocessing Steps:**

In [7]:
# Defining Parameters for Preprocessing:
target_sample_rate = 16000
n_mels = 64
max_duration = 1

In [8]:
def preprocess_audio(example):
  # Convert to tensor:
  waveform = torch.tensor(example["audio"]["array"]).unsqueeze(0).to(torch.float32)

  # Resampling:
  if example["audio"]["sampling_rate"] != target_sample_rate:
    resampler = T.Resample(example["audio"]["sampling_rate"], target_sample_rate)
    waveform = resampler(waveform)

  # Trimming or padding:
  max_len = target_sample_rate * max_duration
  if waveform.shape[1] > max_len:
    waveform = waveform[:, :max_len]
  else:
    pad_len = max_len - waveform.shape[1]
    waveform = torch.nn.functional.pad(waveform, (0, pad_len))

    # Converting into Mel Spectrogram:
    mel_spec_transform = T.MelSpectrogram(sample_rate=target_sample_rate, n_mels=n_mels)
    mel_spec = mel_spec_transform(waveform)

    return {"mel_spec": mel_spec, "label": torch.tensor(example["digit"])}

Preprocessing done for Training, Testing and Validation Splits:

In [9]:
ds["train"] = ds["train"].map(preprocess_audio)
ds["valid"] = ds["valid"].map(preprocess_audio)
ds["test"] = ds["test"].map(preprocess_audio)

Map:   0%|          | 0/28520 [00:00<?, ? examples/s]

Map:   0%|          | 0/750 [00:00<?, ? examples/s]

Map:   0%|          | 0/750 [00:00<?, ? examples/s]

**Convert Dataset into PyTorch Dataset Class:**

In [10]:
class AudioDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset

        def __len__(self):
          return len(self.dataset)

          def __getitem__(self, idx):
            sample = self.dataset[idx]

            # Convert into Pytorch tensor (Ensuring it's a PyTorch tensor)
            mel_spec = torch.tensor(sample["mel_spec"], dtype=torch.float32)
            label = torch.tensor(sample["label"], dtype=torch.long)

            if mel_spec.dim() == 2:
               mel_spec = mel_spec.unsqueeze(0)

            return mel_spec, label # Output: [1, 64, Time], label as scalar tensor

# Create Dataset instances for train, validation, and test
train_dataset = AudioDataset(ds["train"])
valid_dataset = AudioDataset(ds["valid"])
test_dataset = AudioDataset(ds["test"])

# **D. Save Processed Data**

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [13]:
# Define the save path inside Google Drive
save_path = "/content/drive/My Drive/audio_datasets/"

# Ensure the directory exists
import os
os.makedirs(save_path, exist_ok=True)

# Save the datasets
torch.save(train_dataset, save_path + "train_dataset.pt")
torch.save(valid_dataset, save_path + "valid_dataset.pt")
torch.save(test_dataset, save_path + "test_dataset.pt")

print("Datasets saved successfully in Google Drive!")

Datasets saved successfully in Google Drive!
