# LSTM for Spoken Digit Classification

This notebook builds and trains a recurrent neural network (LSTM) to classify spoken digits (0–9) from audio recordings.

- Dataset: [Free Spoken Digit Dataset (FSDD)](https://github.com/Jakobovski/free-spoken-digit-dataset)
- Framework: PyTorch
- Architecture: RNN with LSTM layers

In [286]:
import sys
import os

# Add the project root (parent of current folder) to Python path
project_root_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root_dir)


## Load Model Configuration from YAML

To make the training pipeline configurable and modular, we store model parameters like number of LSTM layers, hidden size, and learning rate etc in a YAML file. This structure enables quick adaptation to related tasks B, and C.

This section loads the model configuration using a custom utility function.

In [287]:
import src.utils as utils

In [288]:
utils.set_seed(42)

[INFO] Random seed set to: 42


In [289]:
import yaml
import json

model_config_path = os.path.join(project_root_dir, 'config', 'model_config.yaml')
model_config = utils.read_yaml_file(model_config_path)
# print(json.dumps(model_config, indent=2))

## Load and Split Dataset for Training and Evaluation

In this section, we load the recordings data from disk, generate data-label pairs, and split them into training and test sets according to the `test_size` defined in the YAML file.

Using `test_size` and `seed` from the YAML config ensures that experiments are reproducible and easily tunable for other tasks by simply updating the configuration.


In [290]:
data_path = model_config['dataset']['path']
test_data_size = model_config['data_splitting']['test_size']
seed = model_config['experiment']['seed']

In [291]:
data_label_pairs, calibration_samples = utils.prepare_data_label_pairs(data_path, calibration_samples_per_class=50)

In [292]:
print(len(calibration_samples))

500


In [293]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_label_pairs, test_size=test_data_size, random_state=seed)

## Transform Raw Data into PyTorch Dataset Objects

The `AudioFeaturesDataset` class converts raw data-label pairs into PyTorch-compatible datasets that provide easy access to samples and labels.

AudioFeaturesDataset is a custom dataset class that:

- Loads audio recordings of spoken digits along with their labels.
- Optionally cleans the audio by filtering out noise.
- Extracts MFCC features (a common speech feature).
- Pads or trims these features to a fixed length so all inputs have the same shape.
- Works with PyTorch to provide samples one-by-one when training or testing a model.
- It helps prepare your audio data in the right format for training neural networks efficiently.


In [294]:
from src.data_preprocessor import AudioFeaturesDataset

train_dataset = AudioFeaturesDataset(train_data)
test_dataset = AudioFeaturesDataset(test_data)

In [295]:
print(f"Train size: {len(train_dataset)}")
print(f"Test size: {len(test_dataset)}")


Train size: 2400
Test size: 600


## Create DataLoaders for Batch Processing

Using PyTorch DataLoaders, we enable efficient loading, batching, and shuffling of data during training and evaluation.

In [296]:
import torch

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

In [297]:
input_dim = model_config['model']['input_dim']
hidden_dim = model_config['model']['hidden_dim']
num_layers = model_config['model']['num_layers']
output_dim = model_config['model']['output_dim']

In [298]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Static Quantization

In [299]:
import src.model as model

hidden_dim = 24
quant_model = model.LSTMClassifier(input_dim=input_dim,
                       hidden_dim=hidden_dim,
                       num_layers=num_layers,
                       output_dim=output_dim).to(device)

In [300]:
memory_constraint_model_load_path = os.path.join(project_root_dir, 'outputs', 'models', 'task-b-part-1_weights.pth')
quant_model.load_state_dict(torch.load(memory_constraint_model_load_path))

<All keys matched successfully>

In [301]:
quant_model.eval()

LSTMClassifier(
  (lstm): LSTM(13, 24, num_layers=2, batch_first=True, dropout=0.3)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=24, out_features=10, bias=True)
)

In [302]:
import src.quantize as quantize
static_quant_model = quantize.StaticQuantizableModel(quant_model)

In [303]:
static_quant_model.qconfig = torch.quantization.get_default_qconfig('x86')  # For edge devices
# torch.backends.quantized.engine = 'fbgemm'

from torch.ao.quantization import MinMaxObserver, PerChannelMinMaxObserver, QConfig, MovingAverageMinMaxObserver, MovingAveragePerChannelMinMaxObserver
qconfig = QConfig(activation=MovingAveragePerChannelMinMaxObserver.with_args(qscheme=torch.per_tensor_affine, dtype=torch.quint8),
       weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8, qscheme=torch.per_channel_symmetric))

static_quant_model_prepared = torch.quantization.prepare(static_quant_model, qconfig)



In [304]:
# calibration_indices = list(range(100))
# calibration_dataset = torch.utils.data.Subset(test_dataset, calibration_indices)
# calibration_dataloader = torch.utils.data.DataLoader(calibration_dataset, batch_size=32)


calibration_sampleset = AudioFeaturesDataset(calibration_samples)
calibration_loader = torch.utils.data.DataLoader(calibration_sampleset, batch_size=32, shuffle=True)
print(f"Calibration set size : {len(calibration_sampleset)}")

Calibration set size : 500


In [305]:
static_quant_model_prepared_cpu = static_quant_model_prepared.to('cpu')
device = torch.device("cpu")

quantize.calibrate(static_quant_model_prepared, calibration_loader, device)

In [306]:
static_quantized_model_converted = torch.quantization.convert(static_quant_model_prepared)

In [307]:
from src.evaluate import ModelEvaluator
print(device)
quant_test_instance = ModelEvaluator(
    static_quantized_model_converted, 
    test_loader,
    device
)

cpu


In [308]:
quant_test_instance.evaluate()

  cx_tensor = torch.stack(cx_list)



 Accuracy on test data: 86.33%

 Classification Report:
              precision    recall  f1-score   support

           0     0.7952    0.9167    0.8516        72
           1     0.9344    0.8261    0.8769        69
           2     0.8478    0.6842    0.7573        57
           3     0.7467    1.0000    0.8550        56
           4     1.0000    0.9492    0.9739        59
           5     0.8704    0.7460    0.8034        63
           6     0.9800    0.8750    0.9245        56
           7     0.7833    0.8545    0.8174        55
           8     0.9804    0.8772    0.9259        57
           9     0.7969    0.9107    0.8500        56

    accuracy                         0.8633       600
   macro avg     0.8735    0.8640    0.8636       600
weighted avg     0.8736    0.8633    0.8634       600


 Confusion Matrix:
[[66  0  2  3  0  0  0  0  0  1]
 [ 1 57  2  0  0  1  0  8  0  0]
 [14  0 39  3  0  0  0  1  0  0]
 [ 0  0  0 56  0  0  0  0  0  0]
 [ 0  2  0  0 56  0  0  1  0  0]

In [309]:
for name, module in static_quantized_model_converted.named_modules():
    if isinstance(module, torch.nn.quantized.Linear):
        print(f"\n{name} is a quantized Linear layer")
        weight = module.weight()
        print("Weight type:", type(weight))
        print("Weight dtype:", weight.dtype)
        print("Weight shape:", weight.shape)
        print("---")


model.lstm.layers.0.layer_fw.cell.igates is a quantized Linear layer
Weight type: <class 'torch.Tensor'>
Weight dtype: torch.qint8
Weight shape: torch.Size([96, 13])
---

model.lstm.layers.0.layer_fw.cell.hgates is a quantized Linear layer
Weight type: <class 'torch.Tensor'>
Weight dtype: torch.qint8
Weight shape: torch.Size([96, 24])
---

model.lstm.layers.1.layer_fw.cell.igates is a quantized Linear layer
Weight type: <class 'torch.Tensor'>
Weight dtype: torch.qint8
Weight shape: torch.Size([96, 24])
---

model.lstm.layers.1.layer_fw.cell.hgates is a quantized Linear layer
Weight type: <class 'torch.Tensor'>
Weight dtype: torch.qint8
Weight shape: torch.Size([96, 24])
---

model.fc is a quantized Linear layer
Weight type: <class 'torch.Tensor'>
Weight dtype: torch.qint8
Weight shape: torch.Size([10, 24])
---


In [310]:
total_fp32_bytes = 0
total_int8_bytes = 0

for name, module in static_quantized_model_converted.named_modules():
    if isinstance(module, torch.nn.quantized.Linear):
        weight = module.weight()
        if weight.is_quantized:
            num_elements = weight.numel()
            total_fp32_bytes += num_elements * 4  # FP32
            total_int8_bytes += num_elements * 1  # INT8

print(f"Estimated FP32 size: {total_fp32_bytes / 1024:.2f} KB")
print(f"Estimated INT8 size: {total_int8_bytes / 1024:.2f} KB")
print(f"Compression ratio: {total_fp32_bytes / total_int8_bytes:.2f}x")

Estimated FP32 size: 32.81 KB
Estimated INT8 size: 8.20 KB
Compression ratio: 4.00x


In [311]:
# static_quant_model_save_path = os.path.join(project_root_dir, 'outputs', 'models', 'task-b-part-2_weights_no_jit.pth')
# torch.save(static_quantized_model_converted.state_dict(), static_quant_model_save_path)

In [312]:
# static_quantized_model_converted.eval()

# static_quant_model_save_path = os.path.join(project_root_dir, 'outputs', 'models', 'task-b-part-2_weights_jit.pth')

# script_model = torch.jit.script(static_quantized_model_converted)
# torch.jit.save(script_model, static_quant_model_save_path)

In [314]:
static_quantized_model_converted.save("save-model-test.pt")

AttributeError: 'StaticQuantizableModel' object has no attribute 'save'