## LSTM for Spoken Digit Classification: Train without constraints

This notebook builds and trains a recurrent neural network (LSTM) to classify spoken digits (0–9) from audio recordings.

- Dataset: [Free Spoken Digit Dataset (FSDD)](https://github.com/Jakobovski/free-spoken-digit-dataset)
- Framework: PyTorch
- Architecture: RNN with LSTM layers

In [1]:
import sys
import os

# Add the project root (parent of current folder) to Python path
project_root_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root_dir)


## Load Model Configuration from YAML

To make the training pipeline configurable and modular, we store model parameters like number of LSTM layers, hidden size, and learning rate etc in a YAML file. This structure enables quick adaptation to related tasks B, and C.

This section loads the model configuration using a custom utility function.

In [2]:
import src.utils as utils

In [3]:
utils.set_seed(42)

[INFO] Random seed set to: 42


In [4]:
import yaml
import json

model_config_path = os.path.join(project_root_dir, 'config', 'model_config.yaml')
model_config = utils.read_yaml_file(model_config_path)
# print(json.dumps(model_config, indent=2))

## Load and Split Dataset for Training and Evaluation

In this section, we load the recordings data from disk, generate data-label pairs, and split them into training and test sets according to the `test_size` defined in the YAML file.

Using `test_size` and `seed` from the YAML config ensures that experiments are reproducible and easily tunable for other tasks by simply updating the configuration.


In [5]:
data_path = model_config['dataset']['path']
test_data_size = model_config['data_splitting']['test_size']
seed = model_config['experiment']['seed']

In [6]:
data_label_pairs, _ = utils.prepare_data_label_pairs(data_path)

In [7]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_label_pairs, test_size=test_data_size, random_state=seed)

## Transform Raw Data into PyTorch Dataset Objects

The `AudioFeaturesDataset` class converts raw data-label pairs into PyTorch-compatible datasets that provide easy access to samples and labels.

AudioFeaturesDataset is a custom dataset class that:

- Loads audio recordings of spoken digits along with their labels.
- Optionally cleans the audio by filtering out noise.
- Extracts MFCC features (a common speech feature).
- Pads or trims these features to a fixed length so all inputs have the same shape.
- Works with PyTorch to provide samples one-by-one when training or testing a model.
- It helps prepare your audio data in the right format for training neural networks efficiently.


In [8]:
from src.data_preprocessor import AudioFeaturesDataset

train_dataset = AudioFeaturesDataset(train_data)
test_dataset = AudioFeaturesDataset(test_data)

In [9]:
print(f"Train size: {len(train_dataset)}")
print(f"Test size: {len(test_dataset)}")

Train size: 2400
Test size: 600


## Create DataLoaders for Batch Processing

Using PyTorch DataLoaders, we enable efficient loading, batching, and shuffling of data during training and evaluation.

In [10]:
import torch

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

## LSTM Model Definition

A simple `n`-layer LSTM followed by a fully connected output layer. Variable `n` is defined in the configuration YAML file

In [11]:
input_dim = model_config['model']['input_dim']
hidden_dim = model_config['model']['hidden_dim']
num_layers = model_config['model']['num_layers']
output_dim = model_config['model']['output_dim']

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [13]:
import src.model as model
import torch.nn as nn
import torch.optim as optim

float_model = model.LSTMClassifier(input_dim=input_dim,
                       hidden_dim=hidden_dim,
                       num_layers=num_layers,
                       output_dim=output_dim).to(device)

## Training Loop

In [14]:
learning_rate = model_config['training']['learning_rate']
epochs = model_config['training']['epochs']

In [15]:
from src.train import ModelTrainer
trainer_instance = ModelTrainer(
    float_model, 
    epochs,
    train_loader,
    device,
    learning_rate
)

In [16]:
trainer_instance.train()

Epoch [1/20], Loss: 135.5336, Accuracy: 57.92%
Epoch [2/20], Loss: 60.2363, Accuracy: 87.79%
Epoch [3/20], Loss: 25.2365, Accuracy: 95.75%
Epoch [4/20], Loss: 15.5873, Accuracy: 96.46%
Epoch [5/20], Loss: 12.0516, Accuracy: 97.29%
Epoch [6/20], Loss: 6.3865, Accuracy: 99.08%
Epoch [7/20], Loss: 5.4559, Accuracy: 96.46%
Epoch [8/20], Loss: 7.2806, Accuracy: 97.54%
Epoch [9/20], Loss: 5.7572, Accuracy: 98.75%
Epoch [10/20], Loss: 7.9146, Accuracy: 98.42%
Epoch [11/20], Loss: 3.3656, Accuracy: 99.62%
Epoch [12/20], Loss: 1.4955, Accuracy: 99.83%
Epoch [13/20], Loss: 0.9625, Accuracy: 99.88%
Epoch [14/20], Loss: 0.5560, Accuracy: 99.96%
Epoch [15/20], Loss: 0.3895, Accuracy: 99.96%
Epoch [16/20], Loss: 0.3192, Accuracy: 99.96%
Epoch [17/20], Loss: 0.1587, Accuracy: 99.96%
Epoch [18/20], Loss: 0.3076, Accuracy: 99.92%
Epoch [19/20], Loss: 0.2430, Accuracy: 99.96%
Epoch [20/20], Loss: 0.1775, Accuracy: 99.96%


In [17]:
_, _ = utils.get_model_params_size(float_model)


Layer-wise parameter counts:
lstm.weight_ih_l0              -> 6,656 params
lstm.weight_hh_l0              -> 65,536 params
lstm.bias_ih_l0                -> 512 params
lstm.bias_hh_l0                -> 512 params
lstm.weight_ih_l1              -> 65,536 params
lstm.weight_hh_l1              -> 65,536 params
lstm.bias_ih_l1                -> 512 params
lstm.bias_hh_l1                -> 512 params
fc.weight                      -> 1,280 params
fc.bias                        -> 10 params


 Total Parameters: 206,602
Estimated Memory: 807.04 KB (0.79 MB)


## Evaluation & Visualization

In [18]:
from src.evaluate import ModelEvaluator

In [19]:
test_instance = ModelEvaluator(
    float_model, 
    test_loader,
    device
)


In [20]:
test_instance.evaluate()


 Accuracy on test data: 99.00%

 Classification Report:
              precision    recall  f1-score   support

           0     0.9861    0.9861    0.9861        72
           1     0.9855    0.9855    0.9855        69
           2     0.9825    0.9825    0.9825        57
           3     1.0000    1.0000    1.0000        56
           4     1.0000    1.0000    1.0000        59
           5     1.0000    0.9841    0.9920        63
           6     0.9825    1.0000    0.9912        56
           7     0.9821    1.0000    0.9910        55
           8     1.0000    1.0000    1.0000        57
           9     0.9818    0.9643    0.9730        56

    accuracy                         0.9900       600
   macro avg     0.9900    0.9902    0.9901       600
weighted avg     0.9900    0.9900    0.9900       600


 Confusion Matrix:
[[71  0  0  0  0  0  1  0  0  0]
 [ 0 68  1  0  0  0  0  0  0  0]
 [ 1  0 56  0  0  0  0  0  0  0]
 [ 0  0  0 56  0  0  0  0  0  0]
 [ 0  0  0  0 59  0  0  0  0  0]

### Compute Inference Time

In [24]:
utils.compute_inference_time(float_model, test_loader)

Median inference time: 0.9460 ms


### Save the model

In [25]:
float_model_save_path = os.path.join(project_root_dir, 'outputs', 'models', 'float_model_weights.pth')
torch.save(float_model.state_dict(), float_model_save_path)

In [26]:
utils.print_float_model_analysis(float_model)

Layer Name                | Num Parameters | Size (Memory)
----------------------------------------------------------------------
lstm.weight_ih_l0         |          6656 | 26.000 KB 
lstm.weight_hh_l0         |         65536 | 256.000 KB 
lstm.bias_ih_l0           |           512 | 2.000 KB 
lstm.bias_hh_l0           |           512 | 2.000 KB 
lstm.weight_ih_l1         |         65536 | 256.000 KB 
lstm.weight_hh_l1         |         65536 | 256.000 KB 
lstm.bias_ih_l1           |           512 | 2.000 KB 
lstm.bias_hh_l1           |           512 | 2.000 KB 
fc.weight                 |          1280 | 5.000 KB 
fc.bias                   |            10 | 0.039 KB 

📊 Total Model Summary
Total number of parameters:      206602
Estimated total size (FP32):    807.04 KB (0.79 MB)
Memory per parameter (FP32):    4 bytes
Meets 36KB per-layer limit?     ❌ No
