# Task B: Train with 36 KB memory constraint

This notebook builds and trains a recurrent neural network (LSTM) to classify spoken digits (0–9) from audio recordings.

- Dataset: [Free Spoken Digit Dataset (FSDD)](https://github.com/Jakobovski/free-spoken-digit-dataset)
- Framework: PyTorch
- Architecture: RNN with LSTM layers

In [1]:
import sys
import os

# Add the project root (parent of current folder) to Python path
project_root_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
sys.path.append(project_root_dir)


## Load Model Configuration from YAML

To make the training pipeline configurable and modular, we store model parameters like number of LSTM layers, hidden size, and learning rate etc in a YAML file. This structure enables quick adaptation to related tasks B, and C.

This section loads the model configuration using a custom utility function.

In [2]:
import src.utils as utils

In [3]:
utils.set_seed(42)

[INFO] Random seed set to: 42


In [4]:
import yaml
import json

model_config_path = os.path.join(project_root_dir, 'config', 'model_config.yaml')
model_config = utils.read_yaml_file(model_config_path)
# print(json.dumps(model_config, indent=2))

## Load and Split Dataset for Training and Evaluation

In this section, we load the recordings data from disk, generate data-label pairs, and split them into training and test sets according to the `test_size` defined in the YAML file.

Using `test_size` and `seed` from the YAML config ensures that experiments are reproducible and easily tunable for other tasks by simply updating the configuration.


In [5]:
data_path = model_config['dataset']['path']
test_data_size = model_config['data_splitting']['test_size']
seed = model_config['experiment']['seed']

In [6]:
data_label_pairs, _ = utils.prepare_data_label_pairs(data_path)

In [7]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(data_label_pairs, test_size=test_data_size, random_state=seed)

## Transform Raw Data into PyTorch Dataset Objects

The `AudioFeaturesDataset` class converts raw data-label pairs into PyTorch-compatible datasets that provide easy access to samples and labels.

AudioFeaturesDataset is a custom dataset class that:

- Loads audio recordings of spoken digits along with their labels.
- Optionally cleans the audio by filtering out noise.
- Extracts MFCC features (a common speech feature).
- Pads or trims these features to a fixed length so all inputs have the same shape.
- Works with PyTorch to provide samples one-by-one when training or testing a model.
- It helps prepare your audio data in the right format for training neural networks efficiently.


In [8]:
from src.data_preprocessor import AudioFeaturesDataset

train_dataset = AudioFeaturesDataset(train_data)
test_dataset = AudioFeaturesDataset(test_data)

In [9]:
print(f"Train size: {len(train_dataset)}")
print(f"Test size: {len(test_dataset)}")

Train size: 2400
Test size: 600


## Create DataLoaders for Batch Processing

Using PyTorch DataLoaders, we enable efficient loading, batching, and shuffling of data during training and evaluation.

In [10]:
import torch

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

In [11]:
input_dim = model_config['model']['input_dim']
hidden_dim = model_config['model']['hidden_dim']
num_layers = model_config['model']['num_layers']
output_dim = model_config['model']['output_dim']

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Task B — Retrain Under Memory Constraints

- All parameters of any one layer must fit into memory simultaneously
- Maximum memory available for layer parameters is 36 KB

Since Pytorch stores all the layer parameters as floating point values, as 32-bit floats(4 bytes per parameter), this implies that the 
maximum number of parameters should not exceed

$$
\text{Max total number of parameters} = \frac{36\,\text{KB}}{4\,\text{bytes}} = \frac{36 \times 1024}{4} = 9,216 \text{ parameters}
$$

## Model Parameter Breakdown for 2 Layers LSTM
The following calculations are based on the parameter definitions from PyTorch's LSTM implementation, as described in the \href{https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html\#torch.nn.LSTM}{official documentation}. 

Let’s define the following variables:

$$
\begin{aligned}
\text{Input dimension} &= i \\
\text{Hidden dimension} &= h \\
\text{Output dimension} &= o \\
\text{Number of LSTM layers} &= 2 \\
\text{Fully Connected (Linear) layer count} &= 1 \\
\end{aligned}
$$

These will be used to calculate the total number of parameters in the model.

---

## LSTM Layer Parameters

Each LSTM layer has 4 internal gates (input, forget, cell, output).  
So each layer has:

$$
\begin{aligned}
W\_{ih} &: \text{Weights from input to hidden} \rightarrow \text{shape: } (4 \times h, i) \\
W\_{hh} &: \text{Weights from hidden to hidden} \rightarrow \text{shape: } (4 \times h, h) \\
b\_{ih}, b\_{hh} &: \text{Biases for each gate} \rightarrow \text{shape: } (4 \times h)
\end{aligned}
$$

### Calculations:

$$
\begin{aligned}
&\textbf{Layer 1 Parameters} \\
&W\_ih: 4 \times h \times i \\
&W\_hh: 4 \times h \times h \\
&b\_ih: 4 \times h \\
&b\_hh: 4 \times h \\
&\\
&\textbf{Layer 2 Parameters} \\
&W\_ih: 4 \times h \times h \\
&W\_hh: 4 \times h \times h \\
&b\_ih: 4 \times h \\
&b\_hh: 4 \times h \\
&\\
&\text{Total LSTM parameters} = (4 \times h \times i) + (4 \times h \times h) + (8 \times h) = 4hi + 4h^2 + 8h
\end{aligned}
$$

---

## Fully Connected (Linear) Layer

- Input features = $$\text{hidden\_dim} = h$$  
- Output features = $$\text{output\_dim} = o$$

### Calculations:
$$
\begin{aligned}
&\text{Weights}: \quad h \times o \\
&\text{Bias}: \quad o \\
&\textbf{Total Linear parameters} = h \cdot o + o = o(h + 1)
\end{aligned}
$$

---

## Total Parameters

The total number of parameters in the model is the sum of the LSTM and Linear layer parameters:

$$
\text{Total Parameters} = \text{Total LSTM Parameters} + \text{Total Linear Parameters}
$$

From earlier calculations:

- $\text{Total LSTM Parameters} = 4hi + 4h^2 + 8h$
- $\text{Total Linear Parameters} = h \cdot o + o = o(h + 1)$

Therefore:

$$
\text{Total Parameters} = (4hi + 4h^2 + 8h) + o(h + 1)
$$

Or more compactly:

$$
\boxed{\text{Total Parameters} = 4hi + 4h^2 + 8h + o(h + 1)}
$$

---


## Designing a 2-Layer LSTM Under a 36 KB Memory Constraint

Given an output dimension of $10$ (representing 10 classes or digits) and an input dimension of $13$ (corresponding to 13 MFCC coefficients per time step), the total number of parameters in the model reduces to the following quadratic expression:

$$
\text{Total Parameters} = 12h^2 + 78h + 10
$$

Here, $ h $ (the hidden dimension) remains the only variable we need to solve for.

Since the memory constraint allows for a maximum of \textbf{9,216 parameters}, the hidden dimension must satisfy:

$$
12h^2 + 78h + 10 \leq 9216
$$

The maximum valid integer value of $ h $ that satisfies the inequality is:
$$
h = 24
$$

---

## LSTM Model Definition

A simple `n`-layer LSTM followed by a fully connected output layer. Variable `n` is defined in the configuration YAML file

In [13]:
import src.model as model_with_constraints

hidden_dim = 24
memory_constraint_model = model_with_constraints.LSTMClassifier(input_dim=input_dim,
                                           hidden_dim=hidden_dim,
                                           num_layers=num_layers,
                                           output_dim=output_dim).to(device)

## Training Loop

In [14]:
learning_rate = model_config['training']['learning_rate']
epochs = model_config['training']['epochs']

In [15]:
from src.train import ModelTrainer
trainer_instance_2 = ModelTrainer(
    memory_constraint_model, 
    epochs,
    train_loader,
    device,
    learning_rate
)

In [16]:
trainer_instance_2.train()

Epoch [1/20], Loss: 171.4266, Accuracy: 26.54%
Epoch [2/20], Loss: 136.4815, Accuracy: 48.58%
Epoch [3/20], Loss: 99.6908, Accuracy: 58.12%
Epoch [4/20], Loss: 86.6863, Accuracy: 63.04%
Epoch [5/20], Loss: 73.5414, Accuracy: 70.42%
Epoch [6/20], Loss: 68.6883, Accuracy: 71.54%
Epoch [7/20], Loss: 62.5110, Accuracy: 71.00%
Epoch [8/20], Loss: 55.6682, Accuracy: 78.33%
Epoch [9/20], Loss: 47.1418, Accuracy: 84.04%
Epoch [10/20], Loss: 48.2762, Accuracy: 85.00%
Epoch [11/20], Loss: 41.0196, Accuracy: 87.62%
Epoch [12/20], Loss: 34.3845, Accuracy: 88.96%
Epoch [13/20], Loss: 31.9757, Accuracy: 87.25%
Epoch [14/20], Loss: 29.9293, Accuracy: 90.00%
Epoch [15/20], Loss: 31.3984, Accuracy: 89.71%
Epoch [16/20], Loss: 26.2656, Accuracy: 92.17%
Epoch [17/20], Loss: 24.9169, Accuracy: 92.67%
Epoch [18/20], Loss: 23.2875, Accuracy: 92.67%
Epoch [19/20], Loss: 21.0122, Accuracy: 95.00%
Epoch [20/20], Loss: 17.6823, Accuracy: 95.54%


In [17]:
_, _ = utils.get_model_params_size(memory_constraint_model)


Layer-wise parameter counts:
lstm.weight_ih_l0              -> 1,248 params
lstm.weight_hh_l0              -> 2,304 params
lstm.bias_ih_l0                -> 96 params
lstm.bias_hh_l0                -> 96 params
lstm.weight_ih_l1              -> 2,304 params
lstm.weight_hh_l1              -> 2,304 params
lstm.bias_ih_l1                -> 96 params
lstm.bias_hh_l1                -> 96 params
fc.weight                      -> 240 params
fc.bias                        -> 10 params


 Total Parameters: 8,794
Estimated Memory: 34.35 KB (0.03 MB)


## Evaluation & Visualization

In [18]:
from src.evaluate import ModelEvaluator

In [19]:
test_instance_2 = ModelEvaluator(
    memory_constraint_model, 
    test_loader,
    device
)

In [20]:
test_instance_2.evaluate()


 Accuracy on test data: 91.33%

 Classification Report:
              precision    recall  f1-score   support

           0     0.9589    0.9722    0.9655        72
           1     0.9077    0.8551    0.8806        69
           2     0.9423    0.8596    0.8991        57
           3     0.9455    0.9286    0.9369        56
           4     0.9048    0.9661    0.9344        59
           5     0.9344    0.9048    0.9194        63
           6     0.9038    0.8393    0.8704        56
           7     0.8361    0.9273    0.8793        55
           8     0.8438    0.9474    0.8926        57
           9     0.9630    0.9286    0.9455        56

    accuracy                         0.9133       600
   macro avg     0.9140    0.9129    0.9124       600
weighted avg     0.9153    0.9133    0.9133       600


 Confusion Matrix:
[[70  0  2  0  0  0  0  0  0  0]
 [ 0 59  0  0  5  1  0  4  0  0]
 [ 3  0 49  1  0  1  0  2  0  1]
 [ 0  0  1 52  0  0  2  0  1  0]
 [ 0  2  0  0 57  0  0  0  0  0]

### Compute Inference Time¶

In [27]:
utils.compute_inference_time(memory_constraint_model, test_loader)

Median inference time: 0.4450 ms


### Save the model

In [28]:
memory_constraint_model_save_path = os.path.join(project_root_dir, 'outputs', 'models', 'task-b-part-1_weights.pth')
torch.save(memory_constraint_model.state_dict(), memory_constraint_model_save_path)

## Does the Model Layer Parameters meet 36 KB Memory Constraint?

In [29]:
utils.print_float_model_analysis(memory_constraint_model)

Layer Name                | Num Parameters | Size (Memory)
----------------------------------------------------------------------
lstm.weight_ih_l0         |          1248 | 4.875 KB 
lstm.weight_hh_l0         |          2304 | 9.000 KB 
lstm.bias_ih_l0           |            96 | 0.375 KB 
lstm.bias_hh_l0           |            96 | 0.375 KB 
lstm.weight_ih_l1         |          2304 | 9.000 KB 
lstm.weight_hh_l1         |          2304 | 9.000 KB 
lstm.bias_ih_l1           |            96 | 0.375 KB 
lstm.bias_hh_l1           |            96 | 0.375 KB 
fc.weight                 |           240 | 0.938 KB 
fc.bias                   |            10 | 0.039 KB 

📊 Total Model Summary
Total number of parameters:      8794
Estimated total size (FP32):    34.35 KB (0.03 MB)
Memory per parameter (FP32):    4 bytes
Meets 36KB per-layer limit?     ✅ Yes
