Real-time speaker identification using MFCC + TensorFlow Lite Micro (TFLM).
Author: Mehedi
Course / Group: EEE 322 — Digital Signal Processing I Lab / ODD
Date: 9 December 2025
This project runs a real-time speaker recognition system on the STM32F746G‑DISCOVERY board. The pipeline captures audio from the on-board microphone, converts it to PCM, extracts MFCC features, and performs on‑device inference with a quantized TensorFlow Lite Micro model. Results are shown on the TFT and output over serial.
Prerequisites:
- STM32CubeIDE or ARM GCC toolchain + make
- git and git-lfs (recommended for dataset/model binaries)
- ST‑Link or compatible programmer
Build and flash:
- Open
Audio_Classification.iocin STM32CubeMX/STM32CubeIDE and generate project files. - Build in STM32CubeIDE or run the Makefile under
Debug/. - Flash with STM32CubeIDE, st-flash, or OpenOCD.
If you encounter LFS-related problems after a migration, reclone or run:
git fetch origin
git reset --hard origin/main- Real-time audio capture using WM8994 + on-board MEMS microphone (MP34DT01)
- PDM → PCM conversion with DMA
- MFCC extraction (40 coefficients) on the MCU
- Quantized TFLite (int8) model for TFLM inference
- TFT LCD + serial output for predicted speaker
- Dataset: ~2,595 one‑second clips across 5 speakers
- STM32F746G‑DISCOVERY (Cortex‑M7 @ 216 MHz, 1 MB Flash, 320 KB SRAM)
- MP34DT01 MEMS microphone (on-board)
- WM8994 audio codec
- 4.3" TFT LCD
- SD card (optional)
- ST‑Link for flashing & debugging
Recommended versions used during development:
- STM32CubeIDE v1.19, STM32CubeF7 HAL v1.26.0
- Python 3.8+, librosa v0.10.0, NumPy v1.25.0
- TensorFlow 2.15.0, TensorFlow Lite Micro (manually integrated)
- Jupyter Notebook v6.5+, optional CMSIS‑DSP
.
├── README.md
├── dataset/ # recorded WAV clips (per speaker)
├── python/ # preprocessing, training, conversion scripts
│ ├── split_audio.py
│ ├── extract_mfcc.py
│ ├── train_model.py
│ ├── convert_to_tflite.py
│ └── ai_test.py
├── stm32/ # STM32 project and integration files
│ ├── Core/
│ ├── Drivers/
│ ├── Middlewares/
│ └── xx_model_data.cc # model C array (generated)
└── report.tex
- Recordings were split into 1‑second WAV clips using
pydub. Total ≈ 2,595 clips across 5 speakers. - Example split script (python/split_audio.py):
from pydub import AudioSegment
import math, os
audio = AudioSegment.from_file("fa him.wav")
chunk_length_ms = 1000
os.makedirs("fa_him", exist_ok=True)
for i in range(math.ceil(len(audio) / chunk_length_ms)):
start = i * chunk_length_ms
end = start + chunk_length_ms
audio[start:end].export(f"fa_him/clip_{i+1:03}.wav", format="wav")- Extracted 40 MFCC coefficients per clip using
librosa.feature.mfcc. - Standard pipeline: pre‑emphasis → framing → Hamming window → FFT → Mel filters → log energy → DCT → MFCC. See
python/extract_mfcc.py.
Example architecture (reference):
- Input: 40 MFCC features
- Dense(100) + ReLU + Dropout(0.2)
- Dense(200) + ReLU + Dropout(0.2)
- Dense(100) + ReLU + Dropout(0.2)
- Output: Softmax (5 classes)
Training snippet:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=100, batch_size=32)
model.save("audio_classification_100.keras")Convert to quantized TFLite (int8):
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset_gen
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_model = converter.convert()
open("audio_class_quant.tflite", "wb").write(tflite_model)Typical quantized model size: ~57 KB.
Key firmware components:
- record.c — read audio from WM8994 via DMA
- mfcc.c — compute MFCC on MCU
- ai_on() — perform inference using TFLM/X‑CUBE‑AI integration
- display.c — update TFT with the predicted speaker
- role.c — application state machine
Workflow: Microphone → WM8994 → PCM → MFCC → NN Model → LCD / Serial
- Real-time latency: ≈ 300 ms
- On-device accuracy: ≈ 70% (depends on noise, placement, quantization)
- Offline training accuracy: up to ≈ 99.8% (likely overfit; validate on held-out test set)
- Max tested distance: ≈ 1 meter
Known challenges:
- Background noise and variable speech volume reduce performance.
- RAM constraints when holding MFCC buffers and TFLM interpreter simultaneously.
- Quantization can degrade accuracy — use a representative dataset for calibration.
- If model fails to load or inference crashes, check available RAM and stack sizes in the linker and FreeRTOS task configuration.
- If captured audio is noisy or clipped, verify WM8994 init and DMA buffer sizes.
- For quantization issues, ensure representative data covers expected audio variance.
- Open an issue for bugs, feature requests, or questions.
- Submit PRs for fixes, model improvements, or CI additions.
- After history rewrite or LFS migration, ask collaborators to reclone.
See LICENSE_X-CUBE-AI.txt for licensing details applicable to generated AI code and model artifacts.
For questions or support, open an issue in this repository.





