A Swift library for extracting mel-spectrogram features compatible with NVIDIA NeMo speech models. Designed for iOS/macOS applications using CoreML.
- Exact compatibility with NeMo's feature extraction pipeline
- Supports VAD, Speaker Recognition, and ASR models
- High performance using Apple's Accelerate framework (vDSP)
- Pre-computed mel filterbank from NeMo for maximum accuracy
- Output as
[[Float]]orMLMultiArrayfor CoreML inference
| Model Type | Config Preset | Use Case |
|---|---|---|
| VAD | .nemoVAD |
Voice Activity Detection (MarbleNet) |
| Speaker | .nemoSpeaker |
Speaker Verification/Identification (TitaNet) |
| ASR | .nemoASR |
Speech Recognition (Parakeet, Conformer) |
Add to your Package.swift:
dependencies: [
.package(url: "https://github.com/Otosaku/NeMoFeatureExtractor-iOS.git", from: "1.0.5")
]Or in Xcode: File → Add Package Dependencies → Enter repository URL.
import NeMoFeatureExtractor
// Create extractor with desired config
let extractor = NeMoFeatureExtractor(config: .nemoVAD)
// Process audio samples (Float32, mono, 16kHz)
let audioSamples: [Float] = loadAudio() // Your audio loading code
let features = try extractor.process(samples: audioSamples)
// features: [[Float]] with shape [80, numFrames]let extractor = NeMoFeatureExtractor(config: .nemoSpeaker)
// Get MLMultiArray directly for CoreML
let mlFeatures = try extractor.processToMLMultiArray(samples: audioSamples)
// mlFeatures: MLMultiArray with shape [1, 80, numFrames]
// Use with your CoreML model
let prediction = try model.prediction(audio_signal: mlFeatures)let customConfig = MelSpectrogramConfig(
sampleRate: 16000,
nMels: 80,
nFFT: 512,
windowSize: 400, // 25ms
hopLength: 160, // 10ms
fMin: 0.0,
fMax: nil, // Nyquist frequency
normalization: .perFeature,
melNorm: .slaney,
logEpsilon: 5.960464477539063e-08, // 2^-24
center: true,
preemph: 0.97,
padTo: 16
)
let extractor = NeMoFeatureExtractor(config: customConfig)- No normalization
padTo: 2- For MarbleNet and similar VAD models
- Per-feature normalization
padTo: 16- For TitaNet and speaker embedding models
- Per-feature normalization
padTo: 0(no padding)- For Parakeet, Conformer, and other ASR models
- Pre-emphasis:
y[n] = x[n] - 0.97 * x[n-1] - STFT: Center-padded, Hann window (symmetric)
- Power Spectrum:
|FFT|² - Mel Filterbank: 80 mel bands, Slaney normalization
- Log Transform:
log(mel + epsilon) - Normalization: Per-feature mean/std (optional)
- Padding: To multiple of
padTo(optional)
Tested against NeMo Python reference with maximum difference < 6e-05 (floating point precision).
- iOS 14.0+ / macOS 11.0+
- Swift 5.9+
MIT License
- NVIDIA NeMo - Original Python implementation
- Apple Accelerate framework for optimized DSP operations