Kotlin library for extracting mel spectrograms compatible with NVIDIA NeMo models on Android.
- NeMo-compatible mel spectrogram extraction
- Support for VAD (MarbleNet), ASR (Conformer, Parakeet), and Speaker (TitaNet) models
- Pre-computed NeMo filterbank for maximum accuracy
- Pure Kotlin implementation with no external dependencies
- Configurable normalization modes
- Android API 24+
- Kotlin 1.9+
Add JitPack repository to your project's settings.gradle.kts:
dependencyResolutionManagement {
repositories {
maven { url = uri("https://jitpack.io") }
}
}Add the dependency to your module's build.gradle.kts:
dependencies {
implementation("com.github.Otosaku:NeMoFeatureExtractor-Android:1.0.0")
}import com.otosaku.nemofeatureextractor.NeMoFeatureExtractor
import com.otosaku.nemofeatureextractor.MelSpectrogramConfig
// For VAD (MarbleNet)
val vadExtractor = NeMoFeatureExtractor(context, MelSpectrogramConfig.nemoVAD)
val features = vadExtractor.process(audioSamples)
// For ASR (Conformer, Parakeet)
val asrExtractor = NeMoFeatureExtractor(context, MelSpectrogramConfig.nemoASR)
val features = asrExtractor.process(audioSamples)
// For Speaker (TitaNet)
val speakerExtractor = NeMoFeatureExtractor(context, MelSpectrogramConfig.nemoSpeaker)
val features = speakerExtractor.process(audioSamples)val extractor = NeMoFeatureExtractor(MelSpectrogramConfig.nemoVAD)
val features = extractor.process(audioSamples)val config = MelSpectrogramConfig(
sampleRate = 16000,
nMels = 80,
nFFT = 512,
windowSize = 400,
hopLength = 160,
normalization = NormalizationMode.PER_FEATURE,
preemph = 0.97f
)
val extractor = NeMoFeatureExtractor(context, config)- Sample rate: 16,000 Hz
- Channels: Mono
- Format: Float32 array
| Preset | Normalization | Pad To | Use Case |
|---|---|---|---|
nemoVAD |
None | 2 | Voice Activity Detection (MarbleNet) |
nemoASR |
Per-feature | 0 | Speech Recognition (Conformer, Parakeet) |
nemoSpeaker |
Per-feature | 16 | Speaker Verification (TitaNet) |
The process() method returns Array<FloatArray> with shape [nMels, nFrames]:
nMels: Number of mel frequency bins (default: 80)nFrames: Number of time frames (depends on audio length)
MIT License
- NeMoFeatureExtractor-iOS - iOS/macOS version
- NVIDIA NeMo - Original implementation