Skip to content

Otosaku/NeMoFeatureExtractor-Android

Repository files navigation

NeMoFeatureExtractor-Android

Kotlin library for extracting mel spectrograms compatible with NVIDIA NeMo models on Android.

Features

  • NeMo-compatible mel spectrogram extraction
  • Support for VAD (MarbleNet), ASR (Conformer, Parakeet), and Speaker (TitaNet) models
  • Pre-computed NeMo filterbank for maximum accuracy
  • Pure Kotlin implementation with no external dependencies
  • Configurable normalization modes

Requirements

  • Android API 24+
  • Kotlin 1.9+

Installation

Gradle

Add JitPack repository to your project's settings.gradle.kts:

dependencyResolutionManagement {
    repositories {
        maven { url = uri("https://jitpack.io") }
    }
}

Add the dependency to your module's build.gradle.kts:

dependencies {
    implementation("com.github.Otosaku:NeMoFeatureExtractor-Android:1.0.0")
}

Usage

Basic Usage

import com.otosaku.nemofeatureextractor.NeMoFeatureExtractor
import com.otosaku.nemofeatureextractor.MelSpectrogramConfig

// For VAD (MarbleNet)
val vadExtractor = NeMoFeatureExtractor(context, MelSpectrogramConfig.nemoVAD)
val features = vadExtractor.process(audioSamples)

// For ASR (Conformer, Parakeet)
val asrExtractor = NeMoFeatureExtractor(context, MelSpectrogramConfig.nemoASR)
val features = asrExtractor.process(audioSamples)

// For Speaker (TitaNet)
val speakerExtractor = NeMoFeatureExtractor(context, MelSpectrogramConfig.nemoSpeaker)
val features = speakerExtractor.process(audioSamples)

Without Context (generates filterbank)

val extractor = NeMoFeatureExtractor(MelSpectrogramConfig.nemoVAD)
val features = extractor.process(audioSamples)

Custom Configuration

val config = MelSpectrogramConfig(
    sampleRate = 16000,
    nMels = 80,
    nFFT = 512,
    windowSize = 400,
    hopLength = 160,
    normalization = NormalizationMode.PER_FEATURE,
    preemph = 0.97f
)

val extractor = NeMoFeatureExtractor(context, config)

Audio Requirements

  • Sample rate: 16,000 Hz
  • Channels: Mono
  • Format: Float32 array

Configuration Presets

Preset Normalization Pad To Use Case
nemoVAD None 2 Voice Activity Detection (MarbleNet)
nemoASR Per-feature 0 Speech Recognition (Conformer, Parakeet)
nemoSpeaker Per-feature 16 Speaker Verification (TitaNet)

Output Format

The process() method returns Array<FloatArray> with shape [nMels, nFrames]:

  • nMels: Number of mel frequency bins (default: 80)
  • nFrames: Number of time frames (depends on audio length)

License

MIT License

Related Projects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages