Fine-tuning Voxtral-Mini-4B-Realtime to explain the context, environment, and emotional subtext of audio for forensic analysis purposes.
Built for the Mistral Worldwide Hackathon.
Given a raw audio recording, the fine-tuned model (voxtral-sentinel-4b) produces structured output containing:
- Transcript — verbatim transcription of the speech
- Analysis — expert assessment of vocal cues, emotion, tone, and environmental context
- Conclusion — recommended action or risk classification
This enables real-time audio triage for automated customer support (intent classification, escalation routing) and emergency services (distress detection, dispatcher assistance) without human-in-the-loop intervention.
### TRANSCRIPT:
I need help immediately, my neighbour hasn't responded in hours and I can hear something...
### ANALYSIS:
The speaker exhibits elevated vocal stress indicators including increased speech rate and
pitch variance. Tone suggests genuine distress rather than rehearsed or non-urgent
communication. Situational context implies potential welfare concern for a third party.
### CONCLUSION:
Escalate to emergency services. Flag as high-priority welfare check. Do not route to
standard support queue.
Forensic-Audio/
├── create_dataset.py # Generate JSONL annotations via Mistral API (MELD + DCASE sources)
├── dataset_packer.py # Pack audio + annotations into a HuggingFace dataset and push to Hub
├── train.py # training script (apply_chat_template, eval split, early stopping)
├── check_api.py # Quick sanity check for Mistral API connectivity
├── debug_fields.py # Inspect raw MELD dataset fields
├── requirements.txt # Python dependencies
├── voxtral_forensic_train.jsonl # Small annotation file
├── voxtral_forensic_train_large.jsonl # Full annotation file (~12,500 samples)
├── LICENSE # GPL-3.0
└── README.md
The project follows a three-stage pipeline:
Streams audio from MELD (emotional dialogue) and DCASE/AudioSet (acoustic scenes). For each sample, it calls Mistral Small to produce a structured forensic analysis (transcript, analysis, conclusion) conditioned on the speech text, detected emotion, and scene label. Results are written to a JSONL file.
Matches JSONL annotations to raw audio files from the MELD archive on HuggingFace. Normalises all answers into a canonical three-section format (regex-based, with Mistral fallback for edge cases). Builds a HuggingFace Dataset with audio, prompt, and answer columns, creates a 95/5 train/test split, and pushes to the Hub.
Full fine-tune (no LoRA) of Voxtral-Mini-4B-Realtime on an NVIDIA A100. Uses apply_chat_template to properly interleave audio and text tokens. Labels are masked on user turns so loss is computed only on the assistant response. Includes early stopping when eval loss drops below a threshold.
- Python 3.10+
- CUDA-capable GPU (A100 recommended for training; inference works on smaller GPUs)
- Accounts: HuggingFace and Mistral AI (for dataset generation)
git clone https://github.com/SageRish/Forensic-Audio.git
cd Forensic-Audio
pip install -r requirements.txtCreate a .env file in the project root:
HF_TOKEN=hf_your_huggingface_token
MISTRAL_API_KEY=your_mistral_api_key
WANDB_API_KEY=your_wandb_key # optional, for experiment trackingpython create_dataset.pyProduces voxtral_forensic_train_large.jsonl with structured annotations for up to 9,988 audio samples.
python dataset_packer.pyStreams audio from the MELD archive, matches it to annotations, normalises formatting, and pushes the final dataset to trishtan/voxtral-forensic-ds.
python train.pyRuns full fine-tuning with the following configuration:
| Parameter | Value |
|---|---|
| Epochs | 5 (early stopping at eval loss < 1.15) |
| Learning rate | 5e-6 |
| LR scheduler | Cosine |
| Warmup ratio | 0.05 |
| Batch size (per device) | 2 |
| Gradient accumulation | 4 |
| Effective batch size | 8 |
| Max grad norm | 1.0 |
| Precision | bfloat16 |
| Eval strategy | Every 100 steps |
Training logs are available on Weights & Biases.
import torch
import soundfile as sf
import numpy as np
from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration
model_id = "trishtan/voxtral-sentinel-4b"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto",
)
audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
audio = audio.mean(axis=1)
audio = audio.astype(np.float32)
PROMPT = "[INST] Analyze this recording for forensic indicators. [/INST]"
audio_inputs = processor.feature_extractor(
[audio], sampling_rate=16000, return_tensors="pt", padding=True,
)
text_inputs = processor.tokenizer(
[PROMPT], return_tensors="pt", padding=True,
)
inputs = {**audio_inputs, **text_inputs}
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(processor.tokenizer.decode(output_ids[0], skip_special_tokens=True))| Metric | Value |
|---|---|
| Final eval loss | 1.148 |
| Mean token accuracy | 74.35% |
| Train/eval accuracy gap | ~0% |
| Stopped at epoch | 2.75 (early stopping) |
The near-zero train/eval gap indicates the model generalises well to unseen audio with no measurable overfitting.
| Resource | Link |
|---|---|
| Fine-tuned model | trishtan/voxtral-sentinel-4b |
| Training dataset | trishtan/voxtral-forensic-ds |
| Base model | mistralai/Voxtral-Mini-4B-Realtime-2602 |
| W&B run | voxtral-sentinel |
A sample of the training data is available in sample_data.jsonl.
Full dataset: trishtan/voxtral-forensic-ds
MELD — Multimodal EmotionLines Dataset
Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. ACL 2019, pp. 527–536. HuggingFace: ajyy/MELD_audio
DCASE 2025 — Acoustic Scene Classification
Mesaros, A., Heittola, T., & Virtanen, T. (2018). A multi-device dataset for urban acoustic scene classification. DCASE 2018 Workshop, pp. 9–13. https://dcase.community/challenge2025/task-low-complexity-acoustic-scene-classification-with-device-information
This project is licensed under the GNU General Public License v3.0 — see the LICENSE file for details.
The fine-tuned model inherits the base model license from Mistral AI.