Forensic-Audio

Fine-tuning Voxtral-Mini-4B-Realtime to explain the context, environment, and emotional subtext of audio for forensic analysis purposes.

Built for the Mistral Worldwide Hackathon.

What It Does

Given a raw audio recording, the fine-tuned model (voxtral-sentinel-4b) produces structured output containing:

Transcript — verbatim transcription of the speech
Analysis — expert assessment of vocal cues, emotion, tone, and environmental context
Conclusion — recommended action or risk classification

This enables real-time audio triage for automated customer support (intent classification, escalation routing) and emergency services (distress detection, dispatcher assistance) without human-in-the-loop intervention.

### TRANSCRIPT:
I need help immediately, my neighbour hasn't responded in hours and I can hear something...

### ANALYSIS:
The speaker exhibits elevated vocal stress indicators including increased speech rate and
pitch variance. Tone suggests genuine distress rather than rehearsed or non-urgent
communication. Situational context implies potential welfare concern for a third party.

### CONCLUSION:
Escalate to emergency services. Flag as high-priority welfare check. Do not route to
standard support queue.

Repository Structure

Forensic-Audio/
├── create_dataset.py      # Generate JSONL annotations via Mistral API (MELD + DCASE sources)
├── dataset_packer.py      # Pack audio + annotations into a HuggingFace dataset and push to Hub
├── train.py               # training script (apply_chat_template, eval split, early stopping)
├── check_api.py           # Quick sanity check for Mistral API connectivity
├── debug_fields.py        # Inspect raw MELD dataset fields
├── requirements.txt       # Python dependencies
├── voxtral_forensic_train.jsonl        # Small annotation file
├── voxtral_forensic_train_large.jsonl  # Full annotation file (~12,500 samples)
├── LICENSE                # GPL-3.0
└── README.md

Pipeline Overview

The project follows a three-stage pipeline:

1. Dataset Generation (`create_dataset.py`)

Streams audio from MELD (emotional dialogue) and DCASE/AudioSet (acoustic scenes). For each sample, it calls Mistral Small to produce a structured forensic analysis (transcript, analysis, conclusion) conditioned on the speech text, detected emotion, and scene label. Results are written to a JSONL file.

2. Dataset Packing (`dataset_packer.py`)

Matches JSONL annotations to raw audio files from the MELD archive on HuggingFace. Normalises all answers into a canonical three-section format (regex-based, with Mistral fallback for edge cases). Builds a HuggingFace Dataset with audio, prompt, and answer columns, creates a 95/5 train/test split, and pushes to the Hub.

3. Training (`train.py`)

Full fine-tune (no LoRA) of Voxtral-Mini-4B-Realtime on an NVIDIA A100. Uses apply_chat_template to properly interleave audio and text tokens. Labels are masked on user turns so loss is computed only on the assistant response. Includes early stopping when eval loss drops below a threshold.

Setup

Prerequisites

Python 3.10+
CUDA-capable GPU (A100 recommended for training; inference works on smaller GPUs)
Accounts: HuggingFace and Mistral AI (for dataset generation)

Installation

git clone https://github.com/SageRish/Forensic-Audio.git
cd Forensic-Audio
pip install -r requirements.txt

Environment Variables

Create a .env file in the project root:

HF_TOKEN=hf_your_huggingface_token
MISTRAL_API_KEY=your_mistral_api_key
WANDB_API_KEY=your_wandb_key          # optional, for experiment tracking

Usage

Generate Dataset Annotations

python create_dataset.py

Produces voxtral_forensic_train_large.jsonl with structured annotations for up to 9,988 audio samples.

Pack and Upload Dataset

python dataset_packer.py

Streams audio from the MELD archive, matches it to annotations, normalises formatting, and pushes the final dataset to trishtan/voxtral-forensic-ds.

Train

python train.py

Runs full fine-tuning with the following configuration:

Parameter	Value
Epochs	5 (early stopping at eval loss < 1.15)
Learning rate	5e-6
LR scheduler	Cosine
Warmup ratio	0.05
Batch size (per device)	2
Gradient accumulation	4
Effective batch size	8
Max grad norm	1.0
Precision	bfloat16
Eval strategy	Every 100 steps

Training logs are available on Weights & Biases.

Inference

import torch
import soundfile as sf
import numpy as np
from transformers import AutoProcessor, VoxtralRealtimeForConditionalGeneration

model_id = "trishtan/voxtral-sentinel-4b"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = VoxtralRealtimeForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto",
)

audio, sr = sf.read("your_audio.wav")
if audio.ndim > 1:
    audio = audio.mean(axis=1)
audio = audio.astype(np.float32)

PROMPT = "[INST] Analyze this recording for forensic indicators. [/INST]"

audio_inputs = processor.feature_extractor(
    [audio], sampling_rate=16000, return_tensors="pt", padding=True,
)
text_inputs = processor.tokenizer(
    [PROMPT], return_tensors="pt", padding=True,
)
inputs = {**audio_inputs, **text_inputs}
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(processor.tokenizer.decode(output_ids[0], skip_special_tokens=True))

Results

Metric	Value
Final eval loss	1.148
Mean token accuracy	74.35%
Train/eval accuracy gap	~0%
Stopped at epoch	2.75 (early stopping)

The near-zero train/eval gap indicates the model generalises well to unseen audio with no measurable overfitting.

Model & Dataset Links

Resource	Link
Fine-tuned model	trishtan/voxtral-sentinel-4b
Training dataset	trishtan/voxtral-forensic-ds
Base model	mistralai/Voxtral-Mini-4B-Realtime-2602
W&B run	voxtral-sentinel

Dataset Sample

A sample of the training data is available in sample_data.jsonl.

Full dataset: trishtan/voxtral-forensic-ds

Data Attribution

MELD — Multimodal EmotionLines Dataset

Poria, S., Hazarika, D., Majumder, N., Naik, G., Cambria, E., & Mihalcea, R. (2019). MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. ACL 2019, pp. 527–536. HuggingFace: ajyy/MELD_audio

DCASE 2025 — Acoustic Scene Classification

Mesaros, A., Heittola, T., & Virtanen, T. (2018). A multi-device dataset for urban acoustic scene classification. DCASE 2018 Workshop, pp. 9–13. https://dcase.community/challenge2025/task-low-complexity-acoustic-scene-classification-with-device-information

License

This project is licensed under the GNU General Public License v3.0 — see the LICENSE file for details.

The fine-tuned model inherits the base model license from Mistral AI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forensic-Audio

What It Does

Repository Structure

Pipeline Overview

1. Dataset Generation (`create_dataset.py`)

2. Dataset Packing (`dataset_packer.py`)

3. Training (`train.py`)

Setup

Prerequisites

Installation

Environment Variables

Usage

Generate Dataset Annotations

Pack and Upload Dataset

Train

Inference

Results

Model & Dataset Links

Dataset Sample

Data Attribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Demo		Demo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
check_api.py		check_api.py
create_dataset.py		create_dataset.py
dataset_packer.py		dataset_packer.py
debug_fields.py		debug_fields.py
requirements.txt		requirements.txt
sample_data.jsonl		sample_data.jsonl
train.py		train.py
voxtral_forensic_train.jsonl		voxtral_forensic_train.jsonl
voxtral_forensic_train_large.jsonl		voxtral_forensic_train_large.jsonl

Folders and files

Latest commit

History

Repository files navigation

Forensic-Audio

What It Does

Repository Structure

Pipeline Overview

1. Dataset Generation (create_dataset.py)

2. Dataset Packing (dataset_packer.py)

3. Training (train.py)

Setup

Prerequisites

Installation

Environment Variables

Usage

Generate Dataset Annotations

Pack and Upload Dataset

Train

Inference

Results

Model & Dataset Links

Dataset Sample

Data Attribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Dataset Generation (`create_dataset.py`)

2. Dataset Packing (`dataset_packer.py`)

3. Training (`train.py`)

Packages