## Open notebook in:
| Colab                                 
:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Nicolepcx/transformers-the-definitive-guide/blob/master/CH06/ch06_Kimi_audio_transcribe_meeting.ipynb)                                             

# About this Notebook

This notebook demonstrates a complete pipeline for **multi-speaker diarization and transcription** using state-of-the-art audio foundation models. The goal is to identify who is speaking when, extract individual speaker segments, and generate corresponding transcriptions using a multimodal large language model.

### Steps Included:

1. **Dataset Loading**:
   The notebook uses the `concatenated_librispeech` dataset from Hugging Face, which combines audio samples containing multiple speakers into longer clips. You can also replace this with your own local audio files.

2. **Speaker Diarization**:
   A pre-trained diarization pipeline from `pyannote.audio` is used to segment the audio by speaker. Each segment is annotated with speaker identity (e.g., `SPEAKER_00`) along with start and end timestamps.

3. **Segment Extraction**:
   Once diarized, the notebook slices the raw waveform into speaker-specific segments. Each audio chunk is saved as an individual `.wav` file for downstream processing.

4. **Model Setup (Kimi-Audio)**:
   The [Kimi-Audio-7B-Instruct](https://huggingface.co/moonshotai/Kimi-Audio-7B-Instruct) model is loaded using the `kimia_infer` library. This model is designed for **instruction-following audio-text tasks**, including transcription, summarization, and Q\&A over audio.

5. **Multimodal Inference**:
   Each speaker segment is paired with a prompt, asking the model to transcribe the speech. The model is run iteratively over all segments to produce labeled transcriptions.

6. **Structured Output**:
   Final results are printed with speaker identity, timestamp boundaries, and transcribed text—providing a detailed, readable transcription of a multi-speaker conversation.

This notebook highlights the power of **hybrid pipelines** that combine **traditional signal processing (for diarization)** with **LLMs capable of multimodal understanding**. It enables detailed, speaker-specific insights from raw audio, suitable for tasks like meeting transcription, podcast summarization, or conversation analysis.



# Installs

In [None]:
!pip install pyannote.audio -q

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.6/59.6 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m898.7/898.7 kB[0m [31m55.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m821.1/821.1 kB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.5/58.5 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.1/48.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.4/51.4 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.9/125.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install --upgrade datasets

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency r

In [None]:
!pip install git+https://github.com/MoonshotAI/Kimi-Audio.git

Collecting git+https://github.com/MoonshotAI/Kimi-Audio.git
  Cloning https://github.com/MoonshotAI/Kimi-Audio.git to /tmp/pip-req-build-syjocq3f
  Running command git clone --filter=blob:none --quiet https://github.com/MoonshotAI/Kimi-Audio.git /tmp/pip-req-build-syjocq3f
  Resolved https://github.com/MoonshotAI/Kimi-Audio.git to commit 349251e1d8f4f98d58fda59246381faecd7392e0
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting flash-attn (from kimi-audio==0.1.0)
  Downloading flash_attn-2.8.0.post2.tar.gz (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m122.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loguru (from kimi-audio==0.1.0)
  Downloading loguru-0.7.3-py3-none-any.whl.metadata (22 kB)
Coll

In [None]:
!pip uninstall -y flash-attn
!pip install flash-attn==2.7.2post1


[0mCollecting flash-attn==2.7.2post1
  Downloading flash_attn-2.7.2.post1.tar.gz (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m57.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash-attn: filename=flash_attn-2.7.2.post1-cp311-cp311-linux_x86_64.whl size=190180789 sha256=8596401f69dea28c0c9f2ea8ee5b636876eb2ea511c34849b8063c46ee37a4fd
  Stored in directory: /root/.cache/pip/wheels/6a/8b/7d/0ac2b18cb28f4104a1852da090dcf9ea8239ce45fc82bcc4d1
Successfully built flash-attn
Installing collected packages: flash-attn
Successfully installed flash-attn-2.7.2.post1


# Imports

In [None]:
from datasets import load_dataset
from pyannote.audio import Pipeline
import torch
import soundfile as sf
import os
from kimia_infer.api.kimia import KimiAudio


# Load audio from LibriSpeech dataset or local file

In [None]:
dataset = load_dataset("sanchit-gandhi/concatenated_librispeech")
#sample = dataset[0]



README.md:   0%|          | 0.00/359 [00:00<?, ?B/s]

(…)-00000-of-00001-4f2454a2c146e655.parquet:   0%|          | 0.00/661k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['audio'],
        num_rows: 1
    })
})

### Att: you need to request access to the models for the pyannote

In [None]:
# Diarization pipeline
diarization_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1", use_auth_token=True)

dataset = load_dataset("sanchit-gandhi/concatenated_librispeech", split="train")


config.yaml:   0%|          | 0.00/500 [00:00<?, ?B/s]

DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover


pytorch_model.bin:   0%|          | 0.00/17.7M [00:00<?, ?B/s]

config.yaml:   0%|          | 0.00/318 [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.2. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
INFO:speechbrain.utils.fetching:Fetch hyperparams.yaml: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.6.0+cu124. Bad things might happen unless you revert torch to 1.x.


hyperparams.yaml: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/hyperparams.yaml' -> '/root/.cache/torch/pyannote/speechbrain/hyperparams.yaml'
INFO:speechbrain.utils.fetching:Fetch custom.py: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _load
DEBUG:speechbrain.utils.checkpoints:Registered parameter transfer hook for _load
  wrapped_fwd = torch.cuda.amp.custom_fwd(fwd, cast_inputs=cast_inputs)
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load_if_possible
DEBUG:speechbrain.utils.parameter_transfer:Collecting files (or symlinks) for pretraining in /root/.cache/torch/pyann

embedding_model.ckpt:   0%|          | 0.00/83.3M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/embedding_model.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["embedding_model"] = /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
INFO:speechbrain.utils.fetching:Fetch mean_var_norm_emb.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


mean_var_norm_emb.ckpt:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/mean_var_norm_emb.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["mean_var_norm_emb"] = /root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt
INFO:speechbrain.utils.fetching:Fetch classifier.ckpt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


classifier.ckpt:   0%|          | 0.00/5.53M [00:00<?, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/classifier.ckpt' -> '/root/.cache/torch/pyannote/speechbrain/classifier.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["classifier"] = /root/.cache/torch/pyannote/speechbrain/classifier.ckpt
INFO:speechbrain.utils.fetching:Fetch label_encoder.txt: Fetching from HuggingFace Hub 'speechbrain/spkrec-ecapa-voxceleb' if not cached


label_encoder.txt: 0.00B [00:00, ?B/s]

DEBUG:speechbrain.utils.fetching:Fetch: Local file found, creating symlink '/root/.cache/huggingface/hub/models--speechbrain--spkrec-ecapa-voxceleb/snapshots/0f99f2d0ebe89ac095bcc5903c4dd8f72b367286/label_encoder.txt' -> '/root/.cache/torch/pyannote/speechbrain/label_encoder.ckpt'
DEBUG:speechbrain.utils.parameter_transfer:Set local path in self.paths["label_encoder"] = /root/.cache/torch/pyannote/speechbrain/label_encoder.ckpt
INFO:speechbrain.utils.parameter_transfer:Loading pretrained files for: embedding_model, mean_var_norm_emb, classifier, label_encoder
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): embedding_model -> /root/.cache/torch/pyannote/speechbrain/embedding_model.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): mean_var_norm_emb -> /root/.cache/torch/pyannote/speechbrain/mean_var_norm_emb.ckpt
DEBUG:speechbrain.utils.parameter_transfer:Redirecting (loading from local path): classifier -> /root/.cac

# Slice audio

In [None]:
sample = dataset[0]
audio_array = sample["audio"]["array"]
sr = sample["audio"]["sampling_rate"]

waveform = torch.tensor(audio_array[None, :]).float()
annotation = diarization_pipeline({"waveform": waveform, "sample_rate": sr})

segments = []
for turn, _, speaker in annotation.itertracks(yield_label=True):
    segments.append({
        "start": turn.start,
        "end": turn.end,
        "speaker": speaker
    })

# Now you can slice:
speaker_segments = []
for seg in segments:
    start = int(seg["start"] * sr)
    end = int(seg["end"] * sr)
    audio_chunk = audio_array[start:end]
    speaker_segments.append({
        "speaker": seg["speaker"],
        "start": seg["start"],
        "end": seg["end"],
        "audio": audio_chunk
    })


## Show segments

In [None]:
speaker_segments

[{'speaker': 'SPEAKER_01',
  'start': 0.03096875,
  'end': 14.543468750000002,
  'audio': array([0.        , 0.        , 0.        , ..., 0.02108765, 0.02081299,
         0.02102661])},
 {'speaker': 'SPEAKER_00',
  'start': 15.387218750000002,
  'end': 21.25971875,
  'audio': array([-0.00094604, -0.00125122, -0.00183105, ..., -0.00531006,
         -0.0057373 , -0.006073  ])}]

# Save Speaker Segments as Individual Audio Files

This section creates a temporary folder and saves each diarized speaker segment as a separate `.wav` file for later transcription.


In [None]:

# Create temp folder
os.makedirs("tmp_segments", exist_ok=True)

for i, seg in enumerate(speaker_segments):
    filename = f"tmp_segments/speaker_{i}.wav"
    sf.write(filename, seg["audio"], sr)
    speaker_segments[i]["file_path"] = filename


# Load Kimi-Audio-7B-Instruct for Multimodal Inference

Here, the KimiAudio model is loaded with detokenizer support and inference parameters are defined for both audio and text modalities.


In [None]:

model_path = "moonshotai/Kimi-Audio-7B-Instruct"
model = KimiAudio(model_path=model_path, load_detokenizer=True)

sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}


[32m2025-07-06 09:47:23.301[0m | [1mINFO    [0m | [36mkimia_infer.api.kimia[0m:[36m__init__[0m:[36m16[0m - [1mLoading kimi-audio main model[0m


Fetching 64 files:   0%|          | 0/64 [00:00<?, ?it/s]

model.pt:   0%|          | 0.00/19.0G [00:00<?, ?B/s]

configuration_moonshot_kimia.py: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

config.yaml: 0.00B [00:00, ?B/s]

model-1-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

model-10-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-11-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-13-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-12-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

model-14-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-15-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-16-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-17-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-18-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-19-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-2-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-20-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-21-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-22-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-23-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-24-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-25-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-26-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-27-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-28-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-29-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-3-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-30-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-31-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-32-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-33-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-34-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-35-of-35.safetensors:   0%|          | 0.00/62.4M [00:00<?, ?B/s]

model-36-of-36.safetensors:   0%|          | 0.00/3.62G [00:00<?, ?B/s]

model-4-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-5-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-6-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-7-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-8-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model-9-of-35.safetensors:   0%|          | 0.00/466M [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

modeling_moonshot_kimia.py: 0.00B [00:00, ?B/s]

tiktoken.model:   0%|          | 0.00/2.56M [00:00<?, ?B/s]

model.pt:   0%|          | 0.00/965M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

README.md: 0.00B [00:00, ?B/s]

added_tokens.json: 0.00B [00:00, ?B/s]

tokenization_kimia.py: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

generation_config.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

normalizer.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

[32m2025-07-06 09:52:47.418[0m | [1mINFO    [0m | [36mkimia_infer.api.kimia[0m:[36m__init__[0m:[36m25[0m - [1mLooking for resources in /root/.cache/huggingface/hub/models--moonshotai--Kimi-Audio-7B-Instruct/snapshots/9a82a84c37ad9eb1307fb6ed8d7b397862ef9e6b[0m
[32m2025-07-06 09:52:47.419[0m | [1mINFO    [0m | [36mkimia_infer.api.kimia[0m:[36m__init__[0m:[36m26[0m - [1mLoading whisper model[0m


Loading checkpoint shards:   0%|          | 0/36 [00:00<?, ?it/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

[32m2025-07-06 09:53:19.217[0m | [1mINFO    [0m | [36mkimia_infer.api.prompt_manager[0m:[36m__init__[0m:[36m20[0m - [1mLooking for resources in /root/.cache/huggingface/hub/models--moonshotai--Kimi-Audio-7B-Instruct/snapshots/9a82a84c37ad9eb1307fb6ed8d7b397862ef9e6b[0m
[32m2025-07-06 09:53:19.218[0m | [1mINFO    [0m | [36mkimia_infer.api.prompt_manager[0m:[36m__init__[0m:[36m21[0m - [1mLoading whisper model[0m
[32m2025-07-06 09:53:22.107[0m | [1mINFO    [0m | [36mkimia_infer.api.prompt_manager[0m:[36m__init__[0m:[36m30[0m - [1mLoading text tokenizer[0m
[32m2025-07-06 09:53:22.517[0m | [1mINFO    [0m | [36mkimia_infer.api.kimia[0m:[36m__init__[0m:[36m41[0m - [1mLoading detokenizer[0m
Detected CUDA files, patching ldflags
Emitting ninja build file /usr/local/lib/python3.11/dist-packages/kimia_infer/models/detokenizer/vocoder/alias_free_activation/cuda/build/build.ninja...
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']

Loading '/root/.cache/huggingface/hub/models--moonshotai--Kimi-Audio-7B-Instruct/snapshots/9a82a84c37ad9eb1307fb6ed8d7b397862ef9e6b/vocoder/model.pt'
Complete.
using rope base theta = 10000.0, interpolation factor = 1.0
Currently using bfloat16 for PrefixFlowMatchingDetokenizer


# Multi-Speaker Transcription Using Joint Audio & Text Prompts

All segmented speaker audio clips are paired with time-based questions and passed to the model as a **batched message sequence** to generate a full conversation-level transcript.


In [None]:
messages = []
for seg in speaker_segments:
    # Add optional prompt to control output
    messages.append({
        "role": "user",
        "message_type": "text",
        "content": f"What is {seg['speaker']} saying between {seg['start']:.1f}s and {seg['end']:.1f}s?"
    })
    messages.append({
        "role": "user",
        "message_type": "audio",
        "content": seg["file_path"]
    })


# Detailed Per-Speaker Transcription (Segment-Wise Loop)

Each speaker segment is processed independently using a prompt + audio pair to generate **fine-grained transcriptions** with speaker labels and timestamps.


In [None]:
_, text_output = model.generate(messages, **sampling_params, output_type="text")
print(">>> Transcribed Multi-speaker Output:\n")
print(text_output)


Generating tokens:   0%|          | 27/7191 [00:01<08:15, 14.46it/s]

>>> Transcribed Multi-speaker Output:

He was in a fevered state of mind, owing to the blight his wife's action threatened to cast upon his entire future.





In [None]:
transcriptions = []

for seg in speaker_segments:
    msgs = [
        {"role": "user", "message_type": "text", "content": f"Transcribe what {seg['speaker']} says:"},
        {"role": "user", "message_type": "audio", "content": seg["file_path"]}
    ]
    _, output = model.generate(msgs, **sampling_params, output_type="text")
    transcriptions.append({
        "speaker": seg["speaker"],
        "start": seg["start"],
        "end": seg["end"],
        "text": output
    })


Generating tokens:   0%|          | 36/7302 [00:01<06:00, 20.13it/s]
Generating tokens:   0%|          | 27/7410 [00:01<06:13, 19.76it/s]


# Print Final Transcription Results with Speaker and Time Annotations

The final output is printed in a structured format showing **who spoke when** and **what was said**, enabling easy inspection of multi-speaker conversations.


In [None]:
for t in transcriptions:
    print(f"{t['speaker']} ({t['start']:.1f}s - {t['end']:.1f}s): {t['text']}\n")


SPEAKER_01 (0.0s - 14.5s): The second in importance is as follows sovereignty may be defined to be the right of making laws in france the king really exercises a portion of the sovereign power since the laws have no weight

SPEAKER_00 (15.4s - 21.3s): He was in a fevered state of mind, owing to the blight his wife's action threatened to cast upon his entire future.

