# Senko in Google Colab

Install `uv`

In [1]:
!curl -LsSf https://astral.sh/uv/install.sh | sh

downloading uv 0.9.7 x86_64-unknown-linux-gnu
no checksums to verify
installing to /usr/local/bin
  uv
  uvx
everything's installed!


Verify `uv` installation

In [2]:
!/usr/local/bin/uv --version

uv 0.9.7


Install Senko

In [3]:
!/usr/local/bin/uv pip install --system "git+https://github.com/narcotic-sh/senko.git[nvidia]"

[2mUsing Python 3.12.12 environment at: /usr[0m
[2K[2mResolved [1m246 packages[0m [2min 25.90s[0m[0m
[2K[2mPrepared [1m71 packages[0m [2min 1m 03s[0m[0m
[2mUninstalled [1m31 packages[0m [2min 731ms[0m[0m
[2K[2mInstalled [1m71 packages[0m [2min 494ms[0m[0m
 [32m+[39m [1masteroid-filterbanks[0m[2m==0.4.0[0m
 [31m-[39m [1mbokeh[0m[2m==3.7.3[0m
 [32m+[39m [1mbokeh[0m[2m==3.6.3[0m
 [32m+[39m [1mcoloredlogs[0m[2m==15.0.1[0m
 [32m+[39m [1mcolorlog[0m[2m==6.10.1[0m
 [32m+[39m [1mcolour-science[0m[2m==0.4.6[0m
 [32m+[39m [1mcucim-cu12[0m[2m==25.8.0[0m
 [32m+[39m [1mcuda-bindings[0m[2m==12.9.4[0m
 [32m+[39m [1mcuda-pathfinder[0m[2m==1.3.2[0m
 [31m-[39m [1mcuda-python[0m[2m==12.6.2.post1[0m
 [32m+[39m [1mcuda-python[0m[2m==12.9.4[0m
 [31m-[39m [1mcudf-cu12[0m[2m==25.6.0 (from https://pypi.nvidia.com/cudf-cu12/cudf_cu12-25.6.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl)[0m
 [32m+

Could've also just used regular `pip` like so and not used `uv`.
```sh
!pip install "git+https://github.com/narcotic-sh/senko.git[nvidia]"
```
`uv` is just way faster.

Download a 1 hr 16 KHz mono 16-bit test wav file

In [4]:
import urllib.request
import os

wav_path = os.path.abspath("cowen.wav")
print("Downloading audio file...")
if not os.path.exists(wav_path):
    urllib.request.urlretrieve("https://www.dropbox.com/scl/fi/77kgl6luhmsm6k30x1muf/cowen.wav?rlkey=n6goatgi3pjpgn7glna654f2a&dl=1", wav_path)
print(f"Downloaded to: {wav_path}")

Downloading audio file...
Downloaded to: /content/cowen.wav


Initialize and warm up Diarizer

In [5]:
import senko
diarizer = senko.Diarizer(device='auto', warmup=True, quiet=False)

Using device: cuda


DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _speechbrain_save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _speechbrain_load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for load
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint save hook for _save
DEBUG:speechbrain.utils.checkpoints:Registered checkpoint load hook for _recover



Loading embedding model ........ done [0.7s]
Warming up embedding model ..... done [5.4s]
Using GPU clustering
Warming up clustering objects ... done [31.7s]


Diarize

In [6]:
ls /content/

cowen.wav  [0m[01;34msample_data[0m/


In [7]:
result = diarizer.diarize(wav_path, generate_colors=False)

# Print first few speaker segments
for seg in result["merged_segments"][:5]:
    print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")


    [38;2;120;167;214mcowen.wav[0m
      ├── Voice activity detection ..... done [7.32s]
      ├── Fbank feature extraction ..... done [37.53s]
      ├── Embeddings generation ........ done [5.82s]
      └── Clustering ................... done [2.05s]

    Total diarization time: 52.73s

SPEAKER_01: 0.03s - 25.16s
SPEAKER_02: 25.16s - 26.61s
SPEAKER_01: 26.61s - 47.91s
SPEAKER_02: 48.11s - 49.56s
SPEAKER_01: 50.00s - 52.43s


The fbank audio feature extraction is noteciably slow due to the underpowered CPU in this Colab runtime.

(I just used the standard free T4 offering; whatever CPU comes with that)

I have put it on my to do list to see if we can do the same audio feature extraction on the GPU instead of the CPU.

(notice all the other parts of the pipeline, which run on the GPU, are quite fast)

# test


구글 마운트

In [8]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


## Sampling
1. .m4a to .wav
2. match 16kHz
3. 16bit

In [9]:
!pip install pydub



In [14]:
from pydub import AudioSegment

filePath = "/content/drive/MyDrive/final_project/1105_오전회의.m4a"

# M4A 파일 불러오기
audio = AudioSegment.from_file(filePath, format="m4a")

# 16kHz 모노 변환 (Senko 요구사항 완전 충족)
audio_converted = audio.set_frame_rate(16000).set_channels(1).set_sample_width(2)

# WAV 파일로 저장
wavFilePath = filePath.replace(".m4a", "_16k.wav")
audio_converted.export(wavFilePath, format="wav")

print(f"원본 정보:")
print(f"  - 샘플링 레이트: {audio.frame_rate}Hz")
print(f"  - 채널: {audio.channels}")
print(f"  - 비트: {audio.sample_width * 8}bit")

print(f"\n변환된 파일 정보:")
print(f"  - 샘플링 레이트: {audio_converted.frame_rate}Hz")
print(f"  - 채널: {audio_converted.channels}")
print(f"  - 비트: {audio_converted.sample_width * 8}bit")
print(f"  - 저장 경로: {wavFilePath}")

원본 정보:
  - 샘플링 레이트: 44100Hz
  - 채널: 1
  - 비트: 16bit

변환된 파일 정보:
  - 샘플링 레이트: 16000Hz
  - 채널: 1
  - 비트: 16bit
  - 저장 경로: /content/drive/MyDrive/final_project/1105_오전회의_16k.wav


In [10]:
from pydub import AudioSegment

filePath = "/content/drive/MyDrive/final_project/1105_오전회의.m4a"

audio = AudioSegment.from_file(filePath, format="m4a")
wavFilePath = filePath.replace("m4a", "wav")
audio.export(wavFilePath, format="wav")

<_io.BufferedRandom name='/content/drive/MyDrive/final_project/1105_오전회의.wav'>

In [15]:
ls "/content/drive/MyDrive/final_project/"

 1105_오전회의_16k.wav   1105_오전회의.wav   pyannote.ipynb  'STT_Model .ipynb'
 1105_오전회의.m4a       DGBAH21000190.wav   Senko.ipynb


In [16]:
wav_path1 = "/content/drive/MyDrive/final_project/1105_오전회의_16k.wav"

In [17]:
result1 = diarizer.diarize(wav_path1, generate_colors=False)

# Print first few speaker segments
for seg in result1["merged_segments"][:5]:
    print(f"{seg['speaker']}: {seg['start']:.2f}s - {seg['end']:.2f}s")


    [38;2;120;167;214m1105_오전회의_16k.wav[0m
      ├── Voice activity detection ..... done [1.82s]
      ├── Fbank feature extraction ..... done [8.06s]
      ├── Embeddings generation ........ done [1.61s]
      └── Clustering ................... done [1.80s]

    Total diarization time: 13.30s

SPEAKER_04: 0.03s - 1.00s
SPEAKER_01: 1.00s - 3.75s
SPEAKER_04: 3.75s - 126.95s
SPEAKER_01: 128.60s - 208.51s
SPEAKER_02: 208.51s - 209.70s


## result 분석

In [19]:
result1.keys()

dict_keys(['raw_segments', 'raw_speakers_detected', 'merged_speakers_detected', 'merged_segments', 'speaker_centroids', 'timing_stats'])

In [21]:
result1['raw_segments'][:5]

[{'speaker': 'SPEAKER_04', 'start': 0.03096875, 'end': 0.9976354166666667},
 {'speaker': 'SPEAKER_01',
  'start': 0.9976354166666667,
  'end': 3.7482604166666667},
 {'speaker': 'SPEAKER_03',
  'start': 2.7815937500000003,
  'end': 3.7482604166666667},
 {'speaker': 'SPEAKER_04', 'start': 3.7482604166666667, 'end': 126.94784375},
 {'speaker': 'SPEAKER_01',
  'start': 5.2790937499999995,
  'end': 14.239718750000002}]

In [23]:
result1['raw_speakers_detected']

4

In [24]:
result1['merged_speakers_detected']

4

In [26]:
result1['merged_segments'][:5]

[{'speaker': 'SPEAKER_04', 'start': 0.03096875, 'end': 0.9976354166666667},
 {'speaker': 'SPEAKER_01',
  'start': 0.9976354166666667,
  'end': 3.7482604166666667},
 {'speaker': 'SPEAKER_04', 'start': 3.7482604166666667, 'end': 126.94784375},
 {'speaker': 'SPEAKER_01', 'start': 128.60284375000003, 'end': 208.50534375},
 {'speaker': 'SPEAKER_02', 'start': 208.50534375, 'end': 209.70284375}]

In [27]:
result1['speaker_centroids']

{'SPEAKER_01': array([ 0.1384968 ,  0.03869756,  0.18093893,  0.00583312,  0.18648282,
         0.1327788 ,  0.02576677,  0.39688292,  0.22473563,  0.32121652,
         0.07376263,  0.18233094,  0.09347282,  0.08392147,  0.17489341,
         0.10710016,  0.02736935,  0.25635603,  0.43392834,  0.23541993,
         0.5425431 ,  0.05170824,  0.21380652,  0.2618908 ,  0.18008958,
         0.27240026,  0.5896259 ,  0.5328004 ,  0.13063905,  0.13001113,
         0.49800563,  0.34457728,  0.19830647,  0.27737334,  0.02190099,
         0.01962602,  0.0539074 ,  0.1303857 ,  0.01326968,  0.06699817,
         0.4081028 ,  0.02147262,  0.37411192,  0.25594944,  0.17794313,
         0.214274  ,  0.82653934,  0.36102635,  0.1330322 ,  0.13204658,
         0.02555575,  0.3146018 ,  0.30124903,  0.05304889,  0.0861363 ,
         0.13142169,  0.07866371,  0.27712032,  0.54350275,  0.12176625,
         0.03949152,  0.09984626,  0.41466728,  0.01421771,  0.17727932,
         0.06828993,  0.3314459 ,  0.

In [28]:
result1['timing_stats']

{'vad_time': 1.82,
 'fbank_time': 8.06,
 'embeddings_time': 1.61,
 'clustering_time': 1.8,
 'total_time': 13.3}