In [None]:
from moviepy.editor import VideoFileClip, AudioFileClip
import glob
import pandas as pd
import pydub
import re
import os
from IPython.display import Video, Audio

# Speaker Segmentation

Here I will try to identify panel speech, so it can be cut out from the audio segment.

In [None]:
# Finding good example Audios

# Import tsst-data dataframe
tsst_data = pd.read_csv("/data/dst_tsst_22_bi_multi_nt_lab/processed/audio_files/tsst_data.csv", index_col=0)
sample_tokens = ["SS291122", "ML031122", "NE563556", "JB011222"]
panel_intervals = {"SS291122": [], "ML031122": [(133, 138), (196, 199)], "NE563556": [],
					   "JB011222": [(149, 151), (173, 176), (213, 215), (241, 244), (275, 278)]}

#### "Transcripts" of Sample Segments

**SS291122** male participant

| Time |  |
|-----------| --- |
| 4:15-4:25 | silence |


**ML031122** male participant

| Time      | <!-- -->    |
|-----------|-------------|
| 1:47-2:13 | silence |
| 2:13-2:18 | panel |
| 2:20-2:28 | silence |
| 2:39-3:16 | silence |
| 3:16-3:19 | panel |
| 4:08-4:14 | silence |
| 4:43-4:49 | silence |

**NE563556** female participant

| Time      |         |
|-----------|---------|
| 4:39-5:00 | silence |

**JB011222** female participant

| Time      |         |
|-----------|---------|
| 1:06-1:10 | silence |
| 2:07-2:29 | silence |
| 2:29-2:31 | panel |
| 2:36-2:53 | silence |
| 2:53-2:56 | panel |
| 3:06-3:10 | silence |
| 3:15-3:33 | silence |
| 3:33-3:35 | panel |
| 3:35-4:01 | silence |
| 4:01-4:04 | panel |
| 4:13-4:35 | silence |
| 4:35-4:38 | panel |
| 4:38-4:52 | silence |

In [None]:
display(tsst_data)

In [None]:
# Code to go through audio segments
participant = 19
sample_audio = tsst_data["TSST_audio_segment"][participant]
sample_token = tsst_data["token"][participant]
print(sample_token)
Audio(sample_audio)

### InaSpeechSegmenter

In [None]:
from inaSpeechSegmenter import Segmenter, seg2csv

In [None]:
segmenter = Segmenter(vad_engine='smn', detect_gender=True)
#segmenter = Segmenter(vad_engine='smn', detect_gender=False)

for sample_token in sample_tokens:
	sample_audio = tsst_data.loc[tsst_data['token'] == sample_token, 'TSST_audio_segment'].values[0]
	segmentation = segmenter(sample_audio)
	print(sample_token)
	print(segmentation, "\n")

**detect_gender=True**
Works well for male participants where the panel is correctly classified as female for both example audios with correct start/end times of the panel speaking. It does not work for females at all, there are many misclassifications (eg. 11times for NE563556) as male when there is silence or when the female participant is speaking. Even the start/end times are incorrect.

Without detect_gender there is no distinction between speakers, other vad-engine is sm, which only distinguishes between speech and music and not noise. No further settings to tweak. **Not an option**



#### AudioSegmentation from pyAudioAnalyis

In [None]:
from pydub import AudioSegment
from pyAudioAnalysis import audioSegmentation

In [None]:
for sample_token in sample_tokens:
	sample_audio = tsst_data.loc[tsst_data['token'] == sample_token, 'TSST_audio_segment'].values[0]

	"""
	# Convert the mp3 file to wav-format
	audio = AudioSegment.from_file(sample_audio, format="wav")
	wav_path = sample_audio[:-3] + "wav"
	audio.export(wav_path, format="wav")
	"""
	print(sample_token)
	audioSegmentation.speaker_diarization(sample_audio, n_speakers=2, lda_dim=1, plot_res=1)

Parameters:
- n-speakers: number of speakers (clusters)
- lda-dim: LDA(Linear Discriminant Analysis) dimension (0 for no LDA), uses GaussianHMM instead
- plot_res: polt results yes/no

**No LDA (lda-dim=0)**
Does not show good results, way to many speaker switches.

**lda-dim=1**
If lda-dim=C-1 then this would be 1 in oour case, since we have two speakers. This does not make the results better at all.

**lda-dim=2**
Less speaker switches, but not correct ones.

**Not and option**

#### Diarization by pydiar

In [None]:
import numpy as np
from pydiar.models import BinaryKeyDiarizationModel, Segment
from pydiar.util.misc import optimize_segments
from pydub import AudioSegment

In [None]:
sample_rate = 32000

for sample_token in sample_tokens:
	sample_audio = tsst_data.loc[tsst_data['token'] == sample_token, 'TSST_audio_segment'].values[0]
	audio = AudioSegment.from_file(sample_audio, format="wav")
	audio = audio.set_frame_rate(sample_rate)
	audio = audio.set_channels(1)
	diarization_model = BinaryKeyDiarizationModel()
	segments = diarization_model.diarize(sample_rate, np.array(audio.get_array_of_samples()))
	optimized_segments = optimize_segments(segments, skip_short_limit=2)
	print(sample_token)
	print(optimized_segments)
	print("\n")

To my surprise generates the BinaryDiarizationModel four speaker IDs for SS291122. The diarization for ML031122 is completely correct (id=2.0 is the panel). For NE563556 is completely correct as well, it only predicts one speaker for (almost 5 minutes). Unfortunately for JB011222 it also predicts only one speaker and misses the 5 times the panel is speaking.

It works great for two scenarios (male with panel, female without panel) and terrible for the other two. Next step is to see if parameters can be tweaked. There are no parameters for BinaryKeyDiarizationModel, diarize (sample_rate and audio). There are some parameters for optimize segments:
- keep_gaps=False -> if silence should be cut out this should be true (we do not want this)
- skip_short_limit=0.5 -> if there are segments made by the model with less than x (0.5seconds) they should be cut out. The panel speaks on average for about 3 seconds, so I set this to 2 seconds.

But since the segmentation is already done at that point, this tweaked parameter just removes very small segmentations, but does not make the model perform better for the two participants that it does not work well for.

#### pyannote

In [None]:
from pyannote.audio import Pipeline

In [None]:
# 1. visit hf.co/pyannote/speaker-diarization and hf.co/pyannote/segmentation and accept user conditions (only if requested)
# 2. visit hf.co/settings/tokens to create an access token (only if you had to go through 1.)
# 3. instantiate pretrained speaker diarization pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",
                                    use_auth_token="ACCESS_TOKEN_GOES_HERE")

# 4. apply pretrained pipeline
diarization = pipeline("audio.wav")

# 5. print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
# start=0.2s stop=1.5s speaker_0
# start=1.8s stop=3.9s speaker_1
# start=4.2s stop=5.7s speaker_0
# ...

In [None]:
import torch
pipeline = torch.hub.load('pyannote/pyannote-audio', 'dia_ami')
# speech activity detection model trained on AMI training set
sad = torch.hub.load('pyannote/pyannote-audio', 'sad_ami')
# speaker change detection model trained on AMI training set
scd = torch.hub.load('pyannote/pyannote-audio', 'scd_ami')
# overlapped speech detection model trained on AMI training set
ovl = torch.hub.load('pyannote/pyannote-audio', 'ovl_ami')
# speaker embedding model trained on AMI training set
emb = torch.hub.load('pyannote/pyannote-audio', 'emb_ami')



In [None]:
from pyannote.audio import Pipeline

#pipeline = pipelines.from_pretrained("pyannote/speaker-diarization", use_auth_token="ACCESS_TOKEN_GOES_HERE")

# 4. apply pretrained pipeline
diarization = pipeline(sample_audio)

# 5. print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
	print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")


#### Fundamental Frequencies

In [None]:
import seaborn as sns
import parselmouth
import matplotlib.pyplot as plt
sns.set()

In [None]:
# Plot raw waveform
for sample_token in sample_tokens:
	sample_audio = tsst_data.loc[tsst_data['token'] == sample_token, 'TSST_audio_segment'].values[0]
	# wav_path already created for sample files above
	snd = parselmouth.Sound(sample_audio)
	plt.figure()
	plt.title(sample_token)
	plt.plot(snd.xs(), snd.values.T)
	plt.xlim([snd.xmin, snd.xmax])
	plt.xlabel("time [s]")
	plt.ylabel("amplitude")
	for interval in panel_intervals[sample_token]:
		start_time, end_time = interval
		plt.axvspan(start_time, end_time, color='red', alpha=0.3)
	plt.show()

In [None]:
# Plot fundamental frequencies
for sample_token in sample_tokens:
	sample_audio = tsst_data.loc[tsst_data['token'] == sample_token, 'TSST_audio_segment'].values[0]
	# wav_path already created for sample files above
	snd = parselmouth.Sound(sample_audio[:-3]+"wav")
	channel_left = snd.extract_left_channel()
	channel_right = snd.extract_right_channel()
	mono = snd.convert_to_mono()

	pitch_stereo = snd.to_pitch()
	pitch_left = channel_left.to_pitch()
	pitch_right = channel_right.to_pitch()
	pitch_mono = mono.to_pitch()

	fig, axs = plt.subplots(1, 4, figsize=(18, 4))
	# Plot the fundamental frequencies for the stereo
	axs[0].plot(pitch_stereo.xs(), pitch_stereo.selected_array['frequency'])
	axs[0].set(xlabel='Time (s)', ylabel='Fundamental Frequency (Hz)', title='Stereo '+sample_token)

	# Plot the fundamental frequencies for the left channel
	axs[1].plot(pitch_left.xs(), pitch_left.selected_array['frequency'])
	axs[1].set(xlabel='Time (s)', ylabel='Fundamental Frequency (Hz)', title='Left Channel '+sample_token)

	# Plot the fundamental frequencies for the right channel
	axs[2].plot(pitch_right.xs(), pitch_right.selected_array['frequency'])
	axs[2].set(xlabel='Time (s)', ylabel='Fundamental Frequency (Hz)', title='Right Channel '+sample_token)

	# Plot the fundamental frequencies for the converted mono
	axs[3].plot(pitch_mono.xs(), pitch_mono.selected_array['frequency'])
	axs[3].set(xlabel='Time (s)', ylabel='Fundamental Frequency (Hz)', title='Mono '+sample_token)

	for interval in panel_intervals[sample_token]:
		for i in range(4):
			start_time, end_time = interval
			axs[i].axvspan(start_time, end_time, color='red', alpha=0.3)
	plt.show()

I tried plotting the fundamental frequencies overlayed with the panel speaking for the stero file, mono file and both channels seperately. Unfortunately no none align completely with the panel. While there are spikes in frequency when the panel is speaking, there are also (higher) spikes when the panel is not speaking. Also there is always a longer break before the panel is speaking, which can be seen as lower frequencies, but that is again not reliable, since in JB011222 ther is a spike in the "break" and in ML031122 the "cut-off" for low-frequency is not consistent. Also in the other two there are spikes with longer times with low frequency, although not as distinct.

##### Combination of Amplitude and Frequency

In the amplitude the silences can be better seen, so one could pick out timepoints there after a longer break and evaluate if those timepoints sho higher frequency. This is an option, but will most likely not work perfectly for cases in which the panel gives other instructions, does not wait the full length or the participants says an "mmh" in the middle, but the panel does not count it as speech, so does not start the 20 seconds over.

In [None]:
Audio(tsst_data["TSST_audio_segment"][0])