# Experiment Setup

## Audio Corpus
We will be using an English excerpt from the IMDA National Speech Corpus (NSC). The aim of this transcription, and subsequent minutes generation, is to test the latest `large-v3` Whisper model on the __Singlish__ Accent.

We will be testing it on the sample ID `3030` from the `NSC` dataset. This sample contains a conversation between a Singaporean Male and Singaporean Female. The exact lexicon used in the conversation recording includes Singlish phrases i.e. 'aiya', 'leh', 'lah'. We will be testing the model's ability to transcribe these phrases accurately. NSC provides:
- Speaker 1's Audio
- Speaker 2's Audio
- Overall Audio

## Transcription Corpus
The NSC also provides a transcription of the audio corpus. We will be using this to compare the model's transcription accuracy. NSC provides:
- Speaker 1's Transcription
- Speaker 2's Transcription

This will greatly aid our efforts when comparing the efficacy of the text-level speaker diarization later on. Do note that the transcription is given using the TextGrid format. From my initial analysis, it seems to conform with some variation of SSML. 

## Model
We will be using the latest `large-v3` model offered by OpenAI, running it on the CPU (due to lack of CUDA GPU on my current system), and analysing the transcription accuracy. If we deem that it is up to standard, then we can proceed to __Speaker Diarization__. Else, we may put plans in place to retrain the model on the Singlish Accent. As mentioned above, we are currently using the IMDA NSC as our audio corpus, and it's roughly 890Gb of data. We will be using a small subset of this data for the initial testing.

## Future Plans
If the model is up to standard, we will proceed to __Speaker Diarization__ and __Minutes Generation__.

In [31]:
!pip install openai python-dotenv faster-whisper



In [1]:
import threading
import time
import dotenv
from openai import AzureOpenAI

DEPLOYMENT = dotenv.get_key(dotenv.find_dotenv(), "DEPLOYMENT")
ENDPOINT = dotenv.get_key(dotenv.find_dotenv(), "AZURE_OPENAI_ENDPOINT")
KEY = dotenv.get_key(dotenv.find_dotenv(), "AZURE_OPENAI_KEY")
VERSION = dotenv.get_key(dotenv.find_dotenv(), "AZURE_OPENAI_VERSION")
HF_ACCESS_KEY = dotenv.get_key(dotenv.find_dotenv(),"HF_ACCESS_KEY")
# gets the API Key from environment variable AZURE_OPENAI_API_KEY
client = AzureOpenAI(
	# https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
	api_version=VERSION,
	# https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
	azure_endpoint=ENDPOINT,
	api_key=KEY

)

Python-dotenv could not parse statement starting at line 1
Python-dotenv could not parse statement starting at line 2
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 4
Python-dotenv could not parse statement starting at line 1
Python-dotenv could not parse statement starting at line 2
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 4
Python-dotenv could not parse statement starting at line 1
Python-dotenv could not parse statement starting at line 2
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 4
Python-dotenv could not parse statement starting at line 1
Python-dotenv could not parse statement starting at line 2
Python-dotenv could not parse statement starting at line 3
Python-dotenv could not parse statement starting at line 4
Python-dotenv could not parse statement starting at line

In [2]:
import whisper
import torch
torch.cuda.init()

model = whisper.load_model("medium")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(1024, 1024, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-23): 24 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=1024, out_features=1024, bias=True)
          (key): Linear(in_features=1024, out_features=1024, bias=False)
          (value): Linear(in_features=1024, out_features=1024, bias=True)
          (out): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (attn_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (mlp_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((

# Meeting-Specific Prompts and Phrases


In [3]:
# general = ['Air Traffic Control communications','1','2','3','4','5','6','7','8','9','0','90','180','270','360']
# nato = [
# 	'Alpha', 'Bravo', 'Charlie', 'Delta', 'Echo', 'Foxtrot', 'Golf',
# 	'Hotel', 'India', 'Juliett', 'Kilo', 'Lima', 'Mike', 'November',
# 	'Oscar', 'Papa', 'Quebec', 'Romeo', 'Sierra', 'Tango', 'Uniform',
# 	'Victor', 'Whiskey', 'Xray', 'Yankee', 'Zulu'
# ]
# 
# atc_words = [
#     "acknowledge", "affirmative", "altitude", "approach", "apron", "arrival",
#     "bandbox", "base", "bearing", "cleared", "climb", "contact", "control",
#     "crosswind", "cruise", "descend", "departure", "direct", "disregard",
#     "downwind", "estimate", "final", "flight", "frequency", "go around",
#     "heading", "hold", "identified", "immediate", "information", "instruct",
#     "intentions", "land", "level", "maintain", "mayday", "message", "missed",
#     "navigation", "negative", "obstruction", "option", "orbit", "pan-pan",
#     "pattern", "position", "proceed", "radar", "readback", "received",
#     "report", "request", "required", "runway", "squawk", "standby", "takeoff",
#     "taxi", "threshold", "traffic", "transit", "turn", "vector", "visual",
#     "waypoint", "weather", "wilco", "wind", "with you", "speed",
#     "heavy", "light", "medium", "emergency", "fuel", "identifier",
#     "limit", "monitor", "notice", "operation", "permission", "relief",
#     "route", "signal", "stand", "system", "terminal", "test", "track",
#     "understand", "verify", "vertical", "warning", "zone", "no", "yes", "unable",
#     "clearance", "conflict", "coordination", "cumulonimbus", "deviation", "enroute",
#     "fix", "glideslope", "handoff", "holding", "IFR", "jetstream", "knots",
#     "localizer", "METAR", "NOTAM", "overfly", "pilot", "QNH", "radial",
#     "sector", "SID", "STAR", "tailwind", "transition", "turbulence", "uncontrolled",
#     "VFR", "wake turbulence", "X-wind", "yaw", "Zulu time", "airspace",
#     "briefing", "checkpoint", "elevation", "FL",
#     "ground control", "hazard", "ILS", "jetway", "kilo", "logbook", "missed approach",
#     "nautical mile", "offset", "profile", "quadrant", "RVR",
#     "static", "touchdown", "upwind", "variable", "wingtip", "Yankee", "zoom climb",
#     "airspeed", "backtrack", "ETOPS", "gate", "holding pattern", 
#     "jumpseat", "minimums", "pushback", "RNAV", "slot time", "taxiway", "TCAS",
#     "wind shear", "zero fuel weight", "ETA",
#     "flight deck", "ground proximity warning system", "jet route",
#     "landing clearance", "Mach number", "NDB", "obstacle clearance",
#     "PAPI", "QFE", "radar contact",
#     'ATC', 'Pilot', 'Call sign', 'Altitude', 'Heading', 'Speed', 'Climb to', 'Descend to',
#     'Maintain', 'Tower', 'Ground', 'Runway', 'Taxi', 'Takeoff', 'Landing',
#     'Flight level', 'Traffic', 'Hold short', 'Cleared for',
#     'Roger', 'Visibility', 'Weather', 'Wind', 'Gusts',
#     'Icing conditions', 'Deicing', 'VFR', 'IFR', 'No-fly zone',
#     'Restricted airspace', 'Flight path', 'Direct route', 'Vector', 'Frequency change',
#     'Final approach', 'Initial climb to', 'Contact approach', 'FIR', 'Control zone', 'TMA',
#     'Missed approach', 'Minimum safe altitude', 'Transponder',
#     'Reduce speed to', 'Increase speed to',
#     'Flight conditions', 'Clear of conflict', 'Resume own navigation', 'Request altitude change',
#     'Request route change', 'Flight visibility', 'Ceiling', 'Severe weather', 'Convective SIGMET',
#     'AIRMET', 'QNH', 'QFE', 'Transition altitude', 'Transition level',
#     'NOSIG', 'TFR', 'Special use airspace',
#     'MOA', 'IAP', 'Visual approach',
#     'NDB', 'VOR',
#     'ATIS', 'Engine start clearance',
#     'Line up and wait', 'Unicom', 'Cross runway', 'Departure frequency',
#     'Arrival frequency', 'Go-ahead', 'Hold position', 'Check gear down',
#     'Touch and go', 'Circuit pattern', 'Climb via SID',
#     'Descend via STAR', 'Speed restriction', 'Flight following', 'Radar service terminated', 'Squawk VFR',
#     'Change to advisory frequency', 'Report passing altitude', 'Report position',
#     'ATD', 'Block altitude', 'Cruise climb', 'Direct to', 'Execute missed approach',
#     'In-flight refueling', 'Joining instructions', 'Lost communications', 'MEA', 'Next waypoint', 'OCH',
#     'Procedure turn', 'Radar vectoring', 'Radio failure', 'Short final', 'Standard rate turn',
#     'TRSA', 'Undershoot', 'VMC',
#     'Wide-body aircraft', 'Yaw damper', 'Zulu time conversion', 'RNAV',
#     'RNP', 'Barometric pressure', 'Control tower handover', 'Datalink communication',
#     'ELT', 'FDR', 'GCI',
#     'Hydraulic failure', 'IMC', 'Knock-it-off',
#     'LVO', 'MAP', 'NAVAIDS',
#     'Oxygen mask deployment', 'PAR', 'QRA',
#     'Runway incursion', 'SAR', 'Tail strike', 'Upwind leg', 'Vertical speed',
#     'Wake turbulence category', 'X-ray cockpit security', 'Yield to incoming aircraft', 'Zero visibility takeoff','good day'
# ]
# 
# 
# collated_list = general + nato + atc_words 


general = ['Singaporean Singlish Meeting Transcription Recording']

singlish_phrases = [
	"ah", "lah", "aiya", "leh", "aiyo", "can or not", "on the ball", "makan session", "pow-wow",
	"kiasu", "bo jio", "sian", "shiok", "jialat", 'saigang',
	"talk cock", "wayang", "kena", "chop-chop", "steady",
	"own time own target (OTOT)", "kopi talk", "catch up", "brainstorm", "align",
	"lobang", "paiseh", "action", "agaration", "angkat bola",
	"bao ga liao", "buay pai", "cheem", "chio", "garang",
	"goondu", "kaypoh", "leh", "lor", "nia",
	"one corner", "open table", "pai seh", "relak one corner", "sabo",
	"sai kang", "shiok", "siam", "sikit-sikit", "suay",
	"tabao", "talk shop", "tan tio", "up lorry", "wa kau"
]

singlish_business_phrases = [
	"lah", "can or not?", "on the ball", "kiasu", "shiok",
	"talk cock", "steady pom pi pi", "own time own target", "bo jio", "catch no ball",
	"chiong", "chop chop", "die die must do", "eat snake", "gostan",
	"jialat", "kaypoh", "leh", "lor", "makan",
	"nabei", "paiseh", "sabo", "sian", "suay",
	"walao eh", "wayang", "win already lor", "yaya papaya", "zi high",
	"send it", "check back next week", "let’s touch base on this", "circle back on that", "park this for now",
	"align our ducks", "low key", "see how", "can make it", "noted with thanks",
	"bo bian", "anyhow", "confirm plus chop", "got chance", "mai tu liao",
	"double confirm", "one shot", "over already", "swee", "talk later"
]

collated_list = general + singlish_phrases + singlish_business_phrases 


collated_list_string = ' '.join(collated_list)

# Our Experiment
## Label-Aware Strided Adaptive Diarization

In this experiment, we implement a strategy to update chunks with strides, incorporating both past and future context, and enhancing the diarization process by making the model aware of speaker labels from previous chunks.

### Methodology

We split the transcribed text into `x` number of chunks, and for each chunk, we also add a stride of `a` behind and in front of the chunk. This approach allows us to capture the conversation context more effectively and thus provide a more accurate diarization.

#### Chunking and Stride

1. **Chunk Size and Number of Chunks:**

\begin{equation}
N = \text{length of list}
\end{equation}

\begin{equation}
c = \text{chunk size}, \, a = \text{stride length}
\end{equation}

\begin{equation}
x = \left\lceil \frac{N}{c} \right\rceil, \, \text{number of chunks}
\end{equation}

2. **Initial Chunk Definition:**

\begin{equation}
S_i = i \cdot c, \, E_i = \min((i+1) \cdot c - 1, N-1), \, 0 \leq i < x
\end{equation}

3. **Stride Modifications:**

\begin{equation}
\text{For } i = 1 \text{ to } x-2:
\end{equation}

\begin{equation}
\text{Prepend } \text{sort}(\text{last } a \text{ elements of } \text{chunk}_{i-1}, \text{desc}) \text{ to } \text{chunk}_i
\end{equation}

\begin{equation}
\text{Append } \text{sort}(\text{first } a \text{ elements of } \text{chunk}_{i+1}) \text{ to } \text{chunk}_i
\end{equation}

\begin{equation}
\text{For } i = 0 \text{ and } i = x-1, \text{ chunks remain unchanged.}
\end{equation}

### Example

Consider the initial chunked list and its updated version with a stride \(a=2\):

#### Initial chunked list:

\begin{bmatrix}
[0, 1, 2, 3, 4, 5] \\
[6, 7, 8, 9, 10, 11] \\
[12, 13, 14, 15, 16, 17] \\
[18, 19, 20, 21]
\end{bmatrix}


#### Updated with stride:

\begin{bmatrix}
[0, 1, 2, 3, 4, 5] \\
[4, 5, 6, 7, 8, 9, 10, 11, 12, 13] \\
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19] \\
[16, 17, 18, 19, 20, 21]
\end{bmatrix}



### Diarization Process

Each chunk, upon submission, will carry the respective speaker labels for the preceding \(a\) elements, enriching the context for improved accuracy.

**Backstride:** Elements from the past stride

**Forwardstride:** Elements from the future stride

Starting with chunk 1 (where 0 ≤ i < x), specifically the second chunk:


\begin{bmatrix}
[4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
\end{bmatrix}


The last \(a=2\) speaker labels from the `backstride` will be incorporated into the context. 

If, for instance, the speaker labels for elements \(4\) and \(5\) are `Speaker 1`, this provides the Language Model (LLM) with valuable historical context from the previous diarization session, hypothetically enabling more precise diarization results.

### Final Formulation

#### Initial Chunks:

\begin{equation}
\text{chunks}_i = [i \cdot c \text{ to } (i+1) \cdot c - 1]
\end{equation}

#### Stride Modifications:

\begin{equation}
\text{chunks}_i = [(\text{chunks}_{i-1} \text{ last } a \text{ elements}) + \text{chunks}_i + (\text{chunks}_{i+1} \text{ first } a \text{ elements})]
\end{equation}

#### Label-Aware Diarization:

\begin{equation}
\text{Speaker Labels for chunk}_i = \text{LLM}( \text{chunks}_i \text{ with context from previous labels} )
\end{equation}

This approach enriches the information available for each chunk, aiding the Language Model in generating more accurate diarization outcomes by leveraging historical speaker labels.


In [4]:
import os
import re
import ast
from datetime import datetime
import pandas as pd
import numpy as np
import json
from collections import deque

In [5]:
# Setup code
# Specify the directory path
transcriptions_dir = "./transcriptions"
audio_file = '../content/meeting/daily_ketchup.wav'
specific_filename = "daily_ketchup.json"

In [6]:
import os
import json
from datetime import datetime

# Check if the directory exists
if not os.path.exists(transcriptions_dir):
    # If the directory does not exist, create it
    os.makedirs(transcriptions_dir)

def find_latest_transcription(directory, specific_filename):
    for filename in os.listdir(directory):
        if filename == specific_filename:
            return filename
    return None

# Attempt to find the specific transcription file
latest_transcription_file = find_latest_transcription(transcriptions_dir, specific_filename)

if latest_transcription_file:
    # Full path for the latest file including the directory
    file_path = os.path.join(transcriptions_dir, latest_transcription_file)
    # Extract the filename without the extension
    file_name, _ = os.path.splitext(latest_transcription_file)
    try:
        with open(file_path, "r") as stored_result:
            # If the file exists and is opened successfully, read the content
            result = json.load(stored_result)
        print('\033[92mTranscription located:\033[0m')
        # Extract and print the first 5 sentences
        segments = result.get('segments', [])
        for segment in segments[:5]:
            start_time = round(float(segment['start']),2)
            end_time = round(float(segment['end']),2)
            text = segment['text']
            print(f'[{start_time}:{end_time}] -> {text}')
        print('...')

    except FileNotFoundError:
        print("File not found, although it was expected to exist.")
else:
    # No transcription file matching the pattern was found
    print('\033[91mNo matching transcription files found.\033[0m')
    # If the file does not exist, execute the transcription process and create the file
    temp_result = model.transcribe(audio_file, verbose=True, language="en", prompt=collated_list_string)
    file_path = os.path.join(transcriptions_dir, specific_filename)
    with open(file_path, "w") as f:
        json.dump(temp_result, f)
    result = temp_result
    # Extract the filename without the extension
    file_name, _ = os.path.splitext(os.path.basename(file_path))
    print(f'\033[92mTranscription completed and saved at {file_path}.\033[0m')

per_line = []
for segment in result['segments']:
    text_to_append = segment['text']
    text_to_append = text_to_append[1:]
    per_line.append(text_to_append)


[91mNo matching transcription files found.[0m
[00:00.000 --> 00:06.880]  After a few weeks, he didn't see any recovery, muscle recovery.
[00:08.160 --> 00:13.280]  So he said, very likely that I will have to be paralysed for life.
[00:13.280 --> 00:16.320]  What goes through your head though, when this information is given to a man?
[00:16.320 --> 00:20.720]  Initially, there was a lot of denial that this actually was going to happen.
[00:20.720 --> 00:22.400]  All my friends are riders.
[00:22.400 --> 00:27.920]  All of us have been through accidents, even some major ones.
[00:27.920 --> 00:32.960]  My friend banged his body into a tree and his rib punctured his lungs.
[00:32.960 --> 00:34.800]  It was quite bad.
[00:34.800 --> 00:39.280]  But after a few months, he still managed to recover and he's back on his feet.
[00:39.280 --> 00:42.800]  So how long did it take for you to be discharged from the hospital?
[00:42.800 --> 00:46.080]  I was in hospital for a total of about six mon

In [9]:
len(per_line)

33

In [13]:
# Initialization code for strided_chunks and strided_chunk_indices
STRIDE = 2
TOTAL_NUMBER_OF_LINES = len(per_line)
DESIRED_CHUNK_SIZE = 4

def calculate_number_of_chunks_with_stride(total_lines, chunk_size, stride):
    effective_chunk_size = chunk_size + stride - 1  # Adjust chunk size to account for stride
    number_of_chunks = (total_lines + effective_chunk_size - 1) // effective_chunk_size
    return number_of_chunks

def find_optimal_chunks(total_lines, desired_chunk_size, max_stride):
    # Iterate through stride values from max_stride down to 1
    for stride in range(max_stride, 0, -1):
        number_of_chunks = calculate_number_of_chunks_with_stride(total_lines, desired_chunk_size, stride)
        # Check if the total coverage with the current number of chunks and chunk size is sufficient
        if number_of_chunks * desired_chunk_size >= total_lines:
            return number_of_chunks, stride
    return -1, -1  # Return an error if no suitable chunk and stride combination is found

NUMBER_OF_CHUNKS, STRIDE = find_optimal_chunks(TOTAL_NUMBER_OF_LINES, DESIRED_CHUNK_SIZE, STRIDE)
print('Optimal Number of Chunks:', NUMBER_OF_CHUNKS)
print('Optimal Stride:',STRIDE)
# but i prefer for stride to be 2 as it retains more history so...
STRIDE = 2
print('Forced Stride:', 2)

Optimal Number of Chunks: 9
Optimal Stride: 1
Forced Stride: 2


In [14]:
def chunk_with_stride_and_indices(initial_list: list, stride: int, number_of_chunks: int):
    stride -= 1
    N = len(initial_list)

    # Calculate base chunk size without considering stride for simplicity
    base_chunk_size = (N + number_of_chunks - 1) // number_of_chunks

    # Prepare initial chunks without stride
    initial_chunks = [initial_list[i * base_chunk_size:(i + 1) * base_chunk_size] for i in range(number_of_chunks)]
    initial_chunk_indices = [list(range(i * base_chunk_size, min((i + 1) * base_chunk_size, N))) for i in range(number_of_chunks)]

    stride_chunks = []
    stride_chunk_indices = []

    for i in range(number_of_chunks):
        # Calculate the effective start and end, incorporating stride where applicable
        start = max(0, i * base_chunk_size - stride)
        end = min(N, (i + 1) * base_chunk_size + stride if i < number_of_chunks - 1 else N)

        # Slice the original list and indices accordingly
        current_chunk = initial_list[start:end]
        current_indices = list(range(start, end))

        stride_chunks.append(current_chunk)
        stride_chunk_indices.append(current_indices)

    return stride_chunks, stride_chunk_indices

strided_chunks, strided_chunk_indices = chunk_with_stride_and_indices(per_line, STRIDE, NUMBER_OF_CHUNKS)

# Remove empty lists
strided_chunk_indices = [ele for ele in strided_chunk_indices if ele != []]
strided_chunks = [ele for ele in strided_chunks if ele != []]

# Assuming 'per_line' and 'strided_chunk_indices' are defined elsewhere in the script
comparison_df = pd.DataFrame({'original': per_line})

# Use numpy to efficiently calculate the min and max values for DataFrame index
min_val = np.min([min(sublist) for sublist in strided_chunk_indices])
max_val = np.max([max(sublist) for sublist in strided_chunk_indices])

# Initialize the DataFrame with the correct index range
strided_chunk_df = pd.DataFrame(index=np.arange(min_val, max_val + 1))

# Populate the DataFrame with strided chunk data
for i, sublist in enumerate(strided_chunk_indices):
    # Direct assignment to the DataFrame using loc for precise index matching
    strided_chunk_df.loc[sublist, f'strided_chunk_{i}'] = strided_chunks[i]

# Combine the initial comparison DataFrame with the newly created strided chunk DataFrame
combined_df = pd.concat([comparison_df, strided_chunk_df], axis=1)

class TextDiarizer:
    def __init__(self, client, deployment):
        self.client = client
        self.deployment = deployment

    def diarize_chunk(self, sentences, prev_labels_info=None):
        """Diarize a chunk of text, optionally using information from previous labels."""
        system_message = "You are a linguistics expert with 100 years of experience. You will be given a transcription of a meeting between 4 people, and you are to assign the Speaker label to each sentence PER line. I.e. Given the prompt, you will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']. There is a possibility that a speaker may speak for more than 1 line at time. You will DO YOUR JOB WELL."

        if prev_labels_info:
            prev_speaker_labels = list(prev_labels_info.values())
            sentences_dict = prev_speaker_labels[0]
            speaker_labels_dict = prev_speaker_labels[1]

            # creating {'sentence':speaker_label} dictionary
            sentence_speaker_mapping = {value: speaker_labels_dict[key] for key, value in sentences_dict.items()}

            user_message = f"Here are the previous exchanges RIGHT before this followed by their respective speaker(s):\n{sentence_speaker_mapping} \n Here is the list of sentences:\n{sentences}\nNote that this contains the previous exchanges as well. I MUST RECEIVE ALL {len(sentences)} exchanges. JUST RETURN ME THE LIST."
        else:
            user_message = f"Here is the list of sentences: \n{sentences}. \nThere are {len(sentences)} exchanges. You will diarize ALL the sentences in the list. You WILL ensure that you label ALL {len(sentences)} lines. JUST RETURN ME THE LIST."
        print('user message:', user_message + '\n\n')
        diarization = self.client.chat.completions.create(
            model=self.deployment,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            max_tokens=2500,
            stream=False,
            temperature=0.2,
        )
        return ast.literal_eval(diarization.choices[0].message.content)

    def get_stride_info(self, chunk_df, total_label_list, chunk_number, stride):
        """Retrieve and label the information for a given stride."""
        column_name = f'strided_chunk_{chunk_number}'
        chunk_df[column_name] = chunk_df[column_name].replace('nan', np.nan)

        _ = chunk_df[column_name].dropna()
        stride_df = pd.DataFrame(_.tail(stride))
        previous_speaker_labels = total_label_list[-1][-stride:]
        stride_df['speaker_labels'] = previous_speaker_labels
        return stride_df.to_dict()

    def label_aware(self, stride, number_of_chunks, combined_chunk_df):
        total_label_list = []

        # Process the first chunk
        combined_chunk_df['strided_chunk_0'] = combined_chunk_df['strided_chunk_0'].replace('nan', np.nan)
        first_chunk = combined_chunk_df['strided_chunk_0'].dropna().tolist()
        speaker_labels = self.diarize_chunk(first_chunk)
        total_label_list.append(speaker_labels)

        # Process subsequent chunks
        for chunk_number in range(1, (len(combined_chunk_df.columns) - 1)):
            column_name = f'strided_chunk_{chunk_number}'
            combined_chunk_df[column_name] = combined_chunk_df[column_name].replace('nan', np.nan)
            current_chunk = combined_chunk_df[column_name].dropna(how='any').tolist()
            prev_labels_info = self.get_stride_info(combined_chunk_df, total_label_list, chunk_number - 1, stride)
            speaker_labels = self.diarize_chunk(current_chunk, prev_labels_info)
            total_label_list.append(speaker_labels)

        return total_label_list

diarizer = TextDiarizer(client, DEPLOYMENT)
final_labels = diarizer.label_aware(STRIDE, NUMBER_OF_CHUNKS, combined_df)
print(final_labels)

def final_df(per_line, strided_chunk_indices, final_labels):
    # Assuming 'per_line' and 'strided_chunk_indices' are defined elsewhere in the script
    label_comparison_df = pd.DataFrame({'original': per_line})

    # Use numpy to efficiently calculate the min and max values for DataFrame index
    min_val = np.min([min(sublist) for sublist in strided_chunk_indices])
    max_val = np.max([max(sublist) for sublist in strided_chunk_indices])

    # Initialize the DataFrame with the correct index range
    label_strided_chunk_df = pd.DataFrame(index=np.arange(min_val, max_val + 1))

    # Populate the DataFrame with strided chunk data
    for i, sublist in enumerate(strided_chunk_indices):
        # Direct assignment to the DataFrame using loc for precise index matching
        label_strided_chunk_df.loc[sublist, f'strided_chunk_{i}'] = final_labels[i]

    # Combine the initial comparison DataFrame with the newly created strided chunk DataFrame
    combined_label_df = pd.concat([label_comparison_df, label_strided_chunk_df], axis=1)
    return combined_label_df

final_label_df = final_df(per_line, strided_chunk_indices, final_labels)
final_label_df


user message: Here is the list of sentences: 
["After a few weeks, he didn't see any recovery, muscle recovery.", 'So he said, very likely that I will have to be paralysed for life.', 'What goes through your head though, when this information is given to a man?', 'Initially, there was a lot of denial that this actually was going to happen.', 'All my friends are riders.']. 
There are 5 exchanges. You will diarize ALL the sentences in the list. You WILL ensure that you label ALL 5 lines. JUST RETURN ME THE LIST.

user message: Here are the previous exchanges RIGHT before this followed by their respective speaker(s):
{'Initially, there was a lot of denial that this actually was going to happen.': 'Speaker 3', 'All my friends are riders.': 'Speaker 4'} 
 Here is the list of sentences:
['Initially, there was a lot of denial that this actually was going to happen.', 'All my friends are riders.', 'All of us have been through accidents, even some major ones.', 'My friend banged his body into a

Unnamed: 0,original,strided_chunk_0,strided_chunk_1,strided_chunk_2,strided_chunk_3,strided_chunk_4,strided_chunk_5,strided_chunk_6,strided_chunk_7,strided_chunk_8
0,"After a few weeks, he didn't see any recovery,...",Speaker 1,,,,,,,,
1,"So he said, very likely that I will have to be...",Speaker 1,,,,,,,,
2,"What goes through your head though, when this ...",Speaker 2,,,,,,,,
3,"Initially, there was a lot of denial that this...",Speaker 3,Speaker 3,,,,,,,
4,All my friends are riders.,Speaker 4,Speaker 4,,,,,,,
5,"All of us have been through accidents, even so...",,Speaker 4,,,,,,,
6,My friend banged his body into a tree and his ...,,Speaker 4,,,,,,,
7,It was quite bad.,,Speaker 4,Speaker 4,,,,,,
8,"But after a few months, he still managed to re...",,Speaker 4,Speaker 4,,,,,,
9,So how long did it take for you to be discharg...,,,Speaker 1,,,,,,


### Storing Results

In [15]:
import pandas as pd
import tables

In [16]:
DF_FILE_NAME = './final_labels/'+file_name+'.h5'
store = pd.HDFStore(DF_FILE_NAME)
store['df'] = final_label_df  # save it

# Diarization at an utterance level

In [10]:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",use_auth_token='')

# send pipeline to GPU (when available)
import torch
pipeline.to(torch.device("cuda"))

# apply pretrained pipeline
diarization = pipeline(audio_file,num_speakers=4)

# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
# start=0.2s stop=1.5s speaker_0
# start=1.8s stop=3.9s speaker_1
# start=4.2s stop=5.7s speaker_0
# ...

Found only 3 clusters. Using a smaller value than 12 for `min_cluster_size` might help.
start=0.0s stop=5.1s speaker_SPEAKER_00
start=5.5s stop=6.9s speaker_SPEAKER_00
start=7.5s stop=7.9s speaker_SPEAKER_00
start=8.0s stop=10.4s speaker_SPEAKER_00
start=10.8s stop=13.2s speaker_SPEAKER_00
start=13.4s stop=16.3s speaker_SPEAKER_02
start=16.3s stop=37.6s speaker_SPEAKER_00
start=17.0s stop=17.3s speaker_SPEAKER_02
start=26.0s stop=26.5s speaker_SPEAKER_02
start=33.0s stop=34.2s speaker_SPEAKER_02
start=37.6s stop=37.8s speaker_SPEAKER_02
start=37.8s stop=46.1s speaker_SPEAKER_00
start=37.8s stop=38.2s speaker_SPEAKER_02
start=38.2s stop=38.3s speaker_SPEAKER_01
start=38.3s stop=38.3s speaker_SPEAKER_02
start=38.3s stop=38.3s speaker_SPEAKER_01
start=46.1s stop=53.3s speaker_SPEAKER_00
start=46.2s stop=46.2s speaker_SPEAKER_01
start=46.3s stop=46.9s speaker_SPEAKER_01
start=53.6s stop=56.5s speaker_SPEAKER_00
start=57.1s stop=62.9s speaker_SPEAKER_02
start=59.9s stop=60.2s speaker_SPEAKE