# Experiment Setup

## Audio Corpus
We will be using an English excerpt from the IMDA National Speech Corpus (NSC). The aim of this transcription, and subsequent minutes generation, is to test the latest `large-v3` Whisper model on the __Singlish__ Accent.

We will be testing it on the sample ID `3030` from the `NSC` dataset. This sample contains a conversation between a Singaporean Male and Singaporean Female. The exact lexicon used in the conversation recording includes Singlish phrases i.e. 'aiya', 'leh', 'lah'. We will be testing the model's ability to transcribe these phrases accurately. NSC provides:
- Speaker 1's Audio
- Speaker 2's Audio
- Overall Audio

## Transcription Corpus
The NSC also provides a transcription of the audio corpus. We will be using this to compare the model's transcription accuracy. NSC provides:
- Speaker 1's Transcription
- Speaker 2's Transcription

This will greatly aid our efforts when comparing the efficacy of the text-level speaker diarization later on. Do note that the transcription is given using the TextGrid format. From my initial analysis, it seems to conform with some variation of SSML. 

## Model
We will be using the latest `large-v3` model offered by OpenAI, running it on the CPU (due to lack of CUDA GPU on my current system), and analysing the transcription accuracy. If we deem that it is up to standard, then we can proceed to __Speaker Diarization__. Else, we may put plans in place to retrain the model on the Singlish Accent. As mentioned above, we are currently using the IMDA NSC as our audio corpus, and it's roughly 890Gb of data. We will be using a small subset of this data for the initial testing.

## Future Plans
If the model is up to standard, we will proceed to __Speaker Diarization__ and __Minutes Generation__.

In [33]:
!pip install openai python-dotenv faster-whisper



In [34]:
import threading
import time
import dotenv
from openai import AzureOpenAI

DEPLOYMENT = dotenv.get_key(dotenv.find_dotenv(), "DEPLOYMENT")
ENDPOINT = dotenv.get_key(dotenv.find_dotenv(), "ENDPOINT")
KEY = dotenv.get_key(dotenv.find_dotenv(), "KEY")
VERSION = dotenv.get_key(dotenv.find_dotenv(), "VERSION")
# gets the API Key from environment variable AZURE_OPENAI_API_KEY
client = AzureOpenAI(
	# https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
	api_version=VERSION,
	# https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
	azure_endpoint=ENDPOINT,
	api_key=KEY

)

In [36]:
import whisper
import torch
torch.cuda.init()

model = whisper.load_model("medium")
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model.to(device)

Whisper(
  (encoder): AudioEncoder(
    (conv1): Conv1d(80, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(1024, 1024, kernel_size=(3,), stride=(2,), padding=(1,))
    (blocks): ModuleList(
      (0-23): 24 x ResidualAttentionBlock(
        (attn): MultiHeadAttention(
          (query): Linear(in_features=1024, out_features=1024, bias=True)
          (key): Linear(in_features=1024, out_features=1024, bias=False)
          (value): Linear(in_features=1024, out_features=1024, bias=True)
          (out): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (attn_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
        (mlp): Sequential(
          (0): Linear(in_features=1024, out_features=4096, bias=True)
          (1): GELU(approximate='none')
          (2): Linear(in_features=4096, out_features=1024, bias=True)
        )
        (mlp_ln): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
      )
    )
    (ln_post): LayerNorm((

# Meeting-Specific Prompts and Phrases


In [38]:
general = ['Singaporean Singlish Government Business Meeting Transcription Recording']

singlish_phrases = [
	"ah", "lah", "aiya", "leh", "aiyo", "can or not", "on the ball", "makan session", "pow-wow",
	"kiasu", "bo jio", "sian", "shiok", "jialat", 'saigang',
	"talk cock", "wayang", "kena", "chop-chop", "steady",
	"own time own target (OTOT)", "kopi talk", "catch up", "brainstorm", "align",
	"lobang", "paiseh", "action", "agaration", "angkat bola",
	"bao ga liao", "buay pai", "cheem", "chio", "garang",
	"goondu", "kaypoh", "leh", "lor", "nia",
	"one corner", "open table", "pai seh", "relak one corner", "sabo",
	"sai kang", "shiok", "siam", "sikit-sikit", "suay",
	"tabao", "talk shop", "tan tio", "up lorry", "wa kau"
]

singlish_business_phrases = [
	"lah", "can or not?", "on the ball", "kiasu", "shiok",
	"talk cock", "steady pom pi pi", "own time own target", "bo jio", "catch no ball",
	"chiong", "chop chop", "die die must do", "eat snake", "gostan",
	"jialat", "kaypoh", "leh", "lor", "makan",
	"nabei", "paiseh", "sabo", "sian", "suay",
	"walao eh", "wayang", "win already lor", "yaya papaya", "zi high",
	"send it", "check back next week", "let’s touch base on this", "circle back on that", "park this for now",
	"align our ducks", "low key", "see how", "can make it", "noted with thanks",
	"bo bian", "anyhow", "confirm plus chop", "got chance", "mai tu liao",
	"double confirm", "one shot", "over already", "swee", "talk later"
]

collated_list = general + singlish_phrases + singlish_business_phrases 

collated_list_string = ' '.join(collated_list)

# Performing Transcription

In [39]:
import os
import re
import ast
from datetime import datetime
import pandas as pd
import numpy as np
from collections import deque

# Setup code
# Specify the directory path
transcriptions_dir = "./transcriptions"

# Check if the directory exists
if not os.path.exists(transcriptions_dir):
    # If the directory does not exist, create it
    os.makedirs(transcriptions_dir)

def find_latest_transcription(directory):
    # Regex pattern for matching the filename
    pattern = re.compile(r'transcription_(\d{2})(\d{2})(\d{2})\.txt')
    latest_file = None
    latest_date = None

    for filename in os.listdir(directory):
        match = pattern.match(filename)
        if match:
            # Extract day, month, year from the filename
            day, month, year = match.groups()
            file_date = datetime.strptime(f'20{year}{month}{day}', '%Y%m%d')

            # Update the latest file based on date
            if not latest_date or file_date > latest_date:
                latest_date = file_date
                latest_file = filename

    return latest_file if latest_file else None

# Attempt to find the latest transcription file
latest_transcription_file = find_latest_transcription(transcriptions_dir)

if latest_transcription_file:
    # Full path for the latest file including the directory
    file_path = os.path.join(transcriptions_dir, latest_transcription_file)
    try:
        with open(file_path, "r") as stored_result:
            # If the file exists and is opened successfully, read the content
            temp_result = stored_result.read()
            result = ast.literal_eval(temp_result)
            # result = dict(result)
        print('\033[92mTranscription located:\033[0m')
        print(result['text'])

    except FileNotFoundError:
        print("File not found, although it was expected to exist.")
else:
    # No transcription file matching the pattern was found
    print('\033[91mNo matching transcription files found.\033[0m')
    # If the file does not exist, execute the transcription process and create the file
    temp_result = str(model.transcribe("./content/conference_call.mp3", verbose=True,
                                  language="en", prompt=collated_list_string))
    file_path = os.path.join(transcriptions_dir, f'transcription_{datetime.now().strftime("%y%m%d")}.txt')
    with open(file_path, "w") as f:
        f.write(temp_result)
    result = ast.literal_eval(temp_result)
    print(f'\033[92mTranscription completed and saved at {file_path}.\033[0m')

per_line = []
for segment in result['segments']:
    text_to_append = segment['text']
    text_to_append = text_to_append[1:]
    per_line.append(text_to_append)

per_line

# Initialization code for strided_chunks and strided_chunk_indices
STRIDE = 2
NUMBER_OF_CHUNKS = 60

def chunk_with_stride_and_indices(initial_list: list, stride: int, number_of_chunks: int):
    stride -= 1
    N = len(initial_list)

    # Calculate base chunk size without considering stride for simplicity
    base_chunk_size = (N + number_of_chunks - 1) // number_of_chunks

    # Prepare initial chunks without stride
    initial_chunks = [initial_list[i * base_chunk_size:(i + 1) * base_chunk_size] for i in range(number_of_chunks)]
    initial_chunk_indices = [list(range(i * base_chunk_size, min((i + 1) * base_chunk_size, N))) for i in range(number_of_chunks)]

    stride_chunks = []
    stride_chunk_indices = []

    for i in range(number_of_chunks):
        # Calculate the effective start and end, incorporating stride where applicable
        start = max(0, i * base_chunk_size - stride)
        end = min(N, (i + 1) * base_chunk_size + stride if i < number_of_chunks - 1 else N)

        # Slice the original list and indices accordingly
        current_chunk = initial_list[start:end]
        current_indices = list(range(start, end))

        stride_chunks.append(current_chunk)
        stride_chunk_indices.append(current_indices)

    return stride_chunks, stride_chunk_indices

strided_chunks, strided_chunk_indices = chunk_with_stride_and_indices(per_line, STRIDE, NUMBER_OF_CHUNKS)

# Code for Experiment 3
# Remove empty lists
strided_chunk_indices = [ele for ele in strided_chunk_indices if ele != []]
strided_chunks = [ele for ele in strided_chunks if ele != []]

# Assuming 'per_line' and 'strided_chunk_indices' are defined elsewhere in the script
comparison_df = pd.DataFrame({'original': per_line})

# Use numpy to efficiently calculate the min and max values for DataFrame index
min_val = np.min([min(sublist) for sublist in strided_chunk_indices])
max_val = np.max([max(sublist) for sublist in strided_chunk_indices])

# Initialize the DataFrame with the correct index range
strided_chunk_df = pd.DataFrame(index=np.arange(min_val, max_val + 1))

# Populate the DataFrame with strided chunk data
for i, sublist in enumerate(strided_chunk_indices):
    # Direct assignment to the DataFrame using loc for precise index matching
    strided_chunk_df.loc[sublist, f'strided_chunk_{i}'] = strided_chunks[i]

# Combine the initial comparison DataFrame with the newly created strided chunk DataFrame
combined_df = pd.concat([comparison_df, strided_chunk_df], axis=1)

class TextDiarizer:
    def __init__(self, client, deployment):
        self.client = client
        self.deployment = deployment

    def diarize_chunk(self, sentences, prev_labels_info=None):
        """Diarize a chunk of text, optionally using information from previous labels."""
        system_message = "You are a linguistics expert with 100 years of experience. You will be given a transcription of a MEETING between an unknown number of speakers, and you are to assign the Speaker label to each sentence PER line. I.e. Given the prompt, you will return me: ['Speaker 1', 'Speaker 2', 'Speaker 2']. There is a possibility that a speaker may speak for more than 1 line at time. You will DO YOUR JOB WELL."

        if prev_labels_info:
            prev_speaker_labels = list(prev_labels_info.values())
            sentences_dict = prev_speaker_labels[0]
            speaker_labels_dict = prev_speaker_labels[1]

            # creating {'sentence':speaker_label} dictionary
            sentence_speaker_mapping = {value: speaker_labels_dict[key] for key, value in sentences_dict.items()}

            user_message = f"Here are the previous exchanges RIGHT before this followed by their respective speaker(s):\n{sentence_speaker_mapping} \n Here is the list of sentences:\n{sentences}\nNote that this contains the previous exchanges as well. I MUST RECEIVE ALL {len(sentences)} exchanges. JUST RETURN ME THE LIST."
        else:
            user_message = f"Here is the list of sentences: \n{sentences}. \nThere are {len(sentences)} exchanges. You will diarize ALL the sentences in the list. You WILL ensure that you label ALL {len(sentences)} lines. JUST RETURN ME THE LIST."
        print('user message:', user_message + '\n\n')
        diarization = self.client.chat.completions.create(
            model=self.deployment,
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            max_tokens=2500,
            stream=False,
            temperature=0.2,
        )
        return ast.literal_eval(diarization.choices[0].message.content)

    def get_stride_info(self, chunk_df, total_label_list, chunk_number, stride):
        """Retrieve and label the information for a given stride."""
        column_name = f'strided_chunk_{chunk_number}'
        chunk_df[column_name] = chunk_df[column_name].replace('nan', np.nan)

        _ = chunk_df[column_name].dropna()
        stride_df = pd.DataFrame(_.tail(stride))
        previous_speaker_labels = total_label_list[-1][-stride:]
        stride_df['speaker_labels'] = previous_speaker_labels
        return stride_df.to_dict()

    def label_aware(self, stride, number_of_chunks, combined_chunk_df):
        total_label_list = []

        # Process the first chunk
        combined_chunk_df['strided_chunk_0'] = combined_chunk_df['strided_chunk_0'].replace('nan', np.nan)
        first_chunk = combined_chunk_df['strided_chunk_0'].dropna().tolist()
        speaker_labels = self.diarize_chunk(first_chunk)
        total_label_list.append(speaker_labels)

        # Process subsequent chunks
        for chunk_number in range(1, (len(combined_chunk_df.columns) - 1)):
            column_name = f'strided_chunk_{chunk_number}'
            combined_chunk_df[column_name] = combined_chunk_df[column_name].replace('nan', np.nan)
            current_chunk = combined_chunk_df[column_name].dropna(how='any').tolist()
            prev_labels_info = self.get_stride_info(combined_chunk_df, total_label_list, chunk_number - 1, stride)
            speaker_labels = self.diarize_chunk(current_chunk, prev_labels_info)
            total_label_list.append(speaker_labels)

        return total_label_list

diarizer = TextDiarizer(client, DEPLOYMENT)
final_labels = diarizer.label_aware(STRIDE, NUMBER_OF_CHUNKS, combined_df)
print(final_labels)

def final_df(per_line, strided_chunk_indices, final_labels):
    # Assuming 'per_line' and 'strided_chunk_indices' are defined elsewhere in the script
    label_comparison_df = pd.DataFrame({'original': per_line})

    # Use numpy to efficiently calculate the min and max values for DataFrame index
    min_val = np.min([min(sublist) for sublist in strided_chunk_indices])
    max_val = np.max([max(sublist) for sublist in strided_chunk_indices])

    # Initialize the DataFrame with the correct index range
    label_strided_chunk_df = pd.DataFrame(index=np.arange(min_val, max_val + 1))

    # Populate the DataFrame with strided chunk data
    for i, sublist in enumerate(strided_chunk_indices):
        # Direct assignment to the DataFrame using loc for precise index matching
        label_strided_chunk_df.loc[sublist, f'strided_chunk_{i}'] = final_labels[i]

    # Combine the initial comparison DataFrame with the newly created strided chunk DataFrame
    combined_label_df = pd.concat([label_comparison_df, label_strided_chunk_df], axis=1)
    return combined_label_df

final_label_df = final_df(per_line, strided_chunk_indices, final_labels)
final_label_df


[92mTranscription located:[0m
 Good afternoon. My name is JL and I will be your conference operator today. At this time, I would like to welcome everyone to NVIDIA's third quarter earnings call. All lines have been placed on mute to prevent any background noise. After the speakers remarks, there will be a question and answer session. If you would like to ask a question during this time, simply press star followed by the number one on your telephone keypad. If you would like to withdraw your question, again press the star one. Thank you. Simone Ciechowski, you may now begin your conference. Thank you. Good afternoon everyone and welcome to NVIDIA's conference call for the third quarter of fiscal 2024. With me today from NVIDIA are Jensen Huang, President and Chief Executive Officer and Collette Press, Executive Vice President and Chief Financial Officer. I'd like to remind you that our call is being webcast live on NVIDIA's investor relations website. The webcast will be available thr