<a href="https://colab.research.google.com/github/Sander-Mander/Notulinator/blob/main/Notulinator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Notulinator 🤖

**Welcome to the Notulinator!**

Before we start, go to Runtime > Change runtime type and select T4 GPU. Save your selection.

Wait until you are connected. You can see that you are connected if a green checkmark appears in the top right of your screen.


You use the Notulinator by running one code block at a time. You can run a code block by hovering over the code block and pressing the arrow button in the top left of the block. <u>Do not change anything in the code itself!</u> After running a code block, <u>instructions will appear below the block</u>. Listen to the instructions from the little robot 🤖. Follow
these instructions and you will have created your transcription in no-time!


---

#Code (<-- Run this before you follow the steps! This takes about 5 minutes.)

In [None]:
!pip uninstall torch torchvision torchaudio -y -q
!pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 -q

!pip install ctranslate2==4.4.0 -q
!pip install faster-whisper==1.1.0 -q
!pip install whisperx==3.3.1 -q

!apt-get update -q
!apt-get install libcudnn8=8.9.2.26-1+cuda12.1 -q
!apt-get install libcudnn8-dev=8.9.2.26-1+cuda12.1 -q

!python -c "import torch; torch.backends.cuda.matmul.allow_tf32 = True; torch.backends.cudnn.allow_tf32 = True"

In [2]:
import ipywidgets as widgets
from IPython.display import display, clear_output
import os
import time
from google.colab import drive
import torch
import gc
import datetime
import csv

class BaseBot:
    def show_message(self, text, delay=0):
      """Display a robot message with optional delay"""
      print(f'🤖: {text}')
      if delay > 0:
          time.sleep(delay)
    def clear(self):
      clear_output(wait=True)

class UploadBot(BaseBot):
    def __init__(self):
        self.path_name = 'Notulinator-Audio'
        self.file_path = f'/content/gdrive/MyDrive/{self.path_name}'
        self.selected_file = None

    def start(self):
        """Start the interactive helper flow"""
        self.clear()

        if torch.cuda.is_available():
          self.show_message('Hello! Time to get started with your minutes.', 2)
          self.preprocessing_choice()
        else:
          self.show_message("You are not connected to a GPU. Go to Runtime > Change runtime type and select T4 GPU. Then run the above code and this code block again.")

    def preprocessing_choice(self):
        """Ask user if they want to preprocess the audio""" #TODO: Skip the ask to make more user friendly
        self.show_message("Do you wish to pre-process your audio file first (highly recommended) or do you want to upload it as-is? The AI needs audio that is not too quiet, else it will fail.")

        preproc_yes = widgets.Button(description='Pre-process first')
        preproc_no = widgets.Button(description='No pre-processing')

        preproc_yes.on_click(lambda b: self.handle_preprocessing(True))
        preproc_no.on_click(lambda b: self.handle_preprocessing(False))

        display(preproc_yes, preproc_no)

    def handle_preprocessing(self, preprocess):
        """Handle the preprocessing choice"""
        self.clear()

        if preprocess:
            self.show_message("Good choice to first process your audio. However, I cannot do this myself; you will have to do this manually.", 2)
            self.show_message("Please look at the instructions at the very bottom of this document. Here is it described how you can quickly process your audio using Audacity!", 1)
        else:
            self.mount_drive()

    def mount_drive(self):
        """Mount Google Drive and set up folders"""
        self.show_message("A pop-up should now appear to get access to your Google Drive. Please accept and wait a few seconds...", 1)

        drive.mount('/content/gdrive')

        os.makedirs(self.file_path, exist_ok=True)

        self.show_message(f"I created a folder in your Google Drive called {self.path_name}. Please go to your Google Drive and upload your audio file in this folder. It should be saved in: {self.file_path}")

        done_button = widgets.Button(description='Done')
        done_button.on_click(lambda b: self.check_uploaded_files())
        display(done_button)

    def check_uploaded_files(self):
        """Check for uploaded files in the designated folder"""
        self.clear()
        file_names = os.listdir(self.file_path) #TODO: Ignore unsupported file formats

        if len(file_names) == 0:
            self.show_message(f"I found no files in the {self.path_name} folder. Can you check if you uploaded anything?")
            done_button = widgets.Button(description='Check Again')
            done_button.on_click(lambda b: self.check_uploaded_files())
            display(done_button)

        elif len(file_names) == 1:
            self.selected_file = file_names[0]
            self.file_selected()

        else:
            self.show_file_selection(file_names)

    def show_file_selection(self, file_names):
        """Display radio buttons for file selection"""
        self.show_message(f"I found multiple files, which one do you want to process:")

        radio_buttons = widgets.RadioButtons(
            options=file_names,
            description='Choose file:',
        )
        confirm_button = widgets.Button(description='Confirm')

        confirm_button.on_click(lambda b: self.handle_file_selection(radio_buttons.value))

        display(radio_buttons, confirm_button)

    def handle_file_selection(self, filename):
        """Handle the file selection"""
        self.clear()
        self.selected_file = filename
        self.file_selected()

    def file_selected(self):
        """Process the selected file"""
        audio_path = os.path.join(self.file_path, self.selected_file)
        self.show_message(f"Ready to process the following file: {self.selected_file}", 1)
        self.show_message("You can now move on to the next code cell! (Step 2)")

        global FILE_PATH, TOKEN_PATH
        FILE_PATH = os.path.join(self.file_path, self.selected_file)
        TOKEN_PATH = os.path.join(self.file_path, 'access_token.txt')

In [3]:
class AccessBot(BaseBot):
  def __init__(self):
    self.token_input = None

  def start(self):
    self.clear()
    self.show_message('To use some of the AI models, you have to request access. Have you already requested access to the models?', 1)
    self.show_message('If not, I will guide you through the process.')
    self.choice_access()

  def choice_access(self):
    access_yes = widgets.Button(description='Yes I have')
    access_no = widgets.Button(description='No I have not')

    access_yes.on_click(lambda b: self.handle_access(True))
    access_no.on_click(lambda b: self.handle_access(False))

    display(access_yes, access_no)

  def handle_access(self, access):
    self.clear()
    if access:
      self.show_message('Great, if you already have access you can move on to the next code cell! (Step 3)')
    else:
      self.show_message('Alright, since we are using AI models from HuggingFace (a large AI community platform), you have to create an account there.', 2)
      self.show_message('Go to huggingface.co and create an account. Press Done when you are ready!')
      done_button = widgets.Button(description='Done')
      done_button.on_click(lambda b: self.request_access())
      display(done_button)

  def request_access(self):
    self.clear()
    self.show_message('Great! Now you have to request access to the specific AI models we will be using.', 2)
    self.show_message('Request access to: https://huggingface.co/pyannote/speaker-diarization-3.1', 1)
    self.show_message('And request access to: https://huggingface.co/pyannote/segmentation-3.0',1)
    self.show_message('You can request access by filling in some basic contact information.')
    done_button = widgets.Button(description='Done')
    done_button.on_click(lambda b: self.create_token())
    display(done_button)

  def create_token(self):
    self.clear()
    self.show_message('Super, we are almost done. We still have to generate an access token so we can actually use these models in our code.', 3)
    self.show_message('Go to your HuggingFace account and go to Settings > Access Tokens > Create new token.', 2)
    self.show_message('You should set your token settings like this:', 1)
    self.show_message('''Token type = Fine-grained
    Token name can be anything you want (but do give it a name)
    Under User permissions, repositories:
    Enable 'Read access to contents of all repos under your personal namespace' AND 'Read access to contents of all public gated repos you can access'.''', 2)
    self.show_message("Finally, click 'Create Token' to create your token")
    done_button = widgets.Button(description='Done')
    done_button.on_click(lambda b: self.ask_token())
    display(done_button)

  def ask_token(self):
    self.clear()
    self.show_message('Paste the generated token here:')
    self.token_input = widgets.Text(
    description="Token:",
    placeholder="Enter your Hugging Face token",)
    save_button = widgets.Button(description="Save Token")
    save_button.on_click(lambda b: self.save_token())
    display(self.token_input, save_button)

  def save_token(self):
    self.clear()
    with open(TOKEN_PATH, "w") as f:
      f.write(self.token_input.value)
      self.show_message('Beep. Token saved in your Google Drive. You can move on to the next code cell! (Step 3)')

In [8]:
class TranscriptBot(BaseBot):
  def __init__(self, batch_size=16, language='en', print=True, model='large-v3', task='transcribe'):
    self.batch_size = batch_size
    self.language = language
    self.print = print
    self.model = model
    self.task = task
    self.device = 'cuda'
    self.compute_type = 'float16'
    self.access_token = None
    self.result = None

  def start(self):
    self.clear() #TODO: Error when no file path specified
    self.show_message('Now it is finally time to start the transcription!')
    with open(TOKEN_PATH, "r") as f:
      self.access_token = f.read().strip() #TODO: Give error when no token found

    print(self.access_token)
    print(FILE_PATH)
    self.transcribe()

  def transcribe(self):
    self.show_message('Importing the AI models...')
    torch.cuda.empty_cache()
    model_whisper = whisperx.load_model(whisper_arch=self.model, device=self.device, compute_type=self.compute_type, language=self.language, task=self.task)
    audio = whisperx.load_audio(FILE_PATH)

    self.show_message('Starting the transcription...')
    self.result = model_whisper.transcribe(audio, batch_size=self.batch_size)

    self.show_message('Aligning the transcription...')
    model_a, metadata = whisperx.load_align_model(language_code=self.result["language"], device=self.device)
    self.result = whisperx.align(self.result["segments"], model_a, metadata, audio, self.device, return_char_alignments=False)

    gc.collect(); torch.cuda.empty_cache(); del model_a #Remove model to save memory

    self.show_message('Assigning speakers to the transcription...')
    diarize_model = whisperx.DiarizationPipeline(use_auth_token=self.access_token, device=self.device)
    diarize_segments = diarize_model(audio)
    self.result = whisperx.assign_word_speakers(diarize_segments, self.result)
    self.print_and_save()

  def print_and_save(self):
    result_intermediate = []
    for segment in self.result['segments']:
        start = datetime.timedelta(seconds=round(segment['start'], 0))
        end = datetime.timedelta(seconds=round(segment['end'], 0))
        speaker = segment.get('speaker', 'UNKNOWN')
        text = segment['text'].strip()
        if self.print:
          print(f"{speaker}: {text}")

        result_intermediate.append((f"{start} to {end}", f"{speaker}: {text}"))

    txt_file = "transcription-notulinator.txt"
    with open(txt_file, mode='w', encoding='utf-8') as file:
        # Write data rows
        for time_interval, transcription in result_intermediate:
            file.write(f"[{time_interval}]\t{transcription}\n")

#Steps

**Step 1: Uploading your audio file** 📝

In [None]:
UploadBot().start()

**Step 2: Getting access to the AI models**

In [None]:
AccessBot().start()

**Step 3: Creating your transcription**

In [None]:
import whisperx
TranscriptBot(language='en').start()

---

**Pre-processing**

To get a good speech-to-text transcription, the audio needs to be as intelligible as possible. The model performs bad on quiet audio. For many audio files, it is therefore important to first do some processing on the audio recording.

1. Download Audacity (or any other audio processing software/site with dynamic range compression)
2. File > Import > Audio, and select your audio file(s). If you have multiple files, it is easiest to move them to the same track.
3. Find a section of the recording where the audio is loud and easy to hear, and check the decibel level of that audio. Then find a section of the recording where the audio speech is very silent, and check the decibel level of that audio. Do Ctrl + A, go to Effect > Volume and Compression > Compressor. Put your threshold to a decibel level **below** the well intelligible audio and **above** the poorly intelligible audio. Set the ratio quite high (something like 6:1, but this depends on your audio). Click Apply.
5. (Optional) If the silent audio sections are still quite quiet, increase the overall volume of the audio by going to Effect > Volume and Compression > Amplify. Amplify it such that silent sections are easy to hear, but loud talkers are not so loud that it becomes distorted (when the volume line becomes red).
6. (Optional) For convenience, it is recommended to export to audio into multiple smaller files (60 mins or so). Place your selection line at the part where you want to split the audio. Go to Edit > Labels > Add Label at Selection. You can do this at multiple locations to split the file into multiple seperate audio files.
7. Go to File > Export Audio. Give your file a name and specify a folder. Choose Channels: **Mono** and Sample Rate: **16000 Hz**. If you want to export the audio to multiple files (Step 6), for Export Range select Multiple Files. Split files based on labels, include audio before first label, and choose a naming format that you like. Click Export.


---

**Common Problems:**



*   *My transcription results are bad.*

The main cause of poor results is bad audio quality. Especially if voices are silent, the AI model struggles with transcription. Be sure to pre-process your audio beforehand so that silent audio sections become louder (as described above). The model also struggles with multiple people speaking at once, for which there is not really a fix.

*   *I want to use a different languange than English.*

The model supports many different languages, and can also translate between them. You can check under 'DEFAULT_ALIGN_MODELS_TORCH' and 'DEFAULT_ALIGN_MODELS_HF' what languages are available in this section of the code: https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py. For example, setting language='nl' in Step 3 sets the model to Dutch.

*   *I am getting an error when running the code.*

Be sure you have ran all the code above the code block you want to run. Errors can often be fixed by deleting and restarting your runtime. Do this by going to Runtime > Disconnect and Delete Runtime, then run the code again. If this does not resolve the error then you can open a New Issue on the Notulinator GitHub and I can take a look.