<a href="https://colab.research.google.com/github/MNoichl/speech_to_text_with_whisper_to_GPT/blob/main/whisper_to_GPT_MN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Actually useful speech to text via Whisper and GPT-4 🎤 ⌨ 😎

This notebook provides a workflow to turn a dictation, which you can create e.g. with Audacity on your PC, into a reasonable piece of text of the genre that you need. After transcribing your text using the whisper model, the notebook passes it to GPT-4, with the instructions to fix any dictation infedilities, and to adapt it to your predefined style, usually without any typos.

I use this for E-Mails, to-do-lists, drafting, meeting-minutes and also to turn ramblings into reasonable notes. You can also dictate several E-mails in a row. No need to worry to much about pauses as well. The GitHub-repo contains also a (hopefully growing) style-sheet, and a notebook that helps in the creation of styles.

Importantly, for this to work, you will need an OpenAi-API-key, and pay for your usage. I feel this is quite reasonably priced, but until you get a feeliong for it, I suggest you monitor your spending! Also **make sure to double check your texts**. E.g. dictated last names in E-Mails are commonly mispelled, and might need post-hoc fixing. First, we need to install some stuff:

In [None]:
%%capture
#@markdown #Load some dependencies!
# !pip install gradio -q
!pip install openai
!pip install noisereduce
!pip install pydub

import time
import shutil
import os
from google.colab import drive
import json


import subprocess
from pydub import AudioSegment
from pydub.silence import detect_nonsilent
from scipy.io import wavfile
import noisereduce as nr
import openai

import datetime

In [None]:
#@markdown # Set everything up

#@markdown # Mount drive 
#@markdown Check this if you want to mount your drive. I would reccomend to do that, because this way,
#@markdown you can save your recordings in a drive-folder on your PC, and load them from there,
#@markdown without the hassle of individually uploading them everytime. Also you can (reasonably) savely store
#@markdown your API-key.

mount_drive = True  #@param {type:"boolean"}

if mount_drive:
  # Mount Google Drive
  drive.mount('/content/drive')
#@markdown # Set up your API-key
#@markdown Here you have to provide your api-key to the OpenAI-API. 
#@markdown **WARNING: Do not share the notebook with your key in it (or in it's edit-history),
#@markdown because this will allow people to spend your money! 
#@markdown Also make sure to set reasonable spending limits at OpenAIs website,
#@markdown and monitor your spending regularly.** Alternatively you can
#@markdown  put a path to a text file in your drive, where you keep your key.

api_key_or_path = "" #@param {type:"string"}

# Check if api_key is a file
if os.path.isfile(api_key_or_path):
    # If it is a file, open it and read the content into a string
    with open(api_key_or_path, 'r') as file:
        api_key = file.read()
    openai.api_key = api_key # this tells the OpenAI library about your key, so it can be used below.
else:
    openai.api_key = api_key_or_path # this tells the OpenAI library about your key, so it can be used below.


#@markdown # Load your file:


#@markdown Specify a directory or audio-file which you want to transcribe.
#@markdown If a directory is specified, the most recently saved file is loaded.


path = '/content/drive/MyDrive/whisper_transcripts/recordings' #@param {type:"string"}

# Check if the path is a directory or a file
if os.path.isdir(path):
    # If the path is a directory, change the current working directory to the specified path
    os.chdir(path)

    # List all files in the directory
    files = os.listdir()

    # Print the list of files
    print("Files in directory:", files)

    # Find the latest file (i.e., the file with the most recent modification time)
    latest_file = max(files, key=os.path.getmtime)
    
    # Print the time when the last added file was recorded
    print("Last recorded file time:", datetime.datetime.fromtimestamp(os.path.getmtime(latest_file)))
    # date_time = mod_time)
    # Specify the working directory
    working_dir = '/content'

    # Copy the latest file to the working directory and rename it as 'current_audio.mp3'
    shutil.copy(os.path.join(path, latest_file), os.path.join(working_dir, 'current_audio.mp3'))
    
else:
    # If the path is a file, copy it to the working directory and rename it as 'current_audio.mp3'
    working_dir = '/content'
    shutil.copy(path, os.path.join(working_dir, 'current_audio.mp3'))
    
    # Print the time when the file was last modified
    print("Last recorded file time:", datetime.datetime.fromtimestamp(os.path.getmtime(path)))
    
    # The latest file is the file itself
    latest_file = path

# Change the current working directory to the working directory
os.chdir(working_dir)

# Get the size of the 'current_audio.mp3' file
file_size = os.path.getsize('current_audio.mp3')

# Print the size of the file in megabytes
print("File size in MB:", file_size / (1024 * 1024))



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Files in directory: ['rec0330-004051.mp3', 'rec0330-005353.mp3', 'rec0330-005533.mp3', 'rec0330-011500.mp3', 'rec0330-131940.mp3', 'rec0330-132315.mp3', 'rec0331-134231.mp3', 'rec0331-135136.mp3', 'rec0331-135257.mp3', 'rec0331-190408.mp3', 'rec0402-172127.mp3', 'rec0402-172137.mp3', 'rec0402-184648.mp3', 'rec0402-211045.mp3', 'rec0404-180034.mp3', 'rec0412-140902.mp3', 'rec0412-141024.mp3', 'rec0412-150105.mp3', 'rec0412-190837.mp3', 'rec0413-134437.mp3', 'rec0419-012022.mp3', 'rec0419-012807.mp3', 'rec0419-012854.mp3', 'rec0420-195205.mp3', 'rec0420-195239.mp3', 'rec0420-201028.mp3', 'rec0427-162614.mp3', 'rec0427-162647.mp3', 'rec0427-172429.mp3', 'rec0430-002737.mp3', 'rec0430-002753.mp3', 'rec0503-153105.mp3', 'rec0503-153123.mp3', 'rec0503-161853.mp3', 'rec0506-130227.mp3', 'rec0506-130617.mp3', 'rec0508-140714.mp3', 'rec0508-162059.mp3', 'rec0508-16375

In [None]:


def reduce_noise_in_directory(directory):
    """
    This function reduces noise in all .mp3 files in a given directory.
    """
    for filename in os.listdir(directory):
        if filename.endswith('.mp3'):
            # Convert mp3 to wav
            input_path = os.path.join(directory, filename)
            wav_path = os.path.join(directory, f"{os.path.splitext(filename)[0]}.wav")
            subprocess.run(f"ffmpeg -y -i {input_path} {wav_path} -hide_banner -loglevel error", shell=True)

            # Load wav data and perform noise reduction
            rate, data = wavfile.read(wav_path)
            data = data[:,0]
            reduced_noise = nr.reduce_noise(y=data, sr=rate,chunk_size=6000)

            # Save reduced noise wav file
            reduced_wav_path = os.path.join(directory, f"{os.path.splitext(filename)[0]}_reduced_noise.wav")
            wavfile.write(reduced_wav_path, rate, reduced_noise)

            # Convert reduced noise wav back to mp3
            reduced_mp3_path = os.path.join(directory, f"{os.path.splitext(filename)[0]}.mp3")
            subprocess.run(f"ffmpeg -y -i {reduced_wav_path} {reduced_mp3_path} -hide_banner -loglevel error", shell=True)

            # Remove intermediate wav files
            os.remove(wav_path)
            os.remove(reduced_wav_path)


def remove_long_silences(input_file, output_file, silence_length_ms=1000, silence_thresh=-50):
    """
    This function removes long silences from an audio file.
    It doesn't really work right now though.
    """
    # Load audio file
    audio = AudioSegment.from_file(input_file, format=os.path.splitext(input_file)[1][1:])

    # Detect non-silent chunks
    nonsilent_chunks = detect_nonsilent(audio, min_silence_len=silence_length_ms, silence_thresh=silence_thresh)

    # Concatenate non-silent chunks
    output_audio = AudioSegment.empty()
    for start, end in nonsilent_chunks:
        output_audio += audio[start:end]

    # Export the audio file without longer silences
    output_audio.export(output_file, format=os.path.splitext(output_file)[1][1:])


def split_audio_file_into_chunks(input_file, output_folder='intermediate_chunks', chunk_length=30,reduce_noise=True):
    """
    This function splits an audio file into chunks of a specified length.
    """
    # Remove the output folder if it exists, and create a new one
    !ffmpeg -y -i $input_file current_audio.wav -hide_banner -loglevel error

    if os.path.exists(output_folder):
        shutil.rmtree(output_folder)
    os.makedirs(output_folder)

    # Load the input audio file
    audio = AudioSegment.from_file("current_audio.wav")

    # Calculate the chunk length in milliseconds
    chunk_length_ms = chunk_length * 1000

    # Split the audio into chunks and export them
    if chunk_length_ms > len(audio):
      output_filename = f"{output_folder}/{str(0).zfill(4)}.mp3"
      audio.export(output_filename, format="mp3")
    else:
      num_chunks = len(audio) // chunk_length_ms
      remaining_duration = len(audio) % chunk_length_ms

      for i in range(num_chunks):
          start_time = i * chunk_length_ms
          if i != num_chunks-1:
            end_time =(i + 1) * chunk_length_ms
          else:
            end_time = len(audio)
          chunk = audio[start_time:end_time]
          output_filename = f"{output_folder}/{str(i).zfill(4)}.mp3"
          chunk.export(output_filename, format="mp3")

    # If reduce_noise is True, apply noise reduction to all chunks
    if reduce_noise:
      reduce_noise_in_directory(output_folder)



#@markdown # Transcription. 
#@markdown Now we transcribe the audio-file to text using whisper. 

#@markdown You can choose the language in which you mainly spoke, and doing so improves results (adding more languages would be possible). If you specify a different language than the one you speak in, this can even serve as a translation tool. 
#@markdown If you use a lot of technical terminology, or something, it can help to paste a piece of text into the prompt-window, which is similar to your dictation.
#@markdown There is also a reduce-noise function (as whisper is sensitive to background-noises), but I haven't quite gotten it to work yet.


# Set the language for transcription
language = 'en' #@param ["en", "de"] {allow-input: true}

# Set whether to reduce noise
reduce_noise= False #@param {type:"boolean"}

# Set the prompt for transcription
prompt = "Email to Gareth" #@param {type:"string"}

# Set the input file and split it into chunks
input_file = "current_audio.mp3"
split_audio_file_into_chunks(input_file, output_folder='intermediate_chunks', chunk_length=300,reduce_noise=reduce_noise)

# Transcribe each chunk and concatenate the transcriptions
sub_transcripts = []
for this_chunk in sorted(os.listdir('intermediate_chunks')):
  audio_file= open('intermediate_chunks/'+this_chunk, "rb")
  transcript = openai.Audio.transcribe("whisper-1", audio_file,language=language, prompt=prompt)
  sub_transcripts.append(transcript.text)
transcript = ' '.join(sub_transcripts)

# Print the final transcription
print(transcript)


Hey Gareth, so I saw you mentioned the dictation to writing workflow on Twitter and I thought I'd just quickly clean up my notebook that I made for that and if you would like to you can give it a little go and test everything and see what it works and I would be very grateful if you had time to do that and give me a little feedback, if there are any issues with it that I didn't anticipate Thanks and all the best, Max


In [None]:
#@markdown # Importing the styles
#@markdown This downloads some style-instructions I prepared from the Github repo. 
#@markdown Feel free to make you own and share them with me, so I can add them!

!wget -q https://raw.githubusercontent.com/MNoichl/speech_to_text_with_whisper_to_GPT/main/style_library.json



def load_json_file(file_path):
    data = {}

    if os.path.exists(file_path):
        with open(file_path, 'r') as file:
            data = json.load(file)

    return data

# Example usage
file_path = 'style_library.json'
loaded_style_dict = load_json_file(file_path)

for x in loaded_style_dict.keys():
  print(x) 




haraway_epistemic
howl_style
koselleck_de
lewis_david_plurality
list_and_pettit
max_noichl_emails_de
max_noichl_emails_en
max_to_do_list
moby_dick


In [None]:
#@markdown # Fixing transcriptions using GPT


style = "max_emails_en" #@param {type:"string"}
if style in loaded_style_dict.keys():
  style_instructions = loaded_style_dict[style]
else: 
  style_instructions = style

fixing_language = 'currently set language' #@param ["currently set language","en", "de"] {allow-input: true}
if 'currently set language' == fixing_language:
  fixing_language = language 

OpenAi_model = "gpt-4" #@param ["gpt-4", "3.5-turbo"] {allow-input: true}




if fixing_language == 'en':
  answer = openai.ChatCompletion.create(
                    model=OpenAi_model,
                    messages=[
                          {"role": "system", "content": """You are a perfect transcription program that is able to take faulty dictations and put them into a readable form, grammatically correct form without changing their content or changing their formulations. """},
                          {"role": "user", "content": ("""Transcribe this faulty dictation. Use all of the techniques in Writing Style A, which you are provided with below the dictation) to generate your result!\n""" +
                                                      """<transcript start> """+ transcript + """<transcript end> """ + """\n\n"""+
                                                      """Writing Style A: """+
                                                      style_instructions)},
                      ]
                  )
  fixed_text = answer['choices'][0]['message']['content']


elif fixing_language == 'de':
  answer = openai.ChatCompletion.create(
      model="gpt-4",  # 3.5-turbo
      messages=[
          {
              "role": "system",
              "content": """Du bist ein perfektes Transkriptionsprogramm, das in der Lage ist, fehlerhafte Diktate aufzunehmen und sie in eine lesbare, grammatikalisch korrekte Form zu bringen, ohne ihren Inhalt oder ihre Formulierungen zu verändern.""",
          },
          {
              "role": "user",
              "content": (
                  """Transkribiere dieses fehlerhafte Diktat. Verwende alle Techniken in Schreibstil A, die dir unter dem Diktat zur Verfügung gestellt werden, um dein Ergebnis zu erzeugen!\n"""
                  + """<transkript start> """
                  + transcript
                  + """<transkript ende> """
                  + """\n\n"""
                  + """Schreibstil A: """
                  + style_instructions 
                  # + """Transkribiere, ohne den Inhalt zu verändern"""
              ),
          },
      ]
  )
  fixed_text = answer["choices"][0]["message"]["content"]
print(fixed_text)

Subject: Testing Dictation-to-Writing Workflow Notebook

Dear Gareth,

I hope this email finds you well. I recently came across your mention of the dictation-to-writing workflow on Twitter. I thought it would be a good idea to clean up the notebook I had created for that purpose. If you're interested, I would be grateful if you could give it a try and test everything to see how well it works.

Your feedback would be invaluable to me, particularly if you encounter any issues that I may not have anticipated. Thank you in advance for your time and assistance.

Best regards,
Max
