# Transcribe Audio

Use these two functions to transcribe audio. The first panel below will
transcribe a single file, the second will transcribe a folder full of files.

All transcriptions are saved in the `completed_transcriptions` folder

The only information you need to provide is the audio file or folder containing
files. There is an example file within the `to_transcribe` folder and the below
code is currently set up to transcribe that file. For ease, the quickest way to
get started is to add your files for transcription into the `to_transcribe`
folder and then update the file name ('test.mp3') in cell 1, or just run the
cell that will transcribe all files in that folder.

The following filetypes are supported:

`.mp3, .mp4, .mpeg, .mpga, .m4a, .wav, .webm`

---

### A note on large files

The [Whisper API](https://openai.com/research/whisper) only accepts files up to
25mb in size. This code will handle files that are greater than this. However,
to do so, it requires the `FFMpeg` package to be installed on your machine. This
can be installed by following the instructions here:
https://ffmpeg.org/download.html

To handle larger files, this code breaks the original audio file into chunks.
While this works to get under the size limit, it can sometimes cut off words or
lead to slightly odd transcriptions around the point the audio was spliced. To
deal with this, if the file has been spilt, the final transcript will include a
line of asterisks (\*\*\*\*) where the seperate chunks have been rejoined. This
should allow you to more easily jump in and manually correct any odd
transcription behaviour.


In [None]:
from audio.transcribe import get_transcription
import shutil
import os
import time

audio_file = open("./to_transcribe/test.mp3", "rb")
name = audio_file.name.split("/")[-1].split(".")[0]
transcript = get_transcription(audio_file)

timestamp = time.strftime("%Y_%m_%d_%H_%M_%S")
save_dir = "completed_transcriptions/"
filename = f"completed_transcriptions/{name}_{timestamp}.txt"

os.makedirs(os.path.dirname(save_dir), exist_ok=True)

with open(filename, "w") as f:
    f.write(transcript)

os.makedirs(os.path.dirname('./completed_audio_files/'), exist_ok=True)

# Move the audio file to the completed folder
shutil.move(audio_file.name, "./completed_audio_files/")

## Transcribe a folder full of files

This will transcribe all files within the `to_transcribe` directory. A new sub
directory is created in the `completed_transcriptions` folder containing these
transcriptions.


In [None]:
from audio.transcribe import get_transcription
import shutil
import os
import time

folder_path = "./to_transcribe"
audio_files = os.listdir(folder_path)
timestamp = time.strftime("%Y_%m_%d_%H_%M_%S")
save_dir = f"completed_transcriptions/{timestamp}"
os.makedirs(save_dir, exist_ok=True)

completed_audio_dir = './completed_audio_files/'
os.makedirs(completed_audio_dir, exist_ok=True)

for audio_file_name in audio_files:
    try: 
        audio_file_path = os.path.join(folder_path, audio_file_name)
        name = audio_file_name.split("/")[-1].split(".")[0]
        print(f"Transcribing {name}...")    
        transcript = get_transcription(audio_file_path)
        timestamp = time.strftime("%Y_%m_%d_%H_%M_%S")
        filename = f"{save_dir}/{name}_{timestamp}.txt"
        with open(filename, "w") as f:
            f.write(transcript)
        shutil.move(audio_file_path, os.path.join(completed_audio_dir, audio_file_name))
        print(f"Transcription for {audio_file_name} completed")
    except Exception as e:
        print(f"Error transcribing {audio_file_name}: {e}")
        continue