<a href="https://colab.research.google.com/github/JanEggers-hr/youtube-scraper/blob/main/whisper_audio_conversion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisper-based audio-to-text conversion

Runs OpenAI's "Whisper" TTS library in a Colab. Nothing is uploaded to OpenAI's servers, everything is processed within the Colab (e.g. in the Google Cloud).

You are asked to upload a file, then the file is converted to a .txt file and downloaded to your download folder.

Whisper needs MP4 files, so MP3 are converted first.

## Tips for running this colab

- Activate the GPU in the colab environment (menu "Runtime"/"Change Runtime type") - this speeds up the Whisper conversion immensely
- Use a browser plugin like [Colab Auto Clicker](https://addons.mozilla.org/en-US/firefox/addon/colab-automatic-clicker/) for Firefox to hold the connection to the Notebook while it's doing the work, and leave the browser tab open




Using the medium-sized model (the multilanguage model is about 5GB); for better accuracy, switch to "large" (10GB), for faster transcription, use "small" (2GB).

Remember to switch on the GPU in Colab, or conversion will be really, really slow. **But even with GPU installed, the conversion takes some time** - approx. one minute for every five minutes of video with the Medium model - so be patient! If you should lose connection to the Colab VM, reconnect, and rerun the cell - it will restart with the audio files it has not converted yet only.

One thing that Whisper does not do for you: insert paragraphs, line breaks, indentations, emphases. Anything that makes the text block more readable is missing. Sorry.

In [None]:
#@title
# Connect to Google Drive to export data
import os
print("Installing libraries for conversion (may take some time)")
!apt install ffmpeg
!pip install git+https://github.com/openai/whisper.git > /dev/null 2>&1
!pip install pydub  > /dev/null 2>&1

output_dir = "/content/audio/"
# Create output directory
if not os.path.exists(output_dir):
    os.mkdir(output_dir)

os.chdir(output_dir)

import whisper
import pandas as pd
model = whisper.load_model("medium")
# model = wisper.load_model("large")

from google.colab import files
file_dict = files.upload()

# List of all files for which there is no transcript now
# M4A files have to exist - if there is an ID in the index but it has not been
# downloaded, the run will fail. Run audio acquisition cells again.

f = list(file_dict.keys())[0]
fname = output_dir + "/" + f

from pydub import AudioSegment
from pathlib import Path

# file extension, slicing away the dot
stem, suffix_raw = os.path.splitext(fname)
suffix = suffix_raw[1:]
if suffix != "m4a":
    #convert to M4a using pydub
    audio = AudioSegment.from_file(fname,suffix)
    audio.export(stem + ".m4a", format="mp4")
    print("M4A-Datei generiert")


print("Starting conversion of audio to text file.")
txt_fname = stem + ".txt"
result = model.transcribe(fname)
with open(txt_fname, 'w') as f:
  f.write(result["text"])


print("Done - files converted.")
print("Saving to the download ")
files.download(txt_fname)

Installing libraries for conversion (may take some time)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
ffmpeg is already the newest version (7:4.4.2-0ubuntu0.22.04.1).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.
