# Description

This Python script for Google Colab automates the batch transcription of Italian audio files using OpenAI's Whisper model. Audio files are processed directly from Google Drive, and the resulting transcriptions are saved as text files in the same folder.

## Variable Description

- `model_name`: Specifies the Whisper model to use ("tiny", "base", "small", "medium", "large"). Larger models offer greater accuracy but require more time and resources.
- `lang`: Sets the audio file language to "it" (Italian).
- `in_folder`: Path on Google Drive where the audio files to be transcribed are located.
- `out_folder`: Path on Google Drive where the transcriptions will be saved.

## Audio File Structure

Audio files must be organized in the `in_folder` folder using the following naming convention:

`<lesson_identifier>.<recording_number>.<extension>`

Where `<lesson_identifier>` can be a number or a date (or any string without dots ".").

### Examples:

- `1.1.m4a` (first recording of lesson 1)
- `2023-10-27.1.mp3` (first recording of the lesson from October 27, 2023)
- `AdvancedCourse.2.wav` (second recording of the AdvancedCourse lesson)

## Transcription Structure

Transcriptions are saved in the `out_folder` folder using the following naming convention:

`<lesson_identifier>.txt`

### Examples:

- `1.txt` (contains the transcription of 1.1.m4a)
- `2023-10-27.txt` (contains the transcription of 2023-10-27.1.mp3)
- `AdvancedCourse.txt` (contains the transcription of AdvancedCourse.2.wav)

In [1]:
%%capture
!pip install -U openai-whisper;
!sudo apt install ffmpeg;
import whisper
import os
from google.colab import drive
drive.mount('/content/drive')

In [2]:
%%capture
model_name = "large" #whisper model name
lang = "it" # audio language
in_folder = "drive/MyDrive/audio-files/" #audio files folder
out_folder = "drive/MyDrive/audio-files/" #transcriptions folder

In [None]:
done = [] #files already transcripted
model = whisper.load_model(model_name)

In [None]:
os.makedirs(out_folder, exist_ok=True)  # Crea la cartella se non esiste
transcriptions = os.listdir(out_folder)
for file in transcriptions:
    if file.endswith(".txt"):
      done.append(file.split(".")[0])
print(done)

In [None]:
audios = os.listdir(in_folder)
audios.sort()for file in audios:
  print("ELABORATION: " + file)
  if file.split(".")[0] in done:
    print("SKIP: " + file)
    continue
  else:
    print("TRANSCRIBING: " + file)
    try:
      result = model.transcribe(in_folder + file, language=lang)
    except whisper.WhisperError as e:
      print(f"ERROR while working on {file}: {e} ")
    print("SAVING: " + file + " transcription")
    with open(out_folder + file.split(".")[0] + ".txt", "a", encoding="utf-8") as f:
      f.write(result["text"])
    print("DONE: " + file)
    print("=======================================")

print("======== END ========")