<a href="https://colab.research.google.com/github/0ldriku/CAF-Annotator/blob/main/CAF_ANNOTATOR_Whisper_timestamped.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to use

whisper-timestamped is an extension of the openai-whisper Python package and is meant to be compatible with any version of openai-whisper. It provides more efficient/accurate word timestamps.
https://github.com/linto-ai/whisper-timestamped

 **Important: Ensure that the runtime of Google Colab is set to the GPU runtime for optimal performance and faster processing.**

To use this notebook:
1. Upload the audio or video files you wish to transcribe. You can select and upload multiple files at once.
2. Adjust the settings according to your preferences, selecting the model size and specifying the language.
3. Click the "Transcribe" cell to initiate the transcription process.
4. **Important** :Ensure that the runtime of Google Colab is set to the GPU runtime for optimal performance and faster processing.



In [None]:
#@title **1. Upload Local File**
# @markdown You can upload multiple files.
from google.colab import files

use_drive = False
uploaded = files.upload()
file_names = []
file_names.extend(list(uploaded.keys()))
print('File uploaded，please continue to upload more or execute next cell')


Saving en_example.wav to en_example.wav
File uploaded，please continue to upload more or execute next cell


In [None]:
#@title **2. Required settings:**


# @markdown **【IMPORTANT】:**Select uploaded file type.

# encoding:utf-8

# @markdown <br/>Model size will affect the processing time and transcribe quality.
# @markdown <br/>The default model is the stable large-v2 model
# @markdown <br/>The model size will affect the transcription time and quality.
# @markdown <br/>The default recognition language is English. If your audio is in other languages, please change the language codes such as 'en', 'ja'.
# @markdown <br/>【Please note】: large-v3 may not necessarily be better than large-v2 or earlier models in some cases. Users should choose for themselves.

model_size = "large-v2"  # @param ["base","small","medium", "large-v1","large-v2","large-v3"]
language = "en"  # @param {type:"string"}
set_beam_size = 5
is_vad_filter = "False"



In [None]:
#@title **3. Transcribe**
#@markdown Transcription files will be auto downloaded after finish.
!pip install ffmpeg
!pip install whisper-timestamped
!pip install auditok

import json
import whisper_timestamped as whisper
import os
import zipfile

def convert_to_schema(result):
    segments = []
    for segment in result['segments']:
        word_timestamps = []
        for word in segment['words']:
            word_timestamps.append({
                'start': word['start'],
                'end': word['end'],
                'text': word['text']
            })
        segments.append({
            'start': segment['start'],
            'end': segment['end'],
            'subtitle': segment['text'],
            'word_timestamps': word_timestamps
        })
    return segments

# Load the Whisper model
model = whisper.load_model(model_size, device="cuda")
clear_output()


# Create a single ZIP file
combined_zip_filename = "transcription_results.zip"
with zipfile.ZipFile(combined_zip_filename, 'w') as zipf:
    for file_name in file_names:
        file_basename = os.path.splitext(os.path.basename(file_name))[0]
        _, extension = os.path.splitext(file_name)

        # Perform the transcription
        result = whisper.transcribe(model, file_name, language= language, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), vad="auditok", detect_disfluencies=True)

        # Convert the result to the desired schema
        converted_result = convert_to_schema(result)

        # Save the transcription results to a JSON file
        output_file = f"{file_basename}{extension}.transcribe.json"
        with open(output_file, "w", encoding="utf-8") as json_file:
            json.dump(converted_result, json_file, indent=2, ensure_ascii=False)

        # Extract and save subtitles to a text file
        subtitles_path = f"{file_basename}{extension}.subtitles.txt"
        with open(subtitles_path, "w", encoding="utf-8") as file:
            for result in converted_result:
                file.write(result["subtitle"] + "\n")

        # Write the JSON file and the subtitles text file to the ZIP archive
        zipf.write(output_file, f"{file_basename}/{output_file}")
        zipf.write(subtitles_path, f"{file_basename}/{subtitles_path}")

        print(f'File {file_name} was completed!')

print('All done!')
files.download(combined_zip_filename)

Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: ffmpeg
  Building wheel for ffmpeg (setup.py) ... [?25l[?25hdone
  Created wheel for ffmpeg: filename=ffmpeg-1.4-py3-none-any.whl size=6082 sha256=538db8d40d2eedb80f5dfc79eb2937a7307d9906ca700eb0748dd86f39687742
  Stored in directory: /root/.cache/pip/wheels/8e/7a/69/cd6aeb83b126a7f04cbe7c9d929028dc52a6e7d525ff56003a
Successfully built ffmpeg
Installing collected packages: ffmpeg
Successfully installed ffmpeg-1.4
Collecting whisper-timestamped
  Downloading whisper_timestamped-1.15.4-py3-none-any.whl (53 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.5/53.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting dtw-python (from whisper-timestamped)
  Downloading dtw_python-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (770 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

100%|█████████████████████████████████████| 2.87G/2.87G [00:34<00:00, 89.9MiB/s]
100%|██████████| 5690/5690 [00:14<00:00, 398.93frames/s]


File en_example.wav was completed!
All done!


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>