**Transcribing Audio to Subtitles Using OpenAI’s Whisper Model on Google Colab**

**Step 1: Setting Up the Environment**

In [1]:
# Install required packages
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-go1euf4g
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-go1euf4g
  Resolved https://github.com/openai/whisper.git to commit 90db0de1896c23cbfaf0c58bc2d30665f709f170
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20240930)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting triton>=2.0.0 (from openai-whisper==20240930)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m

In [2]:
!pip install torch



**Step 2: Loading the Model**

In [3]:
import whisper
import torch
import os

# Check if a CUDA-enabled GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

cuda


In [4]:
# Load the Whisper model and move it to the GPU if available
model = whisper.load_model("large", device=device)

100%|██████████████████████████████████████| 2.88G/2.88G [00:30<00:00, 102MiB/s]
  checkpoint = torch.load(fp, map_location=device)


**Step 3: Transcribing the Audio File**

In [7]:
# Specify the path to the audio file on Google Drive
audio_file = "./Data/harvard.wav"

# Set the input language to Turkish
input_language = "en"  # English language code


In [8]:
# Transcribe the entire audio file with fp16 enabled and specified language
result = model.transcribe(audio_file, fp16=False, language=input_language)

In [10]:
import pprint
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(result)

{   'language': 'en',
    'segments': [   {   'avg_logprob': -0.14225021760855147,
                        'compression_ratio': 1.4210526315789473,
                        'end': 4.0,
                        'id': 0,
                        'no_speech_prob': 0.10386829078197479,
                        'seek': 0,
                        'start': 0.0,
                        'temperature': 0.0,
                        'text': ' The stale smell of old beer lingers.',
                        'tokens': [   50365,
                                      440,
                                      342,
                                      1220,
                                      4316,
                                      295,
                                      1331,
                                      8795,
                                      22949,
                                      433,
                                      13,
                                      50565]},
   

In [25]:
# Transcribed Text
print('Transcription: \n',result['text'])

Transcription: 
  The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos al pastor are my favorite. A zestful food is the hot cross bun.


**Step 4: Creating the SRT File**

In [11]:
# Helper function to convert seconds to SRT timestamp format
def format_timestamp(seconds):
    milliseconds = int((seconds % 1) * 1000)
    seconds = int(seconds)
    minutes, seconds = divmod(seconds, 60)
    hours, minutes = divmod(minutes, 60)
    return f"{hours:02}:{minutes:02}:{seconds:02},{milliseconds:03}"

In [12]:
# Create the SRT file content
srt_content = []
for i, segment in enumerate(result["segments"]):
    start_time = format_timestamp(segment["start"])
    end_time = format_timestamp(segment["end"])
    text = segment["text"].strip()
    srt_content.append(f"{i + 1}")
    srt_content.append(f"{start_time} --> {end_time}")
    srt_content.append(text)
    srt_content.append("")

In [15]:
print(*srt_content, sep='\n')

1
00:00:00,000 --> 00:00:04,000
The stale smell of old beer lingers.

2
00:00:04,000 --> 00:00:07,000
It takes heat to bring out the odor.

3
00:00:07,000 --> 00:00:10,000
A cold dip restores health and zest.

4
00:00:10,000 --> 00:00:13,000
A salt pickle tastes fine with ham.

5
00:00:13,000 --> 00:00:15,000
Tacos al pastor are my favorite.

6
00:00:15,000 --> 00:00:18,000
A zestful food is the hot cross bun.



**Step 5: Saving the SRT file to Google Drive**

In [16]:
# Write the SRT file to Google Drive
output_srt_file = "./Data/harvard.srt"
with open(output_srt_file, "w") as f:
    f.write("\n".join(srt_content))

print(f"Subtitle file saved to {output_srt_file}")

Subtitle file saved to ./Data/harvard.srt


**Lets Try TTS also with the generated Transcription back to speech.**


**edge-tts** for text-to-speech conversion

edge-tts is a Python module that allows you to use Microsoft Edge's online text-to-speech service from within your Python code


**ipython** for audio playback in Jupyter notebooks

In [26]:
!pip install edge-tts

Collecting edge-tts
  Downloading edge_tts-7.0.0-py3-none-any.whl.metadata (5.2 kB)
Collecting srt<4.0.0,>=3.4.1 (from edge-tts)
  Downloading srt-3.5.3.tar.gz (28 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading edge_tts-7.0.0-py3-none-any.whl (23 kB)
Building wheels for collected packages: srt
  Building wheel for srt (setup.py) ... [?25l[?25hdone
  Created wheel for srt: filename=srt-3.5.3-py3-none-any.whl size=22428 sha256=352ddeacea20f3fff5bf28f0c5600a85f178e13db3899a73bfc05e2ffc6f2267
  Stored in directory: /root/.cache/pip/wheels/d7/31/a1/18e1e7e8bfdafd19e6803d7eb919b563dd11de380e4304e332
Successfully built srt
Installing collected packages: srt, edge-tts
Successfully installed edge-tts-7.0.0 srt-3.5.3


In [27]:
!edge-tts --list-voices

Name                               Gender    ContentCategories      VoicePersonalities
---------------------------------  --------  ---------------------  --------------------------------------
af-ZA-AdriNeural                   Female    General                Friendly, Positive
af-ZA-WillemNeural                 Male      General                Friendly, Positive
am-ET-AmehaNeural                  Male      General                Friendly, Positive
am-ET-MekdesNeural                 Female    General                Friendly, Positive
ar-AE-FatimaNeural                 Female    General                Friendly, Positive
ar-AE-HamdanNeural                 Male      General                Friendly, Positive
ar-BH-AliNeural                    Male      General                Friendly, Positive
ar-BH-LailaNeural                  Female    General                Friendly, Positive
ar-DZ-AminaNeural                  Female    General                Friendly, Positive
ar-DZ-IsmaelNeural     

In [None]:
# en-IN-NeerjaExpressiveNeural

In [28]:
!pip install ipython

Collecting jedi>=0.16 (from ipython)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━[0m [32m1.2/1.6 MB[0m [31m36.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m30.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi
Successfully installed jedi-0.19.2


In [29]:
import edge_tts
import IPython.display as ipd

async def text_to_speech_tunable(text, voice="en-IN-NeerjaExpressiveNeural", rate="+0%", pitch="+0Hz"):
    # Initialize the edge-tts Communicate object
    communicate = edge_tts.Communicate(text=text, voice=voice, rate=rate, pitch=pitch)

    # Synthesize and save the output audio
    await communicate.save("./Data/output_audio.mp3")

    # Play the audio file in a Jupyter notebook or similar environment
    ipd.display(ipd.Audio("./Data/output_audio.mp3"))

In [31]:
# Call the async function with await
await text_to_speech_tunable(result['text'], voice="en-IN-NeerjaExpressiveNeural", rate="+0%", pitch="+0Hz")