# Insanely Fast Whisper: A journey to build the fastest possible transcription with Whisper 🔥

By: [Vaibhav (VB) Srivastav](https://twitter.com/reach_vb)

## Base case: fp16

Efficient half-precision kernels for inference.

### Setup our inference environment 🧑‍💻

In [1]:
!pip install -q --upgrade transformers accelerate torch ipython-autotime

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m797.1/797.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.6/410.6 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m80.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m66.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m823.6/823.6 kB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Setting up the utilities to track time taken by each step ⏳

In [2]:
%load_ext autotime

time: 360 µs (started: 2024-09-19 11:09:28 +00:00)


### Necessary imports 🔧




In [3]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

time: 21.5 s (started: 2024-09-19 11:09:28 +00:00)


### Define Model checkpoint, device and datatype 🔉

In [4]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(128, 1280, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(1280, 1280, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 1280)
      (layers): ModuleList(
        (0-31): 32 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=1280, out_features=1280, bias=False)
            (v_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (q_proj): Linear(in_features=1280, out_features=1280, bias=True)
            (out_proj): Linear(in_features=1280, out_features=1280, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=1280, out_features=5120, bias=True)
          (fc2): Linear(in_features=5120, out_features=1280, bia

time: 31.9 s (started: 2024-09-19 11:09:50 +00:00)


### Load the model and initialise the speech recognition pipeline ⚡

In [5]:
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

time: 7.2 s (started: 2024-09-19 11:10:22 +00:00)


### Define an audio sample to test on 👇

10 minute audio to transcribe find it [here](https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669-10.mp3)

In [6]:
sample = "https://huggingface.co/datasets/reach-vb/random-audios/resolve/main/4469669-10.mp3"

time: 387 µs (started: 2024-09-19 11:10:29 +00:00)


### Transcribe away! 💪

In [7]:
result = pipe(sample)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


time: 1min 13s (started: 2024-09-19 11:10:34 +00:00)


In [8]:
print(result["text"])

 Thank you. and investors will be accepted from 5.30 to 6 o'clock Japan time. Please be aware of that. Now, we will be collecting questions via telephone conferencing system. As is informed to you beforehand, the conference call system will require the pre-registration beforehand. Let me introduce the presenter today, President and CEO, Satoshi Tsunakawa. Corporate Senior Executive Vice President Mamoru Hatazawa. Representative Executive Officer, Corporate Executive Vice President, NCFO, Masayoshi Hirata. We have a chairperson of Strategic Review Committee Outside Director, Paul Brough. He is joining from Hong Kong online. My name is Hara of Corporate Communications Department. We are providing simultaneous translation, so if you are watching the live streaming in Japanese, you will be able to hear translation's voice. Please be aware of that. First, before going into transforming Toshiba to enhance Shihara's value, May I have Mr. Tanaka to say a few words upon the receipt of the repor

In [9]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010259_2024.09.14_08.05.28-2024.09.14_08.06.28.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  down on terror of course continues search operations continuing in some of these areas where terrorists are believed to be hiding my colleague nazir joins us with more details nazir take us through these three separate encounters uh one in kishtwar one in khatwar one in baramulla well there was a major encounter yesterday evening in in kishtwar the area which will go to polls and on in just in four days on 18th of this month and prime minister is addressing rally in the same region and so hours before prime minister's arrival in doda we saw this major encounter where army has suffered casualties two soldiers have been killed in action a junior commission officer has been killed and a soldier has javan has been killed in this encounter two soldiers have been injured the operation is underway according to army they had launched an operation in the area but there was a heavy exchange of firing in which army suffered two casualties in this encounter. So, there was also…
T

In [10]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010834_2024.09.14_08.06.27-2024.09.14_08.07.27.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  On this former industrial site in Whitehaven, Cumbria, planned to produce 60 million tonnes of new coal. But since it was approved, a Supreme Court ruling found projects must consider carbon emissions from burning fossil fuels, not just digging them up. The mine was the first legal test of that. It's been a really important victory. But what frustrates me most is that the years local decision makers have been putting all their efforts in fighting for a coal mine in Whitehaven, they could have spent that time investing in green jobs. it was mining jobs the community expected. This site has sat empty for 20 years. It's devastating for the community. The jobs that were going to be created were possibly in the construction and supply chain surrounding it were probably around 2,000 and the long-term well-paid jobs for running the facility were in excess of 500. The new government has been clear about its net zero ambitions but judgments like this one could test them we're 

In [11]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010834_2024.09.14_08.05.27-2024.09.14_08.06.27.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  Content we flagged was taken down, but it's still possible to find more. A century-old ideology of hate, pushed by cutting-edge algorithms, to a massive modern audience. Tom Cheshire, Sky News. The UK's High Court has reversed their decision to approve the UK's first new coal mine in 30 years. It follows the confirmation this week that nearly 3,000 jobs will be lost at the Port Talbot steelworks and potentially 400 at Scotland's only oil refinery, Grangemouth. Our science and technology editor Tom Clark reports now on the UK's changing green credentials. Their case argued UK coal has no future on a rapidly warming planet. We have won. The High Court pretty much agreed. It's a huge win. And this, the proposed coal mine that lost. The Woodhouse Colliery...
Transcript:  Content we flagged was taken down, but it's still possible to find more., Start: 0.00, End: 5.48
Transcript:  A century-old ideology of hate, pushed by cutting-edge algorithms, to a massive modern audienc

In [12]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010259_2024.09.14_08.04.28-2024.09.14_08.05.28.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  assembly seats. And in second and third phase, Prime Minister would be campaigning in Riyasi district, in Katra, Vaishnava Mata, Vaishnava Devi, Konesh Jhansi, Kutwa, Jammu. So an extensive campaigning, Home Minister Amit Shah has already addressed rallies here. He is coming here again in next few days. So our other senior leaders, BJP President J.P. Nanda, Defence Minister Rajnath Singh recently addressed Riley in Ramban and Banihal as well. So, top leaders, top guns of the BJP are campaigning in Jammu and Kashmir as the first phase of polling is approaching. It is just four days away. Nazir, stay on with us. There's some more news coming in from Jammu and Kashmir. Like you had mentioned, there was an encounter that took place in Jammu and Kashmir right ahead of the Prime Minister's visit. Three separate encounters, in fact. In Kishtwar, Two soldiers have unfortunately died. Two terrorists were killed in Kathua in the encounter, and an encounter underway in Barakmula

In [13]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010834_2024.09.14_08.04.27-2024.09.14_08.05.27.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  content to people who don't really understand but think it's cool or funny. However, the impact on the victim is the same, which is the kind of experience of hate of minority communities. This is one of the musicians whose songs have been bolted onto Nazi content without their knowledge. The artist, Pastel Ghost, told us, I was not previously aware that my music was being used in this way and I find it shocking and deplorable. Sky News previously reported about Islamic State supporters using the same sounds loophole to gain more traction on TikTok. We forwarded all the Nazi videos we found this time to TikTok and asked the company for comment. A spokesperson told us, This content was immediately removed for breaching our strict policies against hate speech. We regularly train our safety professionals and update our safeguards to detect hateful behaviour on an ongoing basis and we remove 91% of this type of content before it is reported to us.
Transcript:  content to p

In [14]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010259_2024.09.14_08.03.28-2024.09.14_08.04.28.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  voters and galvanize them and get win seats for the BJP. Doda, Kishtwar area is a very significant, important region for the BJP when it comes to, you know, increasing the number of seats in Jammu province. So, besides addressing this rally in Doda and Kishtwar and Ramban are part of it because It was essentially one district until 2008. And then they were hit two more districts were carved out of the Dora district. So it has eight assembly segments and BJP is fighting from all the eight, even as these are Muslim majority regions. But some of the during the delimitation, some of the constituencies have been carved out, which have become the Hindu majority seats. So BJP is hoping to win big from this election from Doda, Krishnaburam and District which has eight.
Transcript:  voters and galvanize them and get win seats for the BJP., Start: 0.00, End: 7.14
Transcript:  Doda, Kishtwar area is a very significant, important region for the BJP when it comes, Start: 7.14, End

In [15]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010834_2024.09.14_08.03.27-2024.09.14_08.04.27.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  popular with children and the owner of Minecraft, Microsoft told us that hate speech and terrorist content is strictly forbidden and they take action to remove such content. But on TikTok, there are posts that are just too graphic to show, specifically anti-Semitic. We've blurred this one here, which shows images from gas chambers set to the same type of audio. And there's much more of this graphic type of content. In fact, Sky News has seen 72,000 posts used in this way. Not only is that number big, but the level of engagement is high too. Between them, these posts have racked up 21 million likes, showing people are engaging with the videos. Well, how are they engaging? This is a good example in an image of a Nuremberg rally accompanied by a Hitler speech. It's been liked by more than 56,000 users. And in a comment that has been liked 1,695 times, one user states, modern society needs him. Another says, we miss you. It's difficult to know the motivations of the peopl

In [16]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010259_2024.09.14_08.02.28-2024.09.14_08.03.28.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  to be addressed by the Prime Minister in Duda near Kishtwar of Jammu and Kashmir ahead of the first phase of polling on the 18th of September. Prime Minister Modi had last campaigned for the BJP in the Chenab region during the 2014 Assembly elections when the party had won four of the six seats. Remember, the number of seats after delimitation has increased to eight. Ahead of the Prime Minister's visit, there was an encounter that broke out in Kishtwar as well. the Prime Minister will be visiting Duda after 45 years at least. My colleague Nazir joins us with more details. Nazir, if you could take us through the Prime Minister's visit and the big campaign for the BJP in Jammu and Kashmir. Well, it is Prime Minister Narendra Modi's first election rally in Jammu and Kashmir Assembly elections, four days ahead of the first phase of Assembly polling. So, very significant rally as far as the BJP's election campaign is concerned. He is seen as the biggest vote catcher and wh

In [17]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010834_2024.09.14_08.02.27-2024.09.14_08.03.27.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")



Full Transcript:  Nazi speeches and marching music have been used as background sound on tens of thousands of TikTok videos as far-right groups try and spread their message To appeal to a wider audience Most of the speeches are set to a type of music popular on tik-tok called drift funk without the creators permission or knowledge And that could be all sorts cat videos gym post gaming or cars here Those are a few of the most popular categories we have seen used. It's a way to get content shared widely before offering the user more sinister stuff if they hit the sound button in the corner of a post, which shows them other videos using the same sound. For example, this is a more innocuous video of a cat that looks like Hitler. We'll put that back into the stack, and this is a huge stack here. Have a look at another type we've seen, gaming. This video was made using Minecraft, the German dictator recreated in the game. Now, this is popular with children and the owner of Minecraft, Microso

In [18]:
# Input audio sample
sample = "https://storage.googleapis.com/imagestg-bucket/noise_free_audio/1010259_2024.09.14_08.01.27-2024.09.14_08.02.27.wav"

# Perform speech recognition with timestamps
result = pipe(sample, batch_size=8, return_timestamps=True)

# Print the full transcript
print("Full Transcript:", result["text"])

# Adjust timestamps
num_chunks = len(result['chunks'])
total_duration = 60  # Total duration in seconds
adjusted_chunks = []

# Set the start time for the first chunk
start_time = 0.0

# Calculate adjusted timestamps
for i in range(num_chunks):
    # Get the end time based on the chunk's existing end time or next chunk's start time
    if i < num_chunks - 1:
        end_time = result["chunks"][i + 1]['timestamp'][0]  # Start time of the next chunk
    else:
        end_time = total_duration  # Last chunk should end at total duration

    # Ensure that we maintain the original chunk text and interpolate if necessary
    adjusted_chunks.append({
        'text': result['chunks'][i]['text'],
        'timestamp': (start_time, end_time)
    })

    # Update the start time for the next chunk
    start_time = end_time

# Print each adjusted chunk with timestamps
for chunk in adjusted_chunks:
    start, end = chunk['timestamp']
    print(f"Transcript: {chunk['text']}, Start: {start:.2f}, End: {end:.2f}")

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Full Transcript:  Probe continues into the Delhi gym owner's death. Six suspects detained, two country-made pistols recovered. Afghan origin gym owner was shot dead in GK. Adani listed among world's best companies of 2024. Prestigious global industry list features Adani's eight companies. Thank you.
Transcript:  Probe continues into the Delhi gym owner's death., Start: 0.00, End: 6.60
Transcript:  Six suspects detained, two country-made pistols recovered., Start: 6.60, End: 9.78
Transcript:  Afghan origin gym owner was shot dead in GK., Start: 9.78, End: 0.00
Transcript:  Adani listed among world's best companies of 2024., Start: 0.00, End: 16.32
Transcript:  Prestigious global industry list features Adani's eight companies., Start: 16.32, End: 0.00
Transcript:  Thank you., Start: 0.00, End: 60.00
time: 7.13 s (started: 2024-09-19 11:18:24 +00:00)
