# Note 0:

1. Whisper is inherently designed to work with 30-second samples.
2. Anything  < 30 seconds is padded to 30 seconds with silence.
3. Anything > 30 seconds is truncated to 30 seconds by cutting off the extra audio.
4. If the audio is passed directly, only the transcription for the first 30 seconds will be obtained.
5. Memory in a transformer network scales with the sequence length squared - doubling the input length quadruples the memory requirement.
6. Passing super long audio files is bound to lead to an out-of-memory (OOM) error.
7. Long-form transcription in 🤗 Transformers works by chunking the input audio into smaller, more manageable segments - each segment has a small amount of overlap with the previous one.
8. Overlap allows for accurately stitching the segments back together at the boundaries by finding the overlap between segments and merging the transcriptions accordingly.
9. Algorithms can be used to find the exact overlap by analyzing the audio signal in the overlapping regions. They search for the point where the segments best align, often by maximizing some correlation measure or overlap can be pre-defined based on application's requirement

In [3]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m52.8 MB/s[0m eta [36m0:00:0

In [4]:
import torch
from transformers import pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
    "automatic-speech-recognition", model="openai/whisper-base", device=device
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/290M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/3.78k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/840 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.20M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

Downloading (…)main/normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.08k [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

In [8]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xxhash, dill, multiprocess, datasets
Successfully installed datasets-2.14.

In [9]:
from datasets import load_dataset

In [10]:
dataset = load_dataset(
    "facebook/multilingual_librispeech", "spanish", split="validation", streaming=True
)
sample = next(iter(dataset))

Downloading builder script:   0%|          | 0.00/10.1k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.79k [00:00<?, ?B/s]

In [11]:
import numpy as np

target_length_in_m = 5

# convert from minutes to seconds (* 60) to num samples (* sampling rate)
sampling_rate = pipe.feature_extractor.sampling_rate
target_length_in_samples = target_length_in_m * 60 * sampling_rate

# iterate over our streaming dataset, concatenating samples until we hit our target
long_audio = []
for sample in dataset:
    long_audio.extend(sample["audio"]["array"])
    if len(long_audio) > target_length_in_samples:
        break

long_audio = np.asarray(long_audio)

# how did we do?
seconds = len(long_audio) / 16000
minutes, seconds = divmod(seconds, 60)
print(f"Length of audio sample is {minutes} minutes {seconds:.2f} seconds")

Length of audio sample is 5.0 minutes 17.22 seconds


# Note 1:

# Overlap in Audio Segmentation
1. Overlap doesn't generally cause an error but helps in aligning segments accurately.
2. It's used to prevent discontinuities between segments and maintain signal continuity.

# Dealing with Overlap
1. Overlap is handled by using windowing functions or techniques that smoothly combine segments.
2. Ensures no abrupt changes or loss of information at the boundaries.

# Without Overlap
1. Lack of overlap can cause discontinuities or "clicking" sounds.
2. May lead to loss of information or inconsistencies in the final output.

# Disadvantages of Overlap
1. Increased Computational Complexity: Requires more processing for alignment.
2. Potential Redundancy: Leads to redundant processing of the same data.
3. Tuning Required: The degree of overlap needs optimization for specific applications.

# Advantages of Overlap
1. Improved Continuity: Ensures smooth alignment between segments.
2. Reduced Information Loss: Prevents losing information at segment edges.
3. Enhanced Accuracy: Increases accuracy in applications like speech recognition by considering context across boundaries.

# Note 2:


# Chunking Advantage
1. **Stateless Algorithm:** Chunking the audio makes the algorithm stateless, meaning it doesn't need the result of one chunk to transcribe the next.
2. **Parallel Processing:** This approach allows for parallel processing of chunks, significantly speeding up the transcription.
3. **Order Independence:** Chunks can be transcribed in any order, providing flexibility in processing.
4. **Batching Capability:** Chunks can be batched and run through the model simultaneously, further enhancing computational efficiency.

# Activation of Long-Form Transcriptions
1. **Chunk Length Control:** An additional argument, chunk_length_s, controls the length of the chunked segments. For Whisper, 30-second chunks are optimal.
2. **Batching Activation:** By passing the batch_size argument, batching can be activated for even more efficient processing.
3. **Stitching at Boundaries:** Stitching is done at the chunk boundaries after all chunks have been transcribed.
4. **Integration with 🤗 Transformers:** The process can be easily integrated into the 🤗 Transformers framework.

In [12]:
pipe(
    long_audio,
    max_new_tokens=256,
    generate_kwargs={"task": "transcribe"},
    chunk_length_s=30,
    batch_size=8,
)

{'text': ' Entonces te deleitarás en Jehová y yo te haré subir sobre las alturas de la tierra y te daré a comer la heredad de apartó por su camino, mas Jehová car tu retaguaria. ¿Quiénes son estos que vuelan como nubes y como palomas a sus ventanas? Ciertamente a mí esperaran las islas y las naves de tarsis desde el principio para traer tus hijos de lejos y su plata y su oro con ellos al nombre de Jovato Dios y al Santo Israel que te ha glorificado. y por jefe y por maestro a las naciones. E hiciste con ellos alianza, amaste su cama donde quiera que la veías. Y fuiste al rey con un huento y multiplicaste tus perfumes, y enviaste tus embajadores lejos y te batiste hasta el profundo. Y del todo serán asoladas, la gloria deliva no vendrá ti, hallas, pinos y bojes juntamente para decorar el lugar de mi santuario, y yo honraré el lugar de mis pies, y vendrán a ti y y en el madera metal y en lugar de piedras hierro y pondré paz por tu tributo y justicia por tus exactores. Múltitura de camell

# Timestamp Prediction with Whisper
1. **Functionality:** Whisper can predict segment-level timestamps, indicating the start and end time for short audio passages.
2. **Usefulness:** These timestamps are valuable for aligning transcriptions with the corresponding audio or visual segments, like providing closed captions for videos.
3. **Activation:** The prediction of timestamps can be activated by setting the argument return_timestamps=True.
4. **Compatibility:** Timestamp prediction is compatible with both chunking and batching methods, seamlessly integrating with previously described transcription approaches.

In [13]:
pipe(
    long_audio,
    max_new_tokens=256,
    generate_kwargs={"task": "transcribe"},
    chunk_length_s=30,
    batch_size=8,
    return_timestamps=True,
)["chunks"]

[{'timestamp': (0.0, 26.4),
  'text': ' Entonces te deleitarás en Jehová, y yo te haré subir sobre las alturas de la tierra, y te daré a comer la heredad de Jacob tu padre, porque la boca de Jehová lo ha hablado. nosotros curados. Todos nosotros nos descarriamos como bejas, cada cual se apartó por su camino,'},
 {'timestamp': (26.4, 32.48),
  'text': ' mas Jehová cargó en él el pecado de todos nosotros. No es que partas tu pan con el'},
 {'timestamp': (32.48, 38.4),
  'text': ' hambriento y a los hombres herrantes metas en casa, que cuando vieres al desnudo lo cubras y no'},
 {'timestamp': (38.4, 49.3),
  'text': ' tescondas de tu carne, entonces nacerá tu luz como el alba y tu salud se dejará ver presto, E irá a tu justicia delante de ti y la gloria de Jehová será tu retaguaria.'},
 {'timestamp': (49.3, 54.1),
  'text': ' ¿Quiénes son estos que vuelan como nubes y como palomas a sus ventanas?'},
 {'timestamp': (54.1, 59.6),
  'text': ' Ciertamente a mí esperaran las islas y las naves 