In [6]:
import torch
import librosa
from transformers import pipeline
import time

### Check model without telling her language of audio
 As audio example we use 50 seconds dialog in Hebrew with medium quality sound.

In [3]:
# Track the total time for each part
start_time = time.time()

# Set device to CUDA if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Measure pipeline initialization time
init_pipe_start = time.time()
# Initialize the pipeline for automatic speech recognition
pipe = pipeline(
  "automatic-speech-recognition",
  model="ivrit-ai/whisper-large-v2-tuned",
  chunk_length_s=30,
  device=device,
)
init_pipe_end = time.time()

# Load audio file
load_audio_start = time.time()
audio_file_path = "POC_Examples/audio_sample_1.mp3"
audio_array, sampling_rate = librosa.load(audio_file_path, sr=16000)
load_audio_end = time.time()

# Create a dictionary similar to what the pipeline expects
create_sample_start = time.time()
sample = {
    "array": audio_array,
    "sampling_rate": sampling_rate
}
create_sample_end = time.time()

# Get text prediction with timestamps from the audio file
prediction_start = time.time()
prediction = pipe(sample, batch_size=8, return_timestamps=True)["chunks"]
prediction_end = time.time()

# Print each chunk and the time range
print_chunks_start = time.time()
for chunk in prediction:
    start, end = chunk['timestamp']  # timestamp is a tuple (start, end)
    text = chunk['text']
    print(f"Text: {text}, Start: {start:.2f}, End: {end:.2f}")
print_chunks_end = time.time()

# Calculate total times
total_time = time.time() - start_time
pipe_time = init_pipe_end - init_pipe_start
load_audio_time = load_audio_end - load_audio_start
create_sample_time = create_sample_end - create_sample_start
prediction_time = prediction_end - prediction_start
print_chunks_time = print_chunks_end - print_chunks_start

# Print timing results
print(f"Total time: {total_time:.2f} seconds")
print(f"Pipeline initialization time: {pipe_time:.2f} seconds")
print(f"Audio loading time: {load_audio_time:.2f} seconds")
print(f"Sample creation time: {create_sample_time:.2f} seconds")
print(f"Prediction time: {prediction_time:.2f} seconds")
print(f"Chunk printing time: {print_chunks_time:.2f} seconds")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Text: רובן, תחרות אכילת לפות, אתה נגד ערן לוי, מי לוקח?, Start: 0.00, End: 6.00
Text:  אה, חוזרים שתרענו אותי על כדורגל, אתם מדברים איתי על לפות עכשיו. דווקא במקרה הזה אני לא אוהב לפות, אבל אם מצא, אז כבר שיהנו. אבל אני לא, עדיף שאני לא נהיה לשאלה הזאת, אמרתי לכם, דברו איתי על כדורגל, לא מעבר לזה., Start: 6.00, End: 18.00
Text:  אוקיי, שאלה לגבי כדורגל, למה לכדורגלנים אומרים שאין ידע כללי?, Start: 18.00, End: 23.00
Text:  לא יודע, אולי זה סטיגמה, שחושבים שהם אולי טיפשים או משהו כזה? אמרתי לך, זה... Why do people say that there is no general knowledge about football? I don't know, maybe it's a stigma that people think that they are maybe stupid or something like that?, Start: 23.00, End: 28.00
Text:  I told you, it's..., Start: 28.00, End: 29.00
Text:  How's your general knowledge?, Start: 29.00, End: 30.00
Text:  It's okay., Start: 30.00, End: 31.00
Text:  Let's say, what was the profession of Yohanan Asandlar?, Start: 31.00, End: 34.00
Text:  No, I was talking about something in sport

## Model Performance Summary without Language Information

- **Processing Time:**  
  The processing of 50 seconds of Hebrew audio took **212 seconds**, which is relatively slow. However, this performance largely depends on the device being used.
  
- **Unexpected Behavior:**  
  The output also includes **translations into English**, which is not required for our task since we only need the Hebrew transcription.
  



In [4]:
# Track the total time for each part
start_time = time.time()

# Set device to CUDA if available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Measure pipeline initialization time
init_pipe_start = time.time()
# Initialize the pipeline for automatic speech recognition
pipe = pipeline(
  "automatic-speech-recognition",
  model="ivrit-ai/whisper-large-v2-tuned",
  chunk_length_s=30,
  device=device,
)
init_pipe_end = time.time()

# Load audio file
load_audio_start = time.time()
audio_file_path = "POC_Examples/audio_sample_1.mp3"
audio_array, sampling_rate = librosa.load(audio_file_path, sr=16000)
load_audio_end = time.time()

# Create a dictionary similar to what the pipeline expects
create_sample_start = time.time()
sample = {
    "array": audio_array,
    "sampling_rate": sampling_rate
}
create_sample_end = time.time()

# Get text prediction with timestamps from the audio file
prediction_start = time.time()
prediction = pipe(sample, batch_size=8, return_timestamps=True, generate_kwargs={"language": "he"})["chunks"]

prediction_end = time.time()

# Print each chunk and the time range
print_chunks_start = time.time()
for chunk in prediction:
    start, end = chunk['timestamp']  # timestamp is a tuple (start, end)
    text = chunk['text']
    print(f"Text: {text}, Start: {start:.2f}, End: {end:.2f}")
print_chunks_end = time.time()

# Calculate total times
total_time = time.time() - start_time
pipe_time = init_pipe_end - init_pipe_start
load_audio_time = load_audio_end - load_audio_start
create_sample_time = create_sample_end - create_sample_start
prediction_time = prediction_end - prediction_start
print_chunks_time = print_chunks_end - print_chunks_start

# Print timing results
print(f"Total time: {total_time:.2f} seconds")
print(f"Pipeline initialization time: {pipe_time:.2f} seconds")
print(f"Audio loading time: {load_audio_time:.2f} seconds")
print(f"Sample creation time: {create_sample_time:.2f} seconds")
print(f"Prediction time: {prediction_time:.2f} seconds")
print(f"Chunk printing time: {print_chunks_time:.2f} seconds")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Text: רובן, תחרות אכילת לפות, אתה נגד ערן לוי, מי לוקח?, Start: 0.00, End: 6.00
Text:  אה, חוזרים שתרענו אותי על כדורגל, אתם מדברים איתי על לפות עכשיו. דווקא במקרה הזה אני לא אוהב לפות, אבל אם מצא, אז כבר שיהנו. אבל אני לא, עדיף שאני לא נהיה לשאלה הזאת, אמרתי לכם, דברו איתי על כדורגל, לא מעבר לזה., Start: 6.00, End: 18.00
Text:  אוקיי, שאלה לגבי כדורגל, למה לכדורגלנים אומרים שאין ידע כללי?, Start: 18.00, End: 23.00
Text:  לא יודע, אולי זה סטיגמה, שחושבים שהם אולי טיפשים או משהו כזה?, Start: 23.00, End: 27.88
Text:  אמרתי לך, זה..., Start: 27.88, End: 29.04
Text:  איך הידע כללי שלך?, Start: 29.04, End: 29.96
Text:  בסדר גמור., Start: 30.28, End: 30.96
Text:  נגיד, מה היה מקצוע של יוחנן הסנדלר?, Start: 31.08, End: 33.48
Text:  לא... דבר איתי משהו בספורט., Start: 33.72, End: 36.28
Text:  תודה רובן, אחלה רעיון., Start: 37.48, End: 38.92
Text:  כן, רעיון קצר..., Start: 38.92, End: 40.40
Text:  נתראה בחתול., Start: 40.40, End: 41.24
Text:  כן, איזה חתול., Start: 41.24, End: 42.60
Text:  אה, 

## Model Performance Summary with known language

As we can see it still take 150 seconds to process 50 seconds of audio. It better with known language, but still slow.



### Summary

- **Model Performance:**
    The model take a lot time to process the audio file. It took **212 seconds** to process 50 seconds of audio without language information and **150 seconds** with language information. This is relatively slow and may not be suitable for real-time applications.

- **Model Size and Resource Usage:**  
  The model itself is **5 GB** in size, and additional libraries also take up considerable space. This results in the model being relatively accurate, but it works slowly and requires significant computational resources.
  
---

Despite the model performing correctly, its slow speed and large size may present challenges in certain scenarios.