# Task 1.3: Basic Whisper Model Test

This notebook demonstrates the basic functionality of loading a pre-trained OpenAI Whisper model using the Hugging Face `transformers` library and performing inference on a sample audio file to generate subtitles.

## 1. Setup and Imports

Import necessary libraries. Ensure `transformers`, `torch`, and an audio processing library like `librosa` or `soundfile` are installed. For simplicity with Hugging Face, we can also use `datasets` to load audio.

In [1]:
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa # Or soundfile, or datasets.Audio
import torchaudio
import IPython.display as ipd # For playing audio in notebook (optional)

## 2. Configuration

Define the model to use and a placeholder for the audio file path.

In [2]:
# Define the model ID from Hugging Face Model Hub
MODEL_ID = "openai/whisper-tiny" # Using tiny for a quick test, can be base, small, etc.

# Placeholder for the sample audio file path
# TODO: Replace this with an actual path to a .wav, .flac, or .mp3 file
SAMPLE_AUDIO_PATH = "../samples/sample_audio.mp3" 

# Device configuration (use GPU if available)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

Using device: cpu


## 3. Load Model and Processor

In [3]:
try:
    processor = WhisperProcessor.from_pretrained(MODEL_ID)
    model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID).to(DEVICE)
    print(f"Successfully loaded model and processor for {MODEL_ID}")
except Exception as e:
    print(f"Error loading model or processor: {e}")

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.98k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

Successfully loaded model and processor for openai/whisper-tiny


## 4. Load and Preprocess Audio

Load the audio file and preprocess it using the WhisperProcessor. Whisper expects audio to be single-channel (mono) and sampled at 16kHz.

In [None]:
def load_and_preprocess_audio(audio_path: str, target_sr: int = 16000):
    """Loads an audio file, converts to mono, and resamples to target_sr."""
    try:
        # Load audio file; librosa loads as float32, converts to mono, and resamples
        speech_array, sampling_rate = librosa.load(audio_path, sr=target_sr, mono=True) 
        print(f"Audio loaded. Original SR: {sampling_rate} (resampled to {target_sr}), Duration: {len(speech_array)/target_sr:.2f}s")
        # Display audio player in notebook (optional)
        ipd.display(ipd.Audio(data=speech_array, rate=target_sr))
        return speech_array, target_sr
    except FileNotFoundError:
        print(f"Error: Audio file not found at {audio_path}. Please update SAMPLE_AUDIO_PATH.")
        return None, None
    except Exception as e:
        print(f"Error loading or processing audio: {e}")
        return None, None

# Load and preprocess the sample audio
# This will likely show a FileNotFoundError until SAMPLE_AUDIO_PATH is updated.
speech_array, sampling_rate = load_and_preprocess_audio(SAMPLE_AUDIO_PATH)

# Process the audio array to get input features
input_features = None
if speech_array is not None and processor is not None:
    try:
        input_features = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features.to(DEVICE)
        print("Audio preprocessed into input features.")
    except Exception as e:
        print(f"Error during feature extraction: {e}")

## 5. Perform Inference (Generate Subtitles)

In [None]:
transcription = ""
if input_features is not None and model is not None and processor is not None:
    try:
        print("Generating transcription...")
        # Generate token IDs
        predicted_ids = model.generate(input_features)
        
        # Decode token IDs to text
        transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
        
        print("\nTranscription:")
        if isinstance(transcription, list):
            for i, t in enumerate(transcription):
                print(f"Segment {i+1}: {t.strip()}")
        else:
            print(transcription.strip())
            
    except Exception as e:
        print(f"Error during transcription generation: {e}")
else:
    print("Skipping transcription generation due to previous errors or missing components.")

## 6. Document Results and Observations

*(Manually fill this section after running the notebook)*

*   **Model Used:** (e.g., `openai/whisper-tiny`)
*   **Sample Audio File:** (Brief description or name if you replace the placeholder)
*   **Generated Transcription:** (Paste the output here)
*   **Observations:** (Any issues encountered, quality of transcription, time taken, etc.)