# 🎧 VisionAI Audio Quickstart
This notebook demonstrates how to transcribe audio using Faster-Whisper with CPU/GPU on Colab.

# Step 1: Install dependencies

In [1]:
!pip install faster-whisper pydub torch torchvision torchaudio --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.8/38.8 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.4/17.4 MB[0m [31m57.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h

# Step 2: Imports

In [17]:
import torch, os, warnings
warnings.filterwarnings("ignore", category=SyntaxWarning)
from pydub import AudioSegment
from faster_whisper import WhisperModel
from google.colab import files
import warnings
warnings.filterwarnings("ignore", message="The secret `HF_TOKEN` does not exist")


# Step 3: Auto detect GPU/CPU

In [12]:
# GPU status check
if not torch.cuda.is_available():
    print("⚠️ GPU is NOT enabled! Go to: Runtime → Change runtime type → Hardware accelerator → GPU → Save, then rerun this cell.")
else:
    print("✅ GPU is active:", torch.cuda.get_device_name(0))

device = "cuda" if torch.cuda.is_available() else "cpu"
model_size = "large-v3"
compute_type = "float16" if device == "cuda" else "int8"
print(f"🔥 Using {device.upper()} with model '{model_size}' and precision {compute_type}")

✅ GPU is active: Tesla T4
🔥 Using CUDA with model 'large-v3' and precision float16


# Step 4: Load model

In [18]:
fw_model = WhisperModel(model_size, device=device, compute_type=compute_type)
print("✅ Model loaded successfully")

✅ Model loaded successfully


# Step 5: Upload an audio file

In [9]:
uploaded = files.upload()
audio_path = list(uploaded.keys())[0]
print(f"📁 Uploaded: {audio_path}")

Saving Jos001-[AudioTrimmer.com].mp3 to Jos001-[AudioTrimmer.com].mp3
📁 Uploaded: Jos001-[AudioTrimmer.com].mp3



# Step 6: Transcription

In [24]:
segments, info = fw_model.transcribe(
    audio_path,
    beam_size=15,
    vad_filter=True,
    chunk_length=30,
    without_timestamps=False,
    multilingual=True,
)

print("🧠 Transcribing...\n")
full_text = ""
for s in segments:
    print(s.text)
    full_text += s.text + " "
print("\n✅ Done!")

🧠 Transcribing...

 یشیو باب ایک رب کے خادم موسیٰ کی موت کے بعد رب موسیٰ کے مددگار یشیو بن نون سے ہم کلام ہوا اس نے کہا میرا خادم موسیٰ فوت ہو گیا ہے
 اب اٹھ اس پوری قوم کے ساتھ دریائے یردن کو پار کر کے اس ملک میں داخل ہو جا جو میں اسرائیلیوں کو دینے کو ہوں
 جس زمین پر بھی

✅ Done!


### Parameter Tuning for Accuracy and Speed

When running `fw_model.transcribe()`, you can adjust the parameters to balance **accuracy**, **speed**, and **language handling**:

| Parameter | What it does | Effect of changing it | Example Values |
|-----------|-------------|---------------------|----------------|
| `beam_size` | Controls the number of beams in beam search for transcription. | Higher → more accurate but slower. Lower → faster but less accurate. | 5, 10, 15 |
| `chunk_length` | Length (in seconds) of audio chunks processed at a time. | Smaller → better for long files or unstable audio, slower. Larger → faster, might miss details. | 15, 30, 60 |
| `vad_filter` | Voice activity detection filter. | `True` → skips silence, faster. `False` → processes all audio, may include noise. | True / False |
| `without_timestamps` | Include timestamps in segments. | `False` → segments have timestamps. `True` → full text only. | True / False |
| `multilingual` | Enable language detection and multilingual transcription. | `True` → detects multiple languages. `False` → assumes single language. | True / False |

💡 **Tips:**
- Increase `beam_size` for higher transcription accuracy.
- Adjust `chunk_length` depending on audio length: e.g., 15s for shorter chunks or 60s for faster processing.
- Turn off `vad_filter` if you want to capture all audio, even silent parts.
- **Longer audio files** make differences in these parameters more noticeable — you can clearly see the effect of higher `beam_size` or smaller `chunk_length` on transcription quality.
