## Import Transformer
# 🔹 Transformers

- **Transformer** is a deep learning model mainly used for **NLP** tasks.  
- It uses **Attention** → looks at all words at once to find which ones matter most.  
- Works better & faster than RNN/LSTM.  
- Power behind models like **BERT** and **GPT**.  

In [2]:
pip install transformers




### 📌 What does `from transformers import pipeline` mean?

- `transformers` → A library by **Hugging Face** with pre-trained AI/ML models.  
- `pipeline` → A quick way to use those models (like sentiment analysis, text generation, summarization, translation, etc.) **without writing complex code**.  

👉 Basically: **One line to load and use powerful AI models easily.**


In [3]:
from transformers import pipeline

### 📌 Checking Transformers Library Version

- Always good to check the version because different versions may have different features.  
- This helps in debugging or reproducing results later.  


In [4]:
import transformers
print(transformers.__version__)

4.56.1


### 🎵 Importing Required Libraries for Speech-to-Text

- **librosa** → For loading & processing audio files.  
- **torch** → PyTorch (deep learning framework).  
- **IPython.display** → To play audio inside Colab.  
- **transformers (Wav2Vec2ForCTC, Wav2Vec2Tokenizer)** → Pretrained speech-to-text model & tokenizer.  
- **numpy** → For numerical operations on audio data.  


In [5]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

### 🧠 Load Pretrained Wav2Vec2 Model & Tokenizer

- **Tokenizer** → Converts raw audio (waveform) into model-readable inputs.  
- **Model (Wav2Vec2ForCTC)** → Pretrained Speech-to-Text model trained on ~960 hours of English speech.  
- We use `"facebook/wav2vec2-base-960h"` → A popular checkpoint for ASR (Automatic Speech Recognition).  


In [6]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### 🎧 Load Audio File with Librosa

- Using `librosa.load()` to read the audio file.  
- `sr=16000` → Resamples audio to 16kHz (required by Wav2Vec2).  
- `audio` → Numpy array containing waveform data.  
- `sampling_rate` → Sampling rate of the loaded audio.  


In [8]:
audio, sampling_rate = librosa.load("/content/Hello.m4a", sr=16000)

  audio, sampling_rate = librosa.load("/content/Hello.m4a", sr=16000)


### 🔍 Inspect Audio & Sampling Rate


In [9]:
audio,sampling_rate

(array([ 1.7316779e-09, -2.3574103e-09,  2.3719622e-09, ...,
        -4.4372920e-03, -4.6387115e-03, -2.5428366e-03], dtype=float32),
 16000)

### 🎶 Play the Audio File in Colab
- Using `IPython.display.Audio()` to play the sound directly.  
- `autoplay=True` → Automatically starts playing when the cell runs.  


In [10]:
display.Audio("/content/Hello.m4a", autoplay=True)

### 📝 Convert Audio to Input Tensors
- **Tokenizer** takes the raw waveform (`audio`) and prepares it for the model.  
- `return_tensors="pt"` → Returns PyTorch tensors.  
- `input_values` → Model-ready numerical representation of the audio.

In [11]:
input_values=tokenizer(audio, return_tensors="pt").input_values
input_values

tensor([[-1.6341e-05, -1.6393e-05, -1.6332e-05,  ..., -5.6781e-02,
         -5.9358e-02, -3.2546e-02]])

### 🧮 Forward Pass: Audio → Model → Logits
- `model(input_values)` → Runs the audio input through Wav2Vec2.  
- `.logits` → Raw output scores (before softmax).  
- Shape → `(batch_size, sequence_length, vocab_size)`  
  - `batch_size = 1` (your one audio file)  
  - `sequence_length` = number of audio frames  
  - `vocab_size` = number of possible tokens (like characters/letters).  


In [12]:
logits =model(input_values).logits
logits

tensor([[[ 12.5849, -26.1309, -25.8220,  ...,  -5.6840,  -6.8442,  -5.8844],
         [ 12.8003, -26.3805, -26.0685,  ...,  -5.7693,  -7.1110,  -5.8951],
         [ 12.7660, -26.1959, -25.8812,  ...,  -5.6713,  -6.9799,  -5.9420],
         ...,
         [ 12.6222, -26.3600, -26.0735,  ...,  -5.9103,  -7.3980,  -5.8720],
         [ 12.4199, -26.2321, -25.9570,  ...,  -5.8930,  -7.4127,  -5.8910],
         [ 12.3831, -26.2839, -26.0082,  ...,  -5.9778,  -7.5147,  -5.9104]]],
       grad_fn=<ViewBackward0>)

### 🔡 Get Predicted Token IDs
- `torch.argmax(logits, dim=-1)` → Finds the index of the highest score at each time step.  
- These indices correspond to **predicted tokens** (characters/letters).  
- Next, we’ll decode them into **human-readable text** using the tokenizer.  


In [13]:
predicted_ids = torch.argmax(logits, dim =-1)

### 🗣️ Decode Token IDs → Text
- `tokenizer.decode()` → Converts predicted IDs back into words/characters.  
- Output is your **speech-to-text transcription**.  


In [14]:
transcriptions =tokenizer.decode(predicted_ids[0])

### 📜 Final Transcription Output


In [15]:
transcriptions

'HELLO MY NAME IS RAKISH HI'