# Speech to Text Project

**1: Install Transformers Library**

Installs Hugging Face’s transformers package, which provides pre-trained models for tasks such as speech-to-text, NLP, and more.

In [1]:
pip install transformers



**2: Import Transformers Pipeline**

Imports the pipeline API from transformers, which simplifies access to models for various tasks like audio transcription.

In [3]:
from transformers import pipeline

**3: Import Additional Libraries**

Imports essential libraries including librosa for audio processing, torch for tensor computations, and IPython for audio playback.

In [4]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

**4: Load Pretrained Tokenizer and Model**

Loads the Wav2Vec2 tokenizer and model from Hugging Face, specifically the facebook/wav2vec2-base-960h checkpoint trained for English speech recognition.

In [5]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'Wav2Vec2CTCTokenizer'. 
The class this function is called from is 'Wav2Vec2Tokenizer'.


model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**5: Load Audio File**

Loads an MP3 audio file using librosa and resamples it to 16 kHz, the required sampling rate for Wav2Vec2.

In [6]:
audio, sampling_rate = librosa.load("/content/v.mp3", sr=16000)
audio, sampling_rate

(array([ 1.9825604e-10,  1.2977464e-10, -7.7391737e-10, ...,
        -5.6942262e-08, -1.0897398e-07,  0.0000000e+00], dtype=float32),
 16000)

In [7]:
display.Audio('/content/v.mp3', autoplay=True)

**6: Tokenize Audio Input**

Tokenizes the audio waveform into input tensors required by the Wav2Vec2 model for inference.

In [8]:
input_values = tokenizer(audio, return_tensors="pt").input_values
input_values

tensor([[0.0040, 0.0040, 0.0040,  ..., 0.0040, 0.0040, 0.0040]])

**7: Generate Logits from Model**

Passes the tokenized input into the Wav2Vec2 model to generate raw logits representing character probabilities.

In [9]:
logits = model(input_values).logits
logits

tensor([[[ 15.0434, -31.1234, -30.7598,  ...,  -8.2236,  -9.7469,  -8.5590],
         [ 14.9926, -30.9031, -30.5264,  ...,  -8.0866,  -9.6095,  -8.5321],
         [ 14.7949, -30.9562, -30.5803,  ...,  -8.0040,  -9.8658,  -8.5285],
         ...,
         [ 14.5032, -30.7909, -30.4184,  ...,  -8.0420,  -9.9960,  -8.0174],
         [ 14.8440, -31.3340, -30.9675,  ...,  -8.4790,  -9.7610,  -8.3460],
         [ 14.9384, -31.1158, -30.7467,  ...,  -8.2787,  -9.6670,  -8.2084]]],
       grad_fn=<ViewBackward0>)

**8: Decode Predicted Token IDs to Text**

Uses argmax to extract the most probable token IDs that onverts the predicted token IDs into human-readable text using the Wav2Vec2 tokenizer’s decode method.

In [10]:
predicted_ids = torch.argmax(logits, dim = -1)

In [11]:
transcriptions = tokenizer.decode(predicted_ids[0])

**9: Display Final Transcription**

Prints the final transcription result from the speech-to-text model output.

In [12]:
transcriptions

'IN THE ANCIENT LAND OF ELDORIA WHERE SKIES SHIMMERED AND FORESTS WHISPERED SECRETS TO THE WIND LIVED A DRAGON NAMED ZEPHYROS NOT THE BURNET ALL DOWNKIND BUT HE WAS GENTLE WISE WITH EYES LIKE OLD STARS EVEN THE BIRDS FELL SILENT WHEN HE PASSED'