# Speech to Text Project

**1: Install Transformers Library**

Installs Hugging Face’s transformers package, which provides pre-trained models for tasks such as speech-to-text, NLP, and more.

In [None]:
pip install transformers

**2: Import Transformers Pipeline**

Imports the pipeline API from transformers, which simplifies access to models for various tasks like audio transcription.

In [None]:
from transformers import pipeline

**3: Import Additional Libraries**

Imports essential libraries including librosa for audio processing, torch for tensor computations, and IPython for audio playback.

In [None]:
import librosa
import torch
import IPython.display as display
from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
import numpy as np

**4: Load Pretrained Tokenizer and Model**

Loads the Wav2Vec2 tokenizer and model from Hugging Face, specifically the facebook/wav2vec2-base-960h checkpoint trained for English speech recognition.

In [None]:
tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

**5: Load Audio File**

Loads an MP3 audio file using librosa and resamples it to 16 kHz, the required sampling rate for Wav2Vec2.

In [None]:
audio, sampling_rate = librosa.load("/content/v.mp3", sr=16000)
audio, sampling_rate

In [None]:
display.Audio('/content/v.mp3', autoplay=True)

**6: Tokenize Audio Input**

Tokenizes the audio waveform into input tensors required by the Wav2Vec2 model for inference.

In [None]:
input_values = tokenizer(audio, return_tensors="pt").input_values
input_values

**7: Generate Logits from Model**

Passes the tokenized input into the Wav2Vec2 model to generate raw logits representing character probabilities.

In [None]:
logits = model(input_values).logits
logits

**8: Decode Predicted Token IDs to Text**

Uses argmax to extract the most probable token IDs that onverts the predicted token IDs into human-readable text using the Wav2Vec2 tokenizer’s decode method.

In [None]:
predicted_ids = torch.argmax(logits, dim = -1)

In [None]:
transcriptions = tokenizer.decode(predicted_ids[0])

**9: Display Final Transcription**

Prints the final transcription result from the speech-to-text model output.

In [None]:
transcriptions