## **Wav2Vec (Transformer-based)**

Wav2Vec is a transformer-based model designed for self-supervised learning of speech representations, excelling in speech-to-text tasks.

**Imports**

In [1]:
!pip install transformers torch librosa




[notice] A new release of pip is available: 24.2 -> 25.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import librosa




**Load Wav2Vec2 model and processor**

In [3]:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Load audio and preprocess**

In [4]:
audio, rate = librosa.load("hello-278029.mp3", sr=16000)
inputs = processor(audio, sampling_rate=rate, return_tensors="pt", padding=True)


In [5]:
#TO record audio
# import librosa
# import sounddevice as sd
# import soundfile as sf
# from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
# import torch

# # 1. Record Audio
# fs = 16000  # Sample rate (match your model's expected rate)
# seconds = 5  # Duration of recording (adjust as needed)

# print("Recording audio...")
# recording = sd.rec(int(seconds * fs), samplerate=fs, channels=1)
# sd.wait()  # Wait until recording is finished
# print("Recording finished.")

# # Optional: Save the recording (for debugging/inspection)
# sf.write('recording.wav', recording, fs)  # Save as WAV (easier for librosa)


# # 2. Process Recorded Audio (same as before)
# audio = recording.flatten() # Make the audio 1D
# rate = fs # The sample rate is what we set for recording

# inputs = processor(audio, sampling_rate=rate, return_tensors="pt", padding=True)



**Perform inference**

In [6]:
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)

**Decode transcription**

In [7]:
transcription = processor.batch_decode(predicted_ids)
print("Transcription:", transcription)

Transcription: ['HELLO']
