# Assignment: Speech to Text using Wav2Vec2

**Objective**: Convert a recorded audio (.wav) into text using a pre-trained Wav2Vec2 model.

## Steps Followed:
- Installed required libraries (torch, torchaudio, librosa, transformers).
- Loaded the pre-trained Facebook Wav2Vec2.0 model and processor.
- Recorded/used my own audio file and resampled it to 16kHz.
- Preprocessed the audio input.
- Passed the audio through the model to generate transcription.
- Output the transcription result successfully.

## Result:
The transcription of my audio was printed correctly without any errors.


In [3]:
pip install torch torchaudio librosa transformers


Collecting librosa
  Using cached librosa-0.11.0-py3-none-any.whl.metadata (8.7 kB)
Collecting audioread>=2.1.9 (from librosa)
  Using cached audioread-3.0.1-py3-none-any.whl.metadata (8.4 kB)
Collecting soundfile>=0.12.1 (from librosa)
  Using cached soundfile-0.13.1-py2.py3-none-win_amd64.whl.metadata (16 kB)
Collecting pooch>=1.1 (from librosa)
  Using cached pooch-1.8.2-py3-none-any.whl.metadata (10 kB)
Collecting soxr>=0.3.2 (from librosa)
  Downloading soxr-0.5.0.post1-cp312-abi3-win_amd64.whl.metadata (5.6 kB)
Using cached librosa-0.11.0-py3-none-any.whl (260 kB)
Using cached audioread-3.0.1-py3-none-any.whl (23 kB)
Using cached pooch-1.8.2-py3-none-any.whl (64 kB)
Using cached soundfile-0.13.1-py2.py3-none-win_amd64.whl (1.0 MB)
Downloading soxr-0.5.0.post1-cp312-abi3-win_amd64.whl (164 kB)
Installing collected packages: soxr, audioread, soundfile, pooch, librosa
Successfully installed audioread-3.0.1 librosa-0.11.0 pooch-1.8.2 soundfile-0.13.1 soxr-0.5.0.post1
Note: you may ne

In [21]:
pip install sounddevice

Collecting sounddevice
  Using cached sounddevice-0.5.1-py3-none-win_amd64.whl.metadata (1.4 kB)
Using cached sounddevice-0.5.1-py3-none-win_amd64.whl (363 kB)
Installing collected packages: sounddevice
Successfully installed sounddevice-0.5.1
Note: you may need to restart the kernel to use updated packages.


### My Recorded Sentence:
_"Hello, this is Jack, it's lovely to meet you."_

### Model's Transcription Output:


In [46]:
# Install required libraries (if not installed)

import torch
import librosa
import os
import sounddevice as sd
from scipy.io.wavfile import write
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# Step 1: Record Audio

def record_audio(filename="recording_test.wav", duration=5, fs=16000):
    print(f"🎙️ Recording for {duration} seconds...")
    recording = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='float32')
    sd.wait()
    write(filename, fs, recording)
    print(f"✅ Recording saved as: {filename}")

# Step 2: Load Model and Processor

def load_model_and_processor():
    print("\n🔄 Loading model and processor...")
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h", ignore_mismatched_sizes=True)
    print("✅ Model loaded successfully!\n")
    return processor, model

# Step 3: Transcribe Audio

def transcribe_audio(file_path, processor, model):
    print(f"🎵 Loading audio from: {file_path}")
    speech_array, sampling_rate = librosa.load(file_path, sr=16000)
    # Normalize volume a little bit (optional but helps!)
    speech_array = speech_array / max(abs(speech_array))
    inputs = processor(speech_array, return_tensors="pt", sampling_rate=16000)

    with torch.no_grad():
        logits = model(inputs.input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]

    print("\n📝 Transcription Result:")
    print("---------------------------------")
    print(transcription)
    print("---------------------------------")

# Step 4: Execute Everything

if __name__ == "__main__":
    filename = "recording_test.wav"
    duration = 10  # Record for 5 seconds (you can increase if you want)

    # 1. Record
    record_audio(filename=filename, duration=duration)

    # 2. Load Model
    processor, model = load_model_and_processor()

    # 3. Transcribe
    transcribe_audio(filename, processor, model)


🎙️ Recording for 10 seconds...
✅ Recording saved as: recording_test.wav

🔄 Loading model and processor...


Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✅ Model loaded successfully!

🎵 Loading audio from: recording_test.wav

📝 Transcription Result:
---------------------------------
HI THIS IS JACK IT'S VERY LOUDLY TO MEET YOU ALL
---------------------------------
