# Audio Gender Classification Pipeline

*This is the third notebook out of three notebooks focused on Model Inference.*

## 1. Audio Recording
- **Step:** Record audio directly from the microphone.
- **Details:** The function records audio for a specified duration (in seconds) and saves the recorded audio as a `.wav` file.

## 2. Voice Activity Detection (VAD)
- **Step:** Detect if there's any speech in the recorded audio using the Silero VAD model.
- **Details:** 
  - The function loads the `.wav` file.
  - Converts stereo audio to mono if necessary.
  - Detects speech segments and returns speech timestamps if any speech is detected.

## 3. Audio Preprocessing
- **Step:** Preprocess the audio for gender classification.
- **Details:** 
  - The audio is resampled to 16kHz if necessary.
  - Stereo audio is converted to mono.
  - The audio is prepared as input for the Wav2Vec2 model by tokenizing it.

## 4. Gender Prediction
- **Step:** Predict the gender of the speaker using a fine-tuned Wav2Vec2 model.
- **Details:** 
  - The model processes the input audio and outputs logits.
  - The predicted label (Male or Female) is determined by analyzing the logits.

## 5. Pipeline Execution
- **Step:** Run the entire pipeline from audio recording to gender prediction.
- **Details:** 
  - The function first detects voice activity in the audio.
  - If speech is detected, the gender of the speaker is predicted.
  - The predicted gender is displayed.


In [2]:
import sounddevice as sd
import numpy as np
import noisereduce as nr
import scipy.io.wavfile as wavfile
import torchaudio
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification

In [3]:
def record_audio(duration=6, sample_rate=16000, output_file="recorded_audio.wav"):
    print(f"Recording for {duration} seconds...")
    audio_data = sd.rec(int(duration * sample_rate), samplerate=sample_rate, channels=1, dtype=np.float32)
    sd.wait()  # Wait until the recording is finished
    wavfile.write(output_file, sample_rate, audio_data)
    print(f"Recording saved as {output_file}")

In [4]:
def load_wav_file(file_path):
    wav, sr = torchaudio.load(file_path)
    return wav, sr

In [5]:
model_vad, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', 
                                  model='silero_vad', 
                                  force_reload=True, 
                                  trust_repo=True)

# Extract utilities from Silero VAD
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

Downloading: "https://github.com/snakers4/silero-vad/zipball/master" to C:\Users\Mostafa/.cache\torch\hub\master.zip


In [6]:
def detect_voice_activity(wav_file_path):
    wav, sr = load_wav_file(wav_file_path)
    
    # Convert stereo to mono 
    if wav.shape[0] > 1:
        wav = torch.mean(wav, dim=0, keepdim=True)
    
    wav_np = wav.numpy()[0]

    wav_np_denoised = nr.reduce_noise(y=wav_np, sr=sr)

    wav_np_denoised = wav_np_denoised / np.max(np.abs(wav_np_denoised))  # Normalize to [-1, 1]
    
    wav = torch.tensor(wav_np_denoised, dtype=torch.float32).unsqueeze(0)  # Add channel dimension
    
    speech_timestamps = get_speech_timestamps(wav.squeeze(), model_vad, sampling_rate=sr)
    
    if speech_timestamps:
        print("Voice detected in the audio.")
    else:
        print("No voice detected in the audio.")
    
    return speech_timestamps

In [7]:
def predict_gender(audio_path):
    speech_array, sampling_rate = torchaudio.load(audio_path)
    
    if sampling_rate != 16000:  # assuming Wav2Vec2 expects 16kHz
        resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
        speech_array = resampler(speech_array)
    
    # Convert stereo to mono if needed (Wav2Vec2 expects single channel audio)
    if speech_array.shape[0] > 1:
        speech_array = torch.mean(speech_array, dim=0, keepdim=True)
    
    # Remove the extra dimension from speech_array (ensure it has shape [1, sequence_length])
    speech_array = speech_array.squeeze()  
    
    # Ensure the input shape is [batch_size, sequence_length]
    inputs = processor(speech_array, sampling_rate=16000, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        logits = model(**inputs).logits
        predicted_label = torch.argmax(logits, dim=-1).item()

    return "Male" if predicted_label == 0 else "Female"

In [8]:
def main_pipeline(audio_path):
    # Step 1: Detect voice activity using Silero VAD
    speech_timestamps = detect_voice_activity(audio_path)
    
    # Step 2: If voice is detected, predict the gender
    if speech_timestamps:
        predicted_gender = predict_gender(audio_path)
        print(f"Predicted Gender: {predicted_gender}")
    else:
        print("Skipping gender prediction due to no voice detected.")

In [9]:
# Load the saved Wav2Vec2 gender classification model and processor
model_dir = "model_output"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_dir)
processor = Wav2Vec2Processor.from_pretrained(model_dir)

record_audio(duration=6, sample_rate=16000, output_file="recorded_audio.wav")

audio_path = "recorded_audio.wav"
main_pipeline(audio_path)


Recording for 6 seconds...
Recording saved as recorded_audio.wav
Voice detected in the audio.

Predicted Gender: Male
