# Audio Gender Classification Pipeline

*This is the second notebook out of three notebooks focused on finetuning the Wav2Vec Model.*

This pipeline walks through downloading, extracting, processing audio files, training a Wav2Vec2 model for gender classification, and testing the model.

### Step 1: Download Dataset
The dataset is downloaded from Google Drive using a specific link. The data is a ZIP file that contains audio recordings for male and female voices.

### Step 2: Extract Dataset
Once the ZIP file is downloaded, it's extracted into a working directory to prepare the files for processing. The extracted folder contains subfolders for male and female audio recordings.

### Step 3: Load and Prepare Audio Files
Audio files are loaded from the extracted folder, and each audio file is labeled based on its folder (either "male" or "female"). These labels are converted into numerical values (e.g., 0 for male, 1 for female).

### Step 4: Preprocess Audio Files
Audio files are processed by resampling them to 16kHz if needed (to match the model's requirements). This step prepares the audio data for input into the Wav2Vec2 model.

### Step 5: Split the Dataset
The dataset is split into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

### Step 6: Train the Model
The Wav2Vec2 model is fine-tuned using the preprocessed audio data. During training, the model learns to classify gender based on the audio recordings. Key metrics such as accuracy are monitored to ensure the model is performing well.

### Step 7: Evaluate the Model
After training, the model is evaluated using the test dataset. This step checks how accurately the model can classify new, unseen audio recordings into male or female categories.

### Step 8: Save the Model
Once the model is trained and evaluated, it is saved along with its processor. The model can now be used to predict the gender of new audio recordings.

### Step 9: Make Predictions
The saved model is used to predict gender from new audio files. The input audio is preprocessed similarly to the training data, and the model outputs whether the voice is male or female.

### Step 10: Package and Share the Model
Finally, the trained model is packaged into a ZIP file, making it easy to share or deploy elsewhere.


In [1]:
!pip install gdown


Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown
Successfully installed gdown-5.2.0


In [8]:
import gdown

# Correct link with direct download format
file_id = '1b2NV1yM0u9bv9B6SmbCRa1u8Tl-N7FUn'  # Example: '1ABC123defGHI456jklMNO'
gdown.download(f'https://drive.google.com/uc?export=download&id={file_id}', 'my_dataset.zip', quiet=False)

Downloading...
From (original): https://drive.google.com/uc?export=download&id=1b2NV1yM0u9bv9B6SmbCRa1u8Tl-N7FUn
From (redirected): https://drive.google.com/uc?export=download&id=1b2NV1yM0u9bv9B6SmbCRa1u8Tl-N7FUn&confirm=t&uuid=e266c698-7751-4375-9f1a-d3bd79eae07d
To: /kaggle/working/my_dataset.zip
100%|██████████| 3.20G/3.20G [00:42<00:00, 75.7MB/s]


'my_dataset.zip'

In [9]:
import zipfile
import os

zip_file_path = 'my_dataset.zip'
output_folder = '/kaggle/working/unzipped_folder'

os.makedirs(output_folder, exist_ok=True)

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(output_folder)

print(f"Contents extracted to {output_folder}.")


Contents extracted to /kaggle/working/unzipped_folder.


In [12]:
import os
import torch
import torchaudio
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
from datasets import Dataset
from transformers import Trainer, TrainingArguments
import evaluate  

model_name = "facebook/wav2vec2-base-960h"  
num_labels = 2  # Male and Female
output_dir = "./model_output"

processor = Wav2Vec2Processor.from_pretrained(model_name)

def load_audio_files(data_dir):
    dataset = []
    for label in ["male", "female"]:
        label_dir = os.path.join(data_dir, label)
        audio_files = [f for f in os.listdir(label_dir) if f.endswith('.wav')]
        for audio_file in audio_files:
            dataset.append({"audio": os.path.join(label_dir, audio_file), "label": label})
    return dataset

data_dir = "/kaggle/working/unzipped_folder/Final Data"
audio_data = load_audio_files(data_dir)
dataset = Dataset.from_list(audio_data)

train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Convert labels to numerical values
def encode_labels(examples):
    examples['label'] = 1 if examples['label'] == 'female' else 0
    return examples

# Apply encoding to the train and test datasets
train_dataset = train_dataset.map(encode_labels)
test_dataset = test_dataset.map(encode_labels)

def preprocess_function(examples):
    audio, sr = torchaudio.load(examples["audio"])

    if sr != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)
        audio = resampler(audio)

    audio = audio[0]  # Get the waveform (single channel)

    inputs = processor(audio, sampling_rate=16000, return_tensors="pt", padding=True)

    # Squeeze the input_values tensor to remove the extra dimensions
    inputs['input_values'] = inputs['input_values'].squeeze(0)

    # Add the label to the dictionary
    inputs['label'] = examples['label']

    return inputs

train_dataset = train_dataset.map(preprocess_function)
test_dataset = test_dataset.map(preprocess_function)

# Remove the "audio" column (no longer needed after preprocessing)
train_dataset = train_dataset.remove_columns(["audio"])
test_dataset = test_dataset.remove_columns(["audio"])

model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

# Data collator for padding
def data_collator(features):
    # Extract input values and labels
    input_values = [f["input_values"] for f in features]
    labels = [f["label"] for f in features]

    # Pad input values
    batch = processor.pad(
        {"input_values": input_values},
        padding=True,
        return_tensors="pt"
    )

    # Add labels to batch
    batch["labels"] = torch.tensor(labels, dtype=torch.long)

    return batch

accuracy_metric = evaluate.load("accuracy")  # Load accuracy metric

def compute_metrics(pred):
    predictions = pred.predictions.argmax(-1)
    labels = pred.label_ids
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
    return {"accuracy": accuracy["accuracy"]}

training_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,  # Add eval batch size
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",  # Evaluate after every epoch
    save_steps=500,
    learning_rate=5e-5,  # Adjusted Learning Rate
    weight_decay=0.01,   # Regularization with weight decay
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,  
    data_collator=data_collator,
    compute_metrics=compute_metrics,  
)

trainer.train()

evaluation_results = trainer.evaluate()
print("Evaluation results:", evaluation_results)

trainer.save_model(output_dir)
processor.save_pretrained(output_dir)  # Save the processor as well
print(f"Model saved to {output_dir}")


preprocessor_config.json:   0%|          | 0.00/159 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/163 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]



Map:   0%|          | 0/14360 [00:00<?, ? examples/s]

Map:   0%|          | 0/3591 [00:00<?, ? examples/s]

Map:   0%|          | 0/14360 [00:00<?, ? examples/s]

Map:   0%|          | 0/3591 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/378M [00:00<?, ?B/s]

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1', 'wav2vec2.masked_spec_embed']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3715,0.320929,0.901142
2,0.3049,0.220329,0.931774
3,0.1818,0.199163,0.93985


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Evaluation results: {'eval_loss': 0.199162557721138, 'eval_accuracy': 0.9398496240601504, 'eval_runtime': 286.0688, 'eval_samples_per_second': 12.553, 'eval_steps_per_second': 0.395, 'epoch': 3.0}
Model saved to ./model_output


In [13]:
import shutil

folder_to_zip = '/kaggle/working/model_output'  # Update this with your folder path

zip_file_name = '/kaggle/working/model_output.zip'

shutil.make_archive(zip_file_name.replace('.zip', ''), 'zip', folder_to_zip)

print(f"Folder zipped successfully: {zip_file_name}")


Folder zipped successfully: /kaggle/working/model_output.zip


In [23]:
import librosa
import torch
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification

model_dir = "/kaggle/working/model_output"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_dir)
processor = Wav2Vec2Processor.from_pretrained(model_dir)


def predict_gender(audio_path):
    speech_array, sampling_rate = torchaudio.load(audio_path)
    
    if sampling_rate != 16000:  # assuming Wav2Vec2 expects 16kHz
        resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
        speech_array = resampler(speech_array)
    
    # Ensure the input shape is [batch_size, sequence_length]
    inputs = processor(speech_array.squeeze(0), return_tensors="pt", padding=True)
    
    with torch.no_grad():
        logits = model(**inputs).logits
        predicted_label = torch.argmax(logits, dim=-1).item()

    return "Male" if predicted_label == 0 else "Female"

audio_path = "/kaggle/input/gender-test/test/fe/74_combined.wav"  # Replace this with the path to your audio file
predicted_gender = predict_gender(audio_path)
print(f"Predicted Gender: {predicted_gender}")


It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


Predicted Gender: Female
