# Voice Gender Recognition
Gender recognition by voice is a technique in which you can determine the gender category of a speaker by processing speech signals.

## Preparing the Dataset
We'll be using Mozilla's Common Voice Dataset, a corpus of speech data read by users on the Common Voice website. Its purpose is to enable the training and testing of automatic speech recognition. However, after I took a look at the dataset, many of the samples were labeled in the genre column. Therefore, we can extract these labeled samples and perform gender recognition.

We won't be using raw audio data since audio samples can be of any length and can be problematic in terms of noise. As a result, we need to perform some feature extraction before feeding it into the neural network. Here are the steps done to prepare our dataset:

- First, I've only filtered the labeled samples in the genre field. 
- After that, I've balanced the dataset so that the number of female samples is equal to male samples; this will help the model not overfit on a particular gender.
- Finally, I've used the Mel Spectrogram extraction technique to get a vector of the length 128 from each voice sample.

In [1]:
import numpy as np
import shutil
import librosa
from tqdm import tqdm

def extract_feature(file_name, **kwargs):
    """
    Extract feature from audio file `file_name`
        Features supported:
            - MFCC (mfcc)
            - Chroma (chroma)
            - MEL Spectrogram Frequency (mel)
            - Contrast (contrast)
            - Tonnetz (tonnetz)
        e.g:
        `features = extract_feature(path, mel=True, mfcc=True)`
    """
    mfcc = kwargs.get("mfcc")
    chroma = kwargs.get("chroma")
    mel = kwargs.get("mel")
    contrast = kwargs.get("contrast")
    tonnetz = kwargs.get("tonnetz")
    X, sample_rate = librosa.core.load(file_name)
    if chroma or contrast:
        stft = np.abs(librosa.stft(X))
    result = np.array([])
    if mfcc:
        mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T, axis=0)
        result = np.hstack((result, mfccs))
    if chroma:
        chroma = np.mean(librosa.feature.chroma_stft(S=stft, sr=sample_rate).T,axis=0)
        result = np.hstack((result, chroma))
    if mel:
        mel = np.mean(librosa.feature.melspectrogram(y=X, sr=sample_rate).T,axis=0)
        result = np.hstack((result, mel))
    if contrast:
        contrast = np.mean(librosa.feature.spectral_contrast(S=stft, sr=sample_rate).T,axis=0)
        result = np.hstack((result, contrast))
    if tonnetz:
        tonnetz = np.mean(librosa.feature.tonnetz(y=librosa.effects.harmonic(X), sr=sample_rate).T,axis=0)
        result = np.hstack((result, tonnetz))
    return result

In [2]:
import glob
import os
import pandas as pd
from sklearn.utils import resample

def prepare_dataset():
    """
    This function will prepare and preprocess the dataset. The original dataset must be in the working directory in order
    order to be preprocessed. After the execution of this function the new dataset will be in the data directory.
    """
    
    dirname = "data"
    if not os.path.exists(os.path.join(dirname, "balanced_dataset.csv")):
        if not os.path.isdir(dirname):
            os.mkdir(dirname)

        csv_files = glob.glob("*.csv")

        balanced_dataset = pd.DataFrame()

        for i, csv_file in enumerate(csv_files):
            print(f"[+] Preprocessing {csv_file}")
            df = pd.read_csv(csv_file)

            # only take filename and gender columns
            new_df = df[["filename", "gender"]]

            print("Previously:", len(new_df), "rows")
            # take only male & female genders (i.e droping NaNs & 'other' gender)
            new_df = new_df[np.logical_or(new_df['gender'] == 'female', new_df['gender'] == 'male')]
            print("Now:", len(new_df), "rows")

            majority_class = new_df[new_df["gender"] == new_df["gender"].value_counts().idxmax()]
            minority_class = new_df[new_df["gender"] == new_df["gender"].value_counts().idxmin()]

            print("Before Balancing:")
            print(new_df["gender"].value_counts())

            undersampled_majority = resample(majority_class,
                                     replace=False,  # Without replacement
                                     n_samples=new_df["gender"].value_counts()[-1],  # Match minority class size
                                     random_state=42)  # Set random seed for reproducibility

            # Combine the undersampled majority class with the minority class
            balanced_data = pd.concat([undersampled_majority, minority_class])

            print("After Balancing:")
            print(balanced_data["gender"].value_counts())

             # Concatenate the data from the current file to the existing concatenated_data
            balanced_dataset = pd.concat([balanced_dataset, balanced_data], ignore_index=True)

            # get the folder name
            folder_name, _ = csv_file.split(".")
            audio_files = glob.glob(f"{folder_name}/{folder_name}/*")
            all_audio_filenames = set(balanced_data["filename"])

            for i, audio_file in tqdm(list(enumerate(audio_files)), f"Extracting features of {folder_name}"):
                splited = os.path.split(audio_file)
                audio_filename = f"{os.path.split(splited[0])[-1]}/{splited[-1]}"
                if audio_filename in all_audio_filenames:
                    src_path = f"{folder_name}/{audio_filename}"
                    target_path = f"{dirname}/{audio_filename}"
                    #create that folder if it doesn't exist
                    if not os.path.isdir(os.path.dirname(target_path)):
                        os.mkdir(os.path.dirname(target_path))
                    features = extract_feature(src_path, mel=True)
                    target_filename = target_path.split(".")[0]
                    np.save(target_filename, features)

        balanced_dataset.to_csv(os.path.join(dirname, "balanced_dataset.csv"), index=False)
    else:
        print(f"Balanced dataset already exists. {os.path.join(dirname, "balanced_dataset.csv")}")


In [3]:
prepare_dataset()

[+] Preprocessing cv-other-dev.csv
Previously: 3022 rows
Now: 1342 rows
Before Balancing:
male      1034
female     308
Name: gender, dtype: int64
After Balancing:
male      308
female    308
Name: gender, dtype: int64


Extracting features of cv-other-dev: 100%|████████████████████████████████████████| 3022/3022 [00:18<00:00, 162.47it/s]


[+] Preprocessing cv-other-test.csv
Previously: 2961 rows
Now: 1272 rows
Before Balancing:
male      1001
female     271
Name: gender, dtype: int64
After Balancing:
male      271
female    271
Name: gender, dtype: int64


Extracting features of cv-other-test: 100%|███████████████████████████████████████| 2961/2961 [00:18<00:00, 161.19it/s]


[+] Preprocessing cv-other-train.csv
Previously: 145135 rows
Now: 63253 rows
Before Balancing:
male      49398
female    13855
Name: gender, dtype: int64
After Balancing:
male      13855
female    13855
Name: gender, dtype: int64


Extracting features of cv-other-train: 100%|██████████████████████████████████| 145135/145135 [11:55<00:00, 202.75it/s]


[+] Preprocessing cv-valid-dev.csv
Previously: 4076 rows
Now: 1529 rows
Before Balancing:
male      1135
female     394
Name: gender, dtype: int64
After Balancing:
male      394
female    394
Name: gender, dtype: int64


Extracting features of cv-valid-dev: 100%|████████████████████████████████████████| 4076/4076 [00:19<00:00, 213.83it/s]


[+] Preprocessing cv-valid-test.csv
Previously: 3995 rows
Now: 1529 rows
Before Balancing:
male      1137
female     392
Name: gender, dtype: int64
After Balancing:
male      392
female    392
Name: gender, dtype: int64


Extracting features of cv-valid-test: 100%|███████████████████████████████████████| 3995/3995 [00:18<00:00, 213.30it/s]


[+] Preprocessing cv-valid-train.csv
Previously: 195776 rows
Now: 73278 rows
Before Balancing:
male      55029
female    18249
Name: gender, dtype: int64
After Balancing:
male      18249
female    18249
Name: gender, dtype: int64


Extracting features of cv-valid-train: 100%|██████████████████████████████████| 195776/195776 [15:54<00:00, 205.12it/s]
