# 🎤 Gender Classification from Speech using Machine Learning

This notebook demonstrates how to build a machine learning pipeline for **speech classification** — specifically, classifying whether a speaker is male or female based on their voice recordings.

We'll walk through the following steps:

---

## 📌 Objectives

- Preprocess audio data to improve quality and consistency.
- Extract relevant features from speech signals.
- Train a machine learning model (Logistic Regression) for classification.
- Evaluate model performance using accuracy and classification metrics.

---

## 🛠️ Techniques Used

### 🔉 Audio Preprocessing
- **Trimming**: Remove silence from the beginning and end of the audio.
- **Normalization**: Ensure audio levels are consistent across files.
- **Resampling**: Convert all files to a consistent sampling rate.
- **Padding/Truncating**: Ensure all inputs have the same length.

### 📊 Feature Extraction
We extract powerful features that capture characteristics of the speaker’s voice:
- **MFCCs (Mel-Frequency Cepstral Coefficients)**: Capture timbral and phonetic content.
- **Spectral Centroid**: Measures the "center of mass" of frequencies.
- **Spectral Rolloff**: Frequency below which a set percentage (e.g., 85%) of the energy is contained.
- **Zero-Crossing Rate**: Counts how often the signal changes sign — higher for noisy or unvoiced sounds.
- **RMS Energy**: Captures the loudness of the signal.

---

## 🧠 Model
We use a **Logistic Regression** model to classify audio based on extracted features. The model is trained on labeled examples of male and female voices.

---

## 📈 Evaluation
The final model is evaluated using:
- **Accuracy**
- **Precision, Recall, F1-Score**
- **Confusion Matrix**

---

Let's get started!


In [None]:
!pip install noisereduce

In [None]:
pip install xgboost


# Importing Libraries

In [None]:
# Standard Library
import os
from collections import Counter

# Numerical and Data Processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Audio Processing
import librosa
import librosa.effects
import noisereduce as nr
import IPython.display as ipd

# Machine Learning Models
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Model Evaluation
from sklearn.metrics import confusion_matrix, classification_report

# Model Utilities
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.decomposition import PCA

# Experimenting on a Sample

let's take a sample and use it in experimenting and visualizing 

In [None]:
sample = '/kaggle/input/gender-recognition-by-voiceoriginal/data/female/arctic_b0454.wav'

# Audio Processing

## 1. Load Audio File

In [None]:
ipd.Audio(sample)

In [None]:
x, sr = librosa.load(sample)

pd.Series(x).plot(figsize=(8, 3), lw=1)
plt.title('Audio Wave Plot')
plt.show()

## 2. ✂️ Trim Silence
Removes unnecessary silence from the beginning and end of the audio. This helps eliminate parts of the audio that contain no useful information.

We use `librosa.effects.trim()` function with a parameter `top_db` which stands for "how many decibels below the peak" should be considered silence.

"Decibels" is a logarithmic unit used to measure sound level.

* Lower `top_db` (e.g., `20`) → stricter silence removal (only trims very quiet parts).
* Higher `top_db` (e.g., `60`) → more aggressive (trims even moderately quiet parts).

We will choose `35` which is somewhat in the middle

In [None]:
trimmed_x, index = librosa.effects.trim(x, top_db=35)
pd.Series(trimmed_x).plot(figsize=(8, 2), lw=1)
plt.title('Trimmed Audio Wave Plot')
plt.show()

## 3. 🧼 Noise Reduction

Reduces background noise such as hums, hisses, or ambient sounds using filters or noise reduction algorithms.

* High-pass filters remove low-frequency noise.
* `noisereduce` library can automatically detect and reduce noise.

In [None]:
import noisereduce as nr

denoised_x = nr.reduce_noise(y=trimmed_x, sr=sr)

pd.Series(denoised_x).plot(figsize=(8, 2), lw=1)
plt.title('Denoised Audio Wave Plot')
plt.show()

## 4. 📈 Normalization

Ensures all audio signals are on the same volume scale by scaling the waveform so its peak is consistent across samples.

In [None]:
normalized_x = librosa.util.normalize(denoised_x)

## 5. ⏱️ Resampling (Optional)

Resamples all audio to the same sampling rate (e.g. 16,000 Hz), which ensures uniformity across the dataset.

Sample rate: how many samples per second are used to represent the audio.

In [None]:
resampled_x = librosa.resample(x, orig_sr=sr, target_sr=16000)

## 6. 🧱 Padding or Truncating

In [None]:
desired_length = 30000

print("Length of our audio sample (x):", len(x))
resized_x = x

if len(x) < desired_length:
    resized_x = np.pad(x, (0, desired_length - len(x)))
else:
    resized_x = x[:desired_length]

print('Length of our resized sample:', len(resized_x))
pd.Series(resized_x).plot(figsize=(8, 2), lw=1)
plt.title('Resized Audio Wave Plot')
plt.show()

Let's look the audio wave at a more zoomed level

In [None]:
pd.Series(denoised_x[9000:9500]).plot(figsize=(8, 2), lw=1)
plt.title('Zoomed Audio Wave Plot')
plt.show()

# Feature Extraction

Now let's look at various feature extraction techniques

## 1. 📊 Spectrogram

Shows how the frequencies of the audio signal change over time.

* **X-axis**: time
* **Y-axis**: frequency
* **Color**: amplitude (loudness) of each frequency at that moment

In [None]:
transformed_x = librosa.stft(trimmed_x)
db = librosa.amplitude_to_db(abs(transformed_x))
db.shape

In [None]:
image = librosa.display.specshow(db, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(image)
plt.title('Spectogram of Audio Data')
plt.show()

## 2. 🎵 Mel Spectogram

Similar to a regular spectrogram, but the frequency axis is scaled to match how humans hear (the Mel scale).

It focuses more on low to mid frequencies, which are most important for speech and music.

In [None]:
S = librosa.feature.melspectrogram(y=x, sr=sr)

In [None]:
fig, ax = plt.subplots()
S_db = librosa.power_to_db(S, ref=np.max)
img = librosa.display.specshow(S_db, x_axis='time', y_axis='mel', sr=sr, fmax=8000, ax=ax)
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.set(title='Mel-frequency spectogram')
plt.show()

## 3. 🎯 Spectral Centroid

The Spectral Centroid tells us where the "center of mass" of the sound frequencies is — it shows us how "bright" or "dark" a sound is.

* If most energy is in high frequencies, the centroid is high → the sound is bright or sharp (like cymbals).
* If most energy is in low frequencies, the centroid is low → the sound is dull or bassy (like drums or male voices).

In [None]:
cent = librosa.feature.spectral_centroid(y=x, sr=sr)
frames = range(len(cent))
time = librosa.frames_to_time(frames)
S, phase = librosa.magphase(librosa.stft(y=x))
freqs, times, D = librosa.reassigned_spectrogram(x, fill_nan=True)

In [None]:
times = librosa.times_like(cent)
fig, ax = plt.subplots()
librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),
                         y_axis='log', x_axis='time', ax=ax)
ax.plot(times, cent.T, label='Spectral centroid', color='w')
fig.colorbar(img, ax=ax, format='%+2.0f dB')
ax.legend(loc='upper right')
ax.set(title='log Power Spectrogram')
plt.show()

## 4. ⚡ Zero Crossings (ZCR)

Counts how often the waveform changes sign (+ to - or - to +).

* High ZCR → Noisy, sharp, or high-pitched sounds
* Low ZCR → Smooth, low, bassy sounds

In [None]:
n0 = 10000
n1 = 10050
pd.Series(x[n0:n1]).plot(figsize = (8, 3), lw = 1)
plt.title("Zoomed Audio Plot")
plt.show()

In [None]:
zero_crossings = librosa.zero_crossings(x[n0:n1], pad=False)
print(zero_crossings.shape)
print("Number of Zero Crossings: ", sum(zero_crossings))

## 5. 🌊 Spectral Rolloff

Tells us the frequency, where below this frequency (point) is 85% of the total energy (amplitude) in the sound.

#### Example:
🧔 Male Voice (Deep, Low-pitched)

* Most energy is in low frequencies.
* We might reach 85% of the energy by 2000 Hz.
* ✅ So the Spectral Rolloff is low.

👩 Female Voice (High-pitched)

* Energy is spread into higher frequencies.
* We may need to go up to 5000 Hz to reach 85% of the energy.
* ✅ So the Spectral Rolloff is higher.

In [None]:
rolloff = librosa.feature.spectral_rolloff(y=x, sr=sr)[0]
print('Rolloff Shape:', rolloff.shape)

## 6. 🎼 MFCC (Mel-Frequency Cepstral Coefficients)
Captures the overall shape of the audio spectrum in a way that mimics human hearing. Commonly used in speech and music analysis.

In [None]:
mfccs = librosa.feature.mfcc(y=x, sr=sr)
print("MFCCs Shape:", mfccs.shape)

librosa.display.specshow(mfccs, sr=sr, x_axis='time')
plt.title('MFCCs')
plt.show()

## 7. 📉 RMS (Root Mean Square Energy)

Captures the energy or loudness of the signal over time. Useful for understanding how powerful the sound is at each frame.

* High RMS values = loud parts (e.g., speech, music, noise).
* Low RMS values = silence or quiet parts.

It helps in voice activity detection, emotion recognition, and even trimming silent segments.

In [None]:
rms = librosa.feature.rms(y=x)[0]

# Get time axis for plotting
frames = range(len(rms))
t = librosa.frames_to_time(frames, sr=sr)

In [None]:
plt.figure(figsize=(10, 4))
plt.plot(t, rms, label='RMS Energy', color='orange')
plt.xlabel('Time (s)')
plt.ylabel('Energy')
plt.title('RMS Energy Over Time')
plt.legend()
plt.tight_layout()
plt.show()

# Final Preprocessing on Data

### Audio Preprocessing

In [None]:
SAMPLE_RATE = 16000 # Standard rate for speech models
DURATION = 3 # seconds
SAMPLES_PER_TRACK = SAMPLE_RATE * DURATION

In [None]:
def PreprocessAudio(file_path):
    try:
        y, sr = librosa.load(file_path, sr=SAMPLE_RATE)
        
        # trim silence
        y, _ = librosa.effects.trim(y)
        
        # reduce noise
        y = nr.reduce_noise(y=y, sr=sr)
        
        # normalize
        y = librosa.util.normalize(y)

        # padding/truncating
        if len(y) < SAMPLES_PER_TRACK:
            y = np.pad(y, (0, SAMPLES_PER_TRACK - len(y)))
        else:
            y = y[:SAMPLES_PER_TRACK]

        return y, sr
    except Exception as error:
        print(f"Failed to process '{file_path}': {error}")
        return None, None

### Feature Extraction

In [None]:
def ExtractAudioFeatures(file_path):
    audio_signal, sample_rate = PreprocessAudio(file_path)
    if audio_signal is None:
        return None

    n_fft = 2048  # means 2048 samples per window (~128ms if sr=16k)
    hop_length = 512  # means we move 512 samples forward for next window (~32ms)
    
    mfcc = librosa.feature.mfcc(y=audio_signal, sr=sample_rate, n_mfcc=13, hop_length=hop_length, n_fft=n_fft)
    mfcc_mean = np.mean(mfcc, axis=1)

    rolloff = librosa.feature.spectral_rolloff(y=audio_signal, sr=sample_rate, hop_length=hop_length, n_fft=n_fft)[0]
    rolloff_mean = np.mean(rolloff)

    zcr = librosa.feature.zero_crossing_rate(y=audio_signal, hop_length=hop_length)[0]
    zcr_mean = np.mean(zcr)

    centroid = librosa.feature.spectral_centroid(y=audio_signal, sr=sample_rate, hop_length=hop_length, n_fft=n_fft)[0]
    centroid_mean = np.mean(centroid)

    rms = librosa.feature.rms(y=audio_signal, hop_length=hop_length)[0]
    rms_mean = np.mean(rms)

    combined_features = np.hstack([
        mfcc_mean,
        rolloff_mean,
        zcr_mean,
        centroid_mean,
        rms_mean
    ])

    return combined_features

# Load Data

In [None]:
male_folder = '/kaggle/input/gender-recognition-by-voiceoriginal/data/male/'
female_folder = '/kaggle/input/gender-recognition-by-voiceoriginal/data/female'

In [None]:
data = []
labels = []

In [None]:
def process_male_file(file_path):
    features = ExtractAudioFeatures(file_path)
    if features is not None:
        return (features, 'male')
    return None

Since the two folders are large (`male_folder` contains 10.5k files), we will use parallel processing to speed up the process 

In [None]:
import concurrent.futures
from tqdm import tqdm

# get all file paths first
file_paths = [os.path.join(male_folder, f) for f in os.listdir(male_folder) if os.path.isfile(os.path.join(male_folder, f))]

with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_male_file, file_paths), total=len(file_paths)))

for result in results:
    if result is not None:
        features, label = result
        data.append(features)
        labels.append(label)

In [None]:
def process_female_file(file_path):
    features = ExtractAudioFeatures(file_path)
    if features is not None:
        return (features, 'female')
    return None

In [None]:
file_paths = [os.path.join(female_folder, f) for f in os.listdir(female_folder) if os.path.isfile(os.path.join(female_folder, f))]

with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_female_file, file_paths), total=len(file_paths)))

for result in results:
    if result is not None:
        features, label = result
        data.append(features)
        labels.append(label)

Now that we loaded our two folders, we can create our dataframe

In [None]:
df = pd.DataFrame(data)
df['gender'] = labels

In [None]:
feature_columns = ([f"mfcc_{i+1}" for i in range(13)] +["spectral_rolloff", "zero_crossing_rate", "spectral_centroid", "rms"])

df.columns = feature_columns + ['gender']

df = df.sample(frac=1).reset_index(drop=True)

df.head()

# EDA & Preprocessing

Now let's explore our constructed data

In [None]:
df.info()

In [None]:
print("Data Shape:", df.shape)

In [None]:
df.describe().T

## Null Values

In [None]:
df.isna().sum()

## Exploring Duplicates

In [None]:
print("Number of Duplicates:", df.duplicated().sum())
print(f"Percentage of Duplicates: {df.duplicated().sum() / len(df) * 100:.2f}%")

In [None]:
df = df.drop_duplicates()
print("Dropped Duplicates")
print(f"Percentage of Duplicates: {df.duplicated().sum() / len(df) * 100:.2f}%")
print("Data Shape:", df.shape)

## Examining Class Imbalance

In [None]:
gender_counts = df['gender'].value_counts()

plt.figure(figsize=(5, 5))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Gender Distribution')
plt.axis('equal')
plt.show()

We see here that we have a class imbalance in our target variable

To resolve this issue, we will assign weights to each of them in training phase

### Data Distribution

In [None]:
feature_columns = df.columns[:-1]
df[feature_columns].hist(bins=30, figsize=(20, 15))
plt.suptitle('Feature Distributions')
plt.show()

## Handling Outliers

In [None]:
n = len(feature_columns)

rows = (n + 1) // 2
cols = 2

fig, axes = plt.subplots(rows, cols, figsize=(12, 3 * rows))
axes = axes.flatten()

for i, col in enumerate(feature_columns):
    sns.boxplot(x='gender', y=col, data=df, ax=axes[i])
    axes[i].set_title(f'{col}')

# Hide any unused plots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

In [None]:
numerical_columns = df.select_dtypes(include=['number']).columns.tolist()

Q1 = df[numerical_columns].quantile(0.25)
Q3 = df[numerical_columns].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df = df[~((df[numerical_columns] < (Q1 - 1.5 * IQR)) | 
              (df[numerical_columns] > (Q3 + 1.5 * IQR))).any(axis=1)]

## Categorical Encoding

In [None]:
df['gender'] = df['gender'].map({
    'male': 1,
    'female': 0
})

In [None]:
df.head()

In [None]:
df.describe().T

In [None]:
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

In [None]:
df = df.drop(['rms', 'spectral_centroid', 'zero_crossing_rate'], axis=1)

In [None]:
df.shape

# Splitting Data

In [None]:
X = df.drop(columns='gender')
y = df['gender']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model Evaluation

## LogesticRegression

In [None]:
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression - Classification Report:")
print(classification_report(y_test, y_pred_lr))
print("Confusion Matrix:")
sns.heatmap(confusion_matrix(y_test, y_pred_lr),annot=True,fmt='d')
plt.show()

## RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest - Classification Report:")
print(classification_report(y_test, y_pred_rf))

print("Confusion Matrix:")
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d', cmap='Greens')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Random Forest Confusion Matrix')
plt.show()


## XGBOOST

In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

print("XGBoost - Classification Report:")
print(classification_report(y_test, y_pred_xgb))

print("Confusion Matrix:")
sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot=True, fmt='d', cmap='Oranges')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('XGBoost Confusion Matrix')
plt.show()


## NaiveBayes

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred_gnb = gnb.predict(X_test)

print("GaussianNB - Classification Report:")
print(classification_report(y_test, y_pred_gnb))

print("Confusion Matrix:")
sns.heatmap(confusion_matrix(y_test, y_pred_gnb), annot=True, fmt='d', cmap='Purples')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('GaussianNB Confusion Matrix')
plt.show()
