# **Imports**

In [27]:
import numpy as np
import librosa
import soundfile as sf
from pathlib import Path
from tqdm import tqdm

- **numpy** → numerical arrays for ML
- **librosa** → audio feature extraction
- **soundfile** → stable audio loading (faster & safer than librosa.load)
- **Path** → OS-independent file traversal
- **tqdm** → progress bar (important for large datasets)

# **Configuration**

In [None]:
DATA_ROOT = Path("../data")

LABEL_MAP = {
    "AI": 0,
    "Human": 1
}

# **Audio Loading**

In [29]:
def load_audio(path, target_sr=16000):
    y, sr = sf.read(path)

    # Convert stereo to mono
    if y.ndim > 1:
        y = y.mean(axis=1)

    # Resample if needed
    if sr != target_sr:
        y = librosa.resample(y, orig_sr=sr, target_sr=target_sr)

    # Normalize amplitude
    y = librosa.util.normalize(y)

    return y, target_sr

- **Converts all audio to mono**
- **Resamples to 16 kHz (industry standard for speech)**
- **Normalization removes loudness bias**
- **Ensures every sample is comparable**

**This prevents the model from cheating.**

# **Feature Extraction**

In [30]:
def extract_features(y, sr):
    # Mel spectrogram
    mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128, fmax=sr // 2)
    
    mel_db = librosa.power_to_db(mel, ref=np.max)
    
    mel_mean = np.mean(mel_db, axis=1)
    mel_std = np.std(mel_db, axis=1)

    # MFCC
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    
    mfcc_mean = np.mean(mfcc, axis=1)
    mfcc_std = np.std(mfcc, axis=1)

    features = np.concatenate([mel_mean, mel_std, mfcc_mean, mfcc_std])
    return features

- **Mel spectrogram captures Human Auditory Perception**
- **AI voices often show:**
    - **unnatural spectral smoothness**
    - **reduced variance**
- We aggregate using **mean + std**
- Output is a **fixed-length vector**, perfect for ML

# **Dataset Traversal & Feature Collection**

In [31]:
X = []
y = []
languages = []

for label_name in ["Human", "AI"]:
    label_value = LABEL_MAP[label_name]
    label_dir = DATA_ROOT / label_name

    for lang_dir in label_dir.iterdir():
        if not lang_dir.is_dir():
            continue

        language = lang_dir.name

        for audio_file in tqdm(list(lang_dir.glob("*.mp3")), desc=f"{label_name}-{language}"):

            try:
                audio, sr = load_audio(audio_file)
                features = extract_features(audio, sr)

                assert features.shape[0] == 282

                X.append(features)
                y.append(label_value)
                languages.append(language)

            except Exception as e:
                print(f"Failed on {audio_file}: {e}")

Human-English: 100%|██████████| 200/200 [00:08<00:00, 22.74it/s]
Human-Hindi: 100%|██████████| 200/200 [00:06<00:00, 30.43it/s]
Human-Malayalam: 100%|██████████| 200/200 [00:05<00:00, 33.99it/s]
Human-Tamil: 100%|██████████| 200/200 [00:06<00:00, 31.22it/s]
Human-Telugu: 100%|██████████| 200/200 [00:06<00:00, 31.29it/s]
AI-English: 100%|██████████| 200/200 [00:03<00:00, 65.00it/s]
AI-Hindi: 100%|██████████| 200/200 [00:03<00:00, 57.96it/s]
AI-Malayalam: 100%|██████████| 200/200 [00:03<00:00, 58.22it/s]
AI-Tamil: 100%|██████████| 200/200 [00:03<00:00, 55.25it/s]
AI-Telugu: 100%|██████████| 200/200 [00:03<00:00, 56.30it/s]


**Stores:**
- **X** → Numerical Features
- **y** → Class Label
- **languages** → Language Metadata

In [32]:
X = np.array(X)
y = np.array(y)
languages = np.array(languages)

print("Feature Matrix Shape:", X.shape)
print("Labels Shape:", y.shape)
print("Languages Shape:", languages.shape)

print("AI Samples:", np.sum(y == 0))
print("Human Samples:", np.sum(y == 1))

Feature Matrix Shape: (2000, 282)
Labels Shape: (2000,)
Languages Shape: (2000,)
AI Samples: 1000
Human Samples: 1000


In [33]:
FEATURE_DIR = Path("../artifacts/features/ML")
FEATURE_DIR.mkdir(parents=True, exist_ok=True)

np.save(FEATURE_DIR / "X_features.npy", X)
np.save(FEATURE_DIR / "y_labels.npy", y)
np.save(FEATURE_DIR / "languages.npy", languages)

print("Features saved to:", FEATURE_DIR)

Features saved to: ..\artifacts\features\ML
