# üßÆ 5. Data Preprocessing
## 5.1 Tujuan

Tahap Data Preprocessing bertujuan untuk mengubah data audio mentah menjadi data numerik yang siap digunakan untuk proses modeling. Proses ini meliputi pembacaan file audio, normalisasi sinyal, penghapusan noise, pemotongan bagian diam, serta ekstraksi fitur statistik time series seperti mean, standard deviation, RMS, dan zero crossing rate (ZCR).

## 5.2 Langkah-Langkah Preprocessing
### a. Import Library yang Dibutuhkan

In [1]:
import os
import librosa
import numpy as np
import pandas as pd
from scipy.stats import skew, kurtosis


### b. Menentukan Path Dataset

In [2]:
# Path ke folder dataset
DATASET_PATH = r"D:\KULIAH\SEMESTER 5\Program Saint Data\Uranus\myfirstbook\Audio_recognition\dataset48k"

# Folder train dan validation
train_path = os.path.join(DATASET_PATH, "train")
val_path = os.path.join(DATASET_PATH, "val")


### c. Fungsi Ekstraksi Fitur Statistik

Fungsi ini akan membaca setiap file .wav, melakukan preprocessing dasar, lalu menghitung fitur statistik dari sinyal time domain.

In [3]:
def extract_features(file_path):
    try:
        # 1. Load audio (mono, 48kHz)
        y, sr = librosa.load(file_path, sr=48000, mono=True)

        # Cek jika file kosong
        if y is None or len(y) == 0:
            print(f"‚ö†Ô∏è File kosong: {file_path}")
            return None

        # 2. Normalisasi amplitudo ke [-1, 1]
        y = y / np.max(np.abs(y)) if np.max(np.abs(y)) != 0 else y

        # 3. Hilangkan bagian diam
        y, _ = librosa.effects.trim(y, top_db=20)
        if len(y) == 0:
            print(f"‚ö†Ô∏è Setelah trimming, audio kosong: {file_path}")
            return None

        # 4. Hitung fitur statistik
        features = {
            'mean': np.mean(y),
            'std': np.std(y),
            'skew': skew(y),
            'kurtosis': kurtosis(y),
            'rms': np.mean(librosa.feature.rms(y=y)),
            'zcr': np.mean(librosa.feature.zero_crossing_rate(y)),
        }

        return features

    except Exception as e:
        print(f"‚ùå Error saat ekstraksi fitur dari {file_path}: {e}")
        return None

### d. Looping untuk Mengambil Fitur dari Semua File

Kita akan mengambil semua file dari folder buka dan tutup, lalu menambahkan label untuk tiap file.

In [4]:
def process_dataset(folder_path):
    data = []
    for label in os.listdir(folder_path):
        label_path = os.path.join(folder_path, label)
        if not os.path.isdir(label_path):
            continue
        
        for file_name in os.listdir(label_path):
            if file_name.endswith(".wav"):
                file_path = os.path.join(label_path, file_name)
                features = extract_features(file_path)
                features['file'] = file_name
                features['label'] = label
                data.append(features)
    return pd.DataFrame(data)

# Proses dataset train dan validation
df_train = process_dataset(train_path)
df_val = process_dataset(val_path)


### e. Menyimpan Data ke Format CSV

Hasil ekstraksi disimpan agar mudah digunakan pada tahap modeling berikutnya.

In [5]:
# Gabungkan data train dan val
df_all = pd.concat([df_train, df_val], ignore_index=True)

# Simpan ke file CSV
df_all.to_csv("fitur_statistik_bukatutup.csv", index=False)
print("Ekstraksi fitur selesai! Total data:", len(df_all))
df_all.head()


Ekstraksi fitur selesai! Total data: 400


Unnamed: 0,mean,std,skew,kurtosis,rms,zcr,file,label
0,0.034605,0.18093,2.029111,8.908755,0.147845,0.033033,buka48k-buka_0.wav.wav,buka
1,0.030874,0.147154,1.870739,7.919638,0.11814,0.033538,buka48k-buka_1.wav.wav,buka
2,0.030876,0.147155,1.870702,7.919473,0.118141,0.033629,buka48k-buka_10.wav.wav,buka
3,0.033366,0.156327,1.792386,7.179306,0.128227,0.035024,buka48k-buka_100.wav.wav,buka
4,0.033274,0.15634,1.793582,7.18673,0.128196,0.034838,buka48k-buka_101.wav.wav,buka


### f. menangani missing value dan non numerik

In [6]:
# =====================================
# üßπ Tahap: Cleaning + Interpolasi Missing Value
# =====================================

import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv('fitur_statistik_bukatutup.csv')

print("=== Info Awal Dataset ===")
print(df.info())
print("\nContoh data:")
print(df.head())

# Hapus kolom non-numerik kecuali 'label'
non_numeric_cols = df.select_dtypes(include=['object']).columns.tolist()
if 'label' in non_numeric_cols:
    non_numeric_cols.remove('label')

if non_numeric_cols:
    print(f"\nüßæ Kolom non-numerik dihapus: {non_numeric_cols}")
    df = df.drop(columns=non_numeric_cols)

# Ganti inf/-inf menjadi NaN
df = df.replace([np.inf, -np.inf], np.nan)

# Hitung missing value sebelum interpolasi
missing_before = df.isna().sum().sum()
print(f"\nJumlah missing value sebelum interpolasi: {missing_before}")

# Interpolasi linear untuk kolom numerik
df.interpolate(method='linear', limit_direction='forward', axis=0, inplace=True)

# Jika masih ada NaN (misal di awal), isi dengan mean kolom
df.fillna(df.mean(numeric_only=True), inplace=True)

# Hitung missing value setelah interpolasi
missing_after = df.isna().sum().sum()
print(f"Jumlah missing value setelah interpolasi: {missing_after}")

# Simpan dataset bersih
df.to_csv('fitur_statistik_bukatutup_clean.csv', index=False)
print("\nüíæ Dataset bersih (dengan interpolasi) disimpan ke 'fitur_statistik_bukatutup_clean.csv'")


=== Info Awal Dataset ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   mean      400 non-null    float64
 1   std       400 non-null    float64
 2   skew      396 non-null    float64
 3   kurtosis  396 non-null    float64
 4   rms       400 non-null    float64
 5   zcr       400 non-null    float64
 6   file      400 non-null    object 
 7   label     400 non-null    object 
dtypes: float64(6), object(2)
memory usage: 25.1+ KB
None

Contoh data:
       mean       std      skew  kurtosis       rms       zcr  \
0  0.034605  0.180930  2.029111  8.908755  0.147845  0.033033   
1  0.030874  0.147154  1.870739  7.919638  0.118140  0.033538   
2  0.030876  0.147155  1.870702  7.919473  0.118141  0.033629   
3  0.033366  0.156327  1.792386  7.179306  0.128227  0.035024   
4  0.033274  0.156340  1.793582  7.186730  0.128196  0.034838   

                 

  df.interpolate(method='linear', limit_direction='forward', axis=0, inplace=True)
