# üéµ Classification des Genres Musicaux
## Notebook 2: Extraction des Caract√©ristiques Audio

**Objectif:** Extraire les caract√©ristiques audio (MFCC, spectrales, tempo, etc.) de tous les fichiers.

---

## 1. Configuration et Imports

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
from pathlib import Path

sys.path.insert(0, '..')

from src.config import Config
from src.data_loader import DataLoader
from src.feature_extraction import FeatureExtractor
from src.visualization import Visualizer

plt.style.use('seaborn-v0_8-whitegrid')
%matplotlib inline

print("‚úÖ Imports r√©ussis!")

## 2. Chargement du Dataset

In [None]:
loader = DataLoader()
df = loader.scan_dataset()
print(f"\nüìä {len(df)} fichiers audio trouv√©s")

## 3. Pr√©sentation des Caract√©ristiques Audio

Nous allons extraire les caract√©ristiques suivantes:

| Caract√©ristique | Description | Utilit√© |
|-----------------|-------------|--------|
| **MFCC** | Coefficients cepstraux sur l'√©chelle de Mel | Repr√©sente le timbre |
| **Spectral Centroid** | Centre de gravit√© du spectre | Brillance du son |
| **Spectral Bandwidth** | Largeur du spectre | Richesse harmonique |
| **Spectral Rolloff** | Fr√©quence de coupure √† 85% | Sons harmoniques vs percussifs |
| **Zero Crossing Rate** | Passages par z√©ro | Bruit vs signal harmonique |
| **Tempo** | BPM | Rythme |
| **Chroma** | Distribution des 12 notes | Harmonie |
| **RMS Energy** | √ânergie moyenne | Volume |


In [None]:
# Initialiser l'extracteur
extractor = FeatureExtractor()

# Afficher le r√©sum√© des features
extractor.print_feature_summary()

## 4. D√©monstration sur un Fichier

In [None]:
# Prendre un exemple
sample_file = df.iloc[0]['filepath']
sample_genre = df.iloc[0]['genre']

print(f"üéµ Fichier: {Path(sample_file).name}")
print(f"üéµ Genre: {sample_genre}")

In [None]:
# Charger l'audio
y, sr = librosa.load(sample_file, sr=Config.SAMPLE_RATE, duration=30)

# Extraire les features
features = extractor.extract_all_features(y, sr)

print(f"\nüìä Nombre de caract√©ristiques extraites: {len(features)}")
print("\nüîç Aper√ßu des premi√®res caract√©ristiques:")
for i, (name, value) in enumerate(list(features.items())[:10]):
    print(f"   {name}: {value:.4f}")

In [None]:
# Visualiser les MFCC
mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=20)

fig, ax = plt.subplots(figsize=(12, 4))
img = librosa.display.specshow(mfcc, sr=sr, x_axis='time', ax=ax)
ax.set_title(f"MFCC - {sample_genre.upper()}", fontsize=14, fontweight='bold')
ax.set_ylabel('Coefficient MFCC')
fig.colorbar(img, ax=ax)
plt.tight_layout()
plt.show()

## 5. Extraction de Toutes les Caract√©ristiques

In [None]:
# Chemin pour sauvegarder les features
features_path = Config.DATA_PROCESSED / Config.FEATURES_FILE

print(f"üìÅ Les features seront sauvegard√©es dans: {features_path}")

In [None]:
# Extraire les features de tous les fichiers
# ‚ö†Ô∏è Cette op√©ration peut prendre plusieurs minutes!

print("‚è≥ Extraction en cours... (environ 5-10 minutes)")
features_df = extractor.extract_features_from_dataset(df, save_path=features_path)

print(f"\n‚úÖ Extraction termin√©e!")
print(f"   - Fichiers trait√©s: {len(features_df)}")
print(f"   - Features extraites: {len(features_df.columns) - 2}")

In [None]:
# Aper√ßu du DataFrame
features_df.head()

In [None]:
# Statistiques descriptives
features_df.describe()

## 6. Visualisation des Features Extraites

In [None]:
visualizer = Visualizer()

In [None]:
# Distribution de quelques features importantes
important_features = [
    'tempo',
    'spectral_centroid_mean',
    'zero_crossing_rate_mean',
    'mfcc_1_mean',
    'rms_mean',
    'spectral_rolloff_mean'
]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for ax, feature in zip(axes, important_features):
    if feature in features_df.columns:
        sns.boxplot(data=features_df, x='genre', y=feature, ax=ax, palette='husl')
        ax.set_title(feature)
        ax.set_xlabel('')
        plt.setp(ax.xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.suptitle("Distribution des Caract√©ristiques par Genre", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Matrice de corr√©lation (premiers 30 features)
numeric_cols = features_df.select_dtypes(include=[np.number]).columns[:30]
corr_matrix = features_df[numeric_cols].corr()

fig, ax = plt.subplots(figsize=(14, 12))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='RdBu_r', center=0, ax=ax)
ax.set_title("Matrice de Corr√©lation des Caract√©ristiques", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 7. R√©duction de Dimension (PCA)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Pr√©parer les donn√©es
feature_cols = [c for c in features_df.columns if c not in ['filename', 'genre']]
X = features_df[feature_cols].values
y = features_df['genre'].values

# Standardiser
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print(f"Variance expliqu√©e: {pca.explained_variance_ratio_.sum()*100:.1f}%")

In [None]:
# Visualiser la projection PCA
fig, ax = plt.subplots(figsize=(12, 8))

genres = np.unique(y)
colors = sns.color_palette('husl', len(genres))

for genre, color in zip(genres, colors):
    mask = y == genre
    ax.scatter(X_pca[mask, 0], X_pca[mask, 1], c=[color], label=genre, alpha=0.7, s=50)

ax.set_xlabel(f"PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)")
ax.set_ylabel(f"PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)")
ax.set_title("Projection PCA des Genres Musicaux", fontsize=14, fontweight='bold')
ax.legend(title='Genre', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

## 8. V√©rification des Donn√©es

In [None]:
# V√©rifier les valeurs manquantes
missing = features_df.isnull().sum().sum()
print(f"Valeurs manquantes: {missing}")

# V√©rifier les valeurs infinies
inf_count = np.isinf(features_df.select_dtypes(include=[np.number])).sum().sum()
print(f"Valeurs infinies: {inf_count}")

# Distribution des genres
print(f"\nDistribution des genres:")
print(features_df['genre'].value_counts())

## 9. Sauvegarde Finale

In [None]:
# V√©rifier que le fichier a √©t√© sauvegard√©
if features_path.exists():
    print(f"‚úÖ Features sauvegard√©es dans: {features_path}")
    print(f"   Taille du fichier: {features_path.stat().st_size / 1024:.1f} KB")
else:
    # Sauvegarder si pas encore fait
    features_df.to_csv(features_path, index=False)
    print(f"‚úÖ Features sauvegard√©es dans: {features_path}")

In [None]:
print("\n‚úÖ Extraction des caract√©ristiques termin√©e!")
print("\nüìå Passez au notebook 03_modeling.ipynb pour l'entra√Ænement des mod√®les.")