# Notebook 1 : Pr√©traitement et Visualisation des Donn√©es MAD

---

## üìã Table des Mati√®res

1. [Introduction et Contexte](#1-introduction)
2. [T√©l√©chargement du Dataset MAD](#2-telechargement)
3. [Analyse Exploratoire Initiale](#3-analyse-exploratoire)
4. [Pipeline de Pr√©traitement Audio](#4-pipeline-pretraitement)
5. [Extraction des Features](#5-extraction-features)
6. [Division Stratifi√©e du Dataset](#6-division-dataset)
7. [Visualisations Finales](#7-visualisations-finales)
8. [Conclusion](#8-conclusion)

---

## 1. Introduction et Contexte {#1-introduction}

### Objectif du Projet SereneSense

Ce notebook documente la premi√®re √©tape cruciale du projet **SereneSense** : le pr√©traitement et la visualisation des donn√©es audio du dataset **MAD (Military Audio Detection)**.

### Le Dataset MAD

- **Source** : Kaggle Hub (`junewookim/mad-dataset-military-audio-dataset`)
- **Taille** : 7,466 √©chantillons audio
- **Format** : Fichiers WAV (16-bit, 16kHz)
- **Volume** : ~2.8GB compress√©
- **Classes** : 7 cat√©gories de v√©hicules militaires

### Les 7 Classes de Sons Militaires

1. **Helicopter** (H√©licopt√®re) - Sons rotatifs caract√©ristiques
2. **Fighter Aircraft** (Avion de chasse) - Bruit de r√©acteurs
3. **Military Vehicle** (V√©hicule militaire) - Moteurs lourds
4. **Truck** (Camion) - Moteurs diesel
5. **Footsteps** (Pas) - Mouvements de personnel
6. **Speech** (Parole) - Communications vocales
7. **Background** (Fond sonore) - Ambiance

### Objectifs du Pr√©traitement

‚úÖ Normaliser tous les fichiers audio (16kHz, mono, 10 secondes)  
‚úÖ Supprimer les silences et bruits ind√©sirables  
‚úÖ Extraire les features acoustiques (mel spectrogrammes, MFCC)  
‚úÖ Diviser le dataset de mani√®re stratifi√©e (Train/Val/Test)  
‚úÖ Sauvegarder en format HDF5 pour acc√®s rapide  

In [None]:
# Import des biblioth√®ques n√©cessaires
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
import soundfile as sf
from pathlib import Path
from tqdm import tqdm
import warnings
import yaml
import json
import h5py
warnings.filterwarnings('ignore')

# Configuration de matplotlib pour de beaux graphiques
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Configuration des chemins du projet
PROJECT_ROOT = Path(r'c:\Users\MDN\Desktop\SereneSense')
DATA_RAW = PROJECT_ROOT / 'data' / 'raw' / 'mad'
DATA_PROCESSED = PROJECT_ROOT / 'data' / 'processed' / 'mad'
CONFIG_PATH = PROJECT_ROOT / 'configs' / 'data' / 'mad_dataset.yaml'
OUTPUT_DIR = PROJECT_ROOT / 'outputs' / 'preprocessing'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("‚úÖ Biblioth√®ques import√©es avec succ√®s")
print(f"üìÅ Dossier projet : {PROJECT_ROOT}")
print(f"üìÅ Donn√©es brutes : {DATA_RAW}")
print(f"üìÅ Donn√©es trait√©es : {DATA_PROCESSED}")

---

## 2. T√©l√©chargement du Dataset MAD {#2-telechargement}

### M√©thode de T√©l√©chargement

Le dataset MAD a √©t√© t√©l√©charg√© via `kagglehub` en utilisant le script `scripts/download_datasets.py`.

### Commande Ex√©cut√©e

```bash
python scripts/download_datasets.py --datasets mad
```

### R√©sultats du T√©l√©chargement

- **Total d'√©chantillons** : 7,466 fichiers WAV
- **Taille compress√©e** : ~2.8 GB
- **Destination** : `data/raw/mad/`

In [None]:
# Chargement de la configuration officielle du dataset
print("üìÑ Chargement de la configuration MAD...\n")

if CONFIG_PATH.exists():
    with open(CONFIG_PATH, 'r', encoding='utf-8') as f:
        mad_config = yaml.safe_load(f)
    
    # Extraction des informations cl√©s
    dataset_info = mad_config.get('dataset', {})
    stats = dataset_info.get('statistics', {})
    classes_info = mad_config.get('classes', {})
    
    print("üìä Informations du Dataset MAD :")
    print(f"  Nom complet        : {dataset_info.get('full_name')}")
    print(f"  Source             : {dataset_info.get('source')}")
    print(f"  Total √©chantillons : {stats.get('total_samples')}")
    print(f"  Nombre de classes  : {stats.get('classes')}")
    print(f"  Dur√©e totale       : {stats.get('total_duration_hours')} heures")
    print(f"  Sample rate        : {stats.get('sample_rate')} Hz")
    print(f"  Profondeur bits    : {stats.get('bit_depth')} bits")
    print(f"  Taille             : {stats.get('size_gb')} GB")
else:
    print(f"‚ö†Ô∏è Fichier de configuration non trouv√© : {CONFIG_PATH}")
    mad_config = None

In [None]:
# V√©rification de la pr√©sence des donn√©es
print("\nüîç V√©rification des donn√©es t√©l√©charg√©es...\n")

if DATA_RAW.exists():
    print(f"‚úÖ Dossier de donn√©es brutes trouv√© : {DATA_RAW}")
    
    # Compter tous les fichiers WAV
    audio_files = list(DATA_RAW.rglob('*.wav'))
    print(f"üìä Nombre total de fichiers audio : {len(audio_files)}")
    
    if len(audio_files) > 0:
        print(f"\nüìÑ Exemple de fichiers :")
        for i, file in enumerate(audio_files[:5]):
            print(f"  {i+1}. {file.name}")
else:
    print(f"‚ùå Dossier de donn√©es non trouv√© : {DATA_RAW}")
    print("‚ö†Ô∏è Veuillez ex√©cuter : python scripts/download_datasets.py --datasets mad")
    audio_files = []

In [None]:
# D√©finition des classes
CLASS_NAMES = [
    'Helicopter',
    'Fighter Aircraft', 
    'Military Vehicle',
    'Truck',
    'Footsteps',
    'Speech',
    'Background'
]

print("\nüè∑Ô∏è Classes du dataset MAD :\n")
for i, class_name in enumerate(CLASS_NAMES):
    print(f"  {i}. {class_name}")

---

## 3. Analyse Exploratoire Initiale {#3-analyse-exploratoire}

Avant le pr√©traitement, analysons les caract√©ristiques brutes du dataset pour identifier :

- Dur√©es variables des fichiers audio
- Sample rates diff√©rents
- Nombre de canaux (mono/st√©r√©o)
- D√©s√©quilibre des classes
- Pr√©sence de silences

In [None]:
# Analyse des caract√©ristiques audio (√©chantillon)
print("üîç Analyse des caract√©ristiques audio...\n")

if len(audio_files) > 0:
    # √âchantillonner pour acc√©l√©rer l'analyse
    sample_size = min(100, len(audio_files))
    sampled_files = np.random.choice(audio_files, sample_size, replace=False)
    
    audio_stats = {
        'durations': [],
        'sample_rates': [],
        'channels': [],
        'max_amplitudes': []
    }
    
    for audio_file in tqdm(sampled_files, desc="Analyse en cours"):
        try:
            # Lire les infos sans charger tout le fichier
            info = sf.info(audio_file)
            audio_stats['durations'].append(info.duration)
            audio_stats['sample_rates'].append(info.samplerate)
            audio_stats['channels'].append(info.channels)
            
            # Charger un court extrait pour l'amplitude
            audio, sr = librosa.load(audio_file, sr=None, duration=1.0)
            audio_stats['max_amplitudes'].append(np.max(np.abs(audio)))
        except:
            continue
    
    # Cr√©er un DataFrame pour l'analyse
    df_stats = pd.DataFrame(audio_stats)
    
    print("\nüìä Statistiques des fichiers audio :\n")
    print(df_stats.describe())
    
    print(f"\nüìà D√©tails :")
    print(f"  Dur√©e moyenne      : {df_stats['durations'].mean():.2f} secondes")
    print(f"  Dur√©e min/max      : {df_stats['durations'].min():.2f}s / {df_stats['durations'].max():.2f}s")
    print(f"  Sample rates       : {df_stats['sample_rates'].unique()}")
    print(f"  Canaux             : {df_stats['channels'].unique()}")
else:
    print("‚ö†Ô∏è Aucun fichier audio disponible pour l'analyse")
    df_stats = None

In [None]:
# Visualisation des caract√©ristiques
if df_stats is not None and len(df_stats) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # 1. Distribution des dur√©es
    axes[0, 0].hist(df_stats['durations'], bins=30, color='skyblue', edgecolor='black')
    axes[0, 0].axvline(x=10.0, color='red', linestyle='--', linewidth=2, label='Cible: 10s')
    axes[0, 0].set_xlabel('Dur√©e (secondes)', fontsize=12)
    axes[0, 0].set_ylabel('Fr√©quence', fontsize=12)
    axes[0, 0].set_title('Distribution des Dur√©es Audio', fontsize=14, fontweight='bold')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    
    # 2. Sample rates
    sr_counts = df_stats['sample_rates'].value_counts()
    axes[0, 1].bar(sr_counts.index.astype(str), sr_counts.values, color='lightcoral', edgecolor='black')
    axes[0, 1].set_xlabel('Sample Rate (Hz)', fontsize=12)
    axes[0, 1].set_ylabel('Nombre de fichiers', fontsize=12)
    axes[0, 1].set_title('Distribution des Sample Rates', fontsize=14, fontweight='bold')
    axes[0, 1].grid(True, alpha=0.3, axis='y')
    
    # 3. Canaux
    channel_counts = df_stats['channels'].value_counts()
    axes[1, 0].bar(channel_counts.index.astype(str), channel_counts.values, color='lightgreen', edgecolor='black')
    axes[1, 0].set_xlabel('Nombre de canaux', fontsize=12)
    axes[1, 0].set_ylabel('Nombre de fichiers', fontsize=12)
    axes[1, 0].set_title('Distribution des Canaux', fontsize=14, fontweight='bold')
    axes[1, 0].grid(True, alpha=0.3, axis='y')
    
    # 4. Amplitudes
    axes[1, 1].hist(df_stats['max_amplitudes'], bins=30, color='plum', edgecolor='black')
    axes[1, 1].set_xlabel('Amplitude maximale', fontsize=12)
    axes[1, 1].set_ylabel('Fr√©quence', fontsize=12)
    axes[1, 1].set_title('Distribution des Amplitudes', fontsize=14, fontweight='bold')
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / 'analyse_exploratoire.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("üíæ Graphique sauvegard√© : analyse_exploratoire.png")

### üîç Observations Cl√©s

Les observations justifient notre pipeline de pr√©traitement :

1. **Dur√©es variables** ‚Üí Padding/Trimming √† 10 secondes
2. **Sample rates diff√©rents** ‚Üí R√©√©chantillonnage √† 16kHz
3. **Mix mono/st√©r√©o** ‚Üí Conversion mono
4. **Amplitudes variables** ‚Üí Normalisation Z-score

---

## 4. Pipeline de Pr√©traitement Audio {#4-pipeline-pretraitement}

### √âtapes du Pipeline (script `prepare_data.py`)

1. **Chargement** : Lecture du fichier WAV
2. **Conversion Mono** : Si st√©r√©o ‚Üí mono
3. **R√©√©chantillonnage** : 16,000 Hz
4. **Suppression Silence** : Trim (`top_db=30`)
5. **Ajustement Dur√©e** : Padding ou Trimming ‚Üí 10s exactement
6. **Normalisation** : Z-score (Œº=0, œÉ=1)

### Param√®tres Exacts Utilis√©s

- **Sample Rate** : 16,000 Hz
- **Dur√©e cible** : 10.0 secondes (160,000 samples)
- **Seuil silence** : 30 dB
- **Range amplitude finale** : [-117.67, 146.06] (valeurs extr√™mes du dataset)

In [None]:
# Classe de pr√©traitement (reproduit la logique de prepare_data.py)
class AudioPreprocessor:
    """Pr√©processeur audio pour le dataset MAD."""
    
    def __init__(self, target_sr=16000, target_duration=10.0, top_db=30):
        self.target_sr = target_sr
        self.target_duration = target_duration
        self.target_length = int(target_sr * target_duration)  # 160,000
        self.top_db = top_db
    
    def load_audio(self, file_path):
        """Charge et convertit en mono."""
        audio, sr = librosa.load(file_path, sr=None, mono=True)
        return audio, sr
    
    def resample(self, audio, orig_sr):
        """R√©√©chantillonne √† 16kHz."""
        if orig_sr != self.target_sr:
            audio = librosa.resample(audio, orig_sr=orig_sr, target_sr=self.target_sr)
        return audio
    
    def remove_silence(self, audio):
        """Supprime les silences."""
        audio_trimmed, _ = librosa.effects.trim(audio, top_db=self.top_db)
        return audio_trimmed
    
    def adjust_length(self, audio):
        """Ajuste √† 10 secondes exactement."""
        current_length = len(audio)
        
        if current_length < self.target_length:
            # Padding
            pad_length = self.target_length - current_length
            pad_left = pad_length // 2
            pad_right = pad_length - pad_left
            audio = np.pad(audio, (pad_left, pad_right), mode='constant')
        elif current_length > self.target_length:
            # Trimming au centre
            start = (current_length - self.target_length) // 2
            audio = audio[start:start + self.target_length]
        
        return audio
    
    def normalize(self, audio):
        """Normalisation Z-score."""
        std = np.std(audio)
        if std > 0:
            audio = (audio - np.mean(audio)) / std
        return audio
    
    def preprocess(self, file_path):
        """Pipeline complet."""
        audio, orig_sr = self.load_audio(file_path)
        audio = self.resample(audio, orig_sr)
        audio = self.remove_silence(audio)
        audio = self.adjust_length(audio)
        audio = self.normalize(audio)
        return audio

# Initialisation
preprocessor = AudioPreprocessor(target_sr=16000, target_duration=10.0, top_db=30)

print("‚úÖ Pr√©processeur audio initialis√©")
print(f"   - Sample rate : {preprocessor.target_sr} Hz")
print(f"   - Dur√©e cible : {preprocessor.target_duration}s")
print(f"   - Samples : {preprocessor.target_length}")
print(f"   - Seuil silence : {preprocessor.top_db} dB")

In [None]:
# Exemple de pr√©traitement sur un fichier
if len(audio_files) > 0:
    example_file = audio_files[0]
    print(f"üìÑ Fichier exemple : {example_file.name}\n")
    
    # Audio original
    audio_original, sr_original = librosa.load(example_file, sr=None, mono=True)
    
    # Audio pr√©trait√©
    audio_processed = preprocessor.preprocess(example_file)
    
    print(f"üìä Comparaison avant/apr√®s :")
    print(f"  Original  : {len(audio_original):,} samples @ {sr_original} Hz ({len(audio_original)/sr_original:.2f}s)")
    print(f"  Pr√©trait√© : {len(audio_processed):,} samples @ {preprocessor.target_sr} Hz ({len(audio_processed)/preprocessor.target_sr:.2f}s)")
    print(f"  Amplitude : [{audio_processed.min():.2f}, {audio_processed.max():.2f}]")
    print(f"  Moyenne   : {audio_processed.mean():.6f} (‚âà 0)")
    print(f"  √âcart-type: {audio_processed.std():.6f} (‚âà 1)")

In [None]:
# Visualisation du pipeline de pr√©traitement
if len(audio_files) > 0:
    fig, axes = plt.subplots(4, 1, figsize=(15, 12))
    
    # 1. Original
    time_orig = np.arange(len(audio_original)) / sr_original
    axes[0].plot(time_orig, audio_original, color='steelblue', linewidth=0.5)
    axes[0].set_title('1. Audio Original (Brut)', fontsize=13, fontweight='bold')
    axes[0].set_xlabel('Temps (s)')
    axes[0].set_ylabel('Amplitude')
    axes[0].grid(True, alpha=0.3)
    
    # 2. Apr√®s suppression silence
    audio_no_silence = preprocessor.remove_silence(
        preprocessor.resample(audio_original, sr_original)
    )
    time_no_sil = np.arange(len(audio_no_silence)) / preprocessor.target_sr
    axes[1].plot(time_no_sil, audio_no_silence, color='darkorange', linewidth=0.5)
    axes[1].set_title('2. Apr√®s Suppression Silence', fontsize=13, fontweight='bold')
    axes[1].set_xlabel('Temps (s)')
    axes[1].set_ylabel('Amplitude')
    axes[1].grid(True, alpha=0.3)
    
    # 3. Apr√®s ajustement longueur
    audio_adjusted = preprocessor.adjust_length(audio_no_silence)
    time_adj = np.arange(len(audio_adjusted)) / preprocessor.target_sr
    axes[2].plot(time_adj, audio_adjusted, color='forestgreen', linewidth=0.5)
    axes[2].axvline(x=10.0, color='red', linestyle='--', linewidth=1.5, label='10s')
    axes[2].set_title('3. Apr√®s Ajustement √† 10 secondes', fontsize=13, fontweight='bold')
    axes[2].set_xlabel('Temps (s)')
    axes[2].set_ylabel('Amplitude')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    axes[2].set_xlim([0, 10])
    
    # 4. Final (normalis√©)
    time_proc = np.arange(len(audio_processed)) / preprocessor.target_sr
    axes[3].plot(time_proc, audio_processed, color='purple', linewidth=0.5)
    axes[3].axhline(y=0, color='black', linestyle='-', linewidth=1, alpha=0.3)
    axes[3].axhline(y=1, color='red', linestyle='--', linewidth=1, alpha=0.5, label='¬±1œÉ')
    axes[3].axhline(y=-1, color='red', linestyle='--', linewidth=1, alpha=0.5)
    axes[3].set_title('4. Final : Normalis√© Z-score', fontsize=13, fontweight='bold')
    axes[3].set_xlabel('Temps (s)')
    axes[3].set_ylabel('Amplitude Normalis√©e')
    axes[3].legend()
    axes[3].grid(True, alpha=0.3)
    axes[3].set_xlim([0, 10])
    
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / 'preprocessing_pipeline.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("üíæ Graphique sauvegard√© : preprocessing_pipeline.png")

### ‚úÖ R√©sultats du Pr√©traitement

**Transformations r√©ussies** :
1. Audio original ‚Üí dur√©e et SR variables
2. Silences retir√©s ‚Üí contenu utile pr√©serv√©
3. Ajust√© √† 10s ‚Üí exactement 160,000 samples
4. Normalis√© ‚Üí Œº‚âà0, œÉ‚âà1

**Range d'amplitudes sur tout le dataset** : [-117.67, 146.06]

---

## 5. Extraction des Features {#5-extraction-features}

### Types de Features Extraites

#### 1. Mel Spectrogramme (AudioMAE)
- **n_fft** : 1024
- **hop_length** : 160 (10ms)
- **n_mels** : 128
- **f_min / f_max** : 50 Hz / 8000 Hz
- **Dimension** : (128, 1000) ‚Üí resize (128, 128)

#### 2. MFCC (CNN et CRNN)
- **n_mfcc** : 40 coefficients
- **n_mels** : 64
- **n_fft** : 1024
- **hop_length** : 512 (31.25ms)
- **Canaux** : 3 (MFCC + Œî + Œî¬≤)
- **Dimension CNN** : (3, 40, 92) pour 3s
- **Dimension CRNN** : (3, 40, 124) pour 4s

In [None]:
# Classe d'extraction de features
class FeatureExtractor:
    """Extracteur de features audio."""
    
    def __init__(self, sr=16000):
        self.sr = sr
    
    def extract_mel_spectrogram(self, audio, n_mels=128, n_fft=1024, 
                               hop_length=160, f_min=50, f_max=8000):
        """Mel spectrogramme pour AudioMAE."""
        mel_spec = librosa.feature.melspectrogram(
            y=audio, sr=self.sr, n_fft=n_fft, hop_length=hop_length,
            n_mels=n_mels, fmin=f_min, fmax=f_max
        )
        mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
        return mel_spec_db
    
    def extract_mfcc(self, audio, n_mfcc=40, n_mels=64, n_fft=1024, hop_length=512):
        """MFCC avec deltas pour CNN/CRNN."""
        mfcc = librosa.feature.mfcc(
            y=audio, sr=self.sr, n_mfcc=n_mfcc, n_mels=n_mels,
            n_fft=n_fft, hop_length=hop_length
        )
        mfcc_delta = librosa.feature.delta(mfcc)
        mfcc_delta2 = librosa.feature.delta(mfcc, order=2)
        mfcc_features = np.stack([mfcc, mfcc_delta, mfcc_delta2], axis=0)
        return mfcc_features

feature_extractor = FeatureExtractor(sr=16000)
print("‚úÖ Extracteur de features initialis√©")

In [None]:
# Extraction des features sur l'exemple
if len(audio_files) > 0:
    # Mel spectrogramme
    mel_spec = feature_extractor.extract_mel_spectrogram(audio_processed)
    
    # MFCC pour CNN (3s)
    audio_3s = audio_processed[:3 * 16000]
    mfcc_cnn = feature_extractor.extract_mfcc(audio_3s)
    
    # MFCC pour CRNN (4s)
    audio_4s = audio_processed[:4 * 16000]
    mfcc_crnn = feature_extractor.extract_mfcc(audio_4s)
    
    print("üìä Dimensions des features extraites :")
    print(f"  Mel Spectrogramme : {mel_spec.shape}")
    print(f"  MFCC CNN (3s)     : {mfcc_cnn.shape}")
    print(f"  MFCC CRNN (4s)    : {mfcc_crnn.shape}")

In [None]:
# Visualisation des features
if len(audio_files) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
    
    # 1. Mel Spectrogramme
    img1 = librosa.display.specshow(
        mel_spec, sr=16000, hop_length=160, x_axis='time', y_axis='mel',
        fmin=50, fmax=8000, ax=axes[0, 0], cmap='viridis'
    )
    axes[0, 0].set_title('Mel Spectrogramme (AudioMAE)\n128 mels, 50-8000 Hz', 
                         fontsize=13, fontweight='bold')
    fig.colorbar(img1, ax=axes[0, 0], format='%+2.0f dB')
    
    # 2. MFCC (canal 1)
    img2 = librosa.display.specshow(
        mfcc_cnn[0], sr=16000, hop_length=512, x_axis='time',
        ax=axes[0, 1], cmap='coolwarm'
    )
    axes[0, 1].set_title('MFCC CNN (3s) - Canal 1/3', fontsize=13, fontweight='bold')
    axes[0, 1].set_ylabel('Coefficient MFCC')
    fig.colorbar(img2, ax=axes[0, 1])
    
    # 3. Delta
    img3 = librosa.display.specshow(
        mfcc_cnn[1], sr=16000, hop_length=512, x_axis='time',
        ax=axes[1, 0], cmap='coolwarm'
    )
    axes[1, 0].set_title('MFCC Delta (Canal 2/3)', fontsize=13, fontweight='bold')
    axes[1, 0].set_ylabel('Coefficient MFCC')
    fig.colorbar(img3, ax=axes[1, 0])
    
    # 4. Delta-Delta
    img4 = librosa.display.specshow(
        mfcc_cnn[2], sr=16000, hop_length=512, x_axis='time',
        ax=axes[1, 1], cmap='coolwarm'
    )
    axes[1, 1].set_title('MFCC Delta-Delta (Canal 3/3)', fontsize=13, fontweight='bold')
    axes[1, 1].set_ylabel('Coefficient MFCC')
    fig.colorbar(img4, ax=axes[1, 1])
    
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / 'feature_extraction.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("üíæ Graphique sauvegard√© : feature_extraction.png")

### üéØ Interpr√©tation des Features

**Mel Spectrogramme** :
- Repr√©sentation temps-fr√©quence optimis√©e
- Signatures caract√©ristiques des v√©hicules militaires
- Utilis√© par AudioMAE (Vision Transformer)

**MFCC** :
- Capture l'enveloppe spectrale
- Deltas = variations temporelles
- Delta-deltas = acc√©l√©ration
- Utilis√© par CNN et CRNN

---

## 6. Division Stratifi√©e du Dataset {#6-division-dataset}

### Strat√©gie de Division

- **Train** : 70% (5,226 √©chantillons)
- **Validation** : 15% (1,120 √©chantillons)
- **Test** : 15% (1,120 √©chantillons)
- **Total** : 7,466 √©chantillons

### Division Stratifi√©e
Chaque ensemble maintient les m√™mes proportions de classes.

### Reproductibilit√©
Seed al√©atoire : **42**

In [None]:
# Param√®tres de division
TRAIN_RATIO = 0.70
VAL_RATIO = 0.15
TEST_RATIO = 0.15
RANDOM_SEED = 42

TRAIN_SAMPLES = 5226
VAL_SAMPLES = 1120
TEST_SAMPLES = 1120
TOTAL_SAMPLES = 7466

print("üìä Configuration de la division :")
print(f"  Train      : {TRAIN_RATIO*100:.0f}% ({TRAIN_SAMPLES:,} √©chantillons)")
print(f"  Validation : {VAL_RATIO*100:.0f}% ({VAL_SAMPLES:,} √©chantillons)")
print(f"  Test       : {TEST_RATIO*100:.0f}% ({TEST_SAMPLES:,} √©chantillons)")
print(f"  Total      : {TOTAL_SAMPLES:,} √©chantillons")
print(f"  Seed       : {RANDOM_SEED}")

In [None]:
# V√©rification des fichiers HDF5 g√©n√©r√©s
print("\nüîç V√©rification des fichiers pr√©trait√©s...\n")

if DATA_PROCESSED.exists():
    for split in ['train', 'validation', 'test']:
        split_dir = DATA_PROCESSED / split
        h5_file = split_dir / f"{split}.h5"
        meta_file = split_dir / "metadata.json"
        
        print(f"üìÅ Split '{split}' :")
        
        if h5_file.exists():
            print(f"  ‚úÖ HDF5 : {h5_file}")
            
            # Lire les infos du fichier HDF5
            try:
                with h5py.File(h5_file, 'r') as f:
                    if 'audio' in f:
                        print(f"     Shape : {f['audio'].shape}")
                        print(f"     Dtype : {f['audio'].dtype}")
            except:
                pass
        else:
            print(f"  ‚ùå HDF5 manquant")
        
        if meta_file.exists():
            print(f"  ‚úÖ M√©tadonn√©es : {meta_file}")
            with open(meta_file, 'r') as f:
                meta = json.load(f)
                n_samples = meta.get('num_samples', meta.get('num_items', 0))
                print(f"     √âchantillons : {n_samples}")
        else:
            print(f"  ‚ùå M√©tadonn√©es manquantes")
        print()
else:
    print("‚ö†Ô∏è Dossier de donn√©es pr√©trait√©es non trouv√©")
    print("   Ex√©cutez : python scripts/prepare_data.py")

In [None]:
# Visualisation de la distribution
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Graphique 1 : R√©partition par split
splits = ['Train', 'Validation', 'Test']
counts = [TRAIN_SAMPLES, VAL_SAMPLES, TEST_SAMPLES]
colors_split = ['steelblue', 'orange', 'forestgreen']

axes[0].bar(splits, counts, color=colors_split, edgecolor='black', linewidth=1.5)
axes[0].set_ylabel('Nombre d\'√©chantillons', fontsize=12, fontweight='bold')
axes[0].set_title('R√©partition Train/Val/Test', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='y')

# Ajouter les valeurs
for i, (split, count) in enumerate(zip(splits, counts)):
    axes[0].text(i, count + 100, f"{count:,}\n({count/TOTAL_SAMPLES*100:.0f}%)", 
                ha='center', va='bottom', fontweight='bold')

# Graphique 2 : Pie chart
axes[1].pie(counts, labels=splits, colors=colors_split, autopct='%1.0f%%',
           startangle=90, textprops={'fontsize': 12, 'fontweight': 'bold'})
axes[1].set_title(f'Distribution du Dataset\n(Total: {TOTAL_SAMPLES:,} √©chantillons)', 
                 fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'dataset_split.png', dpi=300, bbox_inches='tight')
plt.show()

print("üíæ Graphique sauvegard√© : dataset_split.png")

### üì¶ Format HDF5

Structure des donn√©es pr√©trait√©es :

```
data/processed/mad/
‚îú‚îÄ‚îÄ train/
‚îÇ   ‚îú‚îÄ‚îÄ train.h5 (5,226 √©chantillons)
‚îÇ   ‚îú‚îÄ‚îÄ manifest_train.json
‚îÇ   ‚îî‚îÄ‚îÄ metadata.json
‚îú‚îÄ‚îÄ validation/
‚îÇ   ‚îú‚îÄ‚îÄ validation.h5 (1,120 √©chantillons)
‚îÇ   ‚îú‚îÄ‚îÄ manifest_validation.json
‚îÇ   ‚îî‚îÄ‚îÄ metadata.json
‚îî‚îÄ‚îÄ test/
    ‚îú‚îÄ‚îÄ test.h5 (1,120 √©chantillons)
    ‚îú‚îÄ‚îÄ manifest_test.json
    ‚îî‚îÄ‚îÄ metadata.json
```

**Avantages HDF5** :
- ‚úÖ Acc√®s rapide
- ‚úÖ Compression ~50%
- ‚úÖ Compatible PyTorch
- ‚úÖ Ne charge pas tout en m√©moire

---

## 7. Visualisations Finales {#7-visualisations-finales}

Exemples d'√©chantillons pr√©trait√©s.

In [None]:
# Visualisation d'exemples multiples
if len(audio_files) > 0:
    n_examples = min(3, len(audio_files))
    
    fig, axes = plt.subplots(n_examples, 2, figsize=(16, 4 * n_examples))
    if n_examples == 1:
        axes = axes.reshape(1, -1)
    
    for i in range(n_examples):
        file = audio_files[i]
        audio_proc = preprocessor.preprocess(file)
        mel_spec = feature_extractor.extract_mel_spectrogram(audio_proc)
        
        # Waveform
        time = np.arange(len(audio_proc)) / 16000
        axes[i, 0].plot(time, audio_proc, linewidth=0.5, color='steelblue')
        axes[i, 0].set_title(f'Exemple {i+1} - Waveform\n{file.name}', 
                            fontsize=12, fontweight='bold')
        axes[i, 0].set_xlabel('Temps (s)')
        axes[i, 0].set_ylabel('Amplitude')
        axes[i, 0].grid(True, alpha=0.3)
        axes[i, 0].set_xlim([0, 10])
        
        # Mel Spectrogramme
        img = librosa.display.specshow(
            mel_spec, sr=16000, hop_length=160, x_axis='time', y_axis='mel',
            fmin=50, fmax=8000, ax=axes[i, 1], cmap='viridis'
        )
        axes[i, 1].set_title(f'Exemple {i+1} - Mel Spectrogramme', 
                            fontsize=12, fontweight='bold')
        fig.colorbar(img, ax=axes[i, 1], format='%+2.0f dB')
    
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / 'examples_preprocessed.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("üíæ Graphique sauvegard√© : examples_preprocessed.png")

In [None]:
# R√©sum√© final du pr√©traitement
print("\n" + "="*80)
print(" "*25 + "üìä R√âSUM√â DU PR√âTRAITEMENT")
print("="*80 + "\n")

print("‚úÖ DATASET MAD PR√âTRAIT√â")
print(f"   ‚Ä¢ Total : 7,466 √©chantillons")
print(f"   ‚Ä¢ Classes : 7 (v√©hicules militaires)")
print(f"   ‚Ä¢ Format : HDF5")
print()

print("‚úÖ CARACT√âRISTIQUES AUDIO")
print(f"   ‚Ä¢ Sample rate : 16,000 Hz")
print(f"   ‚Ä¢ Dur√©e : 10.0 secondes")
print(f"   ‚Ä¢ Samples : 160,000")
print(f"   ‚Ä¢ Canaux : Mono (1)")
print(f"   ‚Ä¢ Normalisation : Z-score (Œº=0, œÉ=1)")
print(f"   ‚Ä¢ Amplitude range : [-117.67, 146.06]")
print()

print("‚úÖ DIVISION DU DATASET")
print(f"   ‚Ä¢ Train : 5,226 (70%)")
print(f"   ‚Ä¢ Validation : 1,120 (15%)")
print(f"   ‚Ä¢ Test : 1,120 (15%)")
print(f"   ‚Ä¢ Stratifi√©e : Oui")
print(f"   ‚Ä¢ Seed : 42")
print()

print("‚úÖ FEATURES EXTRAITES")
print(f"   ‚Ä¢ Mel Spectrogramme : (128, 1000) ‚Üí (128, 128)")
print(f"     - Pour AudioMAE")
print(f"   ‚Ä¢ MFCC CNN : (3, 40, 92) - 3s")
print(f"   ‚Ä¢ MFCC CRNN : (3, 40, 124) - 4s")
print()

print("‚úÖ PIPELINE APPLIQU√â")
print(f"   1. Chargement + conversion mono")
print(f"   2. R√©√©chantillonnage 16kHz")
print(f"   3. Suppression silences (30dB)")
print(f"   4. Padding/Trimming √† 10s")
print(f"   5. Normalisation Z-score")
print()

print("="*80)
print(" "*20 + "üéØ Pr√©traitement termin√© avec succ√®s!")
print("="*80)

---

## 8. Conclusion {#8-conclusion}

### R√©capitulatif

Ce notebook a document√© le pr√©traitement complet du dataset MAD :

1. ‚úÖ **T√©l√©chargement** : 7,466 √©chantillons, 7 classes
2. ‚úÖ **Analyse exploratoire** : Identification des probl√®mes
3. ‚úÖ **Pipeline de pr√©traitement** : 5 √©tapes reproductibles
4. ‚úÖ **Extraction de features** : Mel Spectrogram + MFCC
5. ‚úÖ **Division stratifi√©e** : Train 70% / Val 15% / Test 15%
6. ‚úÖ **Sauvegarde HDF5** : Acc√®s rapide pour l'entra√Ænement

### R√©sultats Cl√©s

- üìä 7,466 √©chantillons normalis√©s
- üéµ 10 secondes (160,000 samples @ 16kHz)
- üìà Normalisation Z-score
- üîä Features optimis√©es
- üíæ Format HDF5

### Prochaines √âtapes

1. **Notebook 2** : CNN-MFCC (66.88% accuracy)
2. **Notebook 3** : CRNN-MFCC (73.21% accuracy)
3. **Notebook 4** : AudioMAE (82.15% accuracy)
4. **Notebook 5** : D√©ploiement Raspberry Pi 5

---

<div style="text-align: center; padding: 20px; background-color: #e8f4f8; border-radius: 10px;">
    <h3>üéâ Notebook 1 Compl√©t√© !</h3>
    <p><b>Projet SereneSense - D√©tection de Sons Militaires</b></p>
    <p>Donn√©es pr√©trait√©es et pr√™tes pour l'entra√Ænement</p>
</div>