# Transcription CA

Ce notebook applique le **pipeline complet** :
- **Prétraitement** audio (FFmpeg + noisereduce)
- **Transcription** faster-whisper (réglages anti-hallucinations)
- **Chunks longs** pour une meilleure cohérence (3–5 min)
- **Diarisation** (pyannote → fallback whisperx)
- **Post-traitement** (dédup + normalisation chiffres/unités)
- **Nettoyage LLM** par morceaux (1000 caractères) avec borne de correction
- **Sauvegarde JSON** des sorties (raw, diarized, cleaned, llm_cleaned)

# **Installation des packages nécessaires**

In [1]:
# %%capture
# # Installation silencieuse des dépendances avec gestion des conflits

# # 1. Mise à jour pip pour éviter les problèmes
# #!pip install --upgrade pip -q

# # 2. Installation FFmpeg (système)
# !apt-get update -qq
# !apt-get install -qq ffmpeg sox

# # 3. Nettoyage et verrouillage de la stack NumPy/Numba/Scipy
# #!pip uninstall -y numpy numba >/dev/null 2>&1 || true
# !pip install -q numpy==1.26.4 scipy==1.11.4
# !pip install -q numba==0.58.1
# !pip install -q torch==2.1.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118

# # 4. Installation des packages de transcription
# #!pip install -q openai-whisper==20231117
# !pip install -q faster-whisper==1.0.3

# # 5. Packages de débruitage audio
# !pip install -q librosa==0.10.1
# !pip install -q soundfile==0.12.1
# !pip install -q noisereduce==3.0.0
# !pip install -q pydub==0.25.1

# # 6. Diarization
# !pip install -q "pyannote.audio>=3.1"
# !pip install -q whisperx

# !pip install -q regex==2023.12.25 unidecode==1.3.8

# # 7. Packages documents
# !pip install -q python-docx==1.2.0
# !pip install -q python-pptx==1.0.2

# # 8. Packages LLM et NLP
# !pip install -q openai==1.91.0
# !pip install -q assemblyai==0.44.3
# !pip install -q tiktoken==0.9.0

# # 9. LangChain
# #!pip install -q langchain==0.3.27 langchain-community==0.3.29 langchain-core==0.3.30

# # 10. Packages utilitaires
# !pip install -q pandas==2.1.4 matplotlib==3.8.2 seaborn==0.13.2

# # 11. Installation FAISS pour le RAG
# #!pip install -q faiss-cpu==1.7.4

# print("✅ Installation terminée!")


In [2]:
%%capture
# Installation minimale des dépendances nécessaires sans perturber l'environnement Kaggle
import importlib
import os
import shutil
import subprocess
import sys

def ensure_packages(requirements):
    missing = []
    for module_name, package_spec in requirements:
        try:
            importlib.import_module(module_name)
        except Exception:
            missing.append(package_spec)
    if missing:
        cmd = [sys.executable, '-m', 'pip', 'install', '--no-cache-dir', '-q'] + missing
        subprocess.check_call(cmd)

core_requirements = [
    ('faster_whisper', 'faster-whisper==1.0.3'),
    ('librosa', 'librosa==0.10.1'),
    ('soundfile', 'soundfile==0.12.1'),
    ('noisereduce', 'noisereduce==3.0.0'),
    ('pydub', 'pydub==0.25.1'),
    ('docx', 'python-docx==1.2.0'),
    ('pptx', 'python-pptx==1.0.2'),
    ('openai', 'openai==1.91.0'),
    ('assemblyai', 'assemblyai==0.44.3'),
    ('tiktoken', 'tiktoken==0.9.0'),
]

ensure_packages(core_requirements)

# if os.environ.get('INSTALL_LANGCHAIN', '0') == '1':
#     optional_requirements = [
#         ('langchain', 'langchain==0.3.27'),
#         ('langchain_community', 'langchain-community==0.3.29'),
#         ('faiss', 'faiss-cpu==1.7.4'),
#     ]
#     try:
#         ensure_packages(optional_requirements)
#     except subprocess.CalledProcessError:
#         pass

if not shutil.which('ffmpeg'):
    subprocess.check_call(['apt-get', 'update', '-qq'])
    subprocess.check_call(['apt-get', 'install', '-qq', 'ffmpeg'])

print('✅ Vérification des dépendances terminée')

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 36.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 253.0/253.0 kB 277.9 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 472.8/472.8 kB 300.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.2/50.2 kB 281.8 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 331.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.6/38.6 MB 334.3 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 329.1 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 175.3/175.3 kB 311.7 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 kB 278.6 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 86.8/86.8 kB 296.3 MB/s eta 0:00:00


In [3]:
# Vérification que tout est installé correctement
import importlib

packages_to_check = [
    ('numpy', 'numpy'),
    ('scipy', 'scipy'),
    ('numba', 'numba'),
    ('whisper', 'openai-whisper'),
    ('faster_whisper', 'faster-whisper'),
    ('librosa', 'librosa'),
    ('soundfile', 'soundfile'),
    ('noisereduce', 'noisereduce'),
    ('pydub', 'pydub'),
    ('docx', 'python-docx'),
    ('pptx', 'python-pptx'),
    ('openai', 'openai'),
    ('langchain', 'langchain'),
    ('langchain_community', 'langchain-community'),
    ('faiss', 'faiss-cpu'),
    ('assemblyai', 'assemblyai'),
    ('tiktoken', 'tiktoken')
]

print("🔍 Vérification des packages installés:")
print("-" * 50)

all_ok = True
for import_name, package_name in packages_to_check:
    try:
        module = importlib.import_module(import_name)
        version = getattr(module, '__version__', 'N/A')
        print(f"✅ {package_name:20} : {version}")
    except ImportError:
        print(f"❌ {package_name:20} : Non installé")
        all_ok = False
    except Exception as exc:
        print(f"⚠️ {package_name:20} : Erreur lors de l'import ({type(exc).__name__}: {exc})")
        all_ok = False

if all_ok:
    print("✨ Tous les packages sont installés correctement!")
else:
    print("⚠️ Certains packages nécessitent une attention. Consultez les messages ci-dessus.")


🔍 Vérification des packages installés:
--------------------------------------------------
✅ numpy                : 1.26.4
✅ scipy                : 1.15.3
✅ numba                : 0.60.0
❌ openai-whisper       : Non installé
✅ faster-whisper       : 1.0.3
✅ librosa              : 0.11.0
✅ soundfile            : 0.13.1
✅ noisereduce          : N/A
✅ pydub                : N/A
✅ python-docx          : 1.2.0
✅ python-pptx          : 1.0.2
✅ openai               : 1.91.0
✅ langchain            : 0.3.26
❌ langchain-community  : Non installé
❌ faiss-cpu            : Non installé
✅ assemblyai           : 0.44.3
✅ tiktoken             : 0.9.0
⚠️ Certains packages nécessitent une attention. Consultez les messages ci-dessus.


# **Imports et configuration GPU**

In [4]:
# Imports standards
import os, sys, json, re, shutil, subprocess, tempfile
from math import ceil
import warnings
warnings.filterwarnings('ignore')

from datetime import datetime, timezone
import time
from pathlib import Path
from typing import Dict, List, Any, Optional, Tuple
from dataclasses import dataclass, field
import gc  # Garbage collector

import numpy as np
import pandas as pd

# Imports audio et débruitage
import librosa
import soundfile as sf
import noisereduce as nr
from scipy.signal import butter, filtfilt, medfilt
from pydub import AudioSegment

# Imports pour la transcription
#import whisper
from faster_whisper import WhisperModel

# Imports pour les documents
from docx import Document
from pptx import Presentation

# Imports pour le NLP et LLM
import openai
try:
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.vectorstores import FAISS
    from langchain_community.embeddings import OpenAIEmbeddings
    langchain_available = True
except ImportError:
    print("⚠️ LangChain non disponible")
    langchain_available = False

import torch
print(f"🔧 PyTorch: {torch.__version__}")
print(f"🎮 CUDA disponible: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Mémoire: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

⚠️ LangChain non disponible
🔧 PyTorch: 2.6.0+cu124
🎮 CUDA disponible: True
   GPU: Tesla T4
   Mémoire: 15.83 GB


# **Configuration des clés API**

In [5]:
try:
    from kaggle_secrets import UserSecretsClient
    user_secrets = UserSecretsClient()
    OPENAI_API_KEY = user_secrets.get_secret("OPENAI_API_KEY")
    ASSEMBLYAI_API_KEY = user_secrets.get_secret("ASSEMBLYAI_API_KEY")
    HUGGINGFACE_TOKEN = user_secrets.get_secret("HUGGINGFACE_TOKEN")
    GROQ_API_KEY = user_secrets.get_secret("GROQ_API_KEY")
except:
    OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")
    ASSEMBLYAI_API_KEY = os.environ.get("ASSEMBLYAI_API_KEY", "")
    HUGGINGFACE_TOKEN = os.environ.get("HUGGINGFACE_TOKEN", "")
    GROQ_API_KEY = os.environ.get("GROQ_API_KEY", "")

# **Configuration des chemins**

In [6]:
UPLOAD_PATH = "/kaggle/input/meeting-audio/" # Chemin des fichiers uploadés 
OUTPUT_PATH = "/kaggle/working" # Chemin de sortie

In [7]:
DEBUG_PATH = Path(OUTPUT_PATH) / "debug"
DEBUG_PATH.mkdir(parents=True, exist_ok=True)

In [8]:
TEMP_DIR = Path(OUTPUT_PATH) / "temp_chunks"
TEMP_DIR.mkdir(parents=True, exist_ok=True)

def cleanup_temp_files():
    """Nettoyer les fichiers temporaires"""
    if TEMP_DIR.exists():
        shutil.rmtree(TEMP_DIR)
    TEMP_DIR.mkdir(parents=True, exist_ok=True)

# **Utilitaires de commande système**

In [9]:
def ensure_dir(p): 
    Path(p).mkdir(parents=True, exist_ok=True) #Vérification création de dossier
    
def run(cmd): # Lancement commande
    p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = p.communicate()
    return p.returncode, out.decode(), err.decode()

In [10]:
def check_gpu_memory():
    if torch.cuda.is_available():
        free, total = torch.cuda.mem_get_info()
        print(f"📊 GPU: {free/1e9:.2f}GB libres / {total/1e9:.2f}GB total")
        if free < 4e9:  # Moins de 4GB libres
            print("⚠️ Mémoire GPU faible, utilisation de 'base' recommandée")
            return "medium"
    return None

In [11]:
def save_debug_json(data: Dict, step_name: str, timestamp: Optional[str] = None) -> str:
    """Sauvegarde JSON de debug pour chaque étape"""
    if not config.save_intermediate_json:
        return ""
    
    if timestamp is None:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    filename = f"{step_name}_{timestamp}.json"
    filepath = DEBUG_PATH / filename
    
    # Créer un résumé pour les données volumineuses
    debug_data = {
        "step": step_name,
        "timestamp": timestamp,
        "status": data.get("status", "unknown"),
        "summary": {}
    }
    
    if "segments" in data and isinstance(data["segments"], list):
        debug_data["summary"]["total_segments"] = len(data["segments"])
        debug_data["summary"]["sample_segments"] = data["segments"][:3] if data["segments"] else []
        debug_data["segments_count"] = len(data["segments"])
    
    if "transcription" in data:
        debug_data["summary"]["text_length"] = len(data["transcription"])
        debug_data["summary"]["text_preview"] = data["transcription"][:500] + "..." if len(data["transcription"]) > 500 else data["transcription"]
    
    if "transcription_postprocessed" in data:
        debug_data["summary"]["postprocessed_length"] = len(data["transcription_postprocessed"])
        debug_data["summary"]["postprocessed_preview"] = data["transcription_postprocessed"][:500] + "..."
    
    if "transcription_llm" in data:
        debug_data["summary"]["llm_length"] = len(data["transcription_llm"])
        debug_data["summary"]["llm_preview"] = data["transcription_llm"][:500] + "..."
        debug_data["llm_correction_rate"] = data.get("llm_correction_rate", 0)
    
    # Ajouter les métadonnées complètes
    debug_data["full_data_keys"] = list(data.keys())
    
    with open(filepath, "w", encoding="utf-8") as f:
        json.dump(debug_data, f, ensure_ascii=False, indent=2)
    
    print(f"📁 Debug JSON sauvé: {filepath}")
    return str(filepath)

**Monitoring et debug**

In [12]:
def print_memory_usage(step_name: str = ""):
    """Afficher l'utilisation mémoire"""
    prefix = f"[{step_name}] " if step_name else ""
    if torch.cuda.is_available():
        print(f"{prefix}GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f}GB / {torch.cuda.max_memory_allocated()/1e9:.2f}GB")
    import psutil
    process = psutil.Process()
    print(f"{prefix}RAM Usage: {process.memory_info().rss / 1e9:.2f}GB")

# **Configuration du pipeline**

In [13]:
def get_optimal_model_size() -> str:
    """
    Détermine automatiquement la taille du modèle Whisper selon les ressources.
    Adapté du projet SIIS pour une meilleure gestion mémoire.
    """
    if torch.cuda.is_available():
        try:
            total_mem = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3)
        except:
            total_mem = 0
        
        if total_mem >= 12:
            return "large-v3"
        elif total_mem >= 8:
            return "medium"
        elif total_mem >= 4:
            return "small"
        else:
            return "base"
    
    # Mode CPU
    if psutil is not None:
        try:
            ram_gb = psutil.virtual_memory().total / (1024 ** 3)
        except:
            ram_gb = 0
    else:
        ram_gb = 8  # Défaut conservateur
    
    if ram_gb >= 16:
        return "small"
    elif ram_gb >= 8:
        return "base"
    else:
        return "tiny"

In [14]:
@dataclass 
class Config: 
    """Configuration centralisée pour Kaggle""" 
    
    timezone: str = "Indian/Antananarivo"

    # Debug
    debug_mode: bool = True
    save_intermediate_json: bool = True
    
    # Clés API 
    openai_key: str = OPENAI_API_KEY 
    assemblyai_key: str = ASSEMBLYAI_API_KEY
    huggingface_token: str = HUGGINGFACE_TOKEN
    
    # Whisper
    whisper_model: str = get_optimal_model_size() # 'tiny', 'base', 'small', 'medium', 'large', "large-v3"
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    compute_type: str = "float16" if torch.cuda.is_available() else "int8"
    num_workers: int = 4
    
    # Audio
    sample_rate: int = 16000
    
    # Decoding / anti-hallucination
    beam_size: int = 5
    best_of: int = 5
    patience: float = 1.0
    temperature: float = 0.0
    compression_ratio_threshold: float = 2.4
    log_prob_threshold: float = -1.0
    no_speech_threshold: float = 0.6
    condition_on_previous_text: bool = False
    suppress_blank: bool = True
    suppress_tokens: list[int] = field(default_factory=lambda: [-1])
    max_initial_timestamp: float = 1.0
    
    # VAD
    use_vad: bool = True
    vad_threshold: float = 0.5
    vad_min_speech_duration_ms: int = 250
    vad_max_speech_duration_s: float = float('inf')
    vad_min_silence_duration_ms: int = 2000
    vad_speech_pad_ms: int = 400
    
    # Chunks longs pour cohérence (3–5 min)
    chunk_length_s: int = 900
    chunk_overlap_s: int = 30
    
    # Post-traitement
    max_repetitions: int = 3
    
    # Prompt spécialisé
    initial_prompt: str = (
        "Transcription d'une réunion du conseil d'administration à Madagascar. "
        "Vocabulaire: conseil d'administration, procès-verbal, quorum, "
        "résolution, délibération, vote, ordre du jour, budget, "
        "millions d'Ariary, rapport financier. "
        "Termes spécifiques: Fihariana, SON'INVEST, UNIMA, AQUALMA. "
        "Format: discours naturel sans répétitions ni hallucinations."
    )
    
    # LLM (activé par défaut en production)
    enable_llm: bool = True
    use_groq: bool = True
    groq_model: str = "llama-3.3-70b-versatile"  # ou "llama-3.3-70b-versatile"
    openai_model: str = "gpt-4o-mini" # "gpt-3.5-turbo" : Plus économique que GPT-4 # Fallback
    max_correction_rate: float = 0.18
    chunk_size_chars: int = 1000
    chunk_overlap_chars: int = 200

config = Config() 

print(f"✅ Configuration: Whisper {config.whisper_model} | Device: {config.device}")
print(f"   Chunking: {config.chunk_length_s}s | Condition on previous: {config.condition_on_previous_text}")

✅ Configuration: Whisper large-v3 | Device: cuda
   Chunking: 900s | Condition on previous: False


***Comment régler les paramètres selon les cas***

Cas A — Audio propre (dictaphones, salle calme)
*  beam_size=3, best_of=1–2 (plus rapide)
* no_speech_threshold=0.6 (ok)
* temperature=0.0
* VAD : min_silence_duration_ms=1500

Cas B — Audio bruité (portes, brouhaha)
* beam_size=5, best_of=5 (qualité)
* baisser no_speech_threshold à 0.5 si coupures
* VAD : threshold=0.4–0.5, min_speech_duration_ms=200, min_silence_duration_ms=1800–2200
* Garde-fous : garder compression_ratio_threshold=2.4

Cas C — CPU-only (pas de GPU Kaggle)
* compute_type="int8", modèle tiny ou base
* beam_size=3, best_of=1
* Threads : cpu_threads=2, num_workers=1
* Attends un RTF ≈ 2–5 (selon longueur)

# **Préparation de l'audio**

In [15]:
def slice_audio(input_path: str, output_path: str, start: float = 0.0, duration: Optional[int] = None) -> str:
    args = ["ffmpeg","-y","-hide_banner","-loglevel","error","-ss",str(start),"-i",input_path,"-ac","1","-ar",str(config.sample_rate)]
    if duration and duration > 0:
        args += ["-t",str(duration)]
    args += [output_path]
    ensure_dir(str(Path(output_path).parent))
    code, _, err = run(args)
    if code!=0:
        raise RuntimeError("FFmpeg slice failed: " + err)
    return output_path

# **Préprocessing et Débruitage Audio**

In [16]:
class AudioPreprocessor:
    """Prétraitement audio avec FFmpeg et réduction de bruit"""
    
    def __init__(self, sample_rate: int = 16000):
        self.sample_rate = sample_rate
    
    def ffmpeg_enhance(self, input_path: str, output_path: str):
        """Améliore l'audio avec FFmpeg"""
        cmd = [
            "ffmpeg", "-y", "-hide_banner", "-loglevel", "error",
            "-i", input_path,
            "-ac", "1",  # Mono
            "-ar", str(self.sample_rate),  # 16kHz
            "-af", "highpass=f=200,lowpass=f=3000,afftdn=nf=-20",  # Filtres
            output_path
        ]
        subprocess.run(cmd, check=True)
    
    def reduce_noise(self, input_path: str, output_path: str):
        """Réduit le bruit avec noisereduce"""
        y, sr = librosa.load(input_path, sr=self.sample_rate)
        y_clean = nr.reduce_noise(y=y, sr=sr, stationary=True, prop_decrease=0.8)
        sf.write(output_path, y_clean, sr)
        return output_path
    
    def process(self, input_path: str, output_dir: str) -> str:
        """Pipeline complet de prétraitement"""
        base_name = Path(input_path).stem
        ffmpeg_path = str(Path(output_dir) / f"{base_name}_ffmpeg.wav")
        denoise_path = str(Path(output_dir) / f"{base_name}_clean.wav")
        
        self.ffmpeg_enhance(input_path, ffmpeg_path)
        self.reduce_noise(ffmpeg_path, denoise_path)
        
        # Nettoyage fichier intermédiaire
        if Path(ffmpeg_path).exists():
            Path(ffmpeg_path).unlink()
        
        return denoise_path

In [17]:
def prepare_audio_file(audio_path: str) -> Dict:
    """Prépare et valide le fichier audio pour la transcription"""
    file_info = {
        "path": audio_path,
        "exists": os.path.exists(audio_path),
        "size_mb": 0,
        "duration_seconds": 0,
        "format": audio_path.split('.')[-1],
        "sample_rate": 0,
        "channels": 0
    }
    
    if file_info["exists"]:
        file_info["size_mb"] = os.path.getsize(audio_path) / (1024 * 1024)
        
        try:
            y, sr = librosa.load(audio_path, sr=None, duration=10)
            file_info["sample_rate"] = sr
            duration = librosa.get_duration(path=audio_path)
            file_info["duration_seconds"] = duration
        except Exception as e:
            print(f"⚠️ Erreur lecture audio: {e}")
    
    return file_info

# **Transcription Audio**
**Service de transcription avec audio nettoyé**

In [18]:
class TranscriptionService:
    """Service de transcription avec configurations SIIS optimisées"""
    
    def __init__(self, cfg: Config):
        self.cfg = cfg
        self.model = None
    
    def load_model(self):
        """Charge le modèle Whisper avec gestion mémoire"""
        if self.model is None:
            torch.cuda.empty_cache()
            gc.collect()
            
            self.model = WhisperModel(
                self.cfg.whisper_model,
                device=self.cfg.device,
                compute_type=self.cfg.compute_type,
                num_workers=self.cfg.num_workers,  # 4 au lieu de 1
                cpu_threads=4 if self.cfg.device == "cpu" else 0
            )
        return self.model
    
    def unload_model(self):
        """Libère le modèle de la mémoire"""
        if self.model is not None:
            del self.model
            self.model = None
            torch.cuda.empty_cache()
            gc.collect()
    
    def transcribe_chunk(self, audio_path: str) -> Tuple[List, Dict]:
        """Transcrit un chunk audio"""
        model = self.load_model()
        
        segments, info = model.transcribe(
            audio_path,
            language="fr",
            beam_size=self.cfg.beam_size,
            best_of=self.cfg.best_of,
            patience=self.cfg.patience,
            temperature=self.cfg.temperature,
            compression_ratio_threshold=self.cfg.compression_ratio_threshold,
            log_prob_threshold=self.cfg.log_prob_threshold,
            no_speech_threshold=self.cfg.no_speech_threshold,
            condition_on_previous_text=self.cfg.condition_on_previous_text,  # TRUE!
            initial_prompt=self.cfg.initial_prompt,
            word_timestamps=True,
            suppress_tokens=self.cfg.suppress_tokens,
            suppress_blank=self.cfg.suppress_blank,
            max_initial_timestamp=self.cfg.max_initial_timestamp,
            vad_filter=self.cfg.use_vad,
            vad_parameters={
                "threshold": self.cfg.vad_threshold,
                "min_speech_duration_ms": self.cfg.vad_min_speech_duration_ms,
                "max_speech_duration_s": self.cfg.vad_max_speech_duration_s,
                "min_silence_duration_ms": self.cfg.vad_min_silence_duration_ms,
                "speech_pad_ms": self.cfg.vad_speech_pad_ms,
            } if self.cfg.use_vad else None
        )
        
        return list(segments), info
    
    def transcribe_long_audio(self, audio_path: str) -> Dict[str, Any]:
        """
        Transcrit un audio long avec chunking optimisé SIIS
        Chunks de 900s au lieu de 180-300s pour moins de dérive
        """
        # Obtenir la durée totale
        y, sr = librosa.load(audio_path, sr=self.cfg.sample_rate, duration=1)
        info = sf.info(audio_path)
        total_duration = info.duration
        
        # Calcul des chunks
        chunk_length = self.cfg.chunk_length_s
        chunk_overlap = self.cfg.chunk_overlap_s
        num_chunks = max(1, ceil(total_duration / chunk_length))
        
        print(f"📊 Audio: {total_duration:.1f}s | {num_chunks} chunks de {chunk_length}s")
        
        all_segments = []
        all_text = []
        
        for i in range(num_chunks):
            start_time = max(0, i * chunk_length - (chunk_overlap if i > 0 else 0))
            duration = min(chunk_length + chunk_overlap, total_duration - start_time)
            
            # Extraire le chunk avec ffmpeg
            chunk_path = str(TEMP_DIR / f"chunk_{i:04d}.wav")
            cmd = [
                "ffmpeg", "-y", "-hide_banner", "-loglevel", "error",
                "-ss", str(start_time),
                "-t", str(duration),
                "-i", audio_path,
                "-ac", "1",
                "-ar", str(self.cfg.sample_rate),
                chunk_path
            ]
            subprocess.run(cmd, check=True)
            
            # Transcrire le chunk
            print(f"  Chunk {i+1}/{num_chunks}: {start_time:.1f}s - {start_time+duration:.1f}s")
            segments, chunk_info = self.transcribe_chunk(chunk_path)
            
            # Ajuster les timestamps
            for seg in segments:
                # Créer un nouveau dictionnaire pour chaque segment
                segment_dict = {
                    "start": seg.start + start_time,
                    "end": seg.end + start_time,
                    "text": seg.text.strip(),
                }
                
                # Ajouter les mots avec timestamps ajustés si disponibles
                if hasattr(seg, 'words') and seg.words:
                    segment_dict["words"] = [
                        {
                            "start": w.start + start_time,
                            "end": w.end + start_time,
                            "word": w.word,
                            "probability": getattr(w, 'probability', 0.0)
                        }
                        for w in seg.words
                    ]
                
                # Ajouter d'autres métadonnées si disponibles
                if hasattr(seg, 'no_speech_prob'):
                    segment_dict["no_speech_prob"] = seg.no_speech_prob
                if hasattr(seg, 'avg_logprob'):
                    segment_dict["avg_logprob"] = seg.avg_logprob
                if hasattr(seg, 'compression_ratio'):
                    segment_dict["compression_ratio"] = seg.compression_ratio
                
                all_segments.append(segment_dict)
                all_text.append(seg.text.strip())
            
            # Nettoyer le chunk temporaire
            Path(chunk_path).unlink()
        
        # Assembler le résultat
        result = {
            "status": "success",
            "duration": total_duration,
            "language": "fr",
            "segments": all_segments,
            "text": " ".join(all_text),
            "metadata": {
                "model": self.cfg.whisper_model,
                "chunks": num_chunks,
                "chunk_length": chunk_length,
                "condition_on_previous": self.cfg.condition_on_previous_text
            }
        }
        
        return result

In [19]:
# Exemple d'utilisation
#result = transcription_service.transcribe_audio(audio_file)
#print(f"Transcription: {result['transcription'][:500]}...")

# **Diarization**

In [20]:
def diarize(transcription_data: Dict, audio_path: str, hf_token: str) -> Dict:
    """
    Diarisation avec pyannote (ou fallback whisperx)
    """
    try:
        from pyannote.audio import Pipeline
        
        print("🎙️ Diarisation avec pyannote...")
        pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.1",
            use_auth_token=hf_token
        )
        
        diarization = pipeline(audio_path)
        
        # Mapper les segments aux locuteurs
        segments_with_speakers = []
        for seg in transcription_data.get("segments", []):
            start, end = seg["start"], seg["end"]
            
            # Trouver le locuteur majoritaire pour ce segment
            speaker_times = {}
            for turn, _, speaker in diarization.itertracks(yield_label=True):
                overlap_start = max(start, turn.start)
                overlap_end = min(end, turn.end)
                if overlap_start < overlap_end:
                    overlap_duration = overlap_end - overlap_start
                    speaker_times[speaker] = speaker_times.get(speaker, 0) + overlap_duration
            
            # Assigner le locuteur avec le plus de temps de parole
            if speaker_times:
                main_speaker = max(speaker_times, key=speaker_times.get)
                seg["speaker"] = main_speaker
            else:
                seg["speaker"] = "Unknown"
            
            segments_with_speakers.append(seg)
        
        transcription_data["segments_diarized"] = segments_with_speakers
        transcription_data["diarization_method"] = "pyannote"
        
    except Exception as e:
        print(f"⚠️ Diarisation pyannote échouée: {e}")
        
        # Fallback sur whisperx si disponible
        try:
            import whisperx
            print("🔄 Fallback sur whisperx...")
            
            # Aligner avec whisperx
            device = "cuda" if torch.cuda.is_available() else "cpu"
            align_model, metadata = whisperx.load_align_model(
                language_code="fr",
                device=device
            )
            
            result_aligned = whisperx.align(
                transcription_data["segments"],
                align_model,
                metadata,
                audio_path,
                device
            )
            
            # Diarisation
            diarize_model = whisperx.DiarizationPipeline(use_auth_token=hf_token)
            diarize_segments = diarize_model(audio_path)
            result_diarized = whisperx.assign_word_speakers(diarize_segments, result_aligned)
            
            transcription_data["segments_diarized"] = result_diarized["segments"]
            transcription_data["diarization_method"] = "whisperx"
            
        except Exception as e2:
            print(f"⚠️ Diarisation whisperx échouée: {e2}")
            transcription_data["diarization_method"] = "none"
    
    return transcription_data

# **Post-traitement du texte**

In [21]:
def normalize_numbers_and_units(text: str) -> str:
    """Normalise les nombres et unités monétaires"""
    import re
    
    # Normaliser les millions
    text = re.sub(r'(\d+)\s*,\s*(\d+)\s*millions?', r'\1.\2 millions', text)
    text = re.sub(r'(\d+)\s*virgule\s*(\d+)\s*millions?', r'\1.\2 millions', text)
    
    # Ajouter Ariary si manquant après les montants
    text = re.sub(r'(\d+(?:\.\d+)?)\s*millions?\s*(?!d\'?[Aa]riary)', r'\1 millions d\'Ariary', text)
    
    # Normaliser les pourcentages
    text = re.sub(r'(\d+)\s*pour\s*cent', r'\1%', text)
    
    return text

def deduplicate_sentences(text: str) -> str:
    """Supprime les répétitions de phrases"""
    import re
    
    sentences = re.split(r'(?<=[.!?])\s+', text)
    seen = set()
    unique_sentences = []
    
    for sent in sentences:
        sent_lower = sent.lower().strip()
        if sent_lower and sent_lower not in seen:
            seen.add(sent_lower)
            unique_sentences.append(sent)
    
    return ' '.join(unique_sentences)

def postprocess_text(text: str) -> str:
    # Nettoyer les espaces multiples
    text = re.sub(r'\s+', ' ', text).strip()
    
    text = normalize_numbers_and_units(text)
    text = deduplicate_sentences(text)
    
    return text


# **Nettoyage LLM**

In [22]:
class LLMCleaner:
    """Nettoyage du texte avec LLM (Groq ou OpenAI)"""
    
    def __init__(self, cfg: Config):
        self.cfg = cfg
        self.client = None
        
        if cfg.use_groq and GROQ_API_KEY:
            try:
                from groq import Groq
                self.client = Groq(api_key=GROQ_API_KEY)
                self.provider = "groq"
                print("✅ Utilisation de Groq pour le nettoyage LLM")
            except ImportError:
                print("⚠️ Package groq non installé, installation...")
                subprocess.check_call([sys.executable, "-m", "pip", "install", "groq"])
                from groq import Groq
                self.client = Groq(api_key=GROQ_API_KEY)
                self.provider = "groq"
        elif OPENAI_API_KEY:
            from openai import OpenAI
            self.client = OpenAI(api_key=OPENAI_API_KEY)
            self.provider = "openai"
            print("✅ Utilisation d'OpenAI pour le nettoyage LLM")
        else:
            print("⚠️ Aucune clé API LLM disponible")
    
    def create_chunks(self, text: str) -> List[str]:
        """Découpe le texte en chunks pour traitement LLM"""
        if not text:
            return []
        
        step = max(1, self.cfg.chunk_size_chars - self.cfg.chunk_overlap_chars)
        chunks = []
        for i in range(0, len(text), step):
            chunks.append(text[i:i + self.cfg.chunk_size_chars])
        
        return chunks
    
    def clean_text(self, text: str) -> Tuple[str, float]:
        """Nettoie le texte avec le LLM"""
        if not self.client or not text:
            return text, 0.0
        
        chunks = self.create_chunks(text)
        cleaned_chunks = []
        total_delta = 0
        
        system_prompt = """Tu es un assistant de correction de transcription.
            Tu corriges UNIQUEMENT : orthographe, grammaire, ponctuation, noms propres malgaches.
            RÈGLES STRICTES :
            1. NE JAMAIS ajouter d'information non présente
            2. NE PAS changer le sens des phrases
            3. Conserver tous les chiffres et montants exacts
            Contexte: Réunion du conseil d'administration à Madagascar.
            Termes valides: Fihariana, SON'INVEST, UNIMA, AQUALMA, Ariary."""
        
        for i, chunk in enumerate(chunks, 1):
            print(f"  Nettoyage chunk {i}/{len(chunks)}...")
            
            try:
                if self.provider == "groq":
                    response = self.client.chat.completions.create(
                        model=self.cfg.groq_model,
                        messages=[
                            {"role": "system", "content": system_prompt},
                            {"role": "user", "content": f"Corrige ce texte:\n\n{chunk}"}
                        ],
                        temperature=0.2,
                        max_tokens=1500
                    )
                    cleaned = response.choices[0].message.content.strip()
                else:  # OpenAI
                    response = self.client.chat.completions.create(
                        model=self.cfg.openai_model,
                        messages=[
                            {"role": "system", "content": system_prompt},
                            {"role": "user", "content": chunk}
                        ],
                        temperature=0.2,
                        max_tokens=1400
                    )
                    cleaned = response.choices[0].message.content.strip()
                
                cleaned_chunks.append(cleaned)
                total_delta += abs(len(cleaned) - len(chunk))
                
            except Exception as e:
                print(f"    ⚠️ Erreur LLM chunk {i}: {e}")
                cleaned_chunks.append(chunk)  # Garder l'original si erreur
        
        # Assembler et calculer le taux de correction
        merged_text = ' '.join(cleaned_chunks)
        correction_rate = total_delta / max(len(text), 1)
        
        # Vérifier le taux de correction
        if correction_rate > self.cfg.max_correction_rate:
            print(f"⚠️ Taux de correction {correction_rate:.1%} > seuil {self.cfg.max_correction_rate:.0%}")
            print("   → Conservation du texte post-traité sans LLM")
            return text, correction_rate
        
        return merged_text, correction_rate

# **Fallback AssemblyAI (si échec Whisper)**

In [23]:
class AssemblyAIFallback:
    """Service de fallback avec AssemblyAI"""
    
    def __init__(self, api_key: str):
        self.api_key = api_key
        
    def transcribe_with_assemblyai(self, audio_path: str) -> Dict:
        """
        Transcription de secours via AssemblyAI
        
        Args:
            audio_path: Chemin du fichier audio
            
        Returns:
            Dict avec la transcription
        """
        if not self.api_key:
            return {
                "status": "error",
                "error": "Clé API AssemblyAI non configurée"
            }
        
        try:
            import assemblyai as aai
            
            print("🔄 Utilisation du fallback AssemblyAI...")
            
            aai.settings.api_key = self.api_key
            transcriber = aai.Transcriber()
            
            # Upload et transcription
            config_lang = aai.TranscriptionConfig(
                language_code="fr",
                punctuate=True,
                format_text=True,
                disfluencies=True,
                speaker_labels=True
            )
            transcript = transcriber.transcribe(audio_path, config=config_lang)
            
            if transcript.status == aai.TranscriptStatus.error:
                raise Exception(f"Erreur AssemblyAI: {transcript.error}")
            
            # Attente de la transcription
            while transcript.status not in [aai.TranscriptStatus.completed, aai.TranscriptStatus.error]:
                time.sleep(5)
                transcript = transcriber.get_transcript(transcript.id)
            
            return {
                "status": "success",
                "method": "assemblyai",
                "transcription": transcript.text,
                "confidence": transcript.confidence if hasattr(transcript, 'confidence') else 0.85,
                "words": transcript.words if hasattr(transcript, 'words') else []
            }
            
        except Exception as e:
            print(f"❌ Erreur AssemblyAI: {str(e)}")
            return {
                "status": "error",
                "error": str(e),
                "method": "assemblyai"
            }

# Service de fallback
fallback_service = AssemblyAIFallback(config.assemblyai_key)

1. Par défaut, la langue est auto. Pour ton cas, force français :
        config = aai.TranscriptionConfig(language_code="fr")
2. Diarisation (orateurs)
        config = aai.TranscriptionConfig(speaker_labels=True)

Exemple :
    config = aai.TranscriptionConfig(language_code="fr", speaker_labels=True)
    transcript = transcriber.transcribe(audio_path, config=config)

Appel :
    Si TranscriptionService.transcribe_audio renvoie status="error" ou un real_time_factor >> 5 (trop lent) ou trop de segments sous ton confidence_threshold, alors :
        > result = fallback_service.transcribe_with_assemblyai(audio_path)

# **Pipeline de transcription avec gestion automatique du fallback**

In [24]:
def transcribe_audio_pipeline(
    audio_path: str,
    cfg: Config,
    save_json: bool = True
) -> Dict[str, Any]:
    """
    Pipeline complet optimisé de transcription
    
    Étapes:
    1. Prétraitement audio (FFmpeg + débruitage)
    2. Transcription avec chunking long (900s)
    3. Diarisation des locuteurs
    4. Post-traitement (normalisation, déduplications)
    5. Nettoyage LLM avec Groq
    
    Returns:
        Dictionnaire avec toutes les versions de la transcription
    """
    
    try:
        # Initialisation
        cleanup_temp_files()
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        
        # [1/5] Prétraitement
        print("\n[1/5] 🔊 Prétraitement audio...")
        preprocessor = AudioPreprocessor(cfg.sample_rate)
        clean_audio_path = preprocessor.process(audio_path, str(TEMP_DIR))
        
        # Libérer mémoire
        del preprocessor
        gc.collect()
        
        # [2/5] Transcription
        print("\n[2/5] 📝 Transcription avec Whisper...")
        service = TranscriptionService(cfg)
        transcription_result = service.transcribe_long_audio(clean_audio_path)
        
        # Sauvegarder la transcription brute (02_transcription)
        if save_json:
            raw_output = {
                "timestamp": timestamp,
                "status": transcription_result["status"],
                "duration": transcription_result["duration"],
                "text": transcription_result["text"],
                "segments": transcription_result["segments"],
                "metadata": transcription_result["metadata"]
            }
            raw_path = Path(OUTPUT_PATH) / f"02_transcription_{timestamp}.json"
            with open(raw_path, 'w', encoding='utf-8') as f:
                json.dump(raw_output, f, ensure_ascii=False, indent=2)
            print(f"  💾 Sauvegardé: {raw_path}")
        
        # Libérer le modèle
        service.unload_model()
        gc.collect()
        
        # [3/5] Diarisation
        print("\n[3/5] 🎙️ Diarisation des locuteurs...")
        if HUGGINGFACE_TOKEN:
            transcription_result = diarize(
                transcription_result,
                clean_audio_path,
                HUGGINGFACE_TOKEN
            )
            
            # Sauvegarder avec diarisation (03_diarization)
            if save_json and transcription_result.get("segments_diarized"):
                diar_output = {
                    "timestamp": timestamp,
                    "status": "success",
                    "duration": transcription_result["duration"],
                    "diarization_method": transcription_result.get("diarization_method", "none"),
                    "segments": transcription_result.get("segments_diarized", transcription_result["segments"]),
                    "text": transcription_result["text"]
                }
                diar_path = Path(OUTPUT_PATH) / f"03_diarization_{timestamp}.json"
                with open(diar_path, 'w', encoding='utf-8') as f:
                    json.dump(diar_output, f, ensure_ascii=False, indent=2)
                print(f"  💾 Sauvegardé: {diar_path}")
        else:
            print("  ⚠️ Token HuggingFace manquant, diarisation ignorée")
        
        # [4/5] Post-traitement
        print("\n[4/5] 🔧 Post-traitement du texte...")
        text_postprocessed = postprocess_text(transcription_result["text"])
        print(f"  Réduction: {len(transcription_result['text'])} → {len(text_postprocessed)} caractères")
        
        # [5/5] Nettoyage LLM
        print("\n[5/5] ✨ Nettoyage LLM...")
        text_final = text_postprocessed
        correction_rate = 0.0
        
        if cfg.enable_llm:
            cleaner = LLMCleaner(cfg)
            text_final, correction_rate = cleaner.clean_text(text_postprocessed)
            print(f"  Taux de correction: {correction_rate:.1%}")
        else:
            print("  ℹ️ LLM désactivé")
        
        # Résultat final
        final_result = {
            "timestamp": timestamp,
            "status": "success",
            "duration": transcription_result["duration"],
            "model": cfg.whisper_model,
            "transcription_raw": transcription_result["text"],
            "transcription_postprocessed": text_postprocessed,
            "transcription_final": text_final,
            "llm_correction_rate": correction_rate,
            "llm_provider": getattr(cleaner, 'provider', 'none') if cfg.enable_llm else 'none',
            "segments": transcription_result.get("segments_diarized", transcription_result["segments"]),
            "metadata": {
                "pipeline_version": "2.0-optimized",
                "chunk_length": cfg.chunk_length_s,
                "condition_on_previous": cfg.condition_on_previous_text,
                "vad_enabled": cfg.use_vad,
                "diarization": transcription_result.get("diarization_method", "none")
            }
        }
        
        # Sauvegarder le résultat final
        if save_json:
            final_path = Path(OUTPUT_PATH) / f"transcription_complete_{timestamp}.json"
            with open(final_path, 'w', encoding='utf-8') as f:
                json.dump(final_result, f, ensure_ascii=False, indent=2)
            print(f"\n✅ Pipeline terminé! Résultats sauvegardés:")
            print(f"   - {raw_path.name} (transcription brute)")
            #if HUGGINGFACE_TOKEN:
                #print(f"   - {diar_path.name} (avec diarisation)")
            print(f"   - {final_path.name} (version finale)")
        
        return final_result
        
    except Exception as e:
        print(f"\n❌ Erreur dans le pipeline: {e}")
        import traceback
        traceback.print_exc()
        return {"status": "error", "error": str(e)}
    
    finally:
        cleanup_temp_files()

# **EXÉCUTION PRINCIPALE**

In [25]:
# Test avec votre fichier audio
#audio_file = f"{UPLOAD_PATH}atelier.mp3"
#audio_file = f"{UPLOAD_PATH}test_1h.wav"
audio_file = f"{UPLOAD_PATH}test_30mn.mp3"
#audio_info = prepare_audio_file(audio_file)

#**Analyse des Résultats de Debug**

Les fichiers JSON de debug sont sauvegardés dans `/kaggle/working/debug_json/` avec le format:
- `02_transcription_[timestamp].json` : Résultat brut de Whisper
- `03_diarization_[timestamp].json` : Après identification des locuteurs
- `04_postprocessing_[timestamp].json` : Après normalisation et déduplication
- `05_llm_cleaning_[timestamp].json` : Version finale nettoyée par LLM

Chaque fichier contient :
- Un résumé (`summary`) avec aperçu du texte et statistiques
- Les métadonnées de l'étape (`status`, `timestamp`)
- Les données complètes peuvent être consultées dans le fichier principal

In [26]:
if __name__ == "__main__":
    # Rechercher le fichier audio
    audio_files = list(Path(UPLOAD_PATH).glob("*.mp3")) + \
                  list(Path(UPLOAD_PATH).glob("*.wav")) + \
                  list(Path(UPLOAD_PATH).glob("*.m4a"))
    
    if audio_files:
        audio_file = audio_files[0]
        print(f"\n📂 Fichier trouvé: {audio_file.name}")
        print("=" * 60)
        
        # Lancer le pipeline
        result = transcribe_audio_pipeline(
            str(audio_file),
            config,
            save_json=True
        )
        
        # Afficher un résumé
        if result["status"] == "success":
            print("\n" + "=" * 60)
            print("📊 RÉSUMÉ DE LA TRANSCRIPTION")
            print("=" * 60)
            print(f"Durée: {result['duration']:.1f} secondes")
            print(f"Modèle: {result['model']}")
            print(f"Provider LLM: {result.get('llm_provider', 'none')}")
            print(f"Taux correction LLM: {result.get('llm_correction_rate', 0):.1%}")
            print(f"\n📝 Extrait (500 premiers caractères):")
            print("-" * 40)
            print(result['transcription_final'][:500] + "...")
    else:
        print("⚠️ Aucun fichier audio trouvé dans", UPLOAD_PATH)
        print("   Formats supportés: .mp3, .wav, .m4a")


📂 Fichier trouvé: test_30mn.mp3

[1/5] 🔊 Prétraitement audio...

[2/5] 📝 Transcription avec Whisper...
📊 Audio: 1928.0s | 3 chunks de 900s
  Chunk 1/3: 0.0s - 930.0s


tokenizer.json: 0.00B [00:00, ?B/s]

vocabulary.json: 0.00B [00:00, ?B/s]

config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

  Chunk 2/3: 870.0s - 1800.0s
  Chunk 3/3: 1770.0s - 1928.0s
  💾 Sauvegardé: /kaggle/working/02_transcription_20250929_084057.json

[3/5] 🎙️ Diarisation des locuteurs...
⚠️ Diarisation pyannote échouée: No module named 'pyannote'
⚠️ Diarisation whisperx échouée: No module named 'whisperx'

[4/5] 🔧 Post-traitement du texte...
  Réduction: 14897 → 14543 caractères

[5/5] ✨ Nettoyage LLM...
⚠️ Package groq non installé, installation...
Collecting groq
  Downloading groq-0.32.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.32.0-py3-none-any.whl (135 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 135.4/135.4 kB 3.1 MB/s eta 0:00:00
Installing collected packages: groq
Successfully installed groq-0.32.0
  Nettoyage chunk 1/19...
  Nettoyage chunk 2/19...
  Nettoyage chunk 3/19...
  Nettoyage chunk 4/19...
  Nettoyage chunk 5/19...
  Nettoyage chunk 6/19...
  Nettoyage chunk 7/19...
  Nettoyage chunk 8/19...
  Nettoyage chunk 9/19...
  Nettoyage chunk 10/19...
  Nettoyage chunk 11/19..