# TP ‚Äì API ‚ÄúDigital Social Score‚Äù

De l‚Äôanalyse de texte √† l‚Äôinfrastructure Cloud s√©curis√©e, scalable 
et conform

## Etape 1 : Exploration, analyse et anonymisation des donn√©es

###  T√©l√©chargons le dataset **Toxic Comment Classification Datase**(Hugging Face)

In [1]:
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("thesofakillers/jigsaw-toxic-comment-classification-challenge", split="train")

# Afficher les premier lignes
df = pd.DataFrame(dataset)
df


  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


### Analyse de la structure du dataset

Analysons la structure, les dimensions et la distribution des labels du dataset Toxic Comment Classification.

In [None]:
# 1. DIMENSIONS
print(f"\nüìè DIMENSIONS:")
print(f"   Nombre de lignes: {len(df):,}")
print(f"   Nombre de colonnes: {len(df.columns)}")
print(f"   Colonnes: {list(df.columns)}")

# 2. TYPES DE DONN√âES
print(f"\nüìã TYPES DE DONN√âES:")
print(df.dtypes)

# 3. VALEURS MANQUANTES
print(f"\nüîç VALEURS MANQUANTES:")
missing = df.isnull().sum()
if missing.sum() == 0:
    print("   ‚úÖ Aucune valeur manquante")
else:
    print(missing[missing > 0])

# 4. DISTRIBUTION DES LABELS
print(f"\nüìä DISTRIBUTION DES LABELS (% de commentaires toxiques):")
label_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
for col in label_columns:
    count = df[col].sum()
    percentage = (count / len(df)) * 100
    print(f"   {col:15s}: {count:6,} ({percentage:5.2f}%)")

# 5. STATISTIQUES SUR LA LONGUEUR DES TEXTES
df['text_length'] = df['comment_text'].astype(str).apply(len)
df['word_count'] = df['comment_text'].astype(str).apply(lambda x: len(x.split()))

print(f"\nüìù STATISTIQUES SUR LES TEXTES:")
print(f"   Longueur moyenne: {df['text_length'].mean():.0f} caract√®res")
print(f"   Longueur m√©diane: {df['text_length'].median():.0f} caract√®res")
print(f"   Nombre de mots moyen: {df['word_count'].mean():.0f} mots")

# 6. EXEMPLES DE COMMENTAIRES
print(f"\nüí¨ EXEMPLES DE COMMENTAIRES:")
print(f"\n   Non-toxique:")
non_toxic = df[df['toxic'] == 0]['comment_text'].iloc[0]
print(f"   {non_toxic[:150]}...")

print(f"\n   Toxique:")
toxic = df[df['toxic'] == 1]['comment_text'].iloc[0]
print(f"   {toxic[:150]}...")

# 7. JUSTIFICATION DU CHOIX DE 10K √âCHANTILLONS
print(f"\nüéØ JUSTIFICATION DU CHOIX DE 10,000 √âCHANTILLONS:")
print(f"   ‚úì Dataset complet: {len(df):,} commentaires (tr√®s volumineux)")
print(f"   ‚úì √âchantillon de 10k: repr√©sentatif et g√©rable")
print(f"   ‚úì Permet un entra√Ænement rapide pour le TP")
print(f"   ‚úì Conservation de la distribution des classes")
print(f"   ‚úì Conformit√© RGPD: limitation des donn√©es trait√©es")

### Utilisons spaCy pour anonymiser les donn√©es 

In [2]:
# python -m spacy download en_core_web_sm"
import spacy
import re
import pandas as pd
from tqdm import tqdm

# Charger spaCy pour l'anglais (une seule fois)
nlp = spacy.load("en_core_web_sm")

# Patterns regex pour d√©tecter les donn√©es personnelles
EMAIL_PATTERN = re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
PHONE_PATTERN = re.compile(r'\b(?:\+?1[-.\s]?)?\(?[0-9]{3}\)?[-.\s]?[0-9]{3}[-.\s]?[0-9]{4}\b')

def detect_personal_data(text):
    """
    D√©tecte toutes les donn√©es personnelles pour la conformit√© RGPD
    """
    if not text or pd.isna(text):
        return {
            'has_personal_data': False,
            'names': [],
            'emails': [],
            'phones': [],
            'addresses': []
        }
    
    text_str = str(text)
    
    # 1. EMAILS
    emails = EMAIL_PATTERN.findall(text_str)
    
    # 2. T√âL√âPHONES  
    phones = PHONE_PATTERN.findall(text_str)
    
    # 3. NOMS avec spaCy NER
    doc = nlp(text_str)
    names = []
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            names.append(ent.text)
    
    return {
        'has_personal_data': bool(names or emails or phones),
        'names': names,
        'emails': emails,
        'phones': phones,
    }

def anonymize_text(text, detection_result):
    """
    Anonymise automatiquement le texte pour la conformit√© RGPD
    """
    if not detection_result['has_personal_data']:
        return text
    
    anonymized_text = str(text)
    
    for email in detection_result['emails']:
        anonymized_text = anonymized_text.replace(email, "[EMAIL]")
    
    for phone in detection_result['phones']:
        anonymized_text = anonymized_text.replace(phone, "[PHONE]")
    
    for name in detection_result['names']:
        anonymized_text = anonymized_text.replace(name, "[NAME]")
    
    return anonymized_text


In [None]:
# CR√âATION DU DATASET ANONYMIS√â (df_anonymized)
print("ANONYMISATION DU DATASET")
print("=" * 30)

# Travaillons avec 10 000 donn√©es
max_to_anonymize = 10000

# Taille r√©elle √† traiter
actual_size = min(max_to_anonymize, len(df))

print(f"PARAMETRES D'ANONYMISATION:")
print(f"Dataset total: {len(df):,} commentaires")
print(f"A anonymiser: {actual_size:,} commentaires")

# Cr√©er une copie du DataFrame pour l'anonymisation
df_anonymized = df.copy()

# Statistiques de traitement
comments_with_personal_data = 0
total_anonymizations = 0

print(f"\nAnonymisation en cours...")

# Traiter les commentaires avec barre de progression
for i in tqdm(range(actual_size), desc="Anonymisation"):
    original_comment = df['comment_text'].iloc[i]
    
    # D√©tecter les donn√©es personnelles
    detection = detect_personal_data(original_comment)
    
    # Si des donn√©es personnelles sont trouv√©es, anonymiser
    if detection['has_personal_data']:
        comments_with_personal_data += 1
        
        # Compter le nombre total d'√©l√©ments √† anonymiser
        num_elements = (len(detection['names']) + len(detection['emails']) + len(detection['phones']))
        total_anonymizations += num_elements
        
        # Anonymiser le texte
        anonymized_comment = anonymize_text(original_comment, detection)
        
        # Remplacer dans le DataFrame anonymis√©
        df_anonymized.loc[i, 'comment_text'] = anonymized_comment

print(f"Total commentaires trait√©s: {actual_size:,}")
print(f"Commentaires anonymis√©s: {comments_with_personal_data:,}")
print(f"Pourcentage de commentaires affect√©s: {comments_with_personal_data/actual_size*100:.2f}%")
print(f"Total d'√©l√©ments anonymis√©s: {total_anonymizations:,}")

# Sauvegarder le dataset anonymis√©
anonymized_filename = "dataset_anonymized_rgpd_10k.csv"
df_anonymized.to_csv(anonymized_filename, index=False)
print(f"\nDataset anonymis√© sauvegard√©: {anonymized_filename}")

### Comparaison AVANT/APR√àS Anonymisation

Visualisons des exemples concrets d'anonymisation.

In [None]:
# Trouver des commentaires avec donn√©es personnelles
examples_to_show = []
for i in range(min(10000, len(df))):
    original = df['comment_text'].iloc[i]
    detection = detect_personal_data(original)
    
    if detection['has_personal_data']:
        anonymized = anonymize_text(original, detection)
        examples_to_show.append({
            'index': i,
            'original': original,
            'anonymized': anonymized,
            'detection': detection
        })
        
        if len(examples_to_show) >= 10:  # Afficher 10 exemples
            break

# Afficher les exemples
for idx, example in enumerate(examples_to_show, 1):
    print(f"\n EXEMPLE {idx}:")
    print(f"   Index: {example['index']}")
    print(f"\n    AVANT (donn√©es personnelles d√©tect√©es):")
    print(f"      {example['original'][:200]}...")
    print(f"\n   D√©tections:")
    if example['detection']['names']:
        print(f"      - Noms: {example['detection']['names']}")
    if example['detection']['emails']:
        print(f"      - Emails: {example['detection']['emails']}")
    if example['detection']['phones']:
        print(f"      - T√©l√©phones: {example['detection']['phones']}")
    print(f"\n    APR√àS (anonymis√©):")
    print(f"      {example['anonymized'][:200]}...")
    print("-" * 70)

print(f"\n Total d'exemples trouv√©s: {len(examples_to_show)}")

In [None]:
# CR√âATION DU TABLEAU R√âCAPITULATIF
import pandas as pd

comparison_data = []
for example in examples_to_show[:5]:  # Top 5 pour le tableau
    comparison_data.append({
        'Index': example['index'],
        'Noms': len(example['detection']['names']),
        'Emails': len(example['detection']['emails']),
        'T√©l√©phones': len(example['detection']['phones']),
        'Total Anonymisations': (
            len(example['detection']['names']) + 
            len(example['detection']['emails']) + 
            len(example['detection']['phones'])
        )
    })

df_comparison = pd.DataFrame(comparison_data)
print("\nTABLEAU R√âCAPITULATIF DES ANONYMISATIONS:")
print(df_comparison.to_string(index=False))

###  Justification des choix d'anonymisation

**Pourquoi ces m√©thodes ?**

1. **NER (Named Entity Recognition) avec spaCy** :
   - D√©tection automatique des noms de personnes
   - Pr√©cision √©lev√©e pour l'anglais
   - Mod√®le pr√©-entra√Æn√© robuste

2. **Regex pour emails et t√©l√©phones** :
   - Patterns standards reconnus
   - Compl√©mentaire au NER
   - Haute sensibilit√©

3. **Remplacement par tokens** :
   - `[NAME]`, `[EMAIL]`, `[PHONE]` : tokens g√©n√©riques
   - Pr√©serve la structure du texte
   - Permet l'analyse sans donn√©es personnelles
   - Anonymisation irr√©versible

**Conformit√© RGPD :**
- Article 4(5) : Pseudonymisation effective
- Article 25 : Privacy by design
- Article 30 : Registre des traitements
- Article 32 : S√©curit√© du traitement

### Comparons la version initiale et anonymis√©e et justifions chaque choix.

## Etape 2 :  Pr√©paration et entra√Ænement d‚Äôun mod√®le IA

### Nettoyage des textes (ponctuation, emojis, casse)

In [None]:
import re
import string
import pandas as pd
from tqdm import tqdm

def clean_text(text):
    """
    Fonction compl√®te de nettoyage de texte pour le NLP
    """
    if not text or pd.isna(text):
        return ""
    
    # Convertir en string au cas o√π
    text = str(text)
    
    # 1. CASSE : Convertir en minuscules
    text = text.lower()
    
    # 2. EMOJIS : Supprimer les emojis (patterns Unicode)
    emoji_pattern = re.compile("["
                              u"\U0001F600-\U0001F64F"  # emoticons
                              u"\U0001F300-\U0001F5FF"  # symboles & pictogrammes
                              u"\U0001F680-\U0001F6FF"  # transport & cartes
                              u"\U0001F1E0-\U0001F1FF"  # drapeaux
                              u"\U00002500-\U00002BEF"  # caract√®res chinois
                              u"\U00002702-\U000027B0"
                              u"\U00002702-\U000027B0"
                              u"\U000024C2-\U0001F251"
                              u"\U0001f926-\U0001f937"
                              u"\U00010000-\U0010ffff"
                              u"\u2640-\u2642"
                              u"\u2600-\u2B55"
                              u"\u200d"
                              u"\u23cf"
                              u"\u23e9"
                              u"\u231a"
                              u"\ufe0f"  # dingbats
                              u"\u3030"
                              "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    
    # 3. CARACT√àRES SP√âCIAUX : Supprimer les caract√®res non-ASCII
    text = re.sub(r'[^\x00-\x7F]+', ' ', text)
    
    # 4. URLs : Supprimer les URLs
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', text)
    text = re.sub(r'www\.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ', text)
    
    # 5. MENTIONS : Supprimer les @mentions
    text = re.sub(r'@\w+', ' ', text)
    
    # 6. HASHTAGS : Supprimer les #hashtags
    text = re.sub(r'#\w+', ' ', text)
    
    # 7. CHIFFRES : Supprimer les chiffres isol√©s
    text = re.sub(r'\b\d+\b', ' ', text)
    
    # 8. PONCTUATION : Supprimer la ponctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # 9. ESPACES : Normaliser les espaces multiples
    text = re.sub(r'\s+', ' ', text)
    
    # 10. TRIM : Supprimer les espaces en d√©but/fin
    text = text.strip()
    
    return text

In [None]:
# √âTAPE 3: Application du nettoyage au dataset anonymis√©
print("NETTOYAGE DU DATASET ANONYMISE")
print("=" * 35)

# V√©rifier que df_anonymized existe
if 'df_anonymized' not in locals():
    print("df_anonymized n'existe pas encore!")
else:
    # Travailler seulement sur les 10 000 donn√©es anonymis√©es
    max_to_clean = 10000
    actual_size_to_clean = min(max_to_clean, len(df_anonymized))
    
    # Cr√©er une copie du dataset anonymis√© pour le nettoyage (seulement les 10k)
    df_cleaned = df_anonymized.head(actual_size_to_clean).copy()
    
    # Statistiques de nettoyage
    total_comments = len(df_cleaned)
    original_total_words = 0
    cleaned_total_words = 0
    empty_after_cleaning = 0
    
    print(f"Nettoyage de {total_comments:,} commentaires anonymis√©s...")
    
    # Appliquer le nettoyage avec barre de progression
    for i in tqdm(range(total_comments), desc="Nettoyage des commentaires"):
        original_comment = df_cleaned['comment_text'].iloc[i]
        
        if original_comment and not pd.isna(original_comment):
            # Compter les mots avant nettoyage
            original_total_words += len(str(original_comment).split())
            
            # Appliquer le nettoyage standard
            cleaned_comment = clean_text(original_comment)
            
            # Remplacer dans le DataFrame
            df_cleaned.loc[i, 'comment_text'] = cleaned_comment
            
            # Compter les mots apr√®s nettoyage
            if cleaned_comment:
                cleaned_total_words += len(cleaned_comment.split())
            else:
                empty_after_cleaning += 1
    

    print(f"Total commentaires trait√©s: {total_comments:,}")
    if original_total_words > 0:
        print(f"R√©duction: {((original_total_words - cleaned_total_words) / original_total_words * 100):.1f}%")
    print(f"Commentaires vides apr√®s nettoyage: {empty_after_cleaning}")
    
    # Supprimer les commentaires vides
    df_cleaned = df_cleaned[df_cleaned['comment_text'].str.len() > 0]
    print(f"Commentaires conserv√©s: {len(df_cleaned):,}")
    
    # Sauvegarder le dataset nettoy√©
    cleaned_filename = "dataset_cleaned_and_anonymized_10k.csv"
    df_cleaned.to_csv(cleaned_filename, index=False)
    print(f"\nDataset nettoy√© sauvegard√©: {cleaned_filename}")

### Entra√Ænons le mod√®le‚ÄØavec SVM

In [None]:
# V√©rifier que df_cleaned existe
if 'df_cleaned' not in locals():
    print("ERREUR: df_cleaned n'existe pas!")
    print("Veuillez d'abord ex√©cuter les √©tapes pr√©c√©dentes.")
else:
    # S√©lectionner uniquement les colonnes n√©cessaires
    required_columns = ['comment_text', 'toxic']
    
    # V√©rifier que les colonnes existent
    available_columns = list(df_cleaned.columns)
    missing_columns = [col for col in required_columns if col not in available_columns]
    
    if missing_columns:
        print(f"ATTENTION: Colonnes manquantes: {missing_columns}")
        # Utiliser les colonnes disponibles
        final_columns = [col for col in required_columns if col in available_columns]
    else:
        final_columns = required_columns
    
    print(f"Colonnes finales utilis√©es: {final_columns}")
    
    # Cr√©er le dataset final
    df_final = df_cleaned[final_columns].copy()
    
   

### Split Train/Test

Divisons le dataset en ensembles d'entra√Ænement et de test tout en pr√©servant la distribution des classes.

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# V√©rifier que df_final existe
if 'df_final' not in locals():
    print(" ERREUR: df_final n'existe pas!")
    print("Veuillez ex√©cuter les cellules pr√©c√©dentes.")
else:
    # Extraire X (textes) et y (labels)
    X = df_final['comment_text'].values
    y = df_final['toxic'].values
    
    # Split stratifi√© 80/20
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=0.2,     
        random_state=42, 
        stratify=y          # Pr√©server la distribution des classes
    )
    
    print(f"\n R√âSULTATS DU SPLIT:")
    print(f"   Dataset total: {len(df_final):,} commentaires")
    print(f"   Train: {len(X_train):,} commentaires ({len(X_train)/len(df_final)*100:.1f}%)")
    print(f"   Test:  {len(X_test):,} commentaires ({len(X_test)/len(df_final)*100:.1f}%)")
    
    # Sauvegarder les datasets
    train_df = pd.DataFrame({'comment_text': X_train, 'toxic': y_train})
    test_df = pd.DataFrame({'comment_text': X_test, 'toxic': y_test})
    
    train_filename = "data/train_toxic_10k.csv"
    test_filename = "data/test_toxic_10k.csv"
    
    train_df.to_csv(train_filename, index=False)
    test_df.to_csv(test_filename, index=False)
    
    print(f"\n FICHIERS SAUVEGARD√âS:")
    print(f"   Train: {train_filename}")
    print(f"   Test:  {test_filename}")
    print("\n Split termin√© avec succ√®s!")

### Entra√Ænement du Mod√®le SVM

Lan√ßons l'entra√Ænement du mod√®le SVM avec le pipeline TF-IDF.

In [None]:
import subprocess
import os

# V√©rifier que le fichier d'entra√Ænement existe
train_file = "data/train_toxic_10k.csv"
if not os.path.exists(train_file):
    print(f" ERREUR: {train_file} n'existe pas!")
    print("Veuillez ex√©cuter la cellule pr√©c√©dente (Split Train/Test).")
else:
    print(f" Fichier d'entra√Ænement trouv√©: {train_file}")
    print(f"\n Lancement de l'entra√Ænement SVM...")
    print("-" * 60)
    
    # Ex√©cuter le script SVM.py
    try:
        result = subprocess.run(
            ["python", "model/SVM.py"],
            capture_output=True,
            text=True,
            timeout=300  # Timeout de 5 minutes
        )
        
        # Afficher la sortie
        if result.stdout:
            print(result.stdout)
        
        if result.returncode == 0:
            print("\n ENTRA√éNEMENT SVM TERMIN√â AVEC SUCC√àS!")
            
            # V√©rifier que le mod√®le a √©t√© sauvegard√©
            model_path = "model/svm_pipeline.pkl"
            if os.path.exists(model_path):
                model_size = os.path.getsize(model_path) / (1024 * 1024)  # MB
                print(f" Mod√®le sauvegard√©: {model_path} ({model_size:.2f} MB)")
            else:
                print(f"  Attention: {model_path} non trouv√©")
        else:
            print("\n ERREUR lors de l'entra√Ænement:")
            if result.stderr:
                print(result.stderr)
                
    except subprocess.TimeoutExpired:
        print(" TIMEOUT: L'entra√Ænement a pris trop de temps (>5 min)")
    except Exception as e:
        print(f" ERREUR: {e}")

###  √âvaluation du mod√®le SVM

√âvaluons les performances du mod√®le SVM sur l'ensemble de test avec m√©triques d√©taill√©es et visualisations.

In [None]:
import pickle
import pandas as pd
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_recall_fscore_support,
    confusion_matrix, classification_report
)
import matplotlib.pyplot as plt
import seaborn as sns

# √âVALUATION DU MOD√àLE SVM
print("=" * 60)
print(" √âVALUATION DU MOD√àLE SVM")
print("=" * 60)

# Charger le mod√®le
model_path = "model/svm_pipeline.pkl"
if not os.path.exists(model_path):
    print(f" ERREUR: Mod√®le non trouv√©: {model_path}")
    print("Veuillez d'abord entra√Æner le mod√®le (cellule pr√©c√©dente).")
else:
    # Charger le mod√®le
    with open(model_path, 'rb') as f:
        svm_model = pickle.load(f)
    print(f" Mod√®le charg√©: {model_path}")
    
    # Charger les donn√©es de test
    test_df = pd.read_csv("data/test_toxic_10k.csv")
    X_test = test_df['comment_text'].values
    y_test = test_df['toxic'].values
    
    print(f" Donn√©es de test charg√©es: {len(X_test)} commentaires")
    
    # Pr√©dictions
    print(f"\n Pr√©dictions en cours...")
    y_pred = svm_model.predict(X_test)
    
    # Calculer les m√©triques
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='binary')
    
    print(f"\n M√âTRIQUES DE PERFORMANCE:")
    print(f"   Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"   Precision: {precision:.4f}")
    print(f"   Recall:    {recall:.4f}")
    print(f"   F1-Score:  {f1:.4f}")
    
    # Classification report d√©taill√©
    print(f"\n CLASSIFICATION REPORT D√âTAILL√â:")
    print(classification_report(y_test, y_pred, target_names=['Non-toxic', 'Toxic']))
    
    # Matrice de confusion
    cm = confusion_matrix(y_test, y_pred)
    
    print(f"\n MATRICE DE CONFUSION:")
    print(f"                  Pr√©dit Non-toxic  Pr√©dit Toxic")
    print(f"R√©el Non-toxic         {cm[0][0]:6d}        {cm[0][1]:6d}")
    print(f"R√©el Toxic             {cm[1][0]:6d}        {cm[1][1]:6d}")
    
    # Visualiser la matrice de confusion
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Non-toxic', 'Toxic'],
                yticklabels=['Non-toxic', 'Toxic'])
    plt.title('Matrice de Confusion - SVM')
    plt.ylabel('Vraie Classe')
    plt.xlabel('Classe Pr√©dite')
    plt.tight_layout()
    plt.savefig('svm_confusion_matrix.png', dpi=100, bbox_inches='tight')
    plt.show()
    
    print(f"\n Matrice de confusion sauvegard√©e: svm_confusion_matrix.png")
    print(f"\n √âVALUATION TERMIN√âE!")

###  Tests de Pr√©diction - Exemples Concrets

Testons le mod√®le SVM sur des exemples concrets pour v√©rifier son comportement.

In [None]:
# Charger le mod√®le si n√©cessaire
if 'svm_model' not in locals():
    with open("model/svm_pipeline.pkl", 'rb') as f:
        svm_model = pickle.load(f)

# Exemples de test
test_examples = [
    # Non-toxiques
    "This is a great article, thank you for sharing!",
    "I respectfully disagree with your point of view.",
    "Can you please explain this concept in more detail?",
    
    # Toxiques
    "You are stupid and your opinion is worthless!",
    "I hate you and everything you stand for.",
    "Shut up, nobody cares about what you think!",
    
    # Cas limites
    "This is ridiculous.",
    "I don't like this approach.",
    "You're wrong about this.",
]

print(f"\n Test sur {len(test_examples)} exemples:\n")

for i, text in enumerate(test_examples, 1):
    # Pr√©diction
    prediction = svm_model.predict([text])[0]
    label = " TOXIC" if prediction == 1 else " NON-TOXIC"
    
    # Obtenir le score de d√©cision (confiance)
    try:
        decision_score = svm_model.decision_function([text])[0]
        confidence = abs(decision_score)
    except:
        confidence = 0
    
    print(f"{i:2d}. {label} (confiance: {confidence:.2f})")
    print(f"    Texte: \"{text}\"")
  

## √âtape 3 : D√©ploiement du mod√®le en API Cloud

### Cr√©ation d'une API Fast API recevant un texte et renvoyant un score.

###  Architecture de l'API FastAPI

Notre API de d√©tection de toxicit√© est construite avec **FastAPI** et d√©ploy√©e sur **Google Kubernetes Engine (GKE)**.

**Endpoints principaux :**

1. **`POST /token`** : Authentification JWT
   - Entr√©e : `username` et `password`
   - Sortie : Token JWT valide 24h

2. **`POST /predict`** : Pr√©diction de toxicit√©
   - Entr√©e : `text` (commentaire √† analyser)
   - Sortie : `score` (0-100) et `is_toxic` (bool√©en)
   - Authentification : JWT requis

3. **`GET /health`** : V√©rification de sant√©
   - Sortie : √âtat de l'API et du mod√®le

4. **`GET /metrics`** : M√©triques Prometheus
   - Sortie : M√©triques pour supervision

**S√©curit√© :**
- Authentification JWT (python-jose)
- Validation des entr√©es (pydantic)
- Rate limiting
- CORS configur√©

### Test de l'API en Local

Testons l'API en local (localhost:8080) pour v√©rifier son fonctionnement.

In [None]:
import requests
import json

# CONFIGURATION
API_URL = "http://localhost:8080"  # API locale
USERNAME = "admin"  
PASSWORD = "admin"

# √âTAPE 1: Obtenir un token JWT
print("\n √âTAPE 1: Authentification JWT")

try:
    token_response = requests.post(
        f"{API_URL}/token",
        data={
            "username": USERNAME,
            "password": PASSWORD
        },
        timeout=10
    )
    
    if token_response.status_code == 200:
        token_data = token_response.json()
        access_token = token_data['access_token']
        print(f" Token obtenu avec succ√®s!")
        print(f"   Token type: {token_data['token_type']}")
        print(f"   Token: {access_token[:50]}...")
    else:
        print(f" Erreur d'authentification: {token_response.status_code}")
        print(f"   R√©ponse: {token_response.text}")
        access_token = None
        
except requests.exceptions.ConnectionError:
    print(f" ERREUR: Impossible de se connecter √† {API_URL}")
    print(f"   Assurez-vous que l'API est d√©marr√©e (python app.py)")
    access_token = None
except Exception as e:
    print(f" ERREUR: {e}")
    access_token = None

# √âTAPE 2: Tester les pr√©dictions
if access_token:
    print(f"\n √âTAPE 2: Tests de Pr√©diction")
    print("-" * 70)
    
    # Exemples de tests
    test_cases = [
        {"text": "This is a great article, thank you!", "expected": "Non-toxic"},
        {"text": "You are stupid and worthless!", "expected": "Toxic"},
        {"text": "I disagree with your opinion.", "expected": "Non-toxic"},
    ]
    
    headers = {"Authorization": f"Bearer {access_token}"}
    
    for i, test in enumerate(test_cases, 1):
        try:
            response = requests.post(
                f"{API_URL}/predict",
                headers=headers,
                json={"text": test["text"]},
                timeout=10
            )
            
            if response.status_code == 200:
                result = response.json()
                prediction = " Toxic" if result['is_toxic'] else " Non-toxic"
                print(f"\nTest {i}: {prediction} (Score: {result['score']}/100)")
                print(f"   Texte: \"{test['text']}\"")
                print(f"   Attendu: {test['expected']}")
            else:
                print(f"\n Erreur Test {i}: {response.status_code}")
                print(f"   {response.text}")
                
        except Exception as e:
            print(f"\n Erreur Test {i}: {e}")

### Test de l'API avec GKE

Testons l'API d√©ploy√©e sur Google Kubernetes Engine.

In [None]:
import requests
import json

# CONFIGURATION PRODUCTION
PROD_API_URL = "http://34.22.130.34/docs"  # IP publique GKE
USERNAME = "admin"
PASSWORD = "admin"

# Test de sant√©
print(f"\n Health check de l'API")


try:
    health_response = requests.get(f"{PROD_API_URL}/health", timeout=10)
    
    if health_response.status_code == 200:
        health_data = health_response.json()
        print(f" API op√©rationnelle!")
        print(f"   Status: {health_data.get('status', 'N/A')}")
        print(f"   Model: {health_data.get('model', 'N/A')}")
    else:
        print(f"  R√©ponse inattendue: {health_response.status_code}")
        
except requests.exceptions.ConnectionError:
    print(f" ERREUR: Impossible de se connecter √† {PROD_API_URL}")
    print(f"   L'API de production est peut-√™tre arr√™t√©e.")
except Exception as e:
    print(f" ERREUR: {e}")

# Authentification
print(f"\n Authentification JWT")
print("-" * 70)

try:
    # Note: L'endpoint /token n'est peut-√™tre pas d√©ploy√© en production
    token_response = requests.post(
        f"{PROD_API_URL}/token",
        data={"username": USERNAME, "password": PASSWORD},
        timeout=10
    )
    
    if token_response.status_code == 200:
        token_data = token_response.json()
        access_token = token_data['access_token']
        print(f" Token obtenu!")
        
        # Test de pr√©diction
        print(f"\n Test de Pr√©diction")
        print("-" * 70)
        
        headers = {"Authorization": f"Bearer {access_token}"}
        test_text = "This is a test comment."
        
        pred_response = requests.post(
            f"{PROD_API_URL}/predict",
            headers=headers,
            json={"text": test_text},
            timeout=10
        )
        
        if pred_response.status_code == 200:
            result = pred_response.json()
            print(f" Pr√©diction r√©ussie!")
            print(f"   Texte: \"{test_text}\"")
            print(f"   Score: {result['score']}/100")
            print(f"   Toxic: {result['is_toxic']}")
        else:
            print(f" Erreur de pr√©diction: {pred_response.status_code}")
            
    elif token_response.status_code == 404:
        print(f"  Endpoint /token non trouv√© (404)")
        print(f"   L'API de production doit √™tre red√©ploy√©e avec le endpoint /token")
    else:
        print(f" Erreur d'authentification: {token_response.status_code}")
        
except requests.exceptions.ConnectionError:
    print(f" Connexion impossible")
except Exception as e:
    print(f" ERREUR: {e}")

## √âtape 4 : S√©curit√© et conformit√© RGPD

### Mesures de S√©curit√© Impl√©ment√©es

Notre API respecte les standards de s√©curit√© et la conformit√© RGPD :

**1. Authentification JWT (JSON Web Tokens)**
- Tokens sign√©s avec secret cryptographique
- Expiration configurable (24h par d√©faut)
- Validation syst√©matique des requ√™tes

**2. Validation des entr√©es**
- Pydantic models pour validation stricte
- Sanitization des inputs
- Protection contre injections

**3. RGPD**
- Anonymisation pr√©alable des donn√©es (spaCy NER)
- Pas de stockage de donn√©es personnelles
- Registre de traitement document√©

**4. S√©curit√© Infrastructure**
- Service Account GCP avec permissions minimales
- Secrets dans Secret Manager (JWT_SECRET)
- Network policies Kubernetes

**5. Monitoring & Audit**
- Logs centralis√©s (Cloud Logging)
- M√©triques de s√©curit√© (tentatives auth √©chou√©es)
- Alertes sur comportements anormaux

In [None]:
# V√©rification des permissions IAM et Service Account
print(" Configuration IAM Kubernetes")
print("=" * 60)

iam_config = {
    "Service Account": "mlops-api-sa",
    "Namespace": "default",
    "Permissions GCP": [
        "storage.objects.get (Cloud Storage)",
        "logging.logEntries.create (Cloud Logging)",
        "monitoring.timeSeries.create (Cloud Monitoring)",
        "aiplatform.endpoints.predict (Vertex AI)"
    ],
    "Principe": "Least Privilege (permissions minimales)",
    "Workload Identity": "Recommand√© pour production"
}

print(f"Service Account Kubernetes: {iam_config['Service Account']}")
print(f"\nPermissions GCP attribu√©es :")
for perm in iam_config['Permissions']:
    print(f" {perm}")

print(f"\n Principe appliqu√©: {iam_config['Principe']}")
print(f" {iam_config['Workload Identity']}")

print("\n Commandes pour v√©rifier IAM :")
print("gcloud iam service-accounts list")
print("kubectl get serviceaccount mlops-api-sa -o yaml")

## √âtape 5 : Simulation de mont√©e en charge avec Locust

### Objectifs des tests de charge

Notre strat√©gie de load testing vise √† :
- **Mesurer les performances** : Latence P50, P95, P99
- **Identifier les limites** : Capacit√© maximale (RPS)
- **D√©tecter les goulots** : CPU, m√©moire, I/O
- **Valider la scalabilit√©** : Horizontal Pod Autoscaling

### Configuration du test
- **Outil** : Locust (Python)
- **Sc√©narios** : Authentication + Pr√©dictions (mix toxique/non-toxique)
- **Mont√©e en charge** : Progressive (spawn rate contr√¥l√©e)
- **M√©triques cl√©s** : Response time, throughput, error rate

In [None]:
# V√©rification du fichier locustfile.py
import os

locust_file = "locustfile.py"
if os.path.exists(locust_file):
    print(" locustfile.py trouv√©")
    print("\n Structure du test de charge :")
    
    # Lecture des principales m√©triques
    with open(locust_file, 'r', encoding='utf-8') as f:
        content = f.read()
        
    print("\n Sc√©narios de test d√©tect√©s :")
    if "def login" in content:
        print("  Authentication JWT (/token)")
    if "def predict_toxicity" in content:
        print("  Pr√©dictions toxicit√© (/predict)")
    if "def health_check" in content:
        print("  Health checks (/health)")
    print("  ‚Ä¢ Endpoint : http://localhost:8080 (ou production)")
else:
    print(" locustfile.py non trouv√©")

### Ex√©cution de la simulation de mont√©e en charge

**Commande pour lancer Locust (mode headless) :**

```bash
# Test local (5 minutes, 50 utilisateurs max)
locust -f locustfile.py --host=http://localhost:8080 --users 50 --spawn-rate 5 --run-time 5m --headless --html report_loadtest.html

# Test production
locust -f locustfile.py --host=http://34.77.253.251 --users 100 --spawn-rate 10 --run-time 10m --headless --html report_production.html
```

**Commande pour interface Web (mode interactif) :**

```bash
locust -f locustfile.py --host=http://localhost:8080
# Ouvrir http://localhost:8089 dans le navigateur
```

**M√©triques :**
- **P50 (M√©diane)** : Temps de r√©ponse pour 50% des requ√™tes
- **P95** : 95% des requ√™tes r√©pondent sous ce temps (SLA typique)
- **P99** : 99% des requ√™tes (d√©tection des outliers)
- **RPS (Requests/sec)** : D√©bit maximal support√©
- **Error Rate** : Taux d'erreur acceptable < 1%

## √âtape 6 : Monitoring avec Google Cloud Monitoring

### Architecture de Monitoring

Notre solution de monitoring utilise **Google Cloud Monitoring** pour :

**1. M√©triques Kubernetes/GKE (automatiques)**
- CPU/Memory des pods
- Network I/O
- Disk usage
- Pod restarts, crashes

**2. M√©triques Applicatives (custom)**
- Request rate (requ√™tes/sec)
- Latency (P50, P95, P99)
- Error rate
- Token generation rate
- Model prediction count

**3. Logs centralis√©s**
- Cloud Logging (stdout/stderr des containers)
- Filtres et recherche avanc√©e
- Alertes bas√©es sur logs

**4. Alerting**
- Latency P95 > 500ms
- Error rate > 5%
- Pod availability < 2 replicas
- Memory usage > 80%