# Classification de Commentaires Toxiques - Modele Multilingue

## XLM-RoBERTa pour la detection de toxicite en Francais, Anglais et Arabe

Ce notebook entraine un modele **XLM-RoBERTa** capable de detecter la toxicite dans plusieurs langues.

### Langues supportees:
- Francais (FR)
- Anglais (EN) 
- Arabe (AR)

### Architecture:
- Base: `xlm-roberta-base` (Meta AI)
- Fine-tuning sur dataset Jigsaw + traductions
- Classification binaire (toxique / non-toxique)

In [None]:
# Installation des dependances
!pip install transformers datasets torch accelerate sentencepiece protobuf sacremoses -q

In [None]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
    XLMRobertaTokenizer, 
    XLMRobertaModel,
    XLMRobertaForSequenceClassification,
    AdamW,
    get_linear_schedule_with_warmup
)
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, 
    f1_score, 
    roc_auc_score,
    confusion_matrix
)
from tqdm.auto import tqdm
import warnings
warnings.filterwarnings('ignore')

# Configuration
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
torch.manual_seed(RANDOM_STATE)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1. Chargement et Preparation des Donnees

In [None]:
# Charger le dataset original (anglais)
print("Chargement du dataset original...")
df_en = pd.read_csv('train.csv')
print(f"Dataset original: {len(df_en)} commentaires")

# Creer la colonne toxic binaire
LABEL_COLS = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
df_en['is_toxic'] = (df_en[LABEL_COLS].sum(axis=1) > 0).astype(int)
df_en['lang'] = 'en'

print(f"\nDistribution:")
print(df_en['is_toxic'].value_counts())

In [None]:
# Echantillonnage pour equilibrer les classes et reduire le temps d'entrainement
print("Echantillonnage equilibre...")

# Separer toxiques et non-toxiques
df_toxic = df_en[df_en['is_toxic'] == 1]
df_clean = df_en[df_en['is_toxic'] == 0]

n_toxic = len(df_toxic)
print(f"Commentaires toxiques: {n_toxic}")
print(f"Commentaires propres: {len(df_clean)}")

# Sous-echantillonner les non-toxiques (ratio 1:1.5)
n_clean_sample = int(n_toxic * 1.5)
df_clean_sample = df_clean.sample(n=n_clean_sample, random_state=RANDOM_STATE)

# Combiner
df_balanced = pd.concat([df_toxic, df_clean_sample], ignore_index=True)
df_balanced = df_balanced.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

print(f"\nDataset equilibre: {len(df_balanced)} commentaires")
print(df_balanced['is_toxic'].value_counts())

## 2. Creation de Donnees Multilingues

On va creer des exemples en francais et arabe pour entrainer le modele multilingue.

In [None]:
# Exemples multilingues pour l'entrainement
# Ces exemples vont aider le modele a generaliser

multilingual_examples = [
    # Francais - Toxique
    {"comment_text": "Tu es stupide et je te deteste!", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Ferme ta gueule espece d'imbecile", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Va te faire foutre sale con", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "T'es qu'un pauvre nul sans cerveau", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Je vais te tuer si tu continues", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Degage de la sale raciste", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Tu me degouttes espece de porc", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Tous les gens comme toi sont des idiots", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Tu merites de mourir", "is_toxic": 1, "lang": "fr"},
    {"comment_text": "Retourne dans ton pays sale etranger", "is_toxic": 1, "lang": "fr"},
    
    # Francais - Non toxique
    {"comment_text": "Merci beaucoup pour cet article interessant", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Je suis d'accord avec ton analyse", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Tres bon travail, continue comme ca", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "C'est une excellente idee", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Je ne suis pas d'accord mais je respecte ton avis", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Pourrais-tu expliquer davantage s'il te plait?", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "J'ai appris quelque chose de nouveau aujourd'hui", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Bonne journee a tous", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Cet article m'a beaucoup aide", "is_toxic": 0, "lang": "fr"},
    {"comment_text": "Je trouve cette discussion tres constructive", "is_toxic": 0, "lang": "fr"},
    
    # Arabe - Toxique
    {"comment_text": "انت غبي ولا تفهم شيء", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "اخرس يا حمار", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "انت كلب وابن كلب", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "سوف اقتلك", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "يلعن ابوك", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "انت حقير ووسخ", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "كل الناس مثلك اغبياء", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "ارجع لبلدك يا اجنبي", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "انت لا تستحق الحياة", "is_toxic": 1, "lang": "ar"},
    {"comment_text": "تفو عليك", "is_toxic": 1, "lang": "ar"},
    
    # Arabe - Non toxique
    {"comment_text": "شكرا جزيلا على هذا المقال الرائع", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "اتفق معك تماما", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "عمل ممتاز", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "فكرة جيدة جدا", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "لا اتفق لكن احترم رايك", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "هل يمكنك التوضيح اكثر من فضلك", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "تعلمت شيئا جديدا اليوم", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "يوم سعيد للجميع", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "هذا المقال مفيد جدا", "is_toxic": 0, "lang": "ar"},
    {"comment_text": "نقاش بناء ومفيد", "is_toxic": 0, "lang": "ar"},
]

df_multilingual = pd.DataFrame(multilingual_examples)
print(f"Exemples multilingues: {len(df_multilingual)}")
print(df_multilingual.groupby(['lang', 'is_toxic']).size())

In [None]:
# Augmenter les donnees multilingues (repeter pour avoir plus d'impact)
df_multilingual_augmented = pd.concat([df_multilingual] * 50, ignore_index=True)
print(f"Exemples multilingues augmentes: {len(df_multilingual_augmented)}")

# Combiner avec le dataset anglais
df_balanced_small = df_balanced[['comment_text', 'is_toxic', 'lang']].head(20000)  # Limiter pour l'entrainement
df_train_full = pd.concat([df_balanced_small, df_multilingual_augmented], ignore_index=True)
df_train_full = df_train_full.sample(frac=1, random_state=RANDOM_STATE).reset_index(drop=True)

print(f"\nDataset final: {len(df_train_full)} commentaires")
print("\nPar langue:")
print(df_train_full['lang'].value_counts())
print("\nPar toxicite:")
print(df_train_full['is_toxic'].value_counts())

## 3. Preparation du Modele XLM-RoBERTa

In [None]:
# Configuration du modele
MODEL_NAME = 'xlm-roberta-base'
MAX_LENGTH = 128
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 2e-5

# Charger le tokenizer
print(f"Chargement du tokenizer {MODEL_NAME}...")
tokenizer = XLMRobertaTokenizer.from_pretrained(MODEL_NAME)
print(f"Vocabulaire: {tokenizer.vocab_size} tokens")

In [None]:
# Dataset PyTorch
class ToxicDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        
        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_length,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(label, dtype=torch.long)
        }

In [None]:
# Split train/validation
X_train, X_val, y_train, y_val = train_test_split(
    df_train_full['comment_text'].values,
    df_train_full['is_toxic'].values,
    test_size=0.15,
    random_state=RANDOM_STATE,
    stratify=df_train_full['is_toxic'].values
)

print(f"Train: {len(X_train)} samples")
print(f"Validation: {len(X_val)} samples")

# Creer les datasets
train_dataset = ToxicDataset(X_train, y_train, tokenizer, MAX_LENGTH)
val_dataset = ToxicDataset(X_val, y_val, tokenizer, MAX_LENGTH)

# Creer les dataloaders
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE)

In [None]:
# Modele XLM-RoBERTa pour classification
class XLMRobertaToxicClassifier(nn.Module):
    def __init__(self, model_name, num_classes=2, dropout=0.3):
        super().__init__()
        self.xlm_roberta = XLMRobertaModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.xlm_roberta.config.hidden_size, num_classes)
        
    def forward(self, input_ids, attention_mask):
        outputs = self.xlm_roberta(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        # Utiliser le token [CLS]
        pooled_output = outputs.last_hidden_state[:, 0, :]
        pooled_output = self.dropout(pooled_output)
        return self.classifier(pooled_output)

# Initialiser le modele
print(f"Chargement du modele {MODEL_NAME}...")
model = XLMRobertaToxicClassifier(MODEL_NAME)
model = model.to(device)

# Compter les parametres
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Parametres totaux: {total_params:,}")
print(f"Parametres entrainables: {trainable_params:,}")

## 4. Entrainement

In [None]:
# Optimizer et scheduler
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=0.01)

total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(0.1 * total_steps),
    num_training_steps=total_steps
)

# Loss avec poids pour classes desequilibrees
class_weights = torch.tensor([1.0, 1.5]).to(device)  # Plus de poids sur la classe toxique
criterion = nn.CrossEntropyLoss(weight=class_weights)

print(f"Total steps: {total_steps}")
print(f"Warmup steps: {int(0.1 * total_steps)}")

In [None]:
def train_epoch(model, dataloader, optimizer, scheduler, criterion, device):
    model.train()
    total_loss = 0
    predictions = []
    actuals = []
    
    progress_bar = tqdm(dataloader, desc="Training")
    
    for batch in progress_bar:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        optimizer.zero_grad()
        
        outputs = model(input_ids, attention_mask)
        loss = criterion(outputs, labels)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
        
        total_loss += loss.item()
        
        preds = torch.argmax(outputs, dim=1).cpu().numpy()
        predictions.extend(preds)
        actuals.extend(labels.cpu().numpy())
        
        progress_bar.set_postfix({'loss': loss.item()})
    
    avg_loss = total_loss / len(dataloader)
    f1 = f1_score(actuals, predictions, average='binary')
    
    return avg_loss, f1

def evaluate(model, dataloader, criterion, device):
    model.eval()
    total_loss = 0
    predictions = []
    actuals = []
    probabilities = []
    
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            
            probs = torch.softmax(outputs, dim=1)[:, 1].cpu().numpy()
            preds = torch.argmax(outputs, dim=1).cpu().numpy()
            
            predictions.extend(preds)
            actuals.extend(labels.cpu().numpy())
            probabilities.extend(probs)
    
    avg_loss = total_loss / len(dataloader)
    f1 = f1_score(actuals, predictions, average='binary')
    auc = roc_auc_score(actuals, probabilities)
    
    return avg_loss, f1, auc, predictions, actuals

In [None]:
# Entrainement
print("="*70)
print("ENTRAINEMENT DU MODELE XLM-ROBERTA MULTILINGUE")
print("="*70)

best_f1 = 0
history = {'train_loss': [], 'val_loss': [], 'train_f1': [], 'val_f1': [], 'val_auc': []}

for epoch in range(EPOCHS):
    print(f"\n{'='*70}")
    print(f"Epoch {epoch + 1}/{EPOCHS}")
    print("="*70)
    
    # Train
    train_loss, train_f1 = train_epoch(
        model, train_loader, optimizer, scheduler, criterion, device
    )
    
    # Evaluate
    val_loss, val_f1, val_auc, val_preds, val_actuals = evaluate(
        model, val_loader, criterion, device
    )
    
    # Sauvegarder les metriques
    history['train_loss'].append(train_loss)
    history['val_loss'].append(val_loss)
    history['train_f1'].append(train_f1)
    history['val_f1'].append(val_f1)
    history['val_auc'].append(val_auc)
    
    print(f"\nResultats Epoch {epoch + 1}:")
    print(f"  Train Loss: {train_loss:.4f} | Train F1: {train_f1:.4f}")
    print(f"  Val Loss: {val_loss:.4f} | Val F1: {val_f1:.4f} | Val AUC: {val_auc:.4f}")
    
    # Sauvegarder le meilleur modele
    if val_f1 > best_f1:
        best_f1 = val_f1
        torch.save(model.state_dict(), 'xlm_roberta_toxic_best.pt')
        print(f"  Nouveau meilleur modele sauvegarde! F1: {best_f1:.4f}")

print("\n" + "="*70)
print("ENTRAINEMENT TERMINE")
print(f"Meilleur F1: {best_f1:.4f}")
print("="*70)

## 5. Evaluation Finale

In [None]:
# Charger le meilleur modele
model.load_state_dict(torch.load('xlm_roberta_toxic_best.pt'))
model.eval()

# Evaluation finale
val_loss, val_f1, val_auc, val_preds, val_actuals = evaluate(
    model, val_loader, criterion, device
)

print("\n" + "="*70)
print("RAPPORT DE CLASSIFICATION")
print("="*70)
print(classification_report(val_actuals, val_preds, target_names=['Non-Toxique', 'Toxique']))

print("\nMatrice de Confusion:")
print(confusion_matrix(val_actuals, val_preds))

## 6. Test sur Exemples Multilingues

In [None]:
def predict_toxicity(text, model, tokenizer, device):
    """Predit la toxicite d'un texte"""
    model.eval()
    
    encoding = tokenizer(
        text,
        padding='max_length',
        truncation=True,
        max_length=128,
        return_tensors='pt'
    )
    
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids, attention_mask)
        probs = torch.softmax(outputs, dim=1)
        toxic_prob = probs[0][1].item()
        is_toxic = toxic_prob >= 0.5
    
    return {
        'is_toxic': is_toxic,
        'toxic_probability': toxic_prob,
        'clean_probability': 1 - toxic_prob
    }

# Tests multilingues
test_examples = [
    # Anglais
    ("You are an idiot and I hate you!", "en"),
    ("Thank you for this great article!", "en"),
    
    # Francais
    ("Tu es stupide et je te deteste!", "fr"),
    ("Merci beaucoup pour ce super article!", "fr"),
    ("Va te faire voir sale con!", "fr"),
    ("C'est une excellente contribution", "fr"),
    
    # Arabe
    ("انت غبي ولا تفهم شيء", "ar"),
    ("شكرا جزيلا على هذا المقال الرائع", "ar"),
    ("اخرس يا حمار", "ar"),
    ("عمل ممتاز جدا", "ar"),
]

print("\n" + "="*70)
print("TESTS MULTILINGUES")
print("="*70)

for text, lang in test_examples:
    result = predict_toxicity(text, model, tokenizer, device)
    status = "TOXIQUE" if result['is_toxic'] else "OK"
    print(f"\n[{lang.upper()}] {text}")
    print(f"     -> {status} (probabilite: {result['toxic_probability']:.1%})")

## 7. Sauvegarde du Modele

In [None]:
# Sauvegarder le modele et le tokenizer
import os

save_dir = 'xlm_roberta_multilingual'
os.makedirs(save_dir, exist_ok=True)

# Sauvegarder les poids du modele
torch.save(model.state_dict(), f'{save_dir}/model.pt')

# Sauvegarder le tokenizer
tokenizer.save_pretrained(save_dir)

# Sauvegarder la configuration
config = {
    'model_name': MODEL_NAME,
    'max_length': MAX_LENGTH,
    'num_classes': 2,
    'dropout': 0.3,
    'languages': ['en', 'fr', 'ar'],
    'best_f1': best_f1
}

import json
with open(f'{save_dir}/config.json', 'w') as f:
    json.dump(config, f, indent=2)

print(f"Modele sauvegarde dans: {save_dir}/")
print(f"Fichiers:")
for f in os.listdir(save_dir):
    size = os.path.getsize(f'{save_dir}/{f}') / (1024*1024)
    print(f"  - {f}: {size:.1f} MB")

In [None]:
# Copier aussi dans le dossier principal
import shutil
shutil.copy('xlm_roberta_toxic_best.pt', 'xlm_roberta_toxic_best.pt')
print("\nModele pret pour le deploiement!")
print(f"Fichier: xlm_roberta_toxic_best.pt")

## Resume

Le modele **XLM-RoBERTa Multilingue** a ete entraine avec succes!

### Caracteristiques:
- **Modele de base**: xlm-roberta-base (278M parametres)
- **Langues**: Anglais (EN), Francais (FR), Arabe (AR)
- **Tache**: Classification binaire (Toxique / Non-Toxique)

### Prochaines etapes:
1. Deployer sur AWS Lambda
2. Ajouter l'option multilingue au frontend