# Benchmarking classifiers on this dataset

| Model Name                                           | Size   | Type    | Notes                                                          |
| ---------------------------------------------------- | ------ | ------- | -------------------------------------------------------------- |
| `unitary/toxic-bert`                                 | \~110M | BERT    | Classic English toxicity model trained on Jigsaw.              |
| `microsoft/Multilingual-Toxic-XLMR`                  | \~270M | XLM-R   | Multilingual support (English, French, etc.), trained on mTCR. |
| `Davlan/bert-base-multilingual-cased-toxic-comments` | \~110M | BERT    | French/English support, Jigsaw-based.                          |
| `facebook/roberta-hate-speech-dynabench`             | \~355M | RoBERTa | English hate speech classifier.                                |
| `cardiffnlp/twitter-roberta-base-offensive`          | \~125M | RoBERTa | Robust for social media, English only.                         |
| `CAMeL-Lab/bert-base-arabic-camelbert-da-sentiment`  | \~110M | BERT    | Good if testing multilingual range.                            |
| `papluca/xlm-roberta-base-language-detection`        | -      | XLM-R   | Not toxicity, but to test misclassification across languages.  |

| Model              | Strengths                                                  | Weaknesses            |
| ------------------ | ---------------------------------------------------------- | --------------------- |
| `gpt-4o`           | Best zero-shot classification, supports nuanced reasoning. | Expensive, API-only.  |
| `gpt-3.5-turbo`    | Fast, decent reasoning.                                    | Weaker on edge cases. |
| `text-davinci-003` | Classic prompt-following.                                  | Obsolete vs GPT-4.    |

Classify the following message as toxic (1) or non-toxic (0):
<text>
Answer: 

| Model             | Notes                                     |
| ----------------- | ----------------------------------------- |
| `claude-3-opus`   | Very high performance, nuanced judgement. |
| `claude-3-sonnet` | Balanced latency and cost.                |
| `claude-instant`  | Faster but less accurate.                 |



| Model                                | Prompt Type | Notes                                                      |
| ------------------------------------ | ----------- | ---------------------------------------------------------- |
| `mistralai/Mistral-7B-Instruct-v0.2` | Chat        | Strong general model, use chain-of-thought for edge cases. |
| `Qwen/Qwen1.5-7B-Chat`               | Chat        | Very good in French & toxicity detection.                  |
| `meta-llama/Llama-3-8b-chat-hf`      | Chat        | Top-tier general performance.                              |

| Model                                   | Size | Notes                                       |
| --------------------------------------- | ---- | ------------------------------------------- |
| `Qwen/Qwen1.5-0.5B`                     | 0.5B | Lightest usable Qwen.                       |
| `mistralai/Mistral-7B-Instruct-v0.1`    | 7B   | Efficient and strong for offline inference. |
| `NousResearch/TinyLlama-1.1B-Chat-v1.0` | 1.1B | Useful for embedded or quick deployments.   |




## Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
from tqdm.notebook import tqdm
from pathlib import Path

# Optional: For calling OpenAI or HF models
# from openai import OpenAI
# from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification

## Global variables

In [5]:
ROOT = Path("..")
DATA_DIR = ROOT / "data"
BENCHMARK_PATH = DATA_DIR / "benchmark" / "benchmark.csv" 

## Load dataset

In [6]:
df = pd.read_csv(BENCHMARK_PATH, encoding="utf-8")
df = df.dropna(subset=["content", "label"])
df["label"] = df["label"].astype(int)
df.head()

Unnamed: 0,msg_id,content,label
0,anon_msg_1a684cbe350a,Bien manger c'est le début du bonheur.,0
1,anon_msg_0cc5e09d68f9,non comparaison à l’ouzo t’as rien suivi,0
2,anon_msg_bc0764308f82,Faut être très fragile pour boucler sur un jeu...,0
3,anon_msg_2700e892bb78,Non juste pour montrer que ça vaut le coup d'a...,0
4,anon_msg_41523912b7c8,Allez on rejoint les golemz,0


## Define prediction function

In [7]:
def mock_predict(text: str) -> int:
    """
    Replace this with a real model inference.
    Returns 1 if the message seems toxic, 0 otherwise.
    """
    keywords = ["fragile", "vote truqué", "golemz"]
    return int(any(word in text.lower() for word in keywords))

# If using HuggingFace (example):
# classifier = pipeline("text-classification", model="unitary/toxic-bert")
# def hf_predict(text):
#     out = classifier(text)[0]
#     return int(out["label"] == "TOXIC" and out["score"] > 0.5)

# If using OpenAI GPT model (requires API key):
# import openai
# openai.api_key = "your-api-key"
# def gpt_predict(text: str) -> int:
#     prompt = f"Classify the following message as toxic (1) or non-toxic (0):\n\n{text}\n\nAnswer:"
#     response = openai.ChatCompletion.create(
#         model="gpt-4o",
#         messages=[{"role": "user", "content": prompt}],
#         temperature=0
#     )
#     return int("1" in response.choices[0].message.content)

## Run prediction

In [8]:
tqdm.pandas()
df["prediction"] = df["content"].progress_apply(mock_predict)

  0%|          | 0/21968 [00:00<?, ?it/s]

## Metrics & Report        

In [9]:
y_true = df["label"]
y_pred = df["prediction"]

print("Classification Report:")
print(classification_report(y_true, y_pred, digits=3))

print("\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))

try:
    auc = roc_auc_score(y_true, y_pred)
    print(f"\nROC AUC Score: {auc:.3f}")
except:
    print("\nROC AUC Score could not be computed (only one class present).")

Classification Report:
              precision    recall  f1-score   support

           0      0.968     0.999     0.984     21274
           1      0.053     0.001     0.003       694

    accuracy                          0.968     21968
   macro avg      0.511     0.500     0.493     21968
weighted avg      0.939     0.968     0.953     21968


Confusion Matrix:
[[21256    18]
 [  693     1]]

ROC AUC Score: 0.500
