# Prefiltering with [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B)

According to the [benchmark](./../../benchmarking/benchmark_summary.ipynb), Llama-guard may not have a good recall but it has a high precision (~93%), i.e. almost all the comments annotated as toxic by LlamaGuard are indeed (truly) toxic. Therefore, we will use it to prefilter [subsets](./../../data/subsets_Di/) to gather more toxic contents.

We have already done such an annotation with [gemini 2.0 flash](gemini_prefiltering.ipynb), which has 96% of recall for the toxicity class. This ensures that almost all the (truly) toxic comments were annotated as toxic by Gemini. 

Therefore, we only need to look at the comments that Gemini has annotated as toxic if we want to gather the most toxicity.

## Libraries

In [1]:
# from detoxify import Detoxify
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from pathlib import Path
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from rich.console import Console
from rich.table import Table
import warnings
from tqdm.std import TqdmExperimentalWarning
warnings.filterwarnings("ignore", category=TqdmExperimentalWarning)
from tqdm.rich import tqdm
tqdm.pandas(desc="Prédiction Toxicité")

from rich.panel import Panel
from rich.text import Text

## Global variables

In [2]:
ROOT = Path("../..")
DATA_DIR = ROOT / "data"
range_authorized = (6, 8) # (a,b) -> [a, a+1, ..., b-1]
subsets = [f for f in os.listdir(DATA_DIR / "subsets_Di") if f.replace(".csv","").replace("subset_", "") in map(str, range(range_authorized[0], range_authorized[1]))]
output_path = DATA_DIR / "pre-filtering" / f"llamaguard_and_gemini_pre-filtered_{range_authorized[0]}_{range_authorized[1]}.csv"
gemini_annotated_path = DATA_DIR / "pre-filtering" / f"gemini_pre-filtered_{range_authorized[0]}_{range_authorized[1]}.csv"
console = Console()

In [3]:
os.environ["HTTP_PROXY"] = "socks5h://127.0.0.1:1080"
os.environ["HTTPS_PROXY"] = "socks5h://127.0.0.1:1080"

## Load dataset

In [4]:
df_gemini = pd.read_csv(gemini_annotated_path, encoding='utf-8')
console.print(f"Loaded Gemini annotated data from {gemini_annotated_path} with {len(df_gemini)} rows.")
df_gemini = df_gemini[df_gemini['gemini_prediction'] == 1]
console.print(f"Filtered Gemini data to {len(df_gemini)} rows with gemini_prediction == 1.")

In [5]:
dfs = [pd.read_csv(DATA_DIR / "subsets_Di" / f, encoding='utf-8') for f in subsets]
df = pd.concat(dfs, ignore_index=True)
console.print(f"Loaded and concatenated {len(dfs)} subsets with a total of {len(df)} rows.")
df = df[df['msg_id'].isin(df_gemini['msg_id'])]
console.print(f"Filtered data to {len(df)} rows matching Gemini annotated msg_ids.")
df

Unnamed: 0,msg_id,user,content,topic,deleted,banned,hour
5,anon_msg_514998f35627,anon_user_9cca1e694b,la version chatgpt correspond beaucoup mieux à...,anon_topic_f16270da,1,0,1
6,anon_msg_3bdb9d77af1b,anon_user_6b122a9f24,oui l'auteur doit bien puer aussi avec ses vêt...,anon_topic_61bb448b,1,0,0
10,anon_msg_a2963f6d9b53,anon_user_500456ebff,Comment trouvez vous le temps de lire un livre...,anon_topic_00f0fdf2,0,0,11
12,anon_msg_13fc9507af7a,anon_user_a34644fa48,Propagande de quoi le parigot ? C'est la putai...,anon_topic_8c7a01ff,1,1,14
19,anon_msg_373bf377933b,anon_user_5cfcc99e11,Barres-toi sans donner de raison elle ne mérit...,anon_topic_bb3c1e80,1,0,11
...,...,...,...,...,...,...,...
59983,anon_msg_8adc932fedf0,anon_user_411a03406c,Islamiste et soumis à l'empire ça semble aller...,anon_topic_c0258430,0,0,14
59990,anon_msg_4b2f5ec1a08c,anon_user_3da6356e21,+ viol collectif et grossesse forcé des petite...,anon_topic_f17a8ddb,0,0,11
59991,anon_msg_752808fcf53c,anon_user_262aae22aa,Montagner prix Nobel de médecine = caca\nUn pr...,anon_topic_1d0fa47b,1,0,12
59992,anon_msg_37fd53afd859,anon_user_3753c83248,"ToastED\nLes modos, vous avez aucune race à su...",anon_topic_2954fe72,0,0,1


## Define prediction function

In [6]:
model_id = "meta-llama/Llama-Guard-3-8B"
device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, device_map="auto")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [7]:
def predict(text: str):
    messages = [
        {"role": "user", "content": "Dis quelque chose de toxique."},
        {"role": "assistant", "content": text}
    ]
    input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100, pad_token_id=0)
    response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
    return ('unsafe' in response.lower())

## Run prediction

In [8]:
df['llama_prediction'] = df["content"].progress_apply(predict)
df = df.dropna(subset=["llama_prediction"])

Output()

In [9]:
df_1 = df[df['llama_prediction'] == 1]

for i, row in df_1.sample(5, random_state=42).iterrows():
    content = Text(row['content'], style="bold")
    toxicity = f"[yellow]LlamaGuard Prediction:[/yellow] [bold]{int(row['llama_prediction'])}[/bold]"
    panel = Panel.fit(
        f"{content}\n\n{toxicity}",
        title=f"Exemple {i+1}",
        border_style="magenta"
    )
    console.print(panel)

In [10]:
df['llama_prediction'].value_counts()

llama_prediction
False    11992
True      1672
Name: count, dtype: int64

In [11]:
df.to_csv(output_path, index=False, encoding="utf-8")