# Creating a batch $D_{\text{filtered}}$ with [Gemini 2.0-flash](https://ai.google.dev/gemini-api/docs/quickstart?hl=fr) + [Llama-Guard-3-8B](https://huggingface.co/meta-llama/Llama-Guard-3-8B)'s prefiltering

In [this notebook](gemini_prefiltering.ipynb) and [this notebook](Llama_guard_prefiltering.ipynb), we pre-filtered some toxic content to balance the dataset. We will create a batch from these filterings. 

## Libraries

In [1]:
from pathlib import Path
import os
import pandas as pd
from rich.console import Console

console = Console()

## Global variables

In [7]:
ROOT = Path("../..")
DATA_DIR = ROOT / "data"
SUBSETS_FILTERED = DATA_DIR / "pre-filtering"
files = [f for f in os.listdir(SUBSETS_FILTERED) if f.endswith(".csv") and "checkpoint" not in f and 'llamaguard_and_gemini' in f]
output_csv_1 = DATA_DIR / "subsets_Di" / "subset_filtered.csv"
output_csv_2 = DATA_DIR / "subsets_Di_annotated" / "subset_filtered_gpt-4o-mini.csv"
console = Console()

## Load dataset

In [4]:
dfs = [pd.read_csv(SUBSETS_FILTERED / file) for file in files]
df = pd.concat(dfs, ignore_index=True)
df = df[df['llama_prediction'].astype(int) == 1]
df

Unnamed: 0,msg_id,user,content,topic,deleted,banned,hour,llama_prediction
4,anon_msg_373bf377933b,anon_user_5cfcc99e11,Barres-toi sans donner de raison elle ne mérit...,anon_topic_bb3c1e80,1,0,11,True
20,anon_msg_8c76e5ea5401,anon_user_62bd0aa998,Ah oui les agriculteurs qui vivent en Seine Sa...,anon_topic_83bbdbb1,0,0,12,True
27,anon_msg_cdcc06b5f882,anon_user_e769c386b5,je propose un invasion des états-unis pour se ...,anon_topic_de7347d6,1,0,11,True
28,anon_msg_175572b594d7,anon_user_972dc0a264,Je nie tout ça car je suis expert en escort as...,anon_topic_422f0859,1,0,0,True
33,anon_msg_3885c2832dab,anon_user_2ad71e016a,Sortir un couteau et menacer ouvertement un fl...,anon_topic_f3ec6e6d,1,0,13,True
...,...,...,...,...,...,...,...,...
41816,anon_msg_0299fab1f230,anon_user_485ca4b2c7,This\n+ Les italiens massacrent les croissants...,anon_topic_53c6f06b,0,0,1,True
41821,anon_msg_43a9606c9e21,anon_user_2f87740825,Non mais nous khey ont peu pas les laisser dan...,anon_topic_d38c2c2a,1,0,1,True
41845,anon_msg_e24abf1cd5f6,anon_user_fb511cc977,la Russie a tout intérêt à s'allier avec les m...,anon_topic_d28eab06,1,0,12,True
41851,anon_msg_716f80784555,anon_user_305b6c4214,Putain t'as du faire des trucs d'homosexuelle ...,anon_topic_a492a8b1,1,1,13,True


In [6]:
for i, row in df.sample(5, random_state=42).iterrows():
    console.print(f"Text: {row['content']}")
    console.print("-" * 40)

## Save

In [8]:
df.to_csv(output_csv_1, index=False)
df.to_csv(output_csv_2, index=False)