# Creating a batch $D_{\text{filtered}}$ with [Gemini 2.0-flash](https://ai.google.dev/gemini-api/docs/quickstart?hl=fr)'s prefiltering

In [this notebook](gemini_prefiltering.ipynb), we pre-filtered some toxic content to balance the dataset. We will create a batch from these filterings. 

## Libraries

In [1]:
from pathlib import Path
import os
import pandas as pd
from rich.console import Console

console = Console()

## Global variables

In [2]:
ROOT = Path("../..")
DATA_DIR = ROOT / "data"
SUBSETS_FILTERED = DATA_DIR / "pre-filtering"
files = [f for f in os.listdir(SUBSETS_FILTERED) if f.endswith(".csv") and "checkpoint" not in f]
output_csv_1 = DATA_DIR / "subsets_Di" / "subset_filtered.csv"
output_csv_2 = DATA_DIR / "subsets_Di_annotated" / "subset_filtered_gpt-4o-mini.csv"
console = Console()

## Load dataset

In [11]:
dfs = [pd.read_csv(SUBSETS_FILTERED / file) for file in files]
df = pd.concat(dfs, ignore_index=True)
df = df[df['gemini_prediction'].astype(int) == 1]
df = df[df['banned'].astype(int) == 1]
df = df[df['deleted'].astype(int) == 1]
df

Unnamed: 0,msg_id,user,content,topic,deleted,banned,hour,gemini_prediction
4,anon_msg_4222a131e996,anon_user_8c8e6fd0a1,Bah dans une société saine il ressortent jamai...,anon_topic_6ee71e37,1,1,0,1
24,anon_msg_68296d1810d2,anon_user_e414b0c5f8,Ayaaaaa ce tarax\nLe meme qui disait il y a 5 ...,anon_topic_246f3619,1,1,15,1
46,anon_msg_05aaf2d18d1f,anon_user_5ca2bf0253,AYAAAAAAA maintenant le député LFI dénonce le ...,anon_topic_ea0e3e29,1,1,12,1
208,anon_msg_8598730afbe0,anon_user_f00e4728d0,Week-end foot j'ai dit\nL'odeur de ton calbard...,anon_topic_2dc6f771,1,1,13,1
222,anon_msg_adbef7ac4c23,anon_user_2be664cd79,"Nettoie la, même en portant plainte tu récupér...",anon_topic_ab747476,1,1,12,1
...,...,...,...,...,...,...,...,...
179895,anon_msg_1cb0569b7e2e,anon_user_16c349ba60,NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO...,anon_topic_50f50f16,1,1,13,1
179916,anon_msg_3de75eb30c5e,anon_user_8f9fbbdd55,Ah bah si incelus premier le dit\nOn peut clor...,anon_topic_952a0846,1,1,13,1
179923,anon_msg_34e07f6e807f,anon_user_86cbeabf14,Gourvès bonne mère sur la tête à Marius tu vas...,anon_topic_9a400bc8,1,1,0,1
179955,anon_msg_fcdc9ebf6f9a,anon_user_38cb3e2e7e,On s'en bat les couilles d'être detesté dans u...,anon_topic_da6605c9,1,1,14,1


In [12]:
for i, row in df.sample(5, random_state=42).iterrows():
    console.print(f"Text: {row['content']}")
    console.print("-" * 40)

## Save

In [13]:
df.to_csv(output_csv_1, index=False)
df.to_csv(output_csv_2, index=False)