# Preprocessing pipeline
This notebook loads the four raw CSVs, inspects them, normalizes each to a standard `text,label` schema, cleans text, deduplicates, and writes `train/val/test` CSVs to `data/processed/`.

It is written to be runnable step-by-step with checks and explanatory notes so you can review intermediate outputs.

## 1) Setup and quick inspection
Load libraries and provide helper used for quick head/tail/total checks (streaming tail to avoid loading full large files).

In [3]:
import re
import html
import os
from pathlib import Path
from collections import deque
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm.auto import tqdm

print('pandas', pd.__version__)
print("Python packages imported: pandas, numpy, sklearn, tqdm, re, html, os, pathlib, collections ")

pandas 2.3.3
Python packages imported: pandas, numpy, sklearn, tqdm, re, html, os, pathlib, collections 


  from .autonotebook import tqdm as notebook_tqdm


## Load and Inspect Data Files
Below we load each CSV into a DataFrame and show:
- first few rows (.head())
- column names
- total rows

This helps us decide how to standardize columns.


In [6]:
# Load HC3
hc3_path = "../data/raw/hc3.csv"
assert os.path.exists(hc3_path), f"{hc3_path} not found"

df_hc3 = pd.read_csv(hc3_path)
print("HC3 shape:", df_hc3.shape)
print("HC3 columns:", df_hc3.columns.tolist())
display(df_hc3.head(5))


HC3 shape: (24322, 5)
HC3 columns: ['question', 'human_answers', 'chatgpt_answers', 'index', 'source']


Unnamed: 0,question,human_answers,chatgpt_answers,index,source
0,"Why is every book I hear about a "" NY Times # ...","['Basically there are many categories of "" Bes...",['There are many different best seller lists t...,,reddit_eli5
1,"If salt is so bad for cars , why do we use it ...",['salt is good for not dying in car crashes an...,"[""Salt is used on roads to help melt ice and s...",,reddit_eli5
2,Why do we still have SD TV channels when HD lo...,"[""The way it works is that old TV stations got...","[""There are a few reasons why we still have SD...",,reddit_eli5
3,Why has nobody assassinated Kim Jong - un He i...,"[""You ca n't just go around assassinating the ...",['It is generally not acceptable or ethical to...,,reddit_eli5
4,How was airplane technology able to advance so...,['Wanting to kill the shit out of Germans driv...,['After the Wright Brothers made the first pow...,,reddit_eli5


In [7]:
# Load gpt_generated
gpt_path = "../data/raw/gpt_generated.csv"
assert os.path.exists(gpt_path), f"{gpt_path} not found"

df_gpt = pd.read_csv(gpt_path)
print("gpt_generated shape:", df_gpt.shape)
print("gpt_generated columns:", df_gpt.columns.tolist())
display(df_gpt.head(5))


gpt_generated shape: (1392522, 3)
gpt_generated columns: ['source', 'id', 'text']


Unnamed: 0,source,id,text
0,human,0,12 Years a Slave: An Analysis of the Film Essa...
1,human,1,20+ Social Media Post Ideas to Radically Simpl...
2,human,2,2022 Russian Invasion of Ukraine in Global Med...
3,human,3,533 U.S. 27 (2001) Kyllo v. United States: The...
4,human,4,A Charles Schwab Corporation Case Essay\n\nCha...


In [8]:
# Load Kaggle AI vs Human
kaggle_path = "../data/raw/kaggle_ai_human.csv"
assert os.path.exists(kaggle_path), f"{kaggle_path} not found"

df_kaggle = pd.read_csv(kaggle_path)
print("kaggle_ai_human shape:", df_kaggle.shape)
print("kaggle_ai_human columns:", df_kaggle.columns.tolist())
display(df_kaggle.head(5))


kaggle_ai_human shape: (487235, 2)
kaggle_ai_human columns: ['text', 'generated']


Unnamed: 0,text,generated
0,Cars. Cars have been around since they became ...,0.0
1,Transportation is a large necessity in most co...,0.0
2,"""America's love affair with it's vehicles seem...",0.0
3,How often do you ride in a car? Do you drive a...,0.0
4,Cars are a wonderful thing. They are perhaps o...,0.0


In [9]:
# Load routellm_gpt4_dataset
route_path = "../data/raw/routellm_gpt4_dataset.csv"
assert os.path.exists(route_path), f"{route_path} not found"

df_route = pd.read_csv(route_path)
print("routellm_gpt4_dataset shape:", df_route.shape)
print("routellm_gpt4_dataset columns:", df_route.columns.tolist())
display(df_route.head(5))


routellm_gpt4_dataset shape: (109101, 5)
routellm_gpt4_dataset columns: ['prompt', 'source', 'gpt4_response', 'mixtral_response', 'mixtral_score']


Unnamed: 0,prompt,source,gpt4_response,mixtral_response,mixtral_score
0,"I'll give you a review, can you extract the fo...",['lmsys-chat-1m'],"Sure, here's the analysis of the review:\n\n1....",Food aspects and opinion words:\n\n1. Made to ...,4
1,"Answer the following question: Claim: ""Joker m...",['flan_v2_cot'],The answer is no.\nChain of thoughts: Stan Lee...,The answer is no.\n\nChain of thoughts: While ...,5
2,TASK DEFINITION: In this task you will be give...,['flan_v2_niv2'],ZdoublexpropheciesS,"ZdoublexpropheciesS\n\nIn this task, you are a...",5
3,"Detailed Instructions: In this task, you need ...",['flan_v2_niv2'],Yes,"No, 'station' is not the longest word in the s...",5
4,A guy pick up a woman Then he puts her down Ex...,['sharegpt'],This phrase could be interpreted as a joke bec...,This joke is a play on words and relies on the...,5


## Normalization Plan (brief)
We will convert each dataset to a common two-column format:
- `text` (string)
- `label` (int) where `0 = human`, `1 = ai`

Dataset-specific notes:
- HC3: `human_answers` and `chatgpt_answers` are lists (strings that look like lists). We will parse & explode.
- gpt_generated: has `source` (human/ai) and `text`.
- kaggle: has `text` and `generated` (0/1).
- routellm: contains `prompt` (human) and multiple AI response columns; we'll take `prompt` as human and AI response columns as AI.


In [10]:
# Text cleaning function
def clean_text(text):
    """
    Basic cleaning: ensure string, remove HTML tags, emojis, extra whitespace.
    Keep stylistic features as detectors rely on writing patterns.
    """
    if not isinstance(text, str):
        return ""
    # Remove HTML tags
    text = re.sub(r"<.*?>", "", text)
    # Remove emojis (range)
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"
        u"\U0001F300-\U0001F5FF"
        u"\U0001F680-\U0001F6FF"
        u"\U0001F1E0-\U0001F1FF"
        "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    # Normalize whitespace
    text = " ".join(text.split())
    return text


### Process HC3
HC3 stores `human_answers` and `chatgpt_answers` often as Python-like lists in string form.
We will:
- Convert list-strings into real lists if necessary
- Explode both human and chatgpt lists into rows
- Assign labels (0=human, 1=ai)


In [11]:
import ast

def try_parse_list(x):
    # Try parsing a string that looks like a list, else return [x] if not list
    if isinstance(x, str) and x.strip().startswith('['):
        try:
            parsed = ast.literal_eval(x)
            if isinstance(parsed, list):
                return parsed
        except Exception:
            pass
    return [x]

# human answers
hc3_human = df_hc3[['human_answers']].copy()
hc3_human['human_answers'] = hc3_human['human_answers'].apply(lambda v: try_parse_list(v))
hc3_human = hc3_human.explode('human_answers').rename(columns={'human_answers':'text'})
hc3_human['label'] = 0

# ai answers
hc3_ai = df_hc3[['chatgpt_answers']].copy()
hc3_ai['chatgpt_answers'] = hc3_ai['chatgpt_answers'].apply(lambda v: try_parse_list(v))
hc3_ai = hc3_ai.explode('chatgpt_answers').rename(columns={'chatgpt_answers':'text'})
hc3_ai['label'] = 1

# Keep only text + label
hc3_proc = pd.concat([hc3_human[['text','label']], hc3_ai[['text','label']]], ignore_index=True)
print("HC3 processed rows:", hc3_proc.shape[0])
display(hc3_proc.head())


HC3 processed rows: 85904


Unnamed: 0,text,label
0,"Basically there are many categories of "" Best ...",0
1,"If you 're hearing about it , it 's because it...",0
2,"One reason is lots of catagories . However , h...",0
3,salt is good for not dying in car crashes and ...,0
4,"In Minnesota and North Dakota , they tend to u...",0


### Process gpt_generated
This file already has `source` (human/ai) and `text`. We'll map `source` to labels.


In [12]:
# Normalize column names if necessary
df_gpt_proc = df_gpt.copy()
# Map source -> label (if not already)
if 'source' in df_gpt_proc.columns:
    df_gpt_proc['label'] = df_gpt_proc['source'].map({'human':0, 'ai':1})
else:
    # fallback if already has label
    if 'label' not in df_gpt_proc.columns:
        raise ValueError("gpt_generated missing label/source column")
# Keep only text and label
df_gpt_proc = df_gpt_proc[['text','label']]
print("gpt_generated processed rows:", df_gpt_proc.shape[0])
display(df_gpt_proc.head())


gpt_generated processed rows: 1392522


Unnamed: 0,text,label
0,12 Years a Slave: An Analysis of the Film Essa...,0
1,20+ Social Media Post Ideas to Radically Simpl...,0
2,2022 Russian Invasion of Ukraine in Global Med...,0
3,533 U.S. 27 (2001) Kyllo v. United States: The...,0
4,A Charles Schwab Corporation Case Essay\n\nCha...,0


### Process kaggle_ai_human
Kaggle file has `text` and `generated` where 0=human, 1=ai. We will rename `generated` to `label`.


In [13]:
df_kaggle_proc = df_kaggle.copy()
# If generated is float (0.0/1.0), cast to int
if 'generated' in df_kaggle_proc.columns:
    df_kaggle_proc['label'] = df_kaggle_proc['generated'].astype(int)
else:
    # try other names
    possible = [c for c in df_kaggle_proc.columns if 'gen' in c.lower()]
    if possible:
        df_kaggle_proc['label'] = df_kaggle_proc[possible[0]].astype(int)
    else:
        raise ValueError("kaggle dataset doesn't have 'generated' column")
df_kaggle_proc = df_kaggle_proc[['text','label']]
print("kaggle processed rows:", df_kaggle_proc.shape[0])
display(df_kaggle_proc.head())


kaggle processed rows: 487235


Unnamed: 0,text,label
0,Cars. Cars have been around since they became ...,0
1,Transportation is a large necessity in most co...,0
2,"""America's love affair with it's vehicles seem...",0
3,How often do you ride in a car? Do you drive a...,0
4,Cars are a wonderful thing. They are perhaps o...,0


### Process routellm_gpt4_dataset
This dataset includes `prompt` (human) and AI responses (`gpt4_response`, `mixtral_response`, ...).  
We will:
- use `prompt` as human samples
- extract AI responses as AI samples
Note: keep only non-empty strings and later clean.


In [14]:
# Build human and ai frames from routellm
route_frames = []

# human prompts
if 'prompt' in df_route.columns:
    df_r_human = df_route[['prompt']].rename(columns={'prompt':'text'})
    df_r_human['label'] = 0
    route_frames.append(df_r_human)

# AI response columns to include (check which exist)
ai_cols = [c for c in ['gpt4_response','mixtral_response','anthropic_response','model_response'] if c in df_route.columns]
# if mixtral_response exists, include it, and gpt4_response if present
for c in ai_cols:
    tmp = df_route[[c]].rename(columns={c:'text'}).copy()
    tmp['label'] = 1
    route_frames.append(tmp)

df_route_proc = pd.concat(route_frames, ignore_index=True)
print("routellm processed rows:", df_route_proc.shape[0])
display(df_route_proc.head())


routellm processed rows: 327303


Unnamed: 0,text,label
0,"I'll give you a review, can you extract the fo...",0
1,"Answer the following question: Claim: ""Joker m...",0
2,TASK DEFINITION: In this task you will be give...,0
3,"Detailed Instructions: In this task, you need ...",0
4,A guy pick up a woman Then he puts her down Ex...,0


## Concatenate all processed datasets
Now we will concatenate the processed frames from HC3, GPT-generated, Kaggle, and RouteLLM into a single DataFrame.
We will then:
- clean text
- drop nulls and very short texts
- deduplicate


In [26]:
# Collect all processed frames
frames = [hc3_proc, df_gpt_proc, df_kaggle_proc, df_route_proc]
print("Frames lengths:", [f.shape[0] for f in frames])

full_df = pd.concat(frames, ignore_index=True)
print("Combined rows before cleaning:", full_df.shape[0])

# Basic cleaning + dropna
# Limit maximum text length to avoid giant files
MAX_CHARS = 1500
full_df['text'] = full_df['text'].astype(str).apply(clean_text).str.slice(0, MAX_CHARS)
# Remove empty or too short texts (e.g., < 20 chars)
full_df['text_len'] = full_df['text'].str.len()
full_df = full_df[full_df['text_len'] >= 20].copy()
# Drop duplicates
before = full_df.shape[0]
full_df.drop_duplicates(subset=['text'], inplace=True)
after = full_df.shape[0]
print(f"Rows after removing short & duplicates: {after} (removed {before-after})")

# Keep only text and label
full_df = full_df[['text','label']].sample(frac=1, random_state=42).reset_index(drop=True)
print("Final dataset shape:", full_df.shape)
display(full_df.head())


Frames lengths: [85904, 1392522, 487235, 327303]
Combined rows before cleaning: 2292964
Combined rows before cleaning: 2292964
Rows after removing short & duplicates: 2170318 (removed 112904)
Rows after removing short & duplicates: 2170318 (removed 112904)
Final dataset shape: (2170318, 2)
Final dataset shape: (2170318, 2)


Unnamed: 0,text,label
0,"ime, making bank for minutes, selling my body ...",0
1,Jessica. I am so sorry. I wished we did things...,0
2,2016 Local Elections We want more Kiwis to get...,0
3,The Department of Human Services has created t...,1
4,"``Why am I here,'' I think to myself. I'm stan...",0


## Class balance check
We should see how many human vs AI samples we have after merge. If classes are heavily imbalanced, we may need to handle it (resampling, class weights, or sampling subsets for training).


In [27]:
full_df['label'].value_counts().rename_axis('label').reset_index(name='counts')


Unnamed: 0,label,counts
0,0,1417533
1,1,752785


## Train / Validation / Test split
We'll use a standard split: 70% train, 15% validation, 15% test.
We will stratify by `label` to preserve class balance.


In [28]:
train_df, temp_df = train_test_split(full_df, test_size=0.30, random_state=42, stratify=full_df['label'])
val_df, test_df = train_test_split(temp_df, test_size=0.50, random_state=42, stratify=temp_df['label'])

print("Train:", train_df.shape, "Val:", val_df.shape, "Test:", test_df.shape)


Train: (1519222, 2) Val: (325548, 2) Test: (325548, 2)


In [20]:
print(full_df.shape)


(2187357, 2)


In [21]:
print(full_df['text'].nunique())


2187357


In [23]:
print(full_df.head())

                                                text  label
0  eve what he was seeing. Throughout the room wa...      0
1  Title: "The King's Legacy" In the vast, rugged...      1
2  Organisation of the Organisationless: The Ques...      0
3  aised her eyebrow again. “ Registered? We regi...      0
4  There is no doubt that successful people often...      1


In [29]:
print(full_df['text'].str.len().describe())

count    2.170318e+06
mean     1.120267e+03
std      4.371374e+02
min      2.000000e+01
25%      9.540000e+02
50%      1.242000e+03
75%      1.500000e+03
max      1.500000e+03
Name: text, dtype: float64


In [25]:
print(full_df['label'].value_counts())

label
0    1429280
1     758077
Name: count, dtype: int64


In [30]:
import os
os.makedirs("../data/processed", exist_ok=True)
train_df.to_csv("../data/processed/train.csv", index=False)
val_df.to_csv("../data/processed/val.csv", index=False)
test_df.to_csv("../data/processed/test.csv", index=False)
print("Saved train/val/test in data/processed/")


Saved train/val/test in data/processed/


## Sanity checks
- Print a few positive (AI) and negative (human) examples from train set
- Print label distribution


In [31]:
print("Train label distribution")
display(train_df['label'].value_counts())

print("\nExample human text:")
display(train_df[train_df['label']==0]['text'].sample(3).tolist())

print("\nExample AI text:")
display(train_df[train_df['label']==1]['text'].sample(3).tolist())


Train label distribution


label
0    992273
1    526949
Name: count, dtype: int64


Example human text:


["reflection had... *something* that I didn't. A quick glance down at my chest determined that to be true ; I had a definite lack of mammaries, which was, of course, normal for a guy. I squinted and leaned forward, and almost thought that the reflection was ever-so-slightly delayed in mimicking me. It must've been some kind of carnival mirror effect, I decided. I snickered, and turned to the side, ``admiring'' the strange mirror. Maybe I was still dreaming. A shift in the reflection caught my attention ; its eyes had *definitely* moved. ``What the...?'' I asked nobody, leaning closer again. A thought popped into my head, and I moved to the edge of the mirror, looking across its surface. Yeah, it *looked* flat. Flat as *my* chest. I stepped back and regarded the reflection once more. Yeah, now that I'd looked closely, I could tell more that was off. Its skin was just a hair lighter. Less hairy arms, for sure. Finer features. Maybe a bit shorter. And a hand to my face (",
 "We were able 


Example AI text:


["You'll need to contact the admissions office at the community college and let them know that you're interested in enrolling in classes. They will be able to provide more information on how to proceed with the application and enrollment process.",
 'In an attempt to increase sales at every level, Walmart launched an ad campaign for Black Friday, but in so doing they unwittingly sparked an uproar that they now can\'t get out of. It all started when Walmart invited the New York City-based Black Youth Project 100 to perform at the company\'s Black Friday 2013, but the group didn\'t like that because the group had protested the US immigration system over the summer in a series of protests. To get around this, Walmart enlisted a PR firm and the group released a statement via Billboard that said that they were only invited to perform at the event, and that the invitation was "invalid." But despite the fact that the invitation was only for a private event (and the fact that Black Lives Matte

## Optional: Save a smaller dev set for quick experiments
Create a small balanced dev set (e.g., 5k samples) for fast debugging on local machines.


In [33]:
# Create small balanced sample (if dataset is large)
min_class = min(train_df['label'].value_counts().min(), 2500)  # max 2.5k per class or less if limited
dev_samples = pd.concat([
    train_df[train_df['label']==0].sample(n=min_class, random_state=42),
    train_df[train_df['label']==1].sample(n=min_class, random_state=42)
]).sample(frac=1, random_state=42).reset_index(drop=True)

dev_samples.to_csv("../data/processed/dev_small.csv", index=False)
print("Saved small dev set with shape:", dev_samples.shape)


Saved small dev set with shape: (5000, 2)
