# Detecting AI-Generated Text: A Comprehensive Analysis and Model Development

## Executive Summary and Key Findings

In this project, I develop a multi-class text classifier to distinguish between **human-written**, **AI-paraphrased**, and **AI-generated** long-form content. This notebook (and subsequent ones) documents the entire process, from data curation to model evaluation. Key findings include:

- **High Classification Performance:** My best model (RoBERTa-base) achieves about **91.1% overall accuracy** and **0.910 macro F1** on a held-out test set. Human-written text is identified with ~98.3% F1, while AI-generated and AI-paraphrased texts achieve ~88.3% and ~86.5% F1 respectively.
- **Human vs AI Text Differences:** Exploratory analysis shows human-written articles tend to be longer (median ~255 words) than AI-generated or paraphrased ones (median ~184–211 words). Readability metrics and sentiment analysis reveal subtle differences: human text often has slightly more varied sentiment and higher complexity, whereas AI-generated content is relatively more neutral.
- **Confusion Patterns:** The classifier most often confuses **AI-paraphrased vs AI-generated** texts with each other, while human text is rarely misclassified (>99% recall).

## Introduction

Advances in generative AI have made it possible for algorithms to produce human-like text, raising concerns in domains like journalism and education about authenticity and plagiarism. **Detecting AI-generated text** has thus become crucial for maintaining academic integrity and trust in written content.

I approach the problem as a three-class classification:
1. **Human-written** text (authored entirely by people),  
2. **AI-generated** text (produced by language models like GPT-4 or DeepSeeck without human edits),  
3. **AI-paraphrased** text (human-written content that has been rephrased by AI, or AI text lightly edited by humans).

My dataset is balanced (~130k samples per class) and derived from a large news corpus for human text, with AI variants created using local DeepSeeck. I experiment with transformer-based models (BERT, RoBERTa, Longformer) to develop and evaluate our detector. This notebook begins with data preparation steps.


### Data Loading

First, I load the raw dataset of articles. The data is expected as a CSV file (`final_dataset.csv`) containing at least a text and label for each entry. I'll use my `data_utils.load_raw_data` function, which reads the path from my configuration.


In [1]:
# move up one level so that open("config.yaml") works
import os
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))
print("new cwd:", os.getcwd())


new cwd: C:\Testing\Final_Year_Project\AI-Text-Detection-Tool


In [2]:
import pandas as pd
import os, sys
# add the project root (one level up) into Python’s module search path
sys.path.insert(0, os.path.abspath(os.path.join(os.getcwd(), "..")))
from utils import data_utils

# Load raw data
raw_df = data_utils.load_raw_data()
print(f"Loaded raw dataset with {raw_df.shape[0]} entries and columns: {list(raw_df.columns)}")
raw_df.head(3)


[data_utils] Loaded raw data: 128967 records, 3 columns.
Loaded raw dataset with 128967 entries and columns: ['human_written', 'ai_paraphrased', 'ai_generated']


Unnamed: 0,human_written,ai_paraphrased,ai_generated
0,LONDON (Reuters) - Italy's 10-year government ...,London (Reuters) Italys 10-year government bo...,PIMCO has expressed concerns about the risk co...
1,The Yankees vs Tigers brawl was so crazy ... t...,"One night, when the 83-year-old Larry King, wh...",The Yankees and Tigers had a heated baseball b...
2,Meet Otto Von Schirach. He's been DJing since ...,Heres a fresh take on the article with nearly ...,Otto Von Schirach: A DJ Who Meets Fashion\n\n...


### Data Preparation

Next, I ensure the dataset is in a standard format and clean the text. I will:
1. **Flatten** the dataset into `text` and `label` columns using our `flatten_dataset` function.
2. **Clean** the text strings (lowercase, remove extra whitespace) with `clean_text`.

This prepares the data for downstream analysis and modeling.


In [4]:
from utils import data_utils, text_cleaner
from utils.data_utils import config

# 1) Load the raw DataFrame (your three‑column CSV)
raw_df = data_utils.load_raw_data()

# 2) Flatten into the two‑column format (text + label)
df = data_utils.flatten_dataset(raw_df)

# 3) Clean the text strings in place
df['text'] = df['text'].apply(lambda t: text_cleaner.clean_text(t, lemmatize=False))
df.to_parquet(config['paths']['cleaned_data'], index=False)
print("Saved full cleaned dataset for EDA at", config['paths']['cleaned_data'])

# 4) Inspect a sample and label distribution
print("Sample cleaned text:")
print(df.loc[0, 'text'][:100] + "...")
print("Labels distribution:", df['label'].value_counts().to_dict())



[data_utils] Loaded raw data: 128967 records, 3 columns.
[data_utils] Flattened dataset: 386901 records with columns ['text', 'label']
Saved full cleaned dataset for EDA at data/cleaned_dataset.parquet
Sample cleaned text:
london (reuters) - italy's 10-year government bond yields currently do not adequately compensate inv...
Labels distribution: {'human_written': 128967, 'ai_paraphrased': 128967, 'ai_generated': 128967}


### Train–Validation–Test Split

Now I split the cleaned dataset into train, validation, and test subsets (80/10/10) with stratification to preserve class balance. I then save each split to disk as Parquet files for reuse in later notebooks.


In [None]:
import sys
!{sys.executable} -m pip install pyarrow


In [None]:
import yaml

# Load config for file paths
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)

# Split into train, validation, and test sets
train_df, val_df, test_df = data_utils.train_val_test_split(
    df,
    val_fraction=config['training']['batch_size'] and 0.1,  # we’ll keep defaults here
    test_fraction=0.1,
    random_state=42
)
print(f"Split sizes → Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")

# Save the splits
train_df.to_parquet(config['paths']['train_data'], index=False, engine='pyarrow')
val_df.to_parquet(config['paths']['val_data'], index=False)
test_df.to_parquet(config['paths']['test_data'], index=False)
print("Train/Val/Test saved to disk.")
