# Feature Engineering

While my primary models will be transformer-based (which learn features from text directly), I also engineer some intuitive features for analysis and potential baseline models:
- **Lexical Diversity:** ratio of unique words to total words (higher = more varied vocabulary).
- **Readability and Sentiment:** (Already computed in EDA) can be considered features as well.

These features can help us interpret the data and also try a classic ML approach (e.g., Random Forest) to see how it performs relative to transformers. In this notebook, I will:
1. Compute the lexical diversity for each entry.
2. (Optional) Train a simple classifier on these features + TF-IDF as a comparison.

Let's begin by loading the datasets.


In [2]:
# move up one level so that works
import os
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "..")))
print("new cwd:", os.getcwd())


new cwd: c:\Testing\Final_Year_Project\AI-Text-Detection-Tool


In [3]:
import pandas as pd
from utils import features

# Load splits
train_df = pd.read_parquet("data/train_dataset.parquet")
val_df   = pd.read_parquet("data/val_dataset.parquet")
test_df  = pd.read_parquet("data/test_dataset.parquet")
print(f"Train: {len(train_df)}, Val: {len(val_df)}, Test: {len(test_df)}")


Train: 309520, Val: 38690, Test: 38691
