This repository contains the full pipeline for the subtask C boundary detection task using windowed text classification. The code is fully in Python and builds multiple classical ML baselines plus ensemble variants.
Install dependencies:
uv sync
source .venv/bin/activateTrain and evaluate all models: Note: this takes quite a while.
python main.py- External code used: None
- main.py: Entry point that loads data, builds models, and writes evaluation results.
- data.py: JSONL iterators, windowing, and train/dev splits.
- preprocessing.py: Text cleaning pipeline (mixins).
- docs/preprocessing_and_jsoniterator.md: Detailed usage notes for preprocessing and JSONL iteration.
- models/: Model builders, feature pipelines, and ensemble wrappers.
- evaluation.py: Macro F1 evaluation and report export.
- stats.py: Dataset stats and label distribution summary.
- optimiser.py: Generic grid search helper.
- docs/pipeline_structures.md: Pipeline and feature-union architecture details.
The dataset lives in data/ as JSONL files. Each line has:
text: the full document stringlabel: an integer index indicating the boundary between human and machine segments
Windowing converts each document into overlapping word windows. For a window
size of
- 0 if the index is before the boundary
- 1 if the index is at or after the boundary
See data.py for exact behavior.
The preprocessing pipeline is a mixin-based class in preprocessing.py. Active steps:
- Unicode NFKD normalization + remove diacritics
- Remove punctuation (
! ? , .) - Lowercase tokens unless the alphabetic part is all caps
there exist whitespace-altering mixins that we may wanted to use, however we ended up not using them because they alter whitespace which would have forced us to implement a label offset counter to not mess up labeling.
Detailed usage examples are in docs/preprocessing_and_jsoniterator.md.
Each training example is a tuple (window_text, position) where position is
the relative index within the document
Two feature pipelines are used:
- Simple features: word TF-IDF with n-grams (1,2), min_df=3, sublinear_tf=True`.
- Complex features: word TF-IDF (1,2), char TF-IDF (3,5) also min_df=3, sublinear_tf=True`, plus the numeric position feature.
Feature definitions are in models/model_utils.py.
Base models are trained via scikit-learn pipelines in: simple_models.py complex_models.py:
- Logistic Regression (balanced)
- Linear SVM
- Naive Bayes
- Random Forest
- Dummy baseline
Ensembles:
- Voting classifier (hard voting)
- Stacking classifier (logistic regression meta-learner)
All model builders accept (X_train, y_train) and return a trained estimator.
Evaluation uses macro F1 and full classification reports. Results are appended to the result files in the evaluations/ directory by evaluation.py.
The best model is selected by test macro F1. Based on the current results file the top test score is:
- Complex Linear SVM: macro F1 = 0.9301, with window size 4
See the files in evaluations/ for the full evaluation results.
Descriptive statistics and label distribution are generated by stats.py and saved to stats.md. The current label distribution for window size 4 is ~29.60% label 0 and ~70.40% label 1. This indicates a noticeable class imbalance (label 1 dominates), which is why macro F1 is emphasized in evaluation and several models use class weighting.