Final Project Code Overview

This repository contains the full pipeline for the subtask C boundary detection task using windowed text classification. The code is fully in Python and builds multiple classical ML baselines plus ensemble variants.

How to Run

Install dependencies:

uv sync
source .venv/bin/activate

Train and evaluate all models: Note: this takes quite a while.

python main.py

Attribution

External code used: None

Project Structure

main.py: Entry point that loads data, builds models, and writes evaluation results.
data.py: JSONL iterators, windowing, and train/dev splits.
preprocessing.py: Text cleaning pipeline (mixins).
docs/preprocessing_and_jsoniterator.md: Detailed usage notes for preprocessing and JSONL iteration.
models/: Model builders, feature pipelines, and ensemble wrappers.
evaluation.py: Macro F1 evaluation and report export.
stats.py: Dataset stats and label distribution summary.
optimiser.py: Generic grid search helper.
docs/pipeline_structures.md: Pipeline and feature-union architecture details.

Data and Task Setup

The dataset lives in data/ as JSONL files. Each line has:

text: the full document string
label: an integer index indicating the boundary between human and machine segments

Windowing converts each document into overlapping word windows. For a window size of $n$, each training instance is the center word with its context window around it. Labels are derived from the index position for the center word:

0 if the index is before the boundary
1 if the index is at or after the boundary

See data.py for exact behavior.

Preprocessing

The preprocessing pipeline is a mixin-based class in preprocessing.py. Active steps:

Unicode NFKD normalization + remove diacritics
Remove punctuation (! ? , .)
Lowercase tokens unless the alphabetic part is all caps

there exist whitespace-altering mixins that we may wanted to use, however we ended up not using them because they alter whitespace which would have forced us to implement a label offset counter to not mess up labeling.

Detailed usage examples are in docs/preprocessing_and_jsoniterator.md.

Features

Each training example is a tuple (window_text, position) where position is the relative index within the document $\in [0,1]$.

Two feature pipelines are used:

Simple features: word TF-IDF with n-grams (1,2), min_df=3, sublinear_tf=True`.
Complex features: word TF-IDF (1,2), char TF-IDF (3,5) also min_df=3, sublinear_tf=True`, plus the numeric position feature.

Feature definitions are in models/model_utils.py.

Models

Base models are trained via scikit-learn pipelines in: simple_models.py complex_models.py:

Logistic Regression (balanced)
Linear SVM
Naive Bayes
Random Forest
Dummy baseline

Ensembles:

Voting classifier (hard voting)
Stacking classifier (logistic regression meta-learner)

All model builders accept (X_train, y_train) and return a trained estimator.

Evaluation

Evaluation uses macro F1 and full classification reports. Results are appended to the result files in the evaluations/ directory by evaluation.py.

The best model is selected by test macro F1. Based on the current results file the top test score is:

Complex Linear SVM: macro F1 = 0.9301, with window size 4

See the files in evaluations/ for the full evaluation results.

Dataset Statistics

Descriptive statistics and label distribution are generated by stats.py and saved to stats.md. The current label distribution for window size 4 is ~29.60% label 0 and ~70.40% label 1. This indicates a noticeable class imbalance (label 1 dominates), which is why macro F1 is emphasized in evaluation and several models use class weighting.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
data		data
docs		docs
evaluations		evaluations
models		models
tests		tests
.gitignore		.gitignore
.python-version		.python-version
NB_evaluation.txt		NB_evaluation.txt
README.md		README.md
data.py		data.py
evaluation.py		evaluation.py
logreg_tuning.txt		logreg_tuning.txt
main.py		main.py
naive_bayes_experiments.py		naive_bayes_experiments.py
optimiser.py		optimiser.py
preprocessing.py		preprocessing.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
stats.md		stats.md
stats.py		stats.py
svm_experiments.py		svm_experiments.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Final Project Code Overview

How to Run

Attribution

Project Structure

Data and Task Setup

Preprocessing

Features

Models

Evaluation

Dataset Statistics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Final Project Code Overview

How to Run

Attribution

Project Structure

Data and Task Setup

Preprocessing

Features

Models

Evaluation

Dataset Statistics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages