Skip to content

JasperK04/MLP_final

Repository files navigation

Final Project Code Overview

This repository contains the full pipeline for the subtask C boundary detection task using windowed text classification. The code is fully in Python and builds multiple classical ML baselines plus ensemble variants.

How to Run

Install dependencies:

uv sync
source .venv/bin/activate

Train and evaluate all models: Note: this takes quite a while.

python main.py

Attribution

  • External code used: None

Project Structure

Data and Task Setup

The dataset lives in data/ as JSONL files. Each line has:

  • text: the full document string
  • label: an integer index indicating the boundary between human and machine segments

Windowing converts each document into overlapping word windows. For a window size of $n$, each training instance is the center word with its context window around it. Labels are derived from the index position for the center word:

  • 0 if the index is before the boundary
  • 1 if the index is at or after the boundary

See data.py for exact behavior.

Preprocessing

The preprocessing pipeline is a mixin-based class in preprocessing.py. Active steps:

  • Unicode NFKD normalization + remove diacritics
  • Remove punctuation (! ? , .)
  • Lowercase tokens unless the alphabetic part is all caps

there exist whitespace-altering mixins that we may wanted to use, however we ended up not using them because they alter whitespace which would have forced us to implement a label offset counter to not mess up labeling.

Detailed usage examples are in docs/preprocessing_and_jsoniterator.md.

Features

Each training example is a tuple (window_text, position) where position is the relative index within the document $\in [0,1]$.

Two feature pipelines are used:

  • Simple features: word TF-IDF with n-grams (1,2), min_df=3, sublinear_tf=True`.
  • Complex features: word TF-IDF (1,2), char TF-IDF (3,5) also min_df=3, sublinear_tf=True`, plus the numeric position feature.

Feature definitions are in models/model_utils.py.

Models

Base models are trained via scikit-learn pipelines in: simple_models.py complex_models.py:

  • Logistic Regression (balanced)
  • Linear SVM
  • Naive Bayes
  • Random Forest
  • Dummy baseline

Ensembles:

  • Voting classifier (hard voting)
  • Stacking classifier (logistic regression meta-learner)

All model builders accept (X_train, y_train) and return a trained estimator.

Evaluation

Evaluation uses macro F1 and full classification reports. Results are appended to the result files in the evaluations/ directory by evaluation.py.

The best model is selected by test macro F1. Based on the current results file the top test score is:

  • Complex Linear SVM: macro F1 = 0.9301, with window size 4

See the files in evaluations/ for the full evaluation results.

Dataset Statistics

Descriptive statistics and label distribution are generated by stats.py and saved to stats.md. The current label distribution for window size 4 is ~29.60% label 0 and ~70.40% label 1. This indicates a noticeable class imbalance (label 1 dominates), which is why macro F1 is emphasized in evaluation and several models use class weighting.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages