CodeOriginClassifier

Predict whether a code snippet was written by a human or generated by an LLM, using a fine-tuned CodeBERT text classification model with gradient-based explainability.

This project forms a narrative arc with CodeForensics (a rule-based heuristic tool elsewhere in the portfolio), demonstrating a deliberate progression from hand-crafted detection rules to learned, data-driven classification.

Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Reflex Frontend │────▶│  FastAPI Backend  │────▶│  PostgreSQL      │
│ (Python, :53000) │     │ (/predict, :58000)│     │  (predictions)   │
└──────────────────┘     └────────┬─────────┘     └──────────────────┘
                                  │
                    ┌─────────────▼──────────────┐
                    │  TensorFlow SavedModel     │
                    │  (CodeBERT + classification │
                    │   head, ~125M params)       │
                    └─────────────┬──────────────┘
                                  │
                    ┌─────────────▼──────────────┐
                    │  Integrated Gradients       │
                    │  (token-level attribution)  │
                    └────────────────────────────┘

Stack

Layer	Technology
Frontend	Reflex 0.8.x (full-stack Python — no JavaScript)
API	FastAPI 0.115+ with Pydantic v2 schemas
ML framework	TensorFlow 2.21 / Keras
Pre-trained model	microsoft/codebert-base via HuggingFace Transformers 4.48.x
Experiment tracking	MLflow 3.10
Database	PostgreSQL 17 via SQLAlchemy async ORM + asyncpg
Testing	pytest 8.3 with pytest-asyncio
Containerisation	Docker Compose
CI	GitHub Actions

Quick Start

Prerequisites

Python 3.12
Docker & Docker Compose (for PostgreSQL and MLflow)
~2 GB disk space (CodeBERT weights)

Application Demo

The interface provides:

Code Editor: Input code snippets with language selection (Python, JavaScript, Java, C++, Go, Rust)
Prediction Results: Classification as Human or LLM-Generated with confidence score
Token Attribution: Top-5 most influential tokens via Integrated Gradients
Model Stats: Displays model parameters, architecture, and explainability method

1. Clone and install

git clone https://github.com/<your-username>/CodeOriginClassifier.git
cd CodeOriginClassifier
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

2. Start infrastructure

This project uses randomized default host ports to avoid collisions with other local apps:

Frontend: 53000
API: 58000
MLflow: 55000
PostgreSQL: 55432

You can override these with environment variables (FRONTEND_PORT, API_PORT, MLFLOW_PORT, DB_PORT).

docker compose up -d db mlflow

3. Build the dataset

First, generate the LLM samples (requires a Hugging Face token for the Inference API, or use --local with a local model):

# Option A: Hugging Face Inference API (free tier)
export HF_TOKEN=hf_your_token_here
python -m scripts.generate_llm_samples --samples 5000

# Option B: Local model (requires GPU or patience)
python -m scripts.generate_llm_samples --local --model bigcode/starcoder2-3b --samples 5000

Then build and tokenise the dataset:

python -m scripts.build_dataset --samples-per-class 5000

4. Train the model

python -m scripts.train --epochs 5 --batch-size 16 --lr 2e-5

View training metrics in MLflow at http://localhost:55000.

5. Serve the API

uvicorn src.api.app:app --reload --port 58000

Test with curl:

curl -X POST http://localhost:58000/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"code": "def greet(name):\n    return f\"Hello, {name}!\"", "language": "python"}'

6. Run the frontend

cd frontend && reflex run

If running locally with reflex run, open http://localhost:3000.

If running with Docker Compose (docker compose up), open http://localhost:53000.

If you need custom ports, set them before docker compose up, for example:

export FRONTEND_PORT=53111
export API_PORT=58111
export MLFLOW_PORT=55111
export DB_PORT=55511
docker compose up -d

PowerShell equivalent:

$env:FRONTEND_PORT = "53111"
$env:API_PORT = "58111"
$env:MLFLOW_PORT = "55111"
$env:DB_PORT = "55511"
docker compose up -d

7. Run tests

pytest tests/ -v --tb=short -m "not slow"

Project Structure

CodeOriginClassifier/
├── src/
│   ├── config.py                 # Centralised configuration
│   ├── dataset/
│   │   ├── loader.py             # CodeSearchNet + LLM sample loading
│   │   ├── preprocessing.py      # Tokenisation and stratified splitting
│   │   └── validation.py         # Dataset integrity checks
│   ├── model/
│   │   ├── architecture.py       # CodeBERT + classification head
│   │   ├── evaluation.py         # Metrics (accuracy, F1, AUC-ROC, etc.)
│   │   └── attribution.py        # Integrated Gradients explainability
│   ├── api/
│   │   ├── app.py                # FastAPI application factory
│   │   ├── routes.py             # /predict and /health endpoints
│   │   ├── schemas.py            # Pydantic request/response models
│   │   └── dependencies.py       # Model loading lifespan + DI
│   └── db/
│       ├── engine.py             # Async SQLAlchemy engine
│       └── models.py             # ORM model (Prediction table)
├── frontend/
│   └── app.py                    # Reflex UI (code editor + results)
├── scripts/
│   ├── build_dataset.py          # End-to-end dataset construction
│   ├── generate_llm_samples.py   # LLM sample generation with provenance
│   └── train.py                  # Fine-tuning with MLflow tracking
├── tests/                        # pytest suite
├── DATASET.md                    # Dataset construction documentation
├── MODEL.md                      # Transfer learning strategy documentation
├── docker-compose.yml            # PostgreSQL + MLflow + API + Frontend
└── pyproject.toml                # Dependencies and tool config

Limitations

This classifier demonstrates the methodology of applying transfer learning to code analysis, but has inherent limitations:

Temporal degradation — LLMs improve continuously. A model trained on today's LLM outputs may not detect tomorrow's. This is a fundamental limitation of the classification task, not a bug in the approach.
Single-generator bias — The training dataset uses a single LLM for generation. A production system would need multi-model training data.
Style vs. content — The model detects stylistic patterns, not semantic correctness. Heavily edited LLM output or human code that follows strict linting rules may be misclassified.
Dataset size — With ~10k samples, the model is a proof-of-concept. Production accuracy would require 100k+ diverse samples.

See DATASET.md and MODEL.md for deeper discussion.

License

GNU GPL v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
alembic		alembic
data		data
docs/screenshots		docs/screenshots
frontend		frontend
models		models
public		public
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
DATASET.md		DATASET.md
Dockerfile.api		Dockerfile.api
Dockerfile.frontend		Dockerfile.frontend
LICENSE		LICENSE
MODEL.md		MODEL.md
README.md		README.md
alembic.ini		alembic.ini
demo.html		demo.html
docker-compose.yml		docker-compose.yml
image.png		image.png
pyproject.toml		pyproject.toml
rxconfig.py		rxconfig.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeOriginClassifier

Architecture

Stack

Quick Start

Prerequisites

Application Demo

1. Clone and install

2. Start infrastructure

3. Build the dataset

4. Train the model

5. Serve the API

6. Run the frontend

7. Run tests

Project Structure

Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeOriginClassifier

Architecture

Stack

Quick Start

Prerequisites

Application Demo

1. Clone and install

2. Start infrastructure

3. Build the dataset

4. Train the model

5. Serve the API

6. Run the frontend

7. Run tests

Project Structure

Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages