Skip to content

CyberKatsu/CodeOriginClassifier

Repository files navigation

CodeOriginClassifier

Predict whether a code snippet was written by a human or generated by an LLM, using a fine-tuned CodeBERT text classification model with gradient-based explainability.

This project forms a narrative arc with CodeForensics (a rule-based heuristic tool elsewhere in the portfolio), demonstrating a deliberate progression from hand-crafted detection rules to learned, data-driven classification.

Main screen

Architecture

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Reflex Frontend │────▶│  FastAPI Backend  │────▶│  PostgreSQL      │
│ (Python, :53000) │     │ (/predict, :58000)│     │  (predictions)   │
└──────────────────┘     └────────┬─────────┘     └──────────────────┘
                                  │
                    ┌─────────────▼──────────────┐
                    │  TensorFlow SavedModel     │
                    │  (CodeBERT + classification │
                    │   head, ~125M params)       │
                    └─────────────┬──────────────┘
                                  │
                    ┌─────────────▼──────────────┐
                    │  Integrated Gradients       │
                    │  (token-level attribution)  │
                    └────────────────────────────┘

Stack

Layer Technology
Frontend Reflex 0.8.x (full-stack Python — no JavaScript)
API FastAPI 0.115+ with Pydantic v2 schemas
ML framework TensorFlow 2.21 / Keras
Pre-trained model microsoft/codebert-base via HuggingFace Transformers 4.48.x
Experiment tracking MLflow 3.10
Database PostgreSQL 17 via SQLAlchemy async ORM + asyncpg
Testing pytest 8.3 with pytest-asyncio
Containerisation Docker Compose
CI GitHub Actions

Quick Start

Prerequisites

  • Python 3.12
  • Docker & Docker Compose (for PostgreSQL and MLflow)
  • ~2 GB disk space (CodeBERT weights)

Application Demo

CodeOriginClassifier Demo

The interface provides:

  • Code Editor: Input code snippets with language selection (Python, JavaScript, Java, C++, Go, Rust)
  • Prediction Results: Classification as Human or LLM-Generated with confidence score
  • Token Attribution: Top-5 most influential tokens via Integrated Gradients
  • Model Stats: Displays model parameters, architecture, and explainability method

1. Clone and install

git clone https://github.com/<your-username>/CodeOriginClassifier.git
cd CodeOriginClassifier
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

2. Start infrastructure

This project uses randomized default host ports to avoid collisions with other local apps:

  • Frontend: 53000
  • API: 58000
  • MLflow: 55000
  • PostgreSQL: 55432

You can override these with environment variables (FRONTEND_PORT, API_PORT, MLFLOW_PORT, DB_PORT).

docker compose up -d db mlflow

3. Build the dataset

First, generate the LLM samples (requires a Hugging Face token for the Inference API, or use --local with a local model):

# Option A: Hugging Face Inference API (free tier)
export HF_TOKEN=hf_your_token_here
python -m scripts.generate_llm_samples --samples 5000

# Option B: Local model (requires GPU or patience)
python -m scripts.generate_llm_samples --local --model bigcode/starcoder2-3b --samples 5000

Then build and tokenise the dataset:

python -m scripts.build_dataset --samples-per-class 5000

4. Train the model

python -m scripts.train --epochs 5 --batch-size 16 --lr 2e-5

View training metrics in MLflow at http://localhost:55000.

5. Serve the API

uvicorn src.api.app:app --reload --port 58000

Test with curl:

curl -X POST http://localhost:58000/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"code": "def greet(name):\n    return f\"Hello, {name}!\"", "language": "python"}'

6. Run the frontend

cd frontend && reflex run

If running locally with reflex run, open http://localhost:3000.

If running with Docker Compose (docker compose up), open http://localhost:53000.

If you need custom ports, set them before docker compose up, for example:

export FRONTEND_PORT=53111
export API_PORT=58111
export MLFLOW_PORT=55111
export DB_PORT=55511
docker compose up -d

PowerShell equivalent:

$env:FRONTEND_PORT = "53111"
$env:API_PORT = "58111"
$env:MLFLOW_PORT = "55111"
$env:DB_PORT = "55511"
docker compose up -d

7. Run tests

pytest tests/ -v --tb=short -m "not slow"

Project Structure

CodeOriginClassifier/
├── src/
│   ├── config.py                 # Centralised configuration
│   ├── dataset/
│   │   ├── loader.py             # CodeSearchNet + LLM sample loading
│   │   ├── preprocessing.py      # Tokenisation and stratified splitting
│   │   └── validation.py         # Dataset integrity checks
│   ├── model/
│   │   ├── architecture.py       # CodeBERT + classification head
│   │   ├── evaluation.py         # Metrics (accuracy, F1, AUC-ROC, etc.)
│   │   └── attribution.py        # Integrated Gradients explainability
│   ├── api/
│   │   ├── app.py                # FastAPI application factory
│   │   ├── routes.py             # /predict and /health endpoints
│   │   ├── schemas.py            # Pydantic request/response models
│   │   └── dependencies.py       # Model loading lifespan + DI
│   └── db/
│       ├── engine.py             # Async SQLAlchemy engine
│       └── models.py             # ORM model (Prediction table)
├── frontend/
│   └── app.py                    # Reflex UI (code editor + results)
├── scripts/
│   ├── build_dataset.py          # End-to-end dataset construction
│   ├── generate_llm_samples.py   # LLM sample generation with provenance
│   └── train.py                  # Fine-tuning with MLflow tracking
├── tests/                        # pytest suite
├── DATASET.md                    # Dataset construction documentation
├── MODEL.md                      # Transfer learning strategy documentation
├── docker-compose.yml            # PostgreSQL + MLflow + API + Frontend
└── pyproject.toml                # Dependencies and tool config

Limitations

This classifier demonstrates the methodology of applying transfer learning to code analysis, but has inherent limitations:

  1. Temporal degradation — LLMs improve continuously. A model trained on today's LLM outputs may not detect tomorrow's. This is a fundamental limitation of the classification task, not a bug in the approach.

  2. Single-generator bias — The training dataset uses a single LLM for generation. A production system would need multi-model training data.

  3. Style vs. content — The model detects stylistic patterns, not semantic correctness. Heavily edited LLM output or human code that follows strict linting rules may be misclassified.

  4. Dataset size — With ~10k samples, the model is a proof-of-concept. Production accuracy would require 100k+ diverse samples.

See DATASET.md and MODEL.md for deeper discussion.

License

GNU GPL v3.0

About

Predict whether a code snippet was written by a human or generated by an LLM, using a fine-tuned CodeBERT text classification model with gradient-based explainability.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors