Syntax-Aware Tokenizer for Go Code Style Analysis

Requirements

Python 3.12
UV package manager
Go 1.24

Project Structure

├── data/                          <- Raw data
├── notebooks/                     <- Jupyter notebooks
├── src/
│   └── go_ast_tokenizer/
│       ├── checker/
│       │   ├── checker.go
│       │   ├── checker_test.go
│       │   ├── export.go
│       │   ├── go.mod
│       │   └── go.sum
│       ├── tokenizer/
│       │   ├── go.mod
│       │   ├── tokenizer.go
│       │   └── tokenizer_test.go
│       ├── __init__.py
│       ├── dataset.py             <- Dataset and data module
│       ├── dataset_builder.py     <- Dataset builder
│       ├── dataset_card.py        <- Dataset info card for Hugging Face
│       ├── go_style_checker.py    <- Style checker wrapper
│       ├── main.py                <- Entry point
│       ├── model.py               <- Model definition
│       ├── tokenizer.py           <- Wrapper for Go tokenizer
│       └── utils.py
├── tests/                         <- Unit tests
├── config.yaml                    <- Configuration file for LightningCLI
├── LICENSE
├── Makefile
├── pyproject.toml
├── README.md
└── uv.lock

Usage

Configuration

The project uses a config.yaml file to configure model training parameters:

seed_everything: 2357         # Random seed for reproducibility
model:
  learning_rate: 1.0e-05      # Learning rate for optimizer
data:
  batch_size: 8               # Batch size for training
  num_workers: 4              # DataLoader workers
trainer:
  precision: "bf16-mixed"     # Training precision (bf16, 16, 32)
  max_epochs: 10              # Maximum training epochs
  # More options in the file...

You can modify these parameters in the configuration file to adjust training behavior.

Model Fine-tuning

Run the Llama 3 fine-tuning with:

make fit

To evaluate the fine-tuned model on test data:

uv run --env-file .env -m src.go_ast_tokenizer.main test --config config.yaml --ckpt_path <path_to_checkpoint>

Development

Code Quality Checks

Run all code quality checks with:

make checks

This command runs the following checks in sequence:

Dependencies: make uv-lock - Locks dependencies
Linting: make lint - Lints the code using Ruff with auto-fixes
Formatting: make format - Formats code using Ruff formatter
Type checking: make typecheck - Performs static type checking with Pyright

You can also run each check individually as needed.

Run Tests

To run unit tests:

make unit-test

or

uv run pytest tests/ -v

Build Style Checker

Build and test the Go style checker with:

make checker

This command:

Runs the Go tests for the checker package
Builds the checker as a shared library for use by Python

Build Dataset

Generate the dataset with:

make dataset

This command:

Pulls the "Go" split of bigcode/the‑stack‑v2‑dedup
Runs go‑critic (style group) → labels each snippet
Pushes dataset and README to 🤗 ${HF_USERNAME}/go-critic-style

Note: Required in .env: AWS_PROFILE_NAME, AWS_ROLE_ARN, AWS_SESSION_NAME, HF_USERNAME, HF_TOKEN

Jupyter Notebook

Setup Jupyter Kernel

Install a dedicated Jupyter kernel for this project:

make jupyter-kernel

Run Jupyter Lab

Start Jupyter Lab:

make lab

This launches Jupyter Lab with the ./notebooks directory as the root.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Syntax-Aware Tokenizer for Go Code Style Analysis

Requirements

Project Structure

Usage

Configuration

Model Fine-tuning

Development

Code Quality Checks

Run Tests

Build Style Checker

Build Dataset

Jupyter Notebook

Setup Jupyter Kernel

Run Jupyter Lab

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
data		data
notebooks		notebooks
reports		reports
src/go_ast_tokenizer		src/go_ast_tokenizer
tests		tests
.env		.env
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

aholovko/go-ast-tokenizer

Folders and files

Latest commit

History

Repository files navigation

Syntax-Aware Tokenizer for Go Code Style Analysis

Requirements

Project Structure

Usage

Configuration

Model Fine-tuning

Development

Code Quality Checks

Run Tests

Build Style Checker

Build Dataset

Jupyter Notebook

Setup Jupyter Kernel

Run Jupyter Lab

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages