- Python 3.12
- UV package manager
- Go 1.24
├── data/ <- Raw data
├── notebooks/ <- Jupyter notebooks
├── src/
│ └── go_ast_tokenizer/
│ ├── checker/
│ │ ├── checker.go
│ │ ├── checker_test.go
│ │ ├── export.go
│ │ ├── go.mod
│ │ └── go.sum
│ ├── tokenizer/
│ │ ├── go.mod
│ │ ├── tokenizer.go
│ │ └── tokenizer_test.go
│ ├── __init__.py
│ ├── dataset.py <- Dataset and data module
│ ├── dataset_builder.py <- Dataset builder
│ ├── dataset_card.py <- Dataset info card for Hugging Face
│ ├── go_style_checker.py <- Style checker wrapper
│ ├── main.py <- Entry point
│ ├── model.py <- Model definition
│ ├── tokenizer.py <- Wrapper for Go tokenizer
│ └── utils.py
├── tests/ <- Unit tests
├── config.yaml <- Configuration file for LightningCLI
├── LICENSE
├── Makefile
├── pyproject.toml
├── README.md
└── uv.lock
The project uses a config.yaml
file to configure model training parameters:
seed_everything: 2357 # Random seed for reproducibility
model:
learning_rate: 1.0e-05 # Learning rate for optimizer
data:
batch_size: 8 # Batch size for training
num_workers: 4 # DataLoader workers
trainer:
precision: "bf16-mixed" # Training precision (bf16, 16, 32)
max_epochs: 10 # Maximum training epochs
# More options in the file...
You can modify these parameters in the configuration file to adjust training behavior.
Run the Llama 3 fine-tuning with:
make fit
To evaluate the fine-tuned model on test data:
uv run --env-file .env -m src.go_ast_tokenizer.main test --config config.yaml --ckpt_path <path_to_checkpoint>
Run all code quality checks with:
make checks
This command runs the following checks in sequence:
- Dependencies:
make uv-lock
- Locks dependencies - Linting:
make lint
- Lints the code using Ruff with auto-fixes - Formatting:
make format
- Formats code using Ruff formatter - Type checking:
make typecheck
- Performs static type checking with Pyright
You can also run each check individually as needed.
To run unit tests:
make unit-test
or
uv run pytest tests/ -v
Build and test the Go style checker with:
make checker
This command:
- Runs the Go tests for the checker package
- Builds the checker as a shared library for use by Python
Generate the dataset with:
make dataset
This command:
- Pulls the "Go" split of bigcode/the‑stack‑v2‑dedup
- Runs go‑critic (style group) → labels each snippet
- Pushes dataset and README to 🤗 ${HF_USERNAME}/go-critic-style
Note: Required in .env
: AWS_PROFILE_NAME, AWS_ROLE_ARN, AWS_SESSION_NAME, HF_USERNAME, HF_TOKEN
Install a dedicated Jupyter kernel for this project:
make jupyter-kernel
Start Jupyter Lab:
make lab
This launches Jupyter Lab with the ./notebooks directory as the root.
This project is licensed under the MIT License - see the LICENSE file for details.