Text to Piano is a clean, public-facing version of our text-to-piano model.
Its purpose is to generate piano music from text prompts through a two-stage pipeline:
- A base text-to-piano model generates structural music tokens
- A complementary transformer predicts duration and velocity to make the result more expressive
This project is part of the BachGround ecosystem. BachGround is the product and research context behind this work, and this repository is intended to expose the core model assets and inference flow in a simpler form.
BachGround is the product context behind Text to Piano.
When referring to this repository, please consider it as an open source release of model and pipeline components developed for BachGround.
Website reference:
This repository includes the core pieces needed to understand and run the generation pipeline:
- a fine-tuned Llama-based text-to-token model
- a complementary transformer for duration and velocity prediction
- inference scripts for both stages
- lightweight documentation for running the models
This repository does not aim to include the full internal development history or all experimental datasets.
Hugging Face model links will be published here:
- Llama adapter model:
T2P Base Model - Complementary transformer model:
Complementary Transformer
Recommended split:
- GitHub hosts code, lightweight configs, and documentation
- Hugging Face hosts large model weights and downloadable inference assets
Generating piano music directly from text is easier to manage when musical structure and expressive playback are separated.
In Text to Piano:
- the base model generates the musical skeleton
- the complementary transformer adds performance details
This makes the system easier to inspect, debug, and improve.
The full pipeline is:
text prompt -> base token generation -> duration/velocity enrichment -> MIDI
More concretely:
- A text prompt is given to the base model
- The base model generates piano token sequences
- The complementary transformer enriches the sequence with
dur_*andvel_*tokens - The enriched output is converted into MIDI
Current top-level structure:
models/llama-8b-piast-at-10585Base text-to-piano model assetsmodels/complementary_transformerComplementary transformer code and trained duration/velocity modelsscripts/infer_llama_music.pyOptional standalone inference entrypoint for the base modelscripts/infer_music_full_pipeline.pyEnd-to-end prompt -> base tokens -> enrichment -> MIDI pipelinescripts/midi_to_mp3.pyOptional MIDI to MP3 rendering utilityREADME.mdProject overview and usage notes
Inside models/complementary_transformer:
train.pydetokenizer.pytoken2midi.pymodels/durationmodels/velocity
The recommended inference entrypoint is the combined pipeline script:
python3 scripts/infer_music_full_pipeline.py \
--llama-model-dir models/llama-8b-10585 \
--duration-model-dir models/complementary_transformer/models/duration \
--velocity-model-dir models/complementary_transformer/models/velocity \
--prompt "A calm piano melody in C major" \
--output-dir pipeline_out \
--do-sample \
--temperature 0.9 \
--top-p 0.9 \
--max-new-tokens 450This single command performs the full chain:
- generate base symbolic piano tokens from the text prompt
- detokenize the base sequence to a base MIDI file
- enrich the sequence with
dur_*andvel_*tokens using the complementary transformer - detokenize the enriched sequence to the final MIDI output
Typical outputs written into pipeline_out/ are:
- timestamped base token text
- timestamped base MIDI
- timestamped enriched token text
- timestamped final MIDI
scripts/infer_llama_music.py is still useful if you want to inspect only the raw base-token model without the complementary stage.
Typical command:
python3 scripts/infer_llama_music.py \
--model-dir models/llama-8b-10585 \
--prompt "A calm piano melody in C major" \
--max-new-tokens 450This standalone mode is mainly useful for debugging, token inspection, or comparing pre- and post-enrichment outputs.
A single model could try to generate note identity, timing, duration, and velocity all at once. We chose not to do that here.
The two-stage design has practical advantages:
- cleaner separation between composition and performance
- easier dataset preparation for the complementary task
- easier debugging when results sound wrong
- more flexibility when improving one stage without retraining everything
This repository is meant for:
- researchers exploring symbolic music generation
- developers who want to inspect the inference pipeline
- collaborators who need a simpler view of the BachGround text-to-piano stack
This repository is being cleaned up for open source use.
The models are present, and the inference path is the main priority. Documentation may continue to improve as the repository becomes more polished.
- Some scripts assume a Python environment with
torch,transformers,peft, and MIDI-related dependencies installed. - The complementary transformer is included mainly for inference use in this repository.
- Large training datasets are intentionally not included here.
Licensing is split by artifact type:
- Repository code:
LICENSE_PLACEHOLDER_CODE - Llama adapter weights: distributed separately on Hugging Face under the applicable Llama 3.1 license terms
- Complementary transformer weights:
LICENSE_PLACEHOLDER_COMPLEMENTARY
If you publish the model artifacts, replace the placeholders above and add the final Hugging Face links in the Model Downloads section.
If you reference or use this repository, please mention that it is part of the BachGround project.
