A data-driven approach to evolving optimal keyboard layouts using empirical typing performance measurements.
This repository implements a Genetic Algorithm (GA) system that optimizes keyboard layouts based on real typing performance data collected from human participants. The system minimizes corpus typing time by evolving arrangements of 46 physical key symbols using measured unigram, bigram, and optional trigram timings.
This project is part of the BagreType research initiative, developed by:
- Samuel Marcio Fonseca Santos (RA: 30599661)
- JoΓ£o Pedro de Souza Letro (RA: 25851691)
8th semester students at Universidade de Franca (UniFran)
Graduation Project (TCC)
- π BagreType Platform: bagretype.com - Web platform for collecting typing performance data from participants
- π¦ BagreType Repository: github.com/SamuelMFS/BagreType - Main project repository with data collection system
- π Technical Overview (English): Article.md - Comprehensive technical specification
- π VisΓ£o TΓ©cnica (PortuguΓͺs): Article_pt_BR.md - EspecificaΓ§Γ£o tΓ©cnica completa
The keyboard layout optimization problem involves searching through 46! β 5.5Γ10^55 possible permutationsβa space so large that exhaustive enumeration is computationally infeasible (see Article.md for detailed analysis). This system uses a Genetic Algorithm to efficiently discover high-quality layouts by:
- Parsing empirical timing data from BagreType to construct probability distributions over typing latencies
- Computing corpus-level n-gram frequencies from target text corpora
- Evaluating candidate layouts via a weighted cost function using these distributions
- Evolving populations through tournament selection, Order Crossover (OX), swap mutation, and elitism
- Visualizing results with fitness evolution plots, per-key cost heatmaps, and timing analysis
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtTyping Data: CSV file with a column named typing_data containing JSON arrays of timing records. Data can be collected using the BagreType platform or provided manually.
Example JSON structure:
[
{
"sequence": "h",
"letterTimings": [
{"letter": "h", "reactionTime": 834}
],
"totalSequenceTime": 834
},
{
"sequence": "th",
"letterTimings": [
{"letter": "t", "reactionTime": 210},
{"letter": "h", "reactionTime": 230}
],
"totalSequenceTime": 440
}
]Text Corpus: A text file containing the target corpus for optimization (e.g., Portuguese literature, English articles).
python -m ga_keyboard.main \
--csv /home/xamu/dev/ML/data/typing_test_data_rows\ \(2\).csv \
--csv-json-col typing_data \
--corpus /home/xamu/dev/ML/data/machado.txt \
--generations 300 \
--population 200 \
--mutation-rate 0.1 \
--crossover-rate 0.7 \
--elitism 5 \
--use-trigrams false \
--cost-order bi \
--fallback-to-unigrams falseOutputs are saved to outputs/:
best_layout.txt- ASCII rendering of the optimized layoutfitness.png- Evolution of fitness across generationsheatmap.png- Per-key cost visualizationunigram_timing_heatmap.png- Average unigram timing per physical keybigram_timing_heatmap.png- Average bigram timing per physical keycorpus_character_frequencies.txt- Character frequency statistics
- BagreType Data Collection: Integrates with typing performance data collected via bagretype.com
- Flexible Data Sources: Supports CSV files with JSON timing arrays
- Data Mixing: Option to merge datasets using
--mix-with-typing-testflag
- Tournament Selection: Tunable selective pressure via tournament size
- Order Crossover (OX): Provably correct permutation crossover operator
- Swap Mutation: Local search with configurable mutation rate
- Elitism: Preserves best individuals across generations
- Fitness Evolution: Track convergence across generations
- Timing Heatmaps: Cyan-gradient heatmaps showing per-key timing costs
- Cost Analysis: Per-key cost approximation for evolved layouts
- ASCII Layout Renderer: Human-readable keyboard layout visualization
- N-gram Support: Unigrams, bigrams, and optional trigrams
- Flexible Cost Orders: Choose which n-gram order to use exclusively (
--cost-order uni|bi|tri) - Graceful Missing Data: Configurable backoff strategies for sparse timing data
python -m ga_keyboard.main [OPTIONS]
Required:
--csv PATH Path to CSV file with typing data
--corpus PATH Path to text corpus file
Optional:
--csv-json-col NAME Column name with JSON timing array (default: typing_data)
--generations N Number of evolution generations (default: 300)
--population N Population size (default: 200)
--mutation-rate FLOAT Mutation probability per individual (default: 0.1)
--crossover-rate FLOAT Crossover probability (default: 0.7)
--elitism N Number of elites preserved (default: 5)
--use-trigrams BOOL Enable trigram optimization (default: false)
--cost-order ORDER Which n-gram order to use: uni|bi|tri (default: bi)
--fallback-to-unigrams Enable additive backoff for missing timings
--mix-with-typing-test Merge with typing_test.csv before processing
--seed N Random seed for reproducibility (default: 42)
--outdir PATH Output directory (default: outputs/)- Size: 46! = 55,026,221,598,120,889,498,503,054,288,002,548,929,616,517,529,600,000,000 (β 5.5Γ10^55)
- Complexity: Approximately 19 orders of magnitude larger than all computational operations possible since the Big Bang (see Article.md)
The expected typing time for a corpus under layout L:
Fitness is the inverse of cost: $$ \text{fitness}(L) = \begin{cases} 1 / C(L) & \text{if } C(L) > 0 \ 0 & \text{otherwise} \end{cases} $$
See Article.md for detailed mathematical formulation.
ga_keyboard/
βββ __init__.py # Package exports
βββ layout.py # Canonical key definitions, ASCII rendering
βββ typing_data.py # CSV/JSON parsing, timing aggregation, data merging
βββ corpus.py # Corpus loading, n-gram counting
βββ corpus_stats.py # Character frequency analysis
βββ fitness.py # Cost computation, fitness evaluation
βββ ga.py # Population init, selection, crossover, mutation, evolution
βββ viz.py # Heatmaps, plots, sparklines
βββ main.py # CLI entrypoint, orchestration
scripts/
βββ generate_synthetic_typing_csv.py # Generate synthetic timing data for testing
βββ fuse_typing_csvs.py # Merge multiple typing CSV files
data/ # Input data directory
outputs/ # Generated results directory
For testing and development, synthetic timing data can be generated:
python scripts/generate_synthetic_typing_csv.py \
--out data/synthetic_typing.csv \
--row-base-numbers 220.0 \
--row-base-top 170.0 \
--row-base-home 140.0 \
--row-base-bottom 180.0 \
--noise-std 0.0 \
--seed 1234This generates a complete dataset with all 46 unigrams and 2,116 bigrams (46Γ46) with realistic timing values based on key positions.
- Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.
- Davis, L. (1991). Handbook of Genetic Algorithms. Van Nostrand Reinhold.
This project is developed as part of the Graduation Thesis at Universidade de Franca (UniFran).
This is an academic research project. Contributions and suggestions are welcome!
For questions or collaboration inquiries, please refer to the main BagreType repository.
- Samuel Marcio Fonseca Santos: GitHub
- JoΓ£o Pedro de Souza Letro: Co-author
- Project Website: bagretype.com
- Main Repository: github.com/SamuelMFS/BagreType
