My personal implementation of an LLM Tokenizer.
As of now I'm not entirely sure what the scope of this repo will be, I'll follow the lecture and will consider where to go next afterwards.
Using Python version 3.11.6
This project was mainly inspired by the great video lectures on Neural Networks by Andrej Karpathy.
- Neural Networks: Zero to Hero
- OpenAI Tokenizer
- TikTokenizer Webapp
- tinyshakespeare Data
- From the char-rnn repo of Andrej Karpathy.
- A Programmer's Introduction to Unicode
- The Unicode Standard
- Python3: Unicode HOWTO
- Wikipedia
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. arXiv preprint arXiv:1906.05231. https://doi.org/10.48550/arXiv.1906.05231
- Touvron, H., Martin, L., Stone, K., ... & Scialom, T. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288.