Mistral and Mixtral from scratch with detailed documentation
- Mistral: https://huggingface.co/docs/transformers/en/model_doc/mistral, https://arxiv.org/abs/2310.06825
- Mixtral MoE (Mixtral of Experts): https://huggingface.co/docs/transformers/en/model_doc/mixtral, https://arxiv.org/abs/2401.04088
- RoPE (Rotary Position Embedding): https://arxiv.org/pdf/2104.09864
- RMSNorm (Root Mean Square Layer Normalization): https://arxiv.org/pdf/1910.07467
- MHA (Multi-Head Attention, Q=KV): https://arxiv.org/abs/1706.03762
- MQA (Multi-Query Attention, 1 KV only): https://arxiv.org/abs/1911.02150
- GQA (Grouped-Query Attention, Q<KV): https://arxiv.org/pdf/2305.13245
- KVCache: https://medium.com/@joaolages/kv-caching-explained-276520203249
- SiLU (Sigmoid Linear Unit Activation): https://arxiv.org/abs/1702.03118v3
- SwiGLU Activation: https://arxiv.org/pdf/2002.05202v1
- Inference: https://medium.com/@javaid.nabi/all-you-need-to-know-about-llm-text-generation-03b138e0ed19