Skip to content

KookaS/ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning

Here are the modern implementations of LLM architecture, sharding strategies and kernel optimizations.

Core

Transformer Architecture

Positional Encoder

  • Positional Encoder Sinusoidal in NumPy
  • RoPE in NumPy
  • RoPE GPT-NeoX in NumPy
📊 Positional Encoding Visualizations
Sinusoidal RoPE RoPE GPT-NeoX
Sinusoidal RoPE RoPE NeoX

Flash Attention

  • Flash Attention v1 and v2 in PyTorch
📊 Flash Attention Visualizations

See image/flash_attention/README.md for details.

Performance vs Block Size Memory: HBM vs SRAM
Performance Memory
Tile Size vs Latency Br × Bc Heatmap
Latency Heatmap
FA1 vs FA2: Theoretical HBM Access
FA1 vs FA2

Sharding strategies

Scaling plots

The following are roofline analysis for different architectures. Those are non-fused operations.

  • MLP roofline analysis in NumyPy
  • Multi-Head Attention roofline analysis in NumyPy
📊 Roofline Plots
MLP Attention
MLP Roofline Attention Roofline

NumPy Tutorial

JAX Tutorial

PyTorch Notes

  • Torch distributed API.
  • don't use the old primitives, instead use in-place ones like dist.all_gather_into_tensor and dist.all_reduce_tensor that aggregate along the primary dimension.
  • custom classes for training requires torch.autograd.Function, @staticmethod and ctx.save_for_backward

About

ML implentation of LLM architecture, sharding strategies and kernel optimizations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages