Skip to content

MonkeyAndCoder/nanoPyC

 
 

Repository files navigation

This repo serves as an introduction to compiling numerical Python programs (A Python compiler). The source code is kept as simple as possible and each day is a small step forward based on the previous day. We will start from hand-coding simple functions, and gradually try to automatally generate/optimize them using a compiler.

Following the tutorial, you will learn:

  • Basic CUDA programming
  • How to accelerate your numerical code using parallelization, fusion and tiling
  • How a compiler automatically generates and optimizes numerical code/loops

The end result is, given a function like the following:

def foo(a):
    b = exp(a)
    c = b.sum(axis=1)
    d = b / c
    return d

You will know how to make this code run much faster on CPU/GPU by just adding one line, like this

@compiler.jit
def foo(a):
    b = exp(a)
    c = b.sum(axis=1)
    d = b / c
    return d

and understand why it runs faster!

Prerequisite

  • pytorch
  • cupy

You will also need a Nvidia GPU to run the code. For now we generate CUDA code and compute on the GPU by default.

Day 1

Implement a JIT compiler using Python decorator!

Day 2

Implement a simple matrix exp function in CUDA!

Day 3

Make the exp kernel more efficient by using more parallelism! Now the performance already matches cuBLAS.

Day 4

Simplify the kernel code by using 2D partitioning. The pitfall is partitioning the rows to x dim.

Day 5

First taste of fusion by creating a fused exp-div kernel!

Day 6

Introducing reduction by a simple implementation of softmax.

Day 7

Use parallel reduction and shared memory to make the softmax implementation more efficient.

Day 8

More parallel reduction.

Day 9

Make the softmax even more efficient by storing exp(a) in fast memory and eliminate reloading from global memory.

Day 10

Trying out template-based dynamic code generation.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 65.6%
  • Cuda 34.4%