# CSE113: Parallel Programming

Feb, 28, 2024

- Topics:
  - Intro to GPUs



- Midterm is graded, you should be able to see your grades
  - The TAs will go over the questions and solutions in class next Wednesday with you.
  - This lecture will not be recorded and the slides will not be uploaded.
  - If you want to see your test, please come see me or the TAs in office hours.
  - Sadly we do these things due to academic integrity violations last year.

• Last day to turn in HW 3 was yesterday, hope you got it in!

• HW 4 is released today. We will discuss it in class a little bit today

We will plan for HW 5 to be released on Feb 11.

 I am gone tomorrow, so I will not have office hours, which is why I had them yesterday

Please talk to the TAs (in person) if you want to go over your midterm.

Probably last day on GPUs.

 This is just the tip of the iceburg! Read CUDA by example (free textbook linked on website). Lots of good examples online once you get the basics down.

- Rest of quarter:
  - Memory models
  - Barriers
  - Concurrent Set
  - Schedulers (if time)

# Previous quiz + Review

The host (CPU) will write a C++-like program that allocates and sets up memory on the GPU. The host will then call a GPU program called a kernel. Is this affirmation true?

○ True

○ False

GPUs come in two flavors



How do we allocate CPU memory on the host?

```
int *x = (int*) malloc(sizeof(int)*SIZE);
CPU
GPU

System Memory
PCIE
```

How do we allocate CPU memory on the host?



We need to allocate GPU memory on the host



We need to allocate GPU memory on the host



# Previous quiz + Review

How do we initialize memory for a variable we aim to use in the GPU computation?

- Using cudaMemcpy
- Using memcpy from C++
- O Just declaring a new variable

• Our heterogeneous, parallel, programming model

If we can't access d\_x on the CPU, how do we initialize the memory?

GPU has no access to input devices e.g. disk





# Previous quiz + Review

What keyword do we need to include for a function to be executed on the GPU using CUDA?

- \_\_kernel\_\_
- \_\_global\_\_
- \_\_this\_\_
- O \_\_gpu\_\_

# The GPU Program

```
__global___ void vector_add(int * a, int * b, int * c, int size) {
   for (int i = 0; i < size; i++) {
      a[i] = b[i] + c[i];
   }
}</pre>
```

# First parallelization attempt

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   int chunk_size = size/blockDim.x;
   int start = chunk_size * threadIdx.x;
   int end = start + end;
   for (int i = start; i < end; i++) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

#### calling the function

```
vector_add<<<1,32>>>(d_a, d_b, d_c, size);
number of threads
thread id
```

### Kernel constraints

• What can and can't you do in a GPU kernel?

### Kernel constraints

- What can and can't you do in a GPU kernel?
  - Print?
  - File I/O?
  - C++ standard library? E.g. vectors?
  - Memory allocation?
  - Atomics?

# Previous quiz + Review

Using a few sentences, give examples of workloads that benefit from GPU parallelism.

# Recalling where we left off

# Programming a GPU

Tiny GPU in an embedded system



Fight!



https://www.prolast.com/prolast-elevated-boxing-rings-22-x-22/

Nvidia Jetson Nano (whole chip, CPU + GPU)

2 Billion transistors

10 TDP

Est. \$99

Intel i7-970

2.16 Billion
95 TDP

https://www.techpowerup.com/gpu-specs/geforce-940m.c2648
https://www.alibaba.com/product-detail/Intel-Core-i7-9700K-8-Cores 62512430487.html

Est. \$316

The CPU in my research workstation



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

# Programming a GPU

• The problem: Vector addition

### Parallel Schedules



# Lets set up the CPU

• CPU code

# First GPU attempt

- Basic sequential GPU program
  - Issues?
  - Key insights?

woah, 32 cores!

We should parallelize our application!



# Second GPU attempt

- Use 32 cores
  - Issues?
  - Key insights?

# GPU Memory

#### **CPU Memory:**

Fast: Low Latency

Easily saturated: Low Bandwidth

Scales well: up to 1 TB

DDR

2-lane straight highway driven on by sports cars



Different technologies

#### **GPU Memory:**

slow: High Latency

hard to saturate: High Bandwidth

doesn't scale: 32 GB

GDDR, HBM

16-lane highway on a windy road driven by semi trucks

warp 0

**GPU** 

warp 0

all threads load from memory.

**GPU** 

warp 0

all threads load from memory.

600 cycles!

GPU

warp 0

warp 1

warp 2

We can hide latency through preemption and concurrency!

**GPU** 





warp 1

warp 2

We can hide latency through preemption and concurrency!

warp 0

**GPU** 



memory access 600 cycles

warp 1

warp 2

We can hide latency through preemption and concurrency!

warp 0

**GPU** 



memory access 600 cycles

warp 1

warp 2

warp 0

**GPU** 

**Graphics Memory** 



preempt warp 0 and put warp 1 on

We can hide latency through preemption and concurrency!

warp 2

warp 1

warp 0

GPU

Graphics Memory

memory access 600 cycles

warp 1

GPU

**Graphics Memory** 

warp 2

warp 0



preempt warp 1 and put warp 2 on

warp 2

**GPU** 

**Graphics Memory** 

warp 0

warp 1



memory access 600 cycles

warp 2

**GPU** 

**Graphics Memory** 

warp 0

warp 1



preempt warp 2 and put warp 0 on

#### Hey, my memory has arrived!

warp 0

**GPU** 

**Graphics Memory** 



preempt warp 2 and put warp 0 on

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   int chunk_size = size/blockDim.x;
   int start = chunk_size * threadIdx.x;
   int end = start + end;
   for (int i = start; i < end; i++) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

#### calling the function

Lets launch with 32 warps

```
vector_add<<<1, 1024>>> (d_a, d_b, d_c, size);
```

# Concurrent warps

Lets try it! What do we think?

# Next steps

Back to the architecture

### These cores don't come for free...

Tiny GPU in an embedded system



Fight!



Nvidia Jetson Nano (whole chip, CPU + GPU)
2 Billion transistors
10 TDP

Est. \$99

The CPU in my research workstation



Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

### Optimizing memory accesses



### Optimizing memory accesses



this is the load/store unit. The hardware component responsible for issuing loads and stores.

Why doesn't every core have one?

### Optimizing memory accesses



This is the instruction cache... Why doesn't every core have a instruction buffer to keep track of its program?

this is the load/store unit. The hardware component responsible for issuing loads and stores.

Why doesn't every core have one?



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

instruction is fetched from the buffer and distributed to all the cores.

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Cores can a large register file they share expensive HW units (load/store and special functions)

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

All cores need to wait until all cores finish the first instruction

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Start the next instruction.

#### **Program:**

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```

Why would we have a programming model like this?



Groups of 32 threads are called a "warp"

They are executed in lock-step, i.e. they all execute the same instruction at the same time

Start the next instruction.

#### **Program:**

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```

Why would we have a programming model like this? More cores (share program counters)

Can be efficient to share other hardware resources



#### Lets look closer at memory

#### **Program:**

```
int variable1 = b[0];
int variable2 = c[0];
int variable3 = variable1 + variable2;
a[0] = variable3;
```

4 cores are accessing memory. what happens if they access the same value?

**GPU Memory** 

Load Store Unit

T0

T1

T2

T3

All read the same value

**GPU Memory** 

**Load Store Unit** 

T0 T1 T2 T3

a[0] a[0] a[0] a[0]

#### All read the same value

This is efficient: the load store unit can ask for the value and then broadcast it to all cores.



#### All read the same value

This is efficient: the load store unit can ask for the value and then broadcast it to all cores.

1 request to GPU memory

Efficient, but probably not too common.



Read contiguous values

**GPU Memory** 

Load Store Unit

T0 T1 T2 T3

a[0] a[1] a[2] a[3]

### **Read contiguous values**

Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

### **GPU Memory**

**Load Store Unit** 

T0 T1 T2 T3

a[0] a[1] a[2] a[3]

### **Read contiguous values**

Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes (for this example)



### **Read contiguous values**

Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads



### **Read contiguous values**

Like the CPU cache, the Load/Store Unit reads in memory in chunks. 16 bytes

Can easily distribute the values to the threads

1 request to GPU memory



**Read non-contiguous values** 

Not good!

Accesses are Serialized.
You need 4 requests to GPU memory

**GPU Memory** 

**Load Store Unit** 

T0 T1 T2 T3

a[x] a[y] a[z] a[w]

**Read non-contiguous values** 

Not good!



**Read non-contiguous values** 

Not good!



**Read non-contiguous values** 

Not good!



**Read non-contiguous values** 

Not good!



### Go back to our program

vector add<<<1,32>>>(d a, d b, d c, size);

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
    int chunk_size = size/blockDim.x;
    int start = chunk_size * threadIdx.x;
    int end = start + end;
    for (int i = start; i < end; i++) {
        d_a[i] = d_b[i] + d_c[i];
    }
}
calling the function</pre>
```

#### Chunked Pattern



#### Chunked Pattern

the first element accessed by the 4 threads sharing a load store unit. What sort of access is this?

Computation can easily be divided into threads

Thread 0 - Blue

Thread 1 - Yellow

Thread 2 - Green

Thread 3 - Orange



#### Chunked Pattern

the first element accessed by the 4 threads sharing a load store unit. What sort of access is this?

Computation can easily be divided into threads

Thread 0 - Blue

Thread 1 - Yellow

Thread 2 - Green

Thread 3 - Orange



How can we fix this

#### Stride Pattern



#### Stride Pattern



### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   int chunk_size = size/blockDim.x;
   int start = chunk_size * threadIdx.x;
   int end = start + end;
   for (int i = start; i < end; i++) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

calling the function

Lets change this to a stride pattern

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

#### calling the function

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

#### Stride Pattern



# Coalesced memory accesses

Lets try it! What do we think?

# Coalesced memory accesses

Lets try it! What do we think?



What else can we do?

## Multiple streaming multiprocessors

We've been talking only about 1 streaming multiprocessor, most GPUs have multiple SMs big ML GPUs have 80.



### Multiple streaming multiprocessors

We've been talking only about 1 streaming multiprocessor, most GPUs have multiple SMs big ML GPUs have 80. This little GPU has 4







|      | Instruction Buffer              |      |      |       |     |  |  |  |
|------|---------------------------------|------|------|-------|-----|--|--|--|
|      | Warp Scheduler                  |      |      |       |     |  |  |  |
| Di   | Dispatch Unit Dispatch Unit     |      |      |       |     |  |  |  |
|      | Register File (16,384 x 32-bit) |      |      |       |     |  |  |  |
| Core | ore Core Core LD/ST SFL         |      |      |       |     |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |
| Core | Core                            | Core | Core | LD/ST | SFU |  |  |  |

## Multiple streaming multiprocessors

CUDA provides virtual streaming multiprocessors called **blocks** Very efficient at launching and joining **blocks**.

No limit on blocks: launch as many as you need to map 1 thread to 1 data element

| Instruction Buffer |                             |            |          |         |     |  |  |  |
|--------------------|-----------------------------|------------|----------|---------|-----|--|--|--|
|                    | Warp Scheduler              |            |          |         |     |  |  |  |
| Di                 | Dispatch Unit Dispatch Unit |            |          |         |     |  |  |  |
|                    | Regist                      | er File (* | 16,384 x | 32-bit) |     |  |  |  |
| Core               | Core Core Core LDIST SFU    |            |          |         |     |  |  |  |
| Core               | Core                        | Core       | Core     |         | SFU |  |  |  |
| Core               | Core                        | Core       | Core     |         | SFU |  |  |  |
| Core               | Core                        | Core       | Core     |         | SFU |  |  |  |
| Core               | Core                        | Core       | Core     | LD/ST   | SFU |  |  |  |
| Core               | Core                        | Core       | Core     | LD/ST   | SFU |  |  |  |
| Core               | Core                        | Core       | Core     | LD/ST   | SFU |  |  |  |
| Core               | Core                        | Core       | Core     |         | SFU |  |  |  |

|      | 1          | nstructi  | on Buffe | ır          |     |
|------|------------|-----------|----------|-------------|-----|
|      |            | Warp So   | cheduler |             |     |
| D    | ispatch Un | it        |          | Dispatch Ur | iit |
|      | Regist     | er File ( | 16,384 x | 32-bit)     |     |
| Core | Core       | Core      | Core     | LD/ST       | SFU |
| Core | Core       | Core      | Core     |             | SFU |
| Core | Core       | Core      | Core     |             | SFU |
| Core | Core       | Core      | Core     |             | SFU |
| Core | Core       | Core      | Core     |             | SFU |
| Core | Core       | Core      | Core     |             | SFU |
| Core | Core       | Core      | Core     |             | SFU |
| Core | Core       | Core      | Core     | LD/ST       | SFU |

| Instruction Buffer |                                 |                       |         |             |     |  |  |  |
|--------------------|---------------------------------|-----------------------|---------|-------------|-----|--|--|--|
|                    |                                 | Warp So               | heduler |             |     |  |  |  |
| D                  | ispatch Un                      | it                    | ı       | Dispatch Ur | iit |  |  |  |
|                    | Register File (16,384 x 32-bit) |                       |         |             |     |  |  |  |
| Core               | Core                            | e Core Core LD/ST SFU |         |             |     |  |  |  |
| Core               | Core                            | Core                  | Core    |             | SFU |  |  |  |
| Core               | Core                            | Core                  | Core    |             | SFU |  |  |  |
| Core               | Core                            | Core                  | Core    |             | SFU |  |  |  |
| Core               | Core                            | Core                  | Core    |             | SFU |  |  |  |
| Core               | Core                            | Core                  | Core    |             | SFU |  |  |  |
| Core               | Core                            | Core                  | Core    |             | SFU |  |  |  |
| Core               | Core                            | Core                  | Core    | LD/ST       | SFU |  |  |  |

| Instruction Buffer       |            |            |          |             |     |  |  |
|--------------------------|------------|------------|----------|-------------|-----|--|--|
|                          |            | Warp So    | heduler  |             |     |  |  |
| D                        | ispatch Un | it         | ı        | Dispatch Ur | iit |  |  |
|                          | Regist     | er File (* | 16,384 x | 32-bit)     |     |  |  |
| Core Core Core LD/ST SFU |            |            |          |             |     |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |
| Core                     | Core       | Core       | Core     |             | SFU |  |  |

| Instruction Buffer          |      |           |          |         |     |  |
|-----------------------------|------|-----------|----------|---------|-----|--|
|                             |      | Warp So   | cheduler |         |     |  |
| Dispatch Unit Dispatch Unit |      |           |          |         |     |  |
|                             |      | er File ( | 16,384 x | 32-bit) |     |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |
| Core                        | Core | Core      | Core     |         | SFU |  |

| Instruction Buffer          |        |                |          |         |      |  |  |  |  |
|-----------------------------|--------|----------------|----------|---------|------|--|--|--|--|
| Warp Scheduler              |        |                |          |         |      |  |  |  |  |
| Dispatch Unit Dispatch Unit |        |                |          |         |      |  |  |  |  |
|                             | •      |                |          | •       |      |  |  |  |  |
|                             | Regist | er File (1     | 16,384 x | 32-bit) |      |  |  |  |  |
| _                           |        |                |          |         |      |  |  |  |  |
| Core                        | Core   | Core           | Core     | LD/ST   | SFU  |  |  |  |  |
| Core                        | Core   | Core           | Core     | LD/ST   | SFU  |  |  |  |  |
|                             |        | $\blacksquare$ | -        | =       |      |  |  |  |  |
| Core                        | Core   | Core           | Core     | LD/ST   | SFU  |  |  |  |  |
| Core                        | Core   | Core           | Core     | LDIST   | SFU  |  |  |  |  |
| 0010                        | 00.0   | Colle          | 0010     |         | 5. 0 |  |  |  |  |
| Core                        | Core   | Core           | Core     | LD/ST   | SFU  |  |  |  |  |
| Core                        | Core   | Core           | Core     | LDIST   | SFU  |  |  |  |  |
| Core                        | Core   | Core           | Core     | LD/S1   | aru  |  |  |  |  |
| Core                        | Core   | Core           | Core     |         | SFU  |  |  |  |  |
| Core                        | Core   | Core           | Core     |         | SFU  |  |  |  |  |
|                             |        |                |          |         |      |  |  |  |  |

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
   for (int i = threadIdx.x; i < size; i+=blockDim.x) {
      d_a[i] = d_b[i] + d_c[i];
   }
}</pre>
```

#### calling the function

Launch with many thread blocks

```
vector_add<<<1,1024>>>(d_a, d_b, d_c, size);
```

### Go back to our program

```
__global___ void vector_add(int * d_a, int * d_b, int * d_c, int size) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  d_a[i] = d_b[i] + d_c[i];
}
```

#### calling the function

```
vector_add<<<1024,1024>>>(d_a, d_b, d_c, size);
#define SIZE (1024*1024)
```

Need to recalculate some thread ids.

Launch with many thread blocks

Now we have 1 thread for each element

#### Final Round

Tiny GPU in an embedded system



Fight!



The CPU in my professor workstation



Nvidia Jetson Nano (whole chip, CPU + GPU) 2 Billion transistors 10 TDP

Est. \$99

Intel i7-9700K 2.16 Billion transistors 95 TDP Est. \$316

# Extra thoughts

## Extra thoughts

Why threadIdx.x?

 Because you can pass in a 3D range (x,y,z). Historically threads mapped to pixels (or voxels)

## Consequences of Warps

• Branches can cause significant performance impacts, Demo

### Locks on GPU?

# What is the right way to program GPUs?

• Still an open question!

• It is the first time offering this homework, so feedback is very welcome and we will be generous with support.

 Thanks to Mingun Cho who prepared the initial assignment! (and thanks to the TA's last year who polished it)



- Prerequisits
  - Google Chrome
  - should be stable on Windows and Mac
  - if you are running linux, please try things out ASAP
- We need some security set up

- Javascript shared array buffer:
  - How javascript threads can actually share memory
  - Similar to memory in C++

Shared memory and high-resolution timers were effectively disabled at the start of 2018 in light of Spectre . In 2020, a new, secure approach has been standardized to re-enable shared memory. With a few security measures, postMessage() will no longer throw for SharedArrayBuffer objects and shared memory across threads will be available:

As a baseline requirement, your document needs to be in a secure context.

Your application will be in a secure context (you are writing and running locally!)

You will generate some SSL keys and run in a local server

```
openssl req -x509 -newkey rsa:4096 -keyout key.pem -outcert.pem -days 365 -nodes python3 serve.py
```

• Let's have a look!

- Your assignment:
  - N-body simulation
- Each particle interacts with every other particle







time = 0 time = 1 time = 2

# Examples

• Gravity:

- Boids:
  - https://en.wikipedia.org/wiki/Boids



- N-body require a little bit of physics background so we will do something simpler.
  - If you want to explore with physics please feel free

- Local attraction clustering:
  - For each particle: find your closest neighbor
  - You can take one step in the x direction and one step in the y direction towards your closest neighbor.

• Part 1 of your homework will do this on a single javascript thread

Demo

• Looks good, but with more particles, things start to go slower...

- Looks good, but with more particles, things start to go slower...
- Part 2 of the homework is to implement with multiple CPU threads using javascript webworkers
  - Should get a linear speedup
- Part 3 is to implement with webGPU
  - Should get a BIG speedup!
- You need to explore how many particles you can simulate while keeping a 60 FPS framerate.

Let's look at the code and see some javascript

## Shared Array Buffer

• Like Malloc, allocates a "pointer" to a contagious array of bytes

• Can pass the "pointer" to different threads

Need to instantiate a typed array to access the values

#### Web Workers

How to do multi-threading in javascript

- Async
  - Concurrent (executes on the same thread)
  - Good for I/O and user interactions
- Web Workers will execute on multiple cores
  - Better for compute intensive applications
  - Better performance

#### How to use?

- Create a new worker with a file
  - Doesn't do anything yet
- File contains a function: "on message"
- Main file calls "post message" to start the thread along with arguments
- Worker sends a message back to the main file, it can catch the data

- The language is wgsl
  - It is new, there are not many examples (and the specification changes!)
  - Official specification is here: https://www.w3.org/TR/WGSL/

- wgsl is NOT javascript
- Javascript is interpreted: not possible on GPUs
- wgsl is compiled
  - into Vulkan on Linux
  - into Metal on Apple
  - into HLSL on Windows
- No printing (so GPU code can be difficult to debug)

variables (optional types):

```
var <name> = <value>;
var cluster_dist = 3.0;

var <name> : <type> = <value>;
var cluster_dist : f32 = 3.0;
```

```
types:
```

- i32
- u32
- f32
- vec2<f32>

structures

```
array<type>
```

struct Particle { pos : vec2<f32>; **}**; struct Particles { particles : array<Particle>; **}**; var index\_pos : vec2<f32> = particlesA.particles[index].pos; var index : u32 = GlobalInvocationID.x;

• Built-ins (global id) you have one thread for each particle!

- Built in functions:
  - arrayLength
  - sqrt
  - pow
  - distance

#### For loops:

```
for (var i : u32 = 0u; i < arrayLength(&particlesA.particles); i = i + 1u) {
...
}</pre>
```

Types can be frustrating

• But compiler errors will help you, and you can do casts.