<a href="https://colab.research.google.com/github/Emilevillette/Emilevillette/blob/main/SIMD_student.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LINFO2241 : Practical Session 8

# Exercice 1 : SIMD

## Section 1 : Turning a regular program into AVX

This exercice consist in understanding a rather basic C program, and adapt it so that it can run in AVX, AVX2 and AVX512. We'll then look at how well each implementation performs

### Preliminary steps

#### Checking CPU support

To use AVX, AVX2 and AVX512, we need the support from the CPU. Check the flags of /proc/cpuinfo

Don't forget the scroll bar to see the line to the end. You should see the *avx*, *avx2* and different variant of *avx512* in the flag section. This indicates that the process has hardware support for each version of AVX. You can check on your own computer if you wan (hint : On your local machine, you'll probably have support for AVX and AVX2, but likely not for AVX512).

In [1]:
!cat /proc/cpuinfo | grep avx

flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat md_clear arch_capabilities
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep

#### Create basic files

This is the skeleton you'll need later, just execute these cells to create the required file. You don't have to edit it.

But look at the `aligned_alloc` functions, what do they do ? Why 64 ?

In [2]:
%%writefile avx.c
#include <stdio.h>
#include <stdlib.h>

unsigned long long rdtscl(void)
{
    unsigned int lo, hi;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

long N = 65536;

void VecAdd(float* A, float* B, float* C, long N);

// Host code
int main()
{

    size_t size = N * sizeof(float);
    // Allocate input vectors h_A and h_B in host memory
    float* A = (float*)aligned_alloc(64, size);
    float* B = (float*)aligned_alloc(64, size);
    float* C = (float*)aligned_alloc(64, size);

    // Initialize input vectors
    for (int i = 0; i < N; i ++) {
        A[i] = (float)i;
        B[i] = (float)i;
        //C[i] = 0; will be overwritten, don't care
    }

    printf("Launching computation...\n");
    unsigned long long start = rdtscl();
    VecAdd(A, B, C, N);

    printf("Finished in %llu cycles !\n", rdtscl() - start);

    printf("First floats of C : %f %f %f ...\n", C[0], C[1] , C[2]);

    // Free host memory
    free(A);
    free(B);
    free(C);

    printf("Exiting...");
}

Writing avx.c


### Writing you own implementation

Look a this naive implementation and write an equivalent in AVX, AVX2 and AVX512

#### Naive implementation

In [3]:
%%writefile naive.c

// Naive implementation, you don't have to edit it
void VecAdd(float* A, float* B, float* C, long N)
{
    for (int i = 0; i < N; i++)
        C[i] = A[i] + B[i];
}

Writing naive.c


You can check that it works by compiling and running it

In [4]:
!gcc -march=native avx.c naive.c -o naive
!./naive

Launching computation...
Finished in 950400 cycles !
First floats of C : 0.000000 2.000000 4.000000 ...
Exiting...

#### AVX128 implementation

In [23]:
%%writefile avx128.c
#include <immintrin.h>

void VecAdd(float* A, float* B, float* C, long N)
{
  // Your code here
  for(int i = 0; i<N; i+=4) {
    __m128 a = _mm_load_ps(A+i);
    __m128 b = _mm_load_ps(B+i);
    __m128 result = _mm_add_ps(a, b);
    _mm_store_ps(C+i, result);
  }

}

Overwriting avx128.c


In [24]:
!gcc -march=native avx.c avx128.c -o avx
!./avx

Launching computation...
Finished in 617058 cycles !
First floats of C : 0.000000 2.000000 4.000000 ...
Exiting...

#### AVX256 implementation

In [25]:
%%writefile avx256.c
#include <immintrin.h>

void VecAdd(float* A, float* B, float* C, long N)
{
  // Your code here
  for(int i = 0; i<N; i+=8) {
    __m256 a = _mm256_load_ps(A+i);
    __m256 b = _mm256_load_ps(B+i);
    __m256 result = _mm256_add_ps(a, b);
    _mm256_store_ps(C+i, result);
  }
}

Overwriting avx256.c


In [26]:
!gcc -march=native avx.c avx256.c -o avx2
!./avx2

Launching computation...
Finished in 387120 cycles !
First floats of C : 0.000000 2.000000 4.000000 ...
Exiting...

#### AVX512 implementation

Sadly, AVX512 isn't always available on Google Collab. You need to setup your colab environment to run with a GPU (We won't use the GPU right now, but the GPU machine happens to be AVX512 compatible).

In [29]:
%%writefile avx512.c
#include <immintrin.h>

void VecAdd(float *A, float *B, float *C, long N) {
  // Your code here
  for(int i = 0; i<N; i+=16) {
    __m512 a = _mm512_load_ps(A+i);
    __m512 b = _mm512_load_ps(B+i);
    __m512 result = _mm512_add_ps(a, b);
    _mm512_store_ps(C+i, result);
  }
}

Overwriting avx512.c


In [30]:
!gcc -march=native avx.c avx512.c -mavx512bw -o avx512
!./avx512

Launching computation...
Finished in 397960 cycles !
First floats of C : 0.000000 2.000000 4.000000 ...
Exiting...

#### Performances

Can you see a difference in performances ? Maybe you'll have to increase the value of N to have significant result. (If so, don't forget to recompile everything)

In [None]:
!./naive
!./avx
!./avx2
!./avx512

## Section 2 : Branching and masks

Look at the code below, it uses branching to perform multiplication on odd indexes (Note: odd means 'impair' in english). How can we perform this with AVX ?

In [31]:
%%writefile conditional.c
// Code. Well this is still not vectorized
void VecAdd(float* A, float* B, float* C, long N)
{
    for (int i = 0; i < N; i++)
        if (i % 2 == 0)
            C[i] = A[i] + B[i];
        else
            C[i] = A[i] * B[i];
}

Writing conditional.c


In [32]:
!gcc -march=native avx.c conditional.c -o avx_conditional
!./avx_conditional

Launching computation...
Finished in 1129280 cycles !
First floats of C : 0.000000 1.000000 4.000000 ...
Exiting...

Use masks and blenders to write an AVX version of this program. You're free to use the AVX version you want (AVX, AVX2 or AVX512).

In [75]:
%%writefile mask.c
// Code. Well this is still not vectorized
#include <immintrin.h>

void VecAdd(float* A, float* B, float* C, long N)
{
  // Your code here
  for (int i = 0; i < N; i+=16) {
    __m512 a = _mm512_load_ps(A+i);
    __m512 b = _mm512_load_ps(B+i);
    int array[16];
    if (i%2 == 0) {
      int array[16] = {0, 1, 0, 1 ,0, 1 ,0, 1, 0, 1, 0, 1 ,0, 1 ,0, 1};
    } else {
      int array[16] = {1 ,0,1 ,0,1 ,0,1 ,0, 1 ,0,1 ,0,1 ,0,1 ,0};
    }
    __m512i arr = _mm512_load_epi32(array);
    __mmask32 mask = _mm512_cmpeq_epi32_mask(arr, _mm512_set_epi32(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0));

    __m512 result = _mm512_mask_add_ps(a, mask, a, b);
    __m512 other_result = _mm512_maskz_mul_ps(mask, a, b);

    __m512 real_result = _mm512_mask_blend_ps(mask, result, other_result);

    _mm512_store_ps(C+i, real_result);
  }

}

Overwriting mask.c


In [76]:
!gcc -march=native avx.c mask.c -o avx_mask
!./avx_mask

Launching computation...


### Performances

What can you say about each implementation performances ?

In [None]:
!./avx_conditional
!./avx_mask