gemm

The HPC toolbox: fused matrix multiplication, convolution, data-parallel strided tensor primitives, OpenMP facilities, SIMD, JIT Assembler, CPU detection, state-of-the-art vectorized BLAS for floats and integers

deep-learning assembler parallel openmp jit simd matrix-multiplication high-performance-computing blas convolution tensor compiler-optimization gemm runtime-cpu-detection

Updated Jan 4, 2024
Nim

yzhaiustc / Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

Star

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

optimization cuda nvidia gemm

Updated Nov 28, 2021
Cuda

Bruce-Lee-LY / cuda_hgemm

Star

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

gpu cuda cublas nvidia gemm matrix-multiply tensor-core hgemm

Updated Nov 7, 2023
Cuda

ROCm / Tensile

Star

Stretching GPU performance for GEMMs and tensor contractions.

python machine-learning amd gpu assembly opencl dnn matrix-multiplication neural-networks gpu-acceleration blas hip gpu-computing tensors tensor-contraction gemm radeon auto-tuning radeon-open-compute

Updated May 17, 2024
Python

cp2k / dbcsr

Star

DBCSR: Distributed Block Compressed Sparse Row matrix library

hpc linear-algebra mpi cuda matrix-multiplication blas sparse-matrix cp2k gemm openmp-parallelization

Updated May 17, 2024
Fortran

yui0 / slibs

Star

Single file libraries for C/C++

Updated Nov 8, 2023
C

codingonion / awesome-cuda-tensorrt-fpga

Star

🔥🔥🔥 A collection of some awesome public NVIDIA CUDA, cuBLAS, cuDNN, TensorRT, AMD ROCm and FPGA projects.

Updated Apr 27, 2024

yzhaiustc / Optimizing-DGEMM-on-Intel-CPUs-with-AVX512F

Star

Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.

openmp simd blas avx512 gemm mkl

Updated Feb 3, 2022
C

enp1s0 / ozIMMU

Star

FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme

cuda gemm mixed-precision tensorcore tensorcores

Updated Jan 20, 2024
Cuda

ROCm / hipBLASLt

Star

hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library

machine-learning amd assembly matrix-multiplication blas hip gpu-computing gemm rocm radeon-open-compute