-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Insights: NVIDIA/cutlass
Overview
Could not load contribution data
Please try again later
8 Pull requests merged by 7 people
-
[DOC] Add more exposition to composition example
#2536 merged
Aug 12, 2025 -
Fix typo in smem_allocator.py
#2517 merged
Aug 11, 2025 -
fix typo
#2529 merged
Aug 11, 2025 -
NIT: Grammar
#2537 merged
Aug 11, 2025 -
Made documentation updates in batched_gemm.cu
#2538 merged
Aug 11, 2025 -
Fix a copy error in the SM70 main loop
#2540 merged
Aug 11, 2025 -
Support both CUDA 12 and 13 cccl header locations
#2543 merged
Aug 11, 2025 -
Fix incorrect K dim in CuTe MMA Atom doc
#2544 merged
Aug 11, 2025
10 Pull requests opened by 10 people
-
Fix typo in cute.nvgpu.warpgroup.mma doc
#2548 opened
Aug 7, 2025 -
Make swizzle in pycute work
#2553 opened
Aug 7, 2025 -
Liberate runtime check in example 13
#2554 opened
Aug 7, 2025 -
ex77 backwards GQA
#2556 opened
Aug 7, 2025 -
Add missing CUDA_ARCH guard for `__nanosleep` in example
#2558 opened
Aug 11, 2025 -
fix a typo.
#2561 opened
Aug 12, 2025 -
Add movmatrix support
#2562 opened
Aug 12, 2025 -
Fix typo in tv layout stride in elementwise_add
#2564 opened
Aug 12, 2025 -
Fix arch guards in a few examples
#2567 opened
Aug 13, 2025
6 Issues closed by 4 people
-
[QST] [CuTeDSL] What functions should be in `cute.jit` vs `cute.kernel`
#2546 closed
Aug 8, 2025 -
[QST] Question about logical_divide (only on older version of cutlass)
#2545 closed
Aug 7, 2025 -
Why not use sync after loading from TMEM to RMEM in example 02_mma_tma_sm100.cu
#2525 closed
Aug 7, 2025 -
[QST] Deadlock in producer consumer loop
#2404 closed
Aug 7, 2025 -
[QST] [CuTeDSL] Branching: ValueError: unable to convert ... in type <class 'list'> to Numeric
#2531 closed
Aug 6, 2025
7 Issues opened by 6 people
-
[QST]How to optimize the tensorrt-llm's mixed gemm
#2566 opened
Aug 13, 2025 -
[QST] Why are my registers being loaded from global memory?
#2563 opened
Aug 12, 2025 -
[QST][Cute-DSL] How to return more than 1 value from llvm.inline_asm
#2560 opened
Aug 11, 2025 -
[BUG] `88_hopper_fmha_fp8` example fails to compile on some CUDA archs
#2559 opened
Aug 11, 2025 -
[BUG] [CuTeDSL] LLVM ERROR: pthread_create failed: Resource temporarily unavailable
#2551 opened
Aug 7, 2025 -
Missing PYPI releases
#2549 opened
Aug 7, 2025 -
[BUG] Grouped GEMM with GroupScaling Fails with TMA Error on Hopper When a Problem has K=0
#2547 opened
Aug 7, 2025
21 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
[QST] [CuTeDSL] operand #0 does not dominate this use
#2532 commented on
Aug 6, 2025 • 0 new comments -
[QST] how blackwell gemm deal with Not 16-byte aligned data?
#2491 commented on
Aug 7, 2025 • 0 new comments -
[QST]Will K-Major Blockwise Scale Config be supported on the Hopper architecture?
#2464 commented on
Aug 7, 2025 • 0 new comments -
[QST]Can I use list of tensor as input in 68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling
#2423 commented on
Aug 7, 2025 • 0 new comments -
[QST] b2b gemm fp32
#2415 commented on
Aug 7, 2025 • 0 new comments -
[BUG] Issue with TMA-REDUCE in Split-K GEMM for Inputs with Negative Values on H100
#2535 commented on
Aug 7, 2025 • 0 new comments -
[DOC]构建 cutlass_profiler
#2226 commented on
Aug 7, 2025 • 0 new comments -
[QST] Adding new parameter to Conv2dFprop in Python
#2166 commented on
Aug 7, 2025 • 0 new comments -
[QST] How to correctly vectorize copy with cast?
#2440 commented on
Aug 8, 2025 • 0 new comments -
[FEA] [Cute-DSL] Passing flags to ptxas
#2486 commented on
Aug 9, 2025 • 0 new comments -
[QST]how to use 68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling
#2412 commented on
Aug 12, 2025 • 0 new comments -
[QST] CUTLASS GEMV with very large matrices [50000, 50000]
#2409 commented on
Aug 12, 2025 • 0 new comments -
[QST] Variable size Gemm that can be Cuda graphed
#2164 commented on
Aug 12, 2025 • 0 new comments -
[QST] How to implement a fused mixed precision matrix multiplication such as w4a4 + w16a16?
#2058 commented on
Aug 12, 2025 • 0 new comments -
[FEA] Blackwell support for python libraries.
#2237 commented on
Aug 12, 2025 • 0 new comments -
[BUG] The `print_layout` function failed to format the output correctly.
#2496 commented on
Aug 13, 2025 • 0 new comments -
[QST]What's wrong with my usage of cutlass profiler???
#2458 commented on
Aug 13, 2025 • 0 new comments -
[QST] incomplete type "cutlass::gemm::device::DefaultGemmConfiguration<MMAOp, SmArch, ElementInputA, ElementInputB, ElementOutput, ElementAccumulator>"
#2461 commented on
Aug 13, 2025 • 0 new comments -
Remove duplicate function calls
#1584 commented on
Aug 13, 2025 • 0 new comments -
bwl1289/fix/cmake-build-fixes
#2305 commented on
Aug 12, 2025 • 0 new comments -
Make unittest compilation faster
#2402 commented on
Aug 8, 2025 • 0 new comments