What is your question?
In the wgmma_sm90 tutorial, why is the pipelined inner loop executed k_tile_count > -K_PIPE_MAX times?
The producers load K_PIPE_MAX tiles here before entering the main loop, so wouldn't there be k_tile_count - K_PIPE_MAX tiles remaining?