### API-without-secrets
- In Vulkan, a render pass represents (or describes) a set of framebuffer attachments (images) required for drawing operations and a collection of subpasses that drawing operations will be ordered into. It is a construct that collects all color, depth and stencil attachments and operations modifying them in such a way that driver does not have to deduce this information by itself what may give substantial optimization opportunities on some GPUs. A subpass consists of drawing operations that use (more or less) the same attachments. Each of these drawing operations may read from some input attachments and render data into some other (color, depth, stencil) attachments. A render pass also describes the dependencies between these attachments: in one subpass we perform rendering into the texture, but in another this texture will be used as a source of data (that is, it will be sampled from). All this data help the graphics hardware optimize drawing operations.

- Tiling defines the inner memory structure of an image, Images may have linear or optimal tiling (buffers always have linear tiling). 
- Images with linear tiling have their texels laid out linearly, one texel after another, one row after another, and so on. - -- When we specify an optimal tiling for images, it means that we don’t know how their memory is structured.
- That’s why it is strongly suggested to always specify optimal tiling for images.
- Layout, as it was described in a tutorial about swapchains, defines an image’s memory layout and is strictly connected with the way in which we want to use an image. Each specific usage has its own memory layout. Before we can use an image in a given way we need to perform a layout transition.

- When we want to change the way in which an image is used, we need to perform the above-mentioned layout transition. We must specify a current (old) layout and a new one. The old layout can have one of two values: current image layout or an undefined layout. When we specify the value of a current image’s layout, the image contents are preserved during transition. But when we don’t need an image’s contents, we can provide an undefined layout. In this way layout transition may be performed faster.

- We want to use an image as a texture inside shaders. For this purpose we specify the ***VK_IMAGE_USAGE_SAMPLED_BIT*** usage. We also need a way to upload data to the image. We are going to read it from an image file and copy it to the image object. This can be done by transferring data using a staging resource. In such a case, the image will be a target of a transfer operation; that’s why we also specify the ***VK_IMAGE_USAGE_TRANSFER_DST_BIT*** usage.

- Of course, when we want to bind a memory to an image, we don’t need to create a new memory object each time. It is more optimal to create a small number of larger memory objects and bind parts of them by providing a proper offset value.

- The operation of copying data from a buffer to an image requires recording a command buffer and submitting it to a queue.

- One last thing is to perform another layout transition. Our image will be used as a texture inside shaders, so we need to transition it to a ***VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL*** layout. After that, we can end our command buffer, submit it to a queue, and wait for the transfer to complete (in a real-life application, we should skip waiting and synchronize operations in some other way; for example, using semaphores, to avoid unnecessary pipeline stalls).
----------------------
- Use pipeline caches to reduce cost of pipeline creation
- Vulkan® queues represent independent and asynchronous execution ports for command buffers. Queues belong to a family, which describe capabilities (such as GRAPHICS or COMPUTE).
- Vulkan® on Stadia exposes 3 queue families that map to specific engine types on an AMD Vega 10 class GPU. The number of queues in each family maps to the number of available independently-addressable engines of that type. Each engine has a ring buffer to communicate with the Linux kernel (called a command ring). Submitting command buffers to a queue ultimately results in command buffers being written to the command ring and consumed by the engine's front end.
- Queues correspond to certain rings:
    - The single graphics queue addresses the graphics front end command ring. This is also known as the universal queue.
    - Each compute queue addresses one ***Asynchronous Compute Engine (ACE)*** command ring.
    - Each transfer queue addresses one ***Secure Direct Memory Access (SDMA)*** engine command ring.
- Use transfer queues to keep graphics queue free
- Avoid concurrent access on images unless necessary
    - Avoid creating images with sharing mode ***VK_SHARING_MODE_CONCURRENT*** because it limits compression options. This can result in performance degradation since in most cases the images will always remain uncompressed.
- Two render passes are compatible if the formats and sample counts of their overlapping attachments match. This means that one render pass can have fewer attachments than the other.
- The following are not used in determining the compatibility of render passes:
    - Initial and final image layout in attachment descriptions
    - Load and store operations in attachment descriptions
    - Image layout in attachment references
- A framebuffer is compatible with a render pass if it was created using the same render pass or a compatible render pass.

- There are a few different ways to clear in Vulkan and some of them are faster than others. This section briefly details each of the clear methods and its performance characteristics.
    - Clearing with ***VK_ATTACHMENT_LOAD_OP_CLEAR***
        - This clear happens at the start of the render pass. This is the fastest possible way to clear. When clearing this way, the driver uses the fast path to clear.
    - Clearing with ***vkCmdClearAttachment***
        - This clear must be done inside a render pass. On secondary command buffers, this method of clearing results in a slow clear.
    - Clearing with ***vkCmdClearColorImage***
        - This clear must be done outside a render pass and requires the image to be in one of the following layouts: ***VK_IMAGE_LAYOUT_SHARED_PRESENT_KHR***, ***VK_IMAGE_LAYOUT_GENERAL*** or ***VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL***. Note that if the image layout is ***VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL*** when clearing, a decompress will be forced resulting in slower performance.
    - Clearing with compute shader
        - Clearing with a compute shader may cause some overlap with the driver's clear operations since it also uses a compute shader clear.
    - Clearing caveats
        - Clearing on non-graphics queues are slower than graphics queues.

- Synchronization
    - Use a fence to wait for GPU work on the CPU
        - Assuming there are multiple frames in flight, place a fence on the final submission of GPU work and subsequently wait for this fence on the next usage of the in flight frame.
    - Use semaphores to synchronize queue dependencies
        - To synchronize work between two queue submissions, use a signal semaphore on the first submit and have the second submit wait on that same semaphore. This works for any pair of queues, even across queue families. While it is possible to use a fence to achieve the same work synchronization, this approach is heavy handed and very slow because a CPU thread will be idled in vkWaitForFences while waiting for the GPU to finish before processing the dependent submission, causing both the CPU and GPU to go idle for a large number of cycles. This can cause noticeable performance degradation.

- Use ***VK_PRESENT_MODE_MAILBOX_KHR*** mode to minimize latency at the risk of some stutter in frame timing. This mode behaves like triple buffering where the most recent image completed by the GPU before the "vsync" is presented and any previously rendered frames are dropped. This mode allows for rendering as fast as the GPU can render images, but that can potentially cause minor stutter. Stutter here means a mismatch between when the frame was simulated and when it is displayed. To avoid this, the application should pace its frame rendering to 60hz or 30hz during gameplay. During menus the app can optionally stop pacing itself, let some frames be dropped and reduce input latency, going back to paced mode when interactivity resumes. When choosing MAILBOX present mode, the application must create a swapchain with minImageCount set to 3 to get expected behavior. If the swapchain is created with only 2 images in this mode, it will behave like FIFO mode. This is the preferred mode for performance benchmarks, as the current implementation will allow the game to go up to 120hz.
     
- ***VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT*** Memory with this flag is local to the GPU and has the best throughput and latency for accesses from the GPU. Memory without this property flag is local to the CPU and GPU accesses require the GPU to issue PCIe transactions, which have high latency and limited throughput on Stadia Gen 1.
- ***VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT*** Memory with this flag is visible to the CPU and can be mapped to CPU virtual addresses. Use memory with this flag primarily as staging memory in transit from or to the GPU.
- ***VK_MEMORY_PROPERTY_HOST_COHERENT_BIT*** Memory with this flag has automatic CPU cache management. This means that no explicit GPU or CPU cache management commands are required to maintain CPU cache consistency when this memory is read from or written to the CPU or GPU. For example, a CPU write to memory with this flag done prior to a GPU command submission (vkQueueSubmit) is visible by the GPU commands included in that submission. As another example, a GPU write to memory with this flag done by a command included in a submission with fence A is visible by the CPU as soon as the CPU receives fence A's signal. Host coherency is intrinsic to CPU-visible memory on Stadia Gen 1. Host coherency implies that if CPU caches are present, the system manages them automatically. It does not imply anything about the use of such caches.
- ***VK_MEMORY_PROPERTY_HOST_CACHED_BIT*** Memory with this flag is cached by the CPU for reads and writes. A CPU read to cached device-local memory is significantly faster than an uncached read since the CPU does not have to issue PCIe transactions, which have high latency and limited throughput on Stadia Gen 1.
------------------------------------------------------------------

- ***VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT*** This memory type and its associated heap is local to the GPU device. Therefore, it has the fastest access times for GPU operations. Use it to store all GPU-only data that you use during command buffer execution (render targets, static vertices, textures, etc.). To modify this memory from the CPU, stage your data in a memory type with the HOST_VISIBLE property flag (one of memory types 1, 2, or 3 below) and schedule a copy of that data using the vkCmdCopy* commands. Below, we make recommendations for which memory type to use for staging.
     
- ***VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT*** This memory type is local to the CPU and is CPU uncached and write-combined. The GPU does not snoop CPU caches on reads. This allows lower latency and higher throughput compared to cached non-write-combined memory if you write to it with write-combining in mind (for example, consecutive, no gap, aligned writes). Don’t use this memory type in time-sensitive command buffer operations.

- ***VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT | VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT*** This memory type and its associated heap is local to the GPU device and has the same access performance characteristics as Device Local (type index 0), but is CPU-visible. On instances with the gen1-heaps-legacy heap configuration, this memory type uses a very small dedicated heap, whereas on instances with the gen1-heaps-2019-final heap configuration this memory type shares the same heap as Device Local (type index 0). The system manages CPU caches automatically (see HOST_COHERENT). Writes are not write-combined and reads are not cached (see HOST_CACHED). Reads happen on full cache lines. For optimal performance, when you copy data from this heap, you need to ensure two things: 1.Align source and destination pointers to 64 bytes. 2.Make the size of the copy a multiple of 64 bytes.
- ***VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT*** This memory type is local to the CPU, is CPU cached, and is not write-combined. The GPU snoops on the CPU caches to ensure CPU-GPU cache coherence, which reduces GPU access performance. CPU reads have much higher performance compared to USWC GTT MEMORY, but CPU writes are worse. Prefer this memory type for GPU writes and CPU reads.
- Use ***DEVICE_LOCAL*** memory types for all static, GPU-only data, and prefer the CPU-invisible type as it always has the largest heap size. Use the ***DEVICE_LOCAL*** and ***HOST_VISIBLE memory*** type for all small, dynamic data updates from CPU to GPU that are meant to be used in command buffer execution. Be careful using this type in the gen1-heaps-legacy heap configuration because of its small heap size.
- ***vkCmdCopyBuffer*** behavior On graphics and compute queues, the command processor (front end) associated with the queue copies the copy regions that are smaller than or equal to 64 KiB. A scheduled copy compute shader copies the larger regions. On transfer queues, the SDMA engine associated with the queue always executes the copy of each region.