Ifu 2023 05 04 #40

liligwu · 2023-05-05T00:01:37Z

All tests pass (except batched_unary_embeddings_test.py)
test_log.txt

Summary: Pull Request resolved: pytorch#1638 This diff adds another mechanism for allocating the host mapped pinned memory to reduce adverse affect on other processes running on the same host when one process is doing some large allocations. Reviewed By: zyan0, jianyuh Differential Revision: D43950253 fbshipit-source-id: 41a434cb63354509d32e00c851c5f3a2d68be686

Summary: This PR addresses the issue pytorch#1636 akin to https://github.com/pytorch/FBGEMM/blob/8616ed701015f8b9e4c2825ce592b204b4cfaf28/fbgemm_gpu/test/split_table_batched_embeddings_test.py#L1009 Pull Request resolved: pytorch#1635 Reviewed By: shintaro-iwasaki Differential Revision: D44033725 Pulled By: q10 fbshipit-source-id: 49f28fc2f1c20948a42728eebf3defc5195baa5d

… when using freq based methods (pytorch#1352) Summary: Pull Request resolved: pytorch#1352 1. Update interface to accomadate rowwise_adagrad_with_counter. 2. Route backend for rowwise_adagrad to the new rowwise_adagrad_with_counter when freq based methods (e.g. freq sgd, counter adjusted regularization) are used. Reviewed By: csmiler Differential Revision: D36788395 fbshipit-source-id: 8eb5da8a5c8b52bc1e237af1054aac9f7245c443

…ard (pytorch#1642) Summary: Pull Request resolved: pytorch#1642 Remove sync point in jagged_dense_elementwise_add_jagged_output backward Reviewed By: brad-mengchi Differential Revision: D44039901 fbshipit-source-id: 8e7e23e4d9e01359e67e5b166adc57f894a1224d

…ytorch#1639) Summary: - Remove `.post0` suffix from the autogenerated package version - Document the full FBGEMM_GPU OSS build process in a separate Markdown file - Remove installation of packages not needed for ROCm builds - Migrate CPU and ROCm jobs to run on top of Docker containers instead of bare metal instances - Update GitHub workflow configuration to cancel previous jobs for a PR if a new commit is pushed to the PR Pull Request resolved: pytorch#1639 Reviewed By: shintaro-iwasaki Differential Revision: D44076312 Pulled By: q10 fbshipit-source-id: 6b2d083022feb7421b26da2d998678e00c11f283

Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033

Summary: Pull Request resolved: pytorch#1611 If group size is larger than 54, internally breaks the group down into smaller groups (each subgroup size is less than or equal to 54). Reviewed By: jianyuh Differential Revision: D43585937 fbshipit-source-id: bf14eeb79881a5737dcf7660e3e0f56d21f7b326

Summary: Pull Request resolved: pytorch#1637 Enforce cache misses (even if trace-driven testing doesn't experience cache miss due to limited trace size) so that we can evaluate performance under cache misses. Note that it's not exactly cache misses; enforce access to UVM by overriding lxu_cache_locations -- N / 256 requests. Reviewed By: YuzeDaiMeta Differential Revision: D42194019 fbshipit-source-id: ab04c1cc7a749e84d605cfe4f1687489ceab5725

Summary: Pull Request resolved: pytorch#1602 Illegal memory access is a common problem during GPU kernel execution. The FBGEMM GPU relies on PyTorch's `C10_CUDA_KERNEL_LAUNCH_CHECK()` and the CUDA runtime to detect such problems and throw an error. However, there are a few known issues with this approach. (1) `C10_CUDA_KERNEL_LAUNCH_CHECK()` detects errors on the host. However, due to the non-blocking, asynchronous nature of GPU kernel execution, the error is caught on the host at a later point than where the problematic kernel was launched. This can cause the stack trace to be inaccurate and make debugging more difficult. Although the issue can be fixed by running the code with `CUDA_LAUNCH_BLOCKING=1`, this can change the state of the execution and cause Heisenbugs. (2) Not all illegal memory accesses are caught by the runtime. This means that the system may not always throw an error when illegal memory access occurs. (3) Although the runtime throws an error for illegal memory access, it is difficult to pinpoint the specific kernel and memory buffer/address that is causing the problem. For all the aforementioned reasons, we attempt to catch and throw an error as soon as possible in the kernel when illegal memory accesses occur in FBGEMM GPU. We introduce the `FBGEMM_GPU_MEMCHECK` flag to enable memory checking during compile time. We copy PyTorch's `TensorAccessor.h` into the FBGEMM GPU and extend it to check every memory access through the `PackedTensorAccessor`. If an invalid memory access occurs, we throw an error using `CUDA_KERNEL_ASSERT`. The error message includes the name of the tensor and the kernel that caused the problem. If `FBGEMM_GPU_MEMCHECK` is enabled, FBGEMM operators will use `fbgemm::PackedTensorAccessor`. Otherwise, they will use `at::PackedTensorAccessor` `FBGEMM_GPU_MEMCHECK` integration in FBGEMM ops will be done in subsequent diffs Reviewed By: r-barnes Differential Revision: D43421838 fbshipit-source-id: c8ef04970d94bb097cb5f09b42f994db72845167

Summary: Pull Request resolved: pytorch#1648 This hack is not needed in Xcode 14.3 anymore, where the clang version is 14.0.3. So change the workaround to only include up to 14.0.2. Reviewed By: MatzeB Differential Revision: D44130421 fbshipit-source-id: 1fb2948567941bdf6ee9487ccfaa9dfb2caf92dd

…ch#1646) Summary: - Parallelize the FBGEMM CI builds to build and test static and shared libraries independently instead of in serial - Move the FBGEMM CI builds to run inside Docker containers - Add support for building FBGEMM_GPU against Python 3.11 in OSS - Move all FBGEMM_GPU nightly and release build jobs to run inside `amazonlinux:2023` Docker container - Assuming no build errors or resource starvation, the full OSS build process now runs under 30 minutes. Pull Request resolved: pytorch#1646 Reviewed By: shintaro-iwasaki Differential Revision: D44157228 Pulled By: q10 fbshipit-source-id: 6403ea9955856157785c50837b0b8e4c0cd26d53

Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa

Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672

Summary: Pull Request resolved: pytorch#1643 This diff optimizes the jagged_dense_bmm operator with the following optimizations: * tiling across thread blocks, and use GPU shared memory for thread block * tiling across threads within a thread block, and use registers for each thread Reviewed By: brad-mengchi Differential Revision: D43674845 fbshipit-source-id: 85f0abf89fa958f79636ef59c3070a1c569b73c2

Summary: This patch fixes test failures on AMD GPUs. 1. Remove `__restrict__ `. I don't think it is needed even for CUDA, but it confuses HIPCC. 2. Use `uint32_t` instead of `auto`: old ROCm (including ROCm <= 5.3) does not have `+=` operator for the type of `blockIdx.z`, causing a compilation error. We observed that this issue is fixed in ROCm 5.4.3, but let's use `uint32_t` for now. We should revisit and use `auto` later. See this for details: ROCm/hipamd@86a1634 Pull Request resolved: pytorch#1655 Test Plan: GitHub Actions' AMD CI Reviewed By: q10, brad-mengchi Differential Revision: D44242622 Pulled By: shintaro-iwasaki fbshipit-source-id: c9b88155ebf1ed881b2d03e3be0e8991b4b30174

Summary: Pull Request resolved: pytorch#1656 wushirong reported the failure on https://fburl.com/code/hae91ra7 . - The embedding config is from f418615450 . - `max_int8_128b_rows` is 10 --> D = 1280 Our embedding dim has grown to 1024 + ? Note that the static shared memory can only go up to 48 KB: > Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) in https://docs.nvidia.com/cuda/cuda-c-programming-guide/ for ptx shared mem error: ``` [2023-03-21T22:04:33.899-07:00] ptxas error : Entry function '_ZN4nbit60INT8_split_embedding_codegen_forward_weighted_kernel_small_LIiN3c104HalfELm2ELm4ELm4E Lm8ELm16ELb1EEEvN2at27GenericPackedTensorAccessorIhLm1ENS3_17RestrictPtrTraitsElEES6_NS4_IiLm1ES5_iEENS4_IlLm1ES5_iEENS4_IhLm1ES5_iEES7_N10fbgemm_gpu12FixedDiv isorENS4_IT_Lm1ES5_iEESD_llNS4_IfLm1ES5_iEENS4_IT0_Lm2ES5_iEENS4_IhLm2ES5_lEES7_' uses too much shared data (0x10080 bytes, 0xc000 max) ``` Currently we reduce `InputRowsInFlight` to bypass the issue (the static shared memory used in the kernel is ``` typedef uint4 AllBuffers[WarpsPerBlock][OutputRowsPerThread][InputRowsInFlight][NumUint4LoadsPerRow]; __shared__ AllBuffers buffers; ``` Long term, we can change the static shared memory to dynamic shared memory, and increase the shared memory size to be 64 KB +. Reviewed By: wushirong Differential Revision: D44270081 fbshipit-source-id: 367ae838ea073dfe58d859ea3c0e6c7190beca6a

Summary: - Containerize the remaining FBGEMM_GPU CI jobs - Add Conda cleanups to make PyTorch and CUDA installs more reliable - Update post-install checks for PyTorch to work with ROCm - Update the CI to continue running on jobs that fail on just a few variants - Use PIP to install PyTorch GPU nightly as the nightly packages show up in PIP more reliably than in Conda Pull Request resolved: pytorch#1658 Reviewed By: shintaro-iwasaki Differential Revision: D44306708 Pulled By: q10 fbshipit-source-id: 5f0862f18eca7151759d9983aa97849222539d7d

Summary: Pull Request resolved: pytorch#1647 Implement `tbe_input_combine_with_length` for GPU. The operator takes 3 lists of tensors (`indices`, `lengths`, and `per_sample_weights`) and concatenates each one into a single tensor. Implicit type casting is also performed if the input types are different from the output types. `indices` and `lengths` tensors can be of type `int32_t` or `int64_t`. The outputs for `indices` concatenation and `lengths` concatenation are fixed to `int32_t`. `per_sample_weights` must be `float`. Reviewed By: bangshengtang Differential Revision: D44076452 fbshipit-source-id: f6ce8628e7345093bb55835f9523870c2914516f

Summary: Pull Request resolved: pytorch#1644 This diff optimizes the jagged_jagged_bmm operator using tiling across thread blocks and GPU shared memory. Reviewed By: brad-mengchi Differential Revision: D44029528 fbshipit-source-id: fa5cd5a26893f935427bce5efb7dfcc731c3f47d

Summary: Pull Request resolved: pytorch#1660 When enabled emulate cache miss, it caused illegal memory access, if we're using more than one GPU. It turns out that previous diff didn't specify device within emulate_cache_miss kernel. This diff fixes it. In addition, cleaned up a bit (e.g., no need to used index_t based kernel launch for emulate_cache_miss kernel, as lxu_cache_locations is always with int32_t. Reviewed By: sryap, YuzeDaiMeta Differential Revision: D44340131 fbshipit-source-id: d99ba2364e9030cbca6c1166e578d24d99646bb1

Summary: - Add C++17 support for the entire FBGEMM_GPU build - Add C++17 support for the entire FBGEMM build - Update FBGEMM tests and benchmarks to be C++17-compatible - Make FBGEMM builds output more logging - Cherry-pick code changes from D43776442 v4 now that C++17 is fully supported Pull Request resolved: pytorch#1652 Reviewed By: shintaro-iwasaki Differential Revision: D44287321 Pulled By: q10 fbshipit-source-id: 4bf2bcf66d528939865d42b6deafc470bee55d17

Summary: Pull Request resolved: pytorch#1659 This diff aims to reduce the build time and libary size of `//deeplearning/fbgemm/fbgemm_gpu/codegen:embedding_ops`. The diff modifies the build target to generate and compile only the necessary files. This is based on the fact that CPU and GPU do not support all optimizers in `SplitTBE`. (Before this diff, all optimizers were generated and compiled for both CPU and GPU.) The following is the list of supported optimizers |OptimType|Generated optimizer|Supported on CPU|Supported on GPU| |EXACT_ADAGRAD|adagrad|x|x| |EXACT_ROWWISE_ADAGRAD|rowwise_adagrad_with_counter|x|x| ||rowwise_adagrad|x|x| |EXACT_ROWWISE_WEIGHTED_ADAGRAD|rowwise_weighted_adagrad|x|x| |EXACT_SGD|sgd|x|x| |SGD|approx_sgd|x|x| |ROWWISE_ADAGRAD|approx_rowwise_adagrad_with_counter|x|| ||approx_rowwise_adagrad|x|| |ADAM|adam||x| |LAMB|lamb||x| |LARS_SGD|lars_sgd||x| |PARTIAL_ROWWISE_ADAM|partial_rowwise_adam||x| |PARTIAL_ROWWISE_LAMB|partial_rowwise_lamb||x| |-|rowwise_adagrad_with_weight_decay||| |-|approx_rowwise_adagrad_with_weight_decay||| Note: x = supported Reviewed By: jianyuh Differential Revision: D44326540 fbshipit-source-id: 02413256b4a675f13ada8e8820820cb5112cb405

Summary: - Rewrite the documentation builds job to use the build infrastructure tooling - Rename workflow files for consistency Pull Request resolved: pytorch#1673 Reviewed By: shintaro-iwasaki Differential Revision: D44472660 Pulled By: q10 fbshipit-source-id: 60434c1f7098b7efa8c750133bb22f14fc98d5dc

Summary: Pull Request resolved: pytorch#1675 Original commit changeset: 02413256b4a6 Original Phabricator Diff: D44326540 Reviewed By: q10, jianyuh Differential Revision: D44475251 fbshipit-source-id: 5be66944a833e03a2737fc6d1baaa5c351455b2c

Summary: Pull Request resolved: pytorch#1633 Prepare `bounds_check_indices` for variable batch size TBE (VBE). - Update the frontend API to accept VBE args - Update the backend logic to process VBE data Reviewed By: jianyuh Differential Revision: D43253703 fbshipit-source-id: 2870f0c41a96265650281a9b6362d4e6dc48009b

pytorch#1667) Summary: Pull Request resolved: pytorch#1667 As title. This diff moves pruning/index_remapping support to embedding inplace update files. Reviewed By: jianyuh Differential Revision: D44409419 fbshipit-source-id: 93fc91d83502eb95cb0feca2a8a03b003c336078

Summary: Pull Request resolved: pytorch#1661 This diff optimizes jagged_softmax forward with more efficient reduction from cub library. Reviewed By: brad-mengchi Differential Revision: D44161021 fbshipit-source-id: bf2e059d14ef4d7ad311edac65155a463ba653ff

Summary: Pull Request resolved: pytorch#1662 This diff optimizes jagged_softmax backward with more efficient reduction from cub library Reviewed By: brad-mengchi Differential Revision: D44205819 fbshipit-source-id: cd1d7a886d6ba68201dc1ad782c2e8cde7ff706b

Summary: Pull Request resolved: pytorch#1674 improved multi-gpu all_to_one with: 1. new intermediate hop selection taking advantage of distinct NVLinks 2. overlapping of intermediate hop transfers with each-other and with direct-peer transfers Reviewed By: doehyun Differential Revision: D44285941 fbshipit-source-id: 0202083f04388b5ba60b8155809433f334993ef4

pytorch#1669) Summary: Pull Request resolved: pytorch#1669 Extract portions initializing the weights_placements/offsets tensors into separate functions and jit.export them. SplitState is converted to a NamedTuple since we can't jit.script a dataclass that also holds an enum. Reviewed By: houseroad Differential Revision: D44338256 fbshipit-source-id: e1c12e5956f7217d51cd190958c3764d220e521d

Reviewed By: bigfootjon Differential Revision: D45141964 fbshipit-source-id: 58308a31522a3b1446835e358a93483b611c4b15

Summary: - Re-organize and comment the `CMakeLists.txt` for FBGEMM_GPU for better clarity - Disable verbose HIPCC warnings that are non-actionable when building the ROCm variant of FBGEMM_GPU Pull Request resolved: pytorch#1712 Reviewed By: shintaro-iwasaki Differential Revision: D45189904 Pulled By: q10 fbshipit-source-id: 3df6ff3b957886c64bc13fc6bc7a0147b74ee783

Summary: Pull Request resolved: pytorch#1711 this is to support the case for request-only combined input sparse feature broadcast when `broadcast_indices` is enabled, the assumption for the inputs: - `cat_ad_offsets` and `cat_ad_indices` only contain the offsets and indices for the combined batches, where each batch only contain one instance (potentially multiple tables) - `reordered_cat_ad_offsets` needs to be after broadcasted, and contains `num_ads_in_batch * num_tables + 1` elements - `batch_offsets` is also after broadcasted - `num_indices_after_broadcast` is required to allocate the output buffer added coverage for the newly added branch Reviewed By: r-barnes Differential Revision: D45155887 fbshipit-source-id: 67f96d60168aa83cf24fef459addee89f06e1c6b

Summary: Add a check that get_filelist python exec worked. If bad params (python, args, ...), get_filelist() was continuing without noticing/warning/erroring out, making cmake failing later for weird reasons ("no sources"). Adds a safety check on the RESULT_VARIABLE of cmake execute_process(). Pull Request resolved: pytorch#1715 Reviewed By: shintaro-iwasaki Differential Revision: D45235231 Pulled By: q10 fbshipit-source-id: 049eae1fc5d7f42d73048e81c02c2f282d8859b0

Summary: - Fix bug introduced by PR 1711 (D45155887), which broke compilation of FBGEMM_GPU under ROCm 5.3 Pull Request resolved: pytorch#1719 Reviewed By: sryap Differential Revision: D45238536 Pulled By: q10 fbshipit-source-id: de9d2aa01ced0a37be1ea7903a361e3a24beed8d

…ernel templates out of `embedding_backward_split_template.cu` (pytorch#1710) Summary: - Migrate the definition of `split_embedding_*_backward_codegen_*_*_kernel_warp_per_row_1` from `embedding_backward_split_template.cu` over to `embedding_backward_split_kernel_warp_template.cu` and explicitly instantiate the templates separately - Migrate the definition of `split_embedding_*_backward_codegen_*_*_kernel_cta_per_row_1` from `embedding_backward_split_template.cu` over to `embedding_backward_split_kernel_cta_template.cu` and explicitly instantiate the templates separately Pull Request resolved: pytorch#1710 Reviewed By: sryap Differential Revision: D45205217 Pulled By: q10 fbshipit-source-id: 96b34e9389e70b64d8391f2c9d39f4009f3d65ce

Summary: Add CLI support (M,N,K) to GEMMsBenchmark Pull Request resolved: pytorch#1721 Reviewed By: sryap Differential Revision: D45281533 Pulled By: q10 fbshipit-source-id: 0ce5b38f54877acb26421dead1d2dc63cd11a2a1

) Summary: Fix data conversion in `radix_sort` that can cause data loss. Details: When `elements_count` is passed to the internal kernel implementation it is implicitly converted from `int64_t` to `int`. It can cause data loss, resulting in a partially sorted array. This PR fixes this issue. As a result of changing the `elements_count` type in internal functions to `int64_t`, `histogram` and `histogram_ps` types also were updated (to not generate further conversions). This is a follow-up for pytorch#1672. Pull Request resolved: pytorch#1718 Reviewed By: sryap Differential Revision: D45253811 Pulled By: q10 fbshipit-source-id: a5368a4401f05ebc471cb17107297a48f43a75c0

Summary: Pull Request resolved: pytorch#1716 similar to D45155887 when `broadcast_lengths` is enabled, the lengths are copied from the only instance of each batch, this is also to facilitate request-only broadcast Reviewed By: r-barnes Differential Revision: D45208736 fbshipit-source-id: 2c06cd4e9aae0c9c4e0668098de7db6f6da8c06b

Summary: Pull Request resolved: pytorch#1722 remove unnecessary optional decorators for the two newly added sparse ops Reviewed By: r-barnes Differential Revision: D45286152 fbshipit-source-id: 26109548db1acbc8fdf1a5183977eb8c64b45d41

Summary: Pull Request resolved: pytorch#1713 Prepare bounds_check_indices for variable batch size TBE (VBE). - Update arg names Reviewed By: jspark1105, r-barnes Differential Revision: D45203680 fbshipit-source-id: 396c4122058db8dd1fc9eb5f0d620e8179c3e7a9

Summary: Pull Request resolved: pytorch#1728 Freq-SGD requires to set both `weight_decay_mode=WeightDecayMode.COUNTER` and `counter_based_regularization` to kick in. Previously we checked when `weight_decay_mode` is set but no config provided. There's another missing case when the config is provided but users forget to set `weight_decay_mode`. We add the check in this diff. In addition, added logging to print out whether **internally** counter is used or not to make debugging easier. Reviewed By: dvksabin Differential Revision: D45329516 fbshipit-source-id: 30389671c34a17d4baf48726f28096a670ede0b6

Summary: Pull Request resolved: pytorch#1717 Prepare `transpose_embedding_input` for variable batch size TBE (VBE). - Update the frontend API with new args Reviewed By: yuguo68 Differential Revision: D45212897 fbshipit-source-id: 5ad11a737130777fbe119aed6c7086e892752f4a

Summary: Convert timebreakdown to a runtime cli option. Note: there is no code to measure packing, compute, kernel time ... so these are (atm) reported as 0, only total time is measured. ``` M, N, K, Type, Packing (us), Kernel(us), Postproc (us), Total (us), GOPs 64, 800, 320, FBGEMM_i8_acc32, 0, 0, 0, 218.593, 149.9 64, 800, 320, FBGEMM_i8_acc16, 0.0, 0.0, 0.0, 187.6, 174.7 ``` Pull Request resolved: pytorch#1725 Reviewed By: sryap Differential Revision: D45361847 Pulled By: q10 fbshipit-source-id: 4f2991a6208f0a5ae780729ce19bee611720953b

Summary: Pull Request resolved: pytorch#1730 In some cases, `torch.max(row_counter_dev)` causes failure because `row_counter_dev` is an empty tensor, example flow (f431977946). Here we guard the op by first checking if `row_counter_dev` is empty. Reviewed By: sryap Differential Revision: D45342010 fbshipit-source-id: 756a481c1098095f71dbb278ea84a01e89783790

…ytorch#1729) Summary: Pull Request resolved: pytorch#1729 As all gather becomes expensive for tensor/sequential parallel training, we create padded rowwise quantization/dequantization kernels for flattened tensor to convert between fp8 (stored as uint8 for gpu <= A100) and fp32 formats. Since the activations/grads will be concat into 1d tensor for all gather, the scaling to fit into fp8 format's range might be tricky as small elements will be quantized to zero if the scale is chosen to accommodate the largest element in the model. Thus, we continue to use row-wise quantization used in the previous all2all kernel. Every block with the size of "row_dim" will be quantized with the scale choose to accommodate the largest value in the block. Since the total length of the flattened tensor will not always be divisible by row_dim, we'll pad the 1D tensor to multiple of row_dim. As such, the padding/unpadding is handled by quantize/dequantize kernels and will be invisible to API calling them. Reviewed By: rohan-varma Differential Revision: D42721325 Privacy Context Container: L1138451 fbshipit-source-id: 33c712ba2fae709d29babee5ee4a8af6c7637b68

…emm/fbgemm_gpu/codegen/embedding_forward_split_cpu.cpp (pytorch#1732) Summary: Pull Request resolved: pytorch#1732 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (7 files modified.) Reviewed By: bangshengtang Differential Revision: D45402701 fbshipit-source-id: 42501350543e31455e430b240e53f8e1883eb1ba

…emm/fbgemm_gpu/codegen/embedding_backward_dense_host.cpp (pytorch#1733) Summary: Pull Request resolved: pytorch#1733 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (7 files modified.) Reviewed By: bangshengtang Differential Revision: D45402700 fbshipit-source-id: 275bf837341a00d1cd4642b31bf9168455fa6c77

Summary: - Further break up `setup_env.bash` into separate domain scripts for easier maintenance - Update FBGEMM `CMakeLists.txt` to remove warning (pytorch#1714) Pull Request resolved: pytorch#1731 Reviewed By: sryap Differential Revision: D45406676 Pulled By: q10 fbshipit-source-id: 3ff5a7e2486b6898cb450d268a092371da5c2717

…emm/fbgemm_gpu/fb/src/split_embeddings_utils.cu (pytorch#1735) Summary: Pull Request resolved: pytorch#1735 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (7 files modified.) Reviewed By: bangshengtang Differential Revision: D45402704 fbshipit-source-id: 9e9b1c1f526a398bbe50c99055187195ab751fa2

…emm/fbgemm_gpu/src/split_embeddings_utils.cu (pytorch#1737) Summary: Pull Request resolved: pytorch#1737 `TORCH_CHECK` produces pretty generic error messages. Using, eg, `TORCH_CHECK_GE` produces a message that shows the names of the variables being compared as well as their values at the time of comparison. This makes debugging easier. - If you approve of this diff, please use the "Accept & Ship" button :-) (3 files modified.) Reviewed By: bangshengtang Differential Revision: D45402697 fbshipit-source-id: c490d39bc826eab44ec16cbcc86273f8d7258fd9

Summary: Pull Request resolved: pytorch#1739 In the multi-block cumsum case, the `inclusive_sum_scan_kernel` implements the stream-scan technique in which each thread block has to consume the preceding sum result from the previous block. The sum result is passed via the `block_sums` buffer (global memory). To ensure that the sum results are visible for inter-thread-block consumption, the buffer has to be declared as `volatile` to prevent the compiler from caching the results in registers. This diff adds the `volatile` keyword to `block_sums`. Reviewed By: q10 Differential Revision: D45435897 fbshipit-source-id: f81a25b43eda18ae1eb18bed33f595fc27ef2707

Summary: Pull Request resolved: pytorch#1744 Adding BF16 support for HBC ops, and updates on tests. Reviewed By: q10, sryap Differential Revision: D45449360 fbshipit-source-id: 8321155b426143d80064f12a910c0626bdfafbba

…gemm/include/fbgemm/Utils.h (pytorch#1746) Summary: Pull Request resolved: pytorch#1746 Designated initializers can make the code cleaner - If you approve of this diff, please use the "Accept & Ship" button :-) (1 files modified.) Reviewed By: sryap Differential Revision: D45464948 fbshipit-source-id: 28e38dc60b893fe7c91db0d791e069a6de87b420

Summary: Pull Request resolved: pytorch#1742 Pull Request resolved: pytorch#1738 Instead of hardcoding x86_64 when installing dependencies, let's now dynamically determine the platform name Reviewed By: excelle08 Differential Revision: D45246996 fbshipit-source-id: d9031e76a915c2362be62c85a3c1f0786828ca8b

…ller Files (pytorch#1723) Summary: - Migrate `*_embedding_*_codegen_forward_*_kernel` out of `embedding_forward_split_template.cu` and into `embedding_forward_split_kernel_template.cu` - Migrate `*_embedding_nobag_codegen_forward_unweighted_small_kernel` out of `embedding_forward_split_template.cu` and into `embedding_forward_split_kernel_small_template.cu` Pull Request resolved: pytorch#1723 Reviewed By: sryap Differential Revision: D45363388 Pulled By: q10 fbshipit-source-id: 563ca610b15830aca854bc00d6a31fd6e8cb8a53

Summary: - Add installation instructions for OSS - Migrate Installation, Test, and Documentation information out of the README - Add link to GitHub Discussions in the README - Migrate the Netlify configuration from website to TOML file in the repo so that build jobs are configurable by developers Pull Request resolved: pytorch#1750 Reviewed By: sryap, shintaro-iwasaki Differential Revision: D45540724 Pulled By: q10 fbshipit-source-id: beaab824cc5d441b96b89daea2a71f541e21f2ec

banitag1 and others added 30 commits March 10, 2023 13:04

include cstdint (pytorch#1640)

da01a59

Summary: fix build with gcc-13 Pull Request resolved: pytorch#1640 Reviewed By: shintaro-iwasaki Differential Revision: D44044422 Pulled By: q10 fbshipit-source-id: 692ec9c34f4aaf726294a2b643fbceabf8159033

Remove magic numbers from fbgemm/Types.h (pytorch#1629)

54eeae2

Summary: Pull Request resolved: pytorch#1629 Replaces magic numbers with constexpr variables Reviewed By: sryap Differential Revision: D43776442 fbshipit-source-id: 5cef7566816f8730f5daa08948ee3260367787aa

added check to avoid div 0 errors in cache report (pytorch#1645)

35bdd40

Summary: Pull Request resolved: pytorch#1645 as in title Reviewed By: jianyuh Differential Revision: D44096435 fbshipit-source-id: a7a87a14ffecc2fb6e0be74d199d385357946672

generatedunixname89002005325676 and others added 28 commits April 20, 2023 08:56

Daily arc lint --take CLANGFORMAT

97c9e16

Reviewed By: bigfootjon Differential Revision: D45141964 fbshipit-source-id: 58308a31522a3b1446835e358a93483b611c4b15

Add CLI support (M,N,K) to GEMMsBenchmark (pytorch#1721)

3bf2cf9

Summary: Add CLI support (M,N,K) to GEMMsBenchmark Pull Request resolved: pytorch#1721 Reviewed By: sryap Differential Revision: D45281533 Pulled By: q10 fbshipit-source-id: 0ce5b38f54877acb26421dead1d2dc63cd11a2a1

remove optional for two ops (pytorch#1722)

bfeb436

Summary: Pull Request resolved: pytorch#1722 remove unnecessary optional decorators for the two newly added sparse ops Reviewed By: r-barnes Differential Revision: D45286152 fbshipit-source-id: 26109548db1acbc8fdf1a5183977eb8c64b45d41

BF16 support for HBC ops. (pytorch#1744)

d058dc7

Summary: Pull Request resolved: pytorch#1744 Adding BF16 support for HBC ops, and updates on tests. Reviewed By: q10, sryap Differential Revision: D45449360 fbshipit-source-id: 8321155b426143d80064f12a910c0626bdfafbba

Merge remote-tracking branch 'upstream/main' into IFU-2023-05-04

4ba0956

liligwu self-assigned this May 5, 2023

liligwu merged commit b28e14b into main May 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ifu 2023 05 04 #40

Ifu 2023 05 04 #40

liligwu commented May 5, 2023

Ifu 2023 05 04 #40

Ifu 2023 05 04 #40

Conversation

liligwu commented May 5, 2023