Unrecognized AIO_DEBUG_MODE= 5 usigng default level = WARNING Version: v0.8.0 Built with: clang++(Ubuntu Clang 14.0.0) git-90df81a2a,Kuba Wolynko,2023-08-07T16:35:09+02:00 built 20230809_111727 by on 96f65684ca4a Internal environment variable DLS_DEBUG_SAVE_FAULTY_DATA is not prefixed with AIO_. Internal environment variable DLS_DEBUG_PRINT_ON_SAME_KERNEL is not prefixed with AIO_. AIO_DATA_DIR is /usr/local/share//libampere-aio Available cores: 0, 1, 2, 3, 4, 5, 6, 7 AIO_NUM_THREADS read (but not applied yet) as 16 Couldn't read cpu governor Numa balancing is off - OK Requested 16 but only 8 are available. Num threads limited to 8 Binding thread 2 to 2 Binding thread 1 to 1 Binding thread 3 to 3 Binding thread 4 to 4 Binding thread 7 to 7 Binding thread 6 to 6 Binding thread 5 to 5 CPU bind done Attempt to register kernel AvgPoolingMeta@NEON with priority clashes (priority-wise) with the following kernels: AvgPoolingMeta@NEON AvgPoolingMeta@NEON Attempt to register kernel MaxPoolingMeta@NEON with priority clashes (priority-wise) with the following kernels: MaxPoolingMeta@NEON MaxPoolingMeta@NEON Attempt to register kernel TransposeBERTVectorized@NEON with priority clashes (priority-wise) with the following kernels: TransposeBERTVectorized@NEON TransposeBERTVectorized@NEON Attempt to register kernel TorchSliceVectorized@NEON with priority clashes (priority-wise) with the following kernels: TorchSliceVectorized@NEON TorchSliceVectorized@NEON Attempt to register kernel TorchSliceVectorized@NEON with priority clashes (priority-wise) with the following kernels: TorchSliceVectorized@NEON TorchSliceVectorized@NEON TorchSliceVectorized@NEON Attempt to register kernel TorchSliceVectorized@NEON with priority clashes (priority-wise) with the following kernels: TorchSliceVectorized@NEON TorchSliceVectorized@NEON TorchSliceVectorized@NEON TorchSliceVectorized@NEON Registered Variables: AIO_ALLOW_UNSAFE_DEPTHWISE = "0" is using default value AIO_JIT_PROFILING = "0" is using default value AIO_MICROKERNEL_MATMUL_FORCE = "0" is using default value AIO_MICROKERNEL_DOTPROD_FORCE = "0" is using default value AIO_DEBUG_LAYER_MERGING = "0" is using default value AIO_DATA_CHECK_IMMUTABLE = "0" is using default value AIO_LAYERS_TO_DEBUG is not set and has no default value AIO_IMPLICIT_FP16_TRANSFORM_FILTER = "" (default = "" ) DLS_DEBUG_SAVE_FAULTY_DATA is not set and has no default value AIO_DEBUG_LAYER_MAX_ERROR_FLOAT = "1e-5" is using default value AIO_DEBUG_LAYER_MEAN_ERROR_FP16 = "1e-5" is using default value AIO_DEBUG_LAYER_MEAN_ERROR_INT8 = "1" is using default value AIO_DEBUG_LAYER_MEAN_ERROR is not set and has no default value AIO_DEBUG_LAYER_MAX_ERROR is not set and has no default value AIO_CVJM_USE_MAGIC = "1" is using default value DLS_DEBUG_PRINT_ON_SAME_KERNEL = "0" is using default value AIO_CPU_BIND = "1" is using default value AIO_PROFILER_TIME_SCALE = "1e3" is using default value AIO_LEGACY_TF = "0" is using default value AIO_PROCESS_MODE = "1" (default = "1" ) AIO_REMOVE_PASSTHRU = "1" is using default value AIO_PROFILER_SORT_MODE = "0" is using default value AIO_DEBUGGER_LAYER_ID is not set and has no default value AIO_GRAPH_FILE = "dls_graph" is using default value AIO_PROFILER_SKIP_FIRST = "1" is using default value AIO_DEBUG_LAYER_MAX_ERROR_INT8 = "1" is using default value AIO_TRACING is not set and has no default value AIO_SUPERNODE = "0" is using default value AIO_PROFILER_LAYERS_TO_SKIP = "Data [merged]" is using default value AIO_DEBUG_STRING_PRECISION = "3" is using default value AIO_RECYCLE_BUFFERS = "1" is using default value AIO_DEBUGGER = "0" is using default value AIO_FORCE_MODE = "0" is using default value AIO_MEM_BIND = "1" is using default value AIO_PROFILER_OUTPUT_MODE = "NL" is using default value AIO_CPU_LEVEL is not set and has no default value AIO_NUMA_CPUS = "ALL" is using default value AIO_KERNEL_PREFERLIST = "" is using default value AIO_PROFILER_FLOAT_PRECISION = "6" is using default value AIO_SOFT_FP16 is not set and has no default value AIO_LIST_ENV_VARIABLES = "0" is using default value AIO_PROFILER_MAX_NAME_LEN = "60" is using default value AIO_ABORT_ON_ERROR = "0" is using default value AIO_PREFER_FLOAT_QUANTIZATION = "1" is using default value AIO_FORCE_GENERIC_MICROKERNEL = "0" is using default value AIO_EXPORT_GRAPH = "0" is using default value AIO_PROFILER_CONFIDENCE = "0.9" is using default value AIO_DEBUG_FILE = "" is using default value AIO_PROFILER_CSV_FILE = "cout" is using default value AIO_TOPOLOGY_DEBUG = "0" is using default value AIO_PROFILER_OUT_FILE = "cout" is using default value AIO_SANITIZE_OUTPUT = "0" is using default value AIO_CONV_ONE_JIT_USE_MAGIC = "1" is using default value AIO_NUM_THREADS = "16" has no default AIO_DEBUG_STRING_WIDTH = "-1" is using default value AIO_TRACER_STRING_POOL = "1000000" is using default value AIO_KERNEL_BLACKLIST = "" is using default value AIO_SHOULD_USE_NUMA = "0" is using default value AIO_SPLIT_BATCH = "0" is using default value AIO_USE_NAIVE_BINOP_ALG = "1" is using default value AIO_NEON_CONV_ONE_D = "256" is using default value AIO_NO_LAYER_MERGING = "0" is using default value AIO_DEBUG_LAYER_MAX_ERROR_FP16 = "1e-4" is using default value AIO_USE_SIMPLE_TRANSFORM = "1" is using default value AIO_USE_DETRANSPOSER_TRANSFORM = "1" is using default value AIO_PROFILER_CSV_MODE = "0" is using default value AIO_SAVE_MODEL = "0" is using default value AIO_SKIP_MASTER_THREAD = "1" (default = "0" ) AIO_UKERNEL_QADD_ROUND_INPUT = "1" is using default value AIO_MERGE_PAD_TO_CONV = "1" is using default value AIO_DEBUG_LAYER_MEAN_ERROR_FLOAT = "1e-6" is using default value AIO_PROFILER = "0" is using default value AIO_NEON_CONV_ONE_N = "200" is using default value AIO_REPORT_CONV_TASK is not set and has no default value AIO_CVJM_USE_LOOKUP = "1" is using default value AIO_DEBUG_MODE = "5" (default = "WARN" ) AIO_LIST_UNREGISTERED_ENV_VARIABLES = "1" is using default value XDG_DATA_DIRS = "/usr/local/share/:/usr/share/" is using default value AIO_CVJM_SPARSE_THRESHOLD = "0.05" is using default value AIO_NUMA_NODES = "LOCAL" is using default value AIO_NEON_CONV_ONE_F = "32" is using default value AIO_CONV_ONE_JIT_USE_LOOKUP = "1" is using default value Unknown AIO variable: AIO_LIB_ROOT = "/aio" DLS STARTED 11-10-2023 14:57:08 AIO_PROCESS_MODE: 1 AIO_FORCE_MODE: 0 AIO_NUM_THREADS: 8 CPU_BIND: 1 MEM_BIND: 1 AIO_SPLIT_BATCH: 0 AIO_NO_LAYER_MERGING 0 AIO_LEGACY_TF 0 AIO_SUPERNODE 0 AIO_USE_SIMPLE_TRANSFORM 1 AIO_USE_DETRANSPOSER_TRANSFORM 1 AIO_GRAPH_FILE dls_graph DLS_DEBUG (threshold): 0 AIO_DEBUG_FILE: AIO_PROFILER: 0 Unrecognized AIO_DEBUG_MODE= 5 usigng default level = WARNING Graph before optimizations graph(%self.1 : __torch__.torch.fx.graph_module.___torch_mangle_421.GraphModule, %x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %self.self_pos_embed : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_patch_embed_proj.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_patch_embed_proj.weight : Float(128, 3, 10, 10, strides=[300, 100, 10, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_head.bias : Float(1000, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_head.weight : Float(1000, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %20 : bool = prim::Constant[value=1](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %19 : bool = prim::Constant[value=0](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %18 : int = prim::Constant[value=1](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %17 : int = prim::Constant[value=0](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %16 : int[] = prim::Constant[value=[10, 10]]() %15 : int[] = prim::Constant[value=[0, 0]]() %14 : int[] = prim::Constant[value=[1, 1]]() %13 : int = prim::Constant[value=2]() # .1:6:0 %12 : int = prim::Constant[value=-1]() # .1:6:0 %11 : Float(1, 1, 128, strides=[128, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : float = prim::Constant[value=0.](), scope: __module.self_pos_drop # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:1252:0 %9 : float = prim::Constant[value=9.9999999999999995e-07](), scope: __module.self_blocks_0_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %8 : int[] = prim::Constant[value=[128]]() %7 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %6 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %5 : NoneType = prim::Constant() %4 : int[] = prim::Constant[value=[1, 122, 128]]() %3 : str = prim::Constant[value="none"](), scope: __module.self_blocks_0_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %2 : int = prim::Constant[value=9223372036854775807]() # .1:354:0 %self_patch_embed_proj.1 : Float(1, 128, 11, 11, strides=[15488, 121, 11, 1], requires_grad=0, device=cpu) = aten::_convolution(%x, %self.self_patch_embed_proj.weight, %self.self_patch_embed_proj.bias, %16, %15, %14, %19, %15, %18, %19, %19, %20, %20), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %flatten : Float(1, 128, 121, strides=[15488, 121, 1], requires_grad=0, device=cpu) = aten::flatten(%self_patch_embed_proj.1, %13, %12) # .1:6:0 %self_patch_embed_norm.1 : Float(1, 121, 128, strides=[15488, 1, 121], requires_grad=0, device=cpu) = aten::transpose(%flatten, %18, %13) # .1:7:0 %88 : Tensor[] = prim::ListConstruct(%11, %self_patch_embed_norm.1) %cat : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::cat(%88, %18) # .1:12:0 %input.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%cat, %self.self_pos_embed, %18) # .1:13:0 %input.5 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.1, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_0_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_0_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.5, %self.self_blocks_0_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_0_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_0_attn_qkv.1, %7) # .1:19:0 %permute : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape, %6) # .1:20:0 %106 : Tensor[] = aten::unbind(%permute, %17) # .1:21:0 %self_blocks_0_attn_q_norm.1 : Tensor, %self_blocks_0_attn_k_norm.1 : Tensor, %getitem_2 : Tensor = prim::ListUnpack(%106) %scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_0_attn_q_norm.1, %self_blocks_0_attn_k_norm.1, %getitem_2, %5, %10, %19) # .1:27:0 %transpose_1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention, %18, %13) # .1:28:0 %input.7 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_1, %4) # .1:29:0 %input.9 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.7, %self.self_blocks_0_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_0_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.11 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.1, %input.9, %18) # .1:34:0 %input.13 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.11, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_0_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.15 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.13, %self.self_blocks_0_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_0_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.17 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.15, %3), scope: __module.self_blocks_0_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.21 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.17, %self.self_blocks_0_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_0_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.23 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.11, %input.21, %18) # .1:44:0 %input.25 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.23, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_1_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_1_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.25, %self.self_blocks_1_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_1_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_2 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_1_attn_qkv.1, %7) # .1:47:0 %permute_1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_2, %6) # .1:48:0 %152 : Tensor[] = aten::unbind(%permute_1, %17) # .1:49:0 %self_blocks_1_attn_q_norm.1 : Tensor, %self_blocks_1_attn_k_norm.1 : Tensor, %getitem_5 : Tensor = prim::ListUnpack(%152) %scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_1_attn_q_norm.1, %self_blocks_1_attn_k_norm.1, %getitem_5, %5, %10, %19) # .1:55:0 %transpose_2 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_1, %18, %13) # .1:56:0 %input.27 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_2, %4) # .1:57:0 %input.29 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.27, %self.self_blocks_1_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_1_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.31 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.23, %input.29, %18) # .1:62:0 %input.33 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.31, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_1_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.35 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.33, %self.self_blocks_1_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_1_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.37 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.35, %3), scope: __module.self_blocks_1_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.41 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.37, %self.self_blocks_1_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_1_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.43 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.31, %input.41, %18) # .1:72:0 %input.45 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.43, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_2_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_2_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.45, %self.self_blocks_2_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_2_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_4 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_2_attn_qkv.1, %7) # .1:75:0 %permute_2 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_4, %6) # .1:76:0 %198 : Tensor[] = aten::unbind(%permute_2, %17) # .1:77:0 %self_blocks_2_attn_q_norm.1 : Tensor, %self_blocks_2_attn_k_norm.1 : Tensor, %getitem_8 : Tensor = prim::ListUnpack(%198) %scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_2_attn_q_norm.1, %self_blocks_2_attn_k_norm.1, %getitem_8, %5, %10, %19) # .1:83:0 %transpose_3 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_2, %18, %13) # .1:84:0 %input.47 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_3, %4) # .1:85:0 %input.49 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.47, %self.self_blocks_2_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_2_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.51 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.43, %input.49, %18) # .1:90:0 %input.53 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.51, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_2_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.55 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.53, %self.self_blocks_2_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_2_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.57 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.55, %3), scope: __module.self_blocks_2_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.61 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.57, %self.self_blocks_2_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_2_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.63 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.51, %input.61, %18) # .1:100:0 %input.65 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.63, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_3_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_3_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.65, %self.self_blocks_3_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_3_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_6 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_3_attn_qkv.1, %7) # .1:103:0 %permute_3 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_6, %6) # .1:104:0 %244 : Tensor[] = aten::unbind(%permute_3, %17) # .1:105:0 %self_blocks_3_attn_q_norm.1 : Tensor, %self_blocks_3_attn_k_norm.1 : Tensor, %getitem_11 : Tensor = prim::ListUnpack(%244) %scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_3_attn_q_norm.1, %self_blocks_3_attn_k_norm.1, %getitem_11, %5, %10, %19) # .1:111:0 %transpose_4 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_3, %18, %13) # .1:112:0 %input.67 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_4, %4) # .1:113:0 %input.69 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.67, %self.self_blocks_3_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_3_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.71 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.63, %input.69, %18) # .1:118:0 %input.73 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.71, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_3_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.75 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.73, %self.self_blocks_3_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_3_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.77 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.75, %3), scope: __module.self_blocks_3_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.81 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.77, %self.self_blocks_3_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_3_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.83 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.71, %input.81, %18) # .1:128:0 %input.85 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.83, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_4_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_4_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.85, %self.self_blocks_4_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_4_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_8 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_4_attn_qkv.1, %7) # .1:131:0 %permute_4 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_8, %6) # .1:132:0 %290 : Tensor[] = aten::unbind(%permute_4, %17) # .1:133:0 %self_blocks_4_attn_q_norm.1 : Tensor, %self_blocks_4_attn_k_norm.1 : Tensor, %getitem_14 : Tensor = prim::ListUnpack(%290) %scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_4_attn_q_norm.1, %self_blocks_4_attn_k_norm.1, %getitem_14, %5, %10, %19) # .1:139:0 %transpose_5 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_4, %18, %13) # .1:140:0 %input.87 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_5, %4) # .1:141:0 %input.89 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.87, %self.self_blocks_4_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_4_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.91 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.83, %input.89, %18) # .1:146:0 %input.93 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.91, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_4_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.95 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.93, %self.self_blocks_4_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_4_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.97 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.95, %3), scope: __module.self_blocks_4_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.101 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.97, %self.self_blocks_4_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_4_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.103 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.91, %input.101, %18) # .1:156:0 %input.105 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.103, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_5_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_5_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.105, %self.self_blocks_5_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_5_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_10 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_5_attn_qkv.1, %7) # .1:159:0 %permute_5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_10, %6) # .1:160:0 %336 : Tensor[] = aten::unbind(%permute_5, %17) # .1:161:0 %self_blocks_5_attn_q_norm.1 : Tensor, %self_blocks_5_attn_k_norm.1 : Tensor, %getitem_17 : Tensor = prim::ListUnpack(%336) %scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_5_attn_q_norm.1, %self_blocks_5_attn_k_norm.1, %getitem_17, %5, %10, %19) # .1:167:0 %transpose_6 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_5, %18, %13) # .1:168:0 %input.107 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_6, %4) # .1:169:0 %input.109 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.107, %self.self_blocks_5_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_5_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.111 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.103, %input.109, %18) # .1:174:0 %input.113 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.111, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_5_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.115 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.113, %self.self_blocks_5_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_5_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.117 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.115, %3), scope: __module.self_blocks_5_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.121 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.117, %self.self_blocks_5_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_5_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.123 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.111, %input.121, %18) # .1:184:0 %input.125 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.123, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_6_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_6_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.125, %self.self_blocks_6_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_6_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_12 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_6_attn_qkv.1, %7) # .1:187:0 %permute_6 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_12, %6) # .1:188:0 %382 : Tensor[] = aten::unbind(%permute_6, %17) # .1:189:0 %self_blocks_6_attn_q_norm.1 : Tensor, %self_blocks_6_attn_k_norm.1 : Tensor, %getitem_20 : Tensor = prim::ListUnpack(%382) %scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_6_attn_q_norm.1, %self_blocks_6_attn_k_norm.1, %getitem_20, %5, %10, %19) # .1:195:0 %transpose_7 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_6, %18, %13) # .1:196:0 %input.127 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_7, %4) # .1:197:0 %input.129 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.127, %self.self_blocks_6_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_6_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.131 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.123, %input.129, %18) # .1:202:0 %input.133 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.131, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_6_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.135 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.133, %self.self_blocks_6_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_6_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.137 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.135, %3), scope: __module.self_blocks_6_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.141 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.137, %self.self_blocks_6_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_6_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.143 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.131, %input.141, %18) # .1:212:0 %input.145 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.143, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_7_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_7_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.145, %self.self_blocks_7_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_7_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_14 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_7_attn_qkv.1, %7) # .1:215:0 %permute_7 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_14, %6) # .1:216:0 %428 : Tensor[] = aten::unbind(%permute_7, %17) # .1:217:0 %self_blocks_7_attn_q_norm.1 : Tensor, %self_blocks_7_attn_k_norm.1 : Tensor, %getitem_23 : Tensor = prim::ListUnpack(%428) %scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_7_attn_q_norm.1, %self_blocks_7_attn_k_norm.1, %getitem_23, %5, %10, %19) # .1:223:0 %transpose_8 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_7, %18, %13) # .1:224:0 %input.147 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_8, %4) # .1:225:0 %input.149 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.147, %self.self_blocks_7_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_7_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.151 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.143, %input.149, %18) # .1:230:0 %input.153 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.151, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_7_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.155 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.153, %self.self_blocks_7_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_7_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.157 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.155, %3), scope: __module.self_blocks_7_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.161 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.157, %self.self_blocks_7_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_7_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.163 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.151, %input.161, %18) # .1:240:0 %input.165 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.163, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_8_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_8_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.165, %self.self_blocks_8_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_8_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_16 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_8_attn_qkv.1, %7) # .1:243:0 %permute_8 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_16, %6) # .1:244:0 %474 : Tensor[] = aten::unbind(%permute_8, %17) # .1:245:0 %self_blocks_8_attn_q_norm.1 : Tensor, %self_blocks_8_attn_k_norm.1 : Tensor, %getitem_26 : Tensor = prim::ListUnpack(%474) %scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_8_attn_q_norm.1, %self_blocks_8_attn_k_norm.1, %getitem_26, %5, %10, %19) # .1:251:0 %transpose_9 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_8, %18, %13) # .1:252:0 %input.167 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_9, %4) # .1:253:0 %input.169 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.167, %self.self_blocks_8_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_8_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.171 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.163, %input.169, %18) # .1:258:0 %input.173 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.171, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_8_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.175 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.173, %self.self_blocks_8_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_8_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.177 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.175, %3), scope: __module.self_blocks_8_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.181 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.177, %self.self_blocks_8_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_8_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.183 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.171, %input.181, %18) # .1:268:0 %input.185 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.183, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_9_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_9_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.185, %self.self_blocks_9_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_9_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_18 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_9_attn_qkv.1, %7) # .1:271:0 %permute_9 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_18, %6) # .1:272:0 %520 : Tensor[] = aten::unbind(%permute_9, %17) # .1:273:0 %self_blocks_9_attn_q_norm.1 : Tensor, %self_blocks_9_attn_k_norm.1 : Tensor, %getitem_29 : Tensor = prim::ListUnpack(%520) %scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_9_attn_q_norm.1, %self_blocks_9_attn_k_norm.1, %getitem_29, %5, %10, %19) # .1:279:0 %transpose_10 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_9, %18, %13) # .1:280:0 %input.187 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_10, %4) # .1:281:0 %input.189 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.187, %self.self_blocks_9_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_9_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.191 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.183, %input.189, %18) # .1:286:0 %input.193 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.191, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_9_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.195 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.193, %self.self_blocks_9_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_9_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.197 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.195, %3), scope: __module.self_blocks_9_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.201 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.197, %self.self_blocks_9_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_9_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.203 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.191, %input.201, %18) # .1:296:0 %input.205 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.203, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_10_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_10_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.205, %self.self_blocks_10_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_10_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_20 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_10_attn_qkv.1, %7) # .1:299:0 %permute_10 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_20, %6) # .1:300:0 %566 : Tensor[] = aten::unbind(%permute_10, %17) # .1:301:0 %self_blocks_10_attn_q_norm.1 : Tensor, %self_blocks_10_attn_k_norm.1 : Tensor, %getitem_32 : Tensor = prim::ListUnpack(%566) %scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_10_attn_q_norm.1, %self_blocks_10_attn_k_norm.1, %getitem_32, %5, %10, %19) # .1:307:0 %transpose_11 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_10, %18, %13) # .1:308:0 %input.207 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_11, %4) # .1:309:0 %input.209 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.207, %self.self_blocks_10_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_10_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.211 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.203, %input.209, %18) # .1:314:0 %input.213 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.211, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_10_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.215 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.213, %self.self_blocks_10_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_10_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.217 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.215, %3), scope: __module.self_blocks_10_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.221 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.217, %self.self_blocks_10_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_10_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.223 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.211, %input.221, %18) # .1:324:0 %input.225 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.223, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_11_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_11_attn_qkv.1 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.225, %self.self_blocks_11_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_11_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_22 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_11_attn_qkv.1, %7) # .1:327:0 %permute_11 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_22, %6) # .1:328:0 %612 : Tensor[] = aten::unbind(%permute_11, %17) # .1:329:0 %self_blocks_11_attn_q_norm.1 : Tensor, %self_blocks_11_attn_k_norm.1 : Tensor, %getitem_35 : Tensor = prim::ListUnpack(%612) %scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_11_attn_q_norm.1, %self_blocks_11_attn_k_norm.1, %getitem_35, %5, %10, %19) # .1:335:0 %transpose_12 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_11, %18, %13) # .1:336:0 %input.227 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_12, %4) # .1:337:0 %input.229 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.227, %self.self_blocks_11_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_11_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.231 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.223, %input.229, %18) # .1:342:0 %input.233 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.231, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_blocks_11_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.235 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.233, %self.self_blocks_11_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_11_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.237 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.235, %3), scope: __module.self_blocks_11_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.241 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.237, %self.self_blocks_11_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_11_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.243 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.231, %input.241, %18) # .1:352:0 %self_norm.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.243, %8, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %9, %20), scope: __module.self_norm # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %650 : Tensor = aten::slice(%self_norm.1, %17, %17, %2, %18) # .1:354:0 %input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu) = aten::select(%650, %18, %17) # .1:354:0 %655 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = aten::linear(%input.245, %self.self_head.weight, %self.self_head.bias), scope: __module.self_head # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %657 : (Tensor) = prim::TupleConstruct(%655) return (%657) Graph after fusion pass graph(%self.1 : __torch__.torch.fx.graph_module.___torch_mangle_421.GraphModule, %x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %19 : bool = prim::Constant[value=0](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %18 : int = prim::Constant[value=1](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %17 : int = prim::Constant[value=0](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %10 : float = prim::Constant[value=0.](), scope: __module.self_pos_drop # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:1252:0 %5 : NoneType = prim::Constant() %self_patch_embed_proj.2 : Float(1, 128, 11, 11, strides=[15488, 121, 11, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_0(%x) %flatten.1 : Float(1, 128, 121, strides=[15488, 121, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_1(%self_patch_embed_proj.2) %self_patch_embed_norm.2 : Float(1, 121, 128, strides=[15488, 1, 121], requires_grad=0, device=cpu) = prim::AIOFusionGroup_2(%flatten.1) %987 : Tensor[] = prim::AIOFusionGroup_3(%self_patch_embed_norm.2) %cat.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_4(%987) %input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_5(%cat.1) %input.6 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_6(%input.2) %self_blocks_0_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_7(%input.6) %reshape.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_8(%self_blocks_0_attn_qkv.2) %permute.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_9(%reshape.1) %106 : Tensor[] = aten::unbind(%permute.1, %17) # .1:21:0 %self_blocks_0_attn_q_norm.1 : Tensor, %self_blocks_0_attn_k_norm.1 : Tensor, %getitem_2 : Tensor = prim::ListUnpack(%106) %scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_0_attn_q_norm.1, %self_blocks_0_attn_k_norm.1, %getitem_2, %5, %10, %19) # .1:27:0 %transpose_1.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_10(%scaled_dot_product_attention) %input.10 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_11(%transpose_1.1) %input.14 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_12(%input.10) %input.18 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_13(%input.2, %input.14) %input.22 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_14(%input.18) %input.26 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_15(%input.22) %input.30 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_16(%input.26) %input.34 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_17(%input.30) %input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_18(%input.18, %input.34) %input.42 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_19(%input.38) %self_blocks_1_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_20(%input.42) %reshape_2.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_21(%self_blocks_1_attn_qkv.2) %permute_1.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_22(%reshape_2.1) %152 : Tensor[] = aten::unbind(%permute_1.1, %17) # .1:49:0 %self_blocks_1_attn_q_norm.1 : Tensor, %self_blocks_1_attn_k_norm.1 : Tensor, %getitem_5 : Tensor = prim::ListUnpack(%152) %scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_1_attn_q_norm.1, %self_blocks_1_attn_k_norm.1, %getitem_5, %5, %10, %19) # .1:55:0 %transpose_2.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_23(%scaled_dot_product_attention_1) %input.46 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_24(%transpose_2.1) %input.50 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_25(%input.46) %input.54 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_26(%input.38, %input.50) %input.58 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_27(%input.54) %input.62 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_28(%input.58) %input.66 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_29(%input.62) %input.70 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_30(%input.66) %input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_31(%input.54, %input.70) %input.78 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_32(%input.74) %self_blocks_2_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_33(%input.78) %reshape_4.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_34(%self_blocks_2_attn_qkv.2) %permute_2.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_35(%reshape_4.1) %198 : Tensor[] = aten::unbind(%permute_2.1, %17) # .1:77:0 %self_blocks_2_attn_q_norm.1 : Tensor, %self_blocks_2_attn_k_norm.1 : Tensor, %getitem_8 : Tensor = prim::ListUnpack(%198) %scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_2_attn_q_norm.1, %self_blocks_2_attn_k_norm.1, %getitem_8, %5, %10, %19) # .1:83:0 %transpose_3.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_36(%scaled_dot_product_attention_2) %input.82 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_37(%transpose_3.1) %input.86 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_38(%input.82) %input.90 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_39(%input.74, %input.86) %input.94 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_40(%input.90) %input.98 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_41(%input.94) %input.102 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_42(%input.98) %input.106 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_43(%input.102) %input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_44(%input.90, %input.106) %input.114 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_45(%input.110) %self_blocks_3_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_46(%input.114) %reshape_6.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_47(%self_blocks_3_attn_qkv.2) %permute_3.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_48(%reshape_6.1) %244 : Tensor[] = aten::unbind(%permute_3.1, %17) # .1:105:0 %self_blocks_3_attn_q_norm.1 : Tensor, %self_blocks_3_attn_k_norm.1 : Tensor, %getitem_11 : Tensor = prim::ListUnpack(%244) %scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_3_attn_q_norm.1, %self_blocks_3_attn_k_norm.1, %getitem_11, %5, %10, %19) # .1:111:0 %transpose_4.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_49(%scaled_dot_product_attention_3) %input.118 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_50(%transpose_4.1) %input.122 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_51(%input.118) %input.126 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_52(%input.110, %input.122) %input.130 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_53(%input.126) %input.134 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_54(%input.130) %input.138 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_55(%input.134) %input.142 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_56(%input.138) %input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_57(%input.126, %input.142) %input.150 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_58(%input.146) %self_blocks_4_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_59(%input.150) %reshape_8.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_60(%self_blocks_4_attn_qkv.2) %permute_4.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_61(%reshape_8.1) %290 : Tensor[] = aten::unbind(%permute_4.1, %17) # .1:133:0 %self_blocks_4_attn_q_norm.1 : Tensor, %self_blocks_4_attn_k_norm.1 : Tensor, %getitem_14 : Tensor = prim::ListUnpack(%290) %scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_4_attn_q_norm.1, %self_blocks_4_attn_k_norm.1, %getitem_14, %5, %10, %19) # .1:139:0 %transpose_5.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_62(%scaled_dot_product_attention_4) %input.154 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_63(%transpose_5.1) %input.158 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_64(%input.154) %input.162 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_65(%input.146, %input.158) %input.166 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_66(%input.162) %input.170 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_67(%input.166) %input.174 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_68(%input.170) %input.178 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_69(%input.174) %input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_70(%input.162, %input.178) %input.186 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_71(%input.182) %self_blocks_5_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_72(%input.186) %reshape_10.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_73(%self_blocks_5_attn_qkv.2) %permute_5.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_74(%reshape_10.1) %336 : Tensor[] = aten::unbind(%permute_5.1, %17) # .1:161:0 %self_blocks_5_attn_q_norm.1 : Tensor, %self_blocks_5_attn_k_norm.1 : Tensor, %getitem_17 : Tensor = prim::ListUnpack(%336) %scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_5_attn_q_norm.1, %self_blocks_5_attn_k_norm.1, %getitem_17, %5, %10, %19) # .1:167:0 %transpose_6.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_75(%scaled_dot_product_attention_5) %input.190 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_76(%transpose_6.1) %input.194 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_77(%input.190) %input.198 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_78(%input.182, %input.194) %input.202 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_79(%input.198) %input.206 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_80(%input.202) %input.210 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_81(%input.206) %input.214 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_82(%input.210) %input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_83(%input.198, %input.214) %input.222 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_84(%input.218) %self_blocks_6_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_85(%input.222) %reshape_12.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_86(%self_blocks_6_attn_qkv.2) %permute_6.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_87(%reshape_12.1) %382 : Tensor[] = aten::unbind(%permute_6.1, %17) # .1:189:0 %self_blocks_6_attn_q_norm.1 : Tensor, %self_blocks_6_attn_k_norm.1 : Tensor, %getitem_20 : Tensor = prim::ListUnpack(%382) %scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_6_attn_q_norm.1, %self_blocks_6_attn_k_norm.1, %getitem_20, %5, %10, %19) # .1:195:0 %transpose_7.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_88(%scaled_dot_product_attention_6) %input.226 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_89(%transpose_7.1) %input.230 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_90(%input.226) %input.234 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_91(%input.218, %input.230) %input.238 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_92(%input.234) %input.242 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_93(%input.238) %input.246 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_94(%input.242) %input.250 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_95(%input.246) %input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_96(%input.234, %input.250) %input.258 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_97(%input.254) %self_blocks_7_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_98(%input.258) %reshape_14.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_99(%self_blocks_7_attn_qkv.2) %permute_7.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_100(%reshape_14.1) %428 : Tensor[] = aten::unbind(%permute_7.1, %17) # .1:217:0 %self_blocks_7_attn_q_norm.1 : Tensor, %self_blocks_7_attn_k_norm.1 : Tensor, %getitem_23 : Tensor = prim::ListUnpack(%428) %scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_7_attn_q_norm.1, %self_blocks_7_attn_k_norm.1, %getitem_23, %5, %10, %19) # .1:223:0 %transpose_8.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_101(%scaled_dot_product_attention_7) %input.262 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_102(%transpose_8.1) %input.266 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_103(%input.262) %input.270 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_104(%input.254, %input.266) %input.274 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_105(%input.270) %input.278 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_106(%input.274) %input.282 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_107(%input.278) %input.286 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_108(%input.282) %input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_109(%input.270, %input.286) %input.294 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_110(%input.290) %self_blocks_8_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_111(%input.294) %reshape_16.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_112(%self_blocks_8_attn_qkv.2) %permute_8.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_113(%reshape_16.1) %474 : Tensor[] = aten::unbind(%permute_8.1, %17) # .1:245:0 %self_blocks_8_attn_q_norm.1 : Tensor, %self_blocks_8_attn_k_norm.1 : Tensor, %getitem_26 : Tensor = prim::ListUnpack(%474) %scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_8_attn_q_norm.1, %self_blocks_8_attn_k_norm.1, %getitem_26, %5, %10, %19) # .1:251:0 %transpose_9.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_114(%scaled_dot_product_attention_8) %input.298 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_115(%transpose_9.1) %input.302 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_116(%input.298) %input.306 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_117(%input.290, %input.302) %input.310 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_118(%input.306) %input.314 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_119(%input.310) %input.318 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_120(%input.314) %input.322 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_121(%input.318) %input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_122(%input.306, %input.322) %input.330 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_123(%input.326) %self_blocks_9_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_124(%input.330) %reshape_18.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_125(%self_blocks_9_attn_qkv.2) %permute_9.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_126(%reshape_18.1) %520 : Tensor[] = aten::unbind(%permute_9.1, %17) # .1:273:0 %self_blocks_9_attn_q_norm.1 : Tensor, %self_blocks_9_attn_k_norm.1 : Tensor, %getitem_29 : Tensor = prim::ListUnpack(%520) %scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_9_attn_q_norm.1, %self_blocks_9_attn_k_norm.1, %getitem_29, %5, %10, %19) # .1:279:0 %transpose_10.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_127(%scaled_dot_product_attention_9) %input.334 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_128(%transpose_10.1) %input.338 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_129(%input.334) %input.342 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_130(%input.326, %input.338) %input.346 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_131(%input.342) %input.350 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_132(%input.346) %input.354 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_133(%input.350) %input.358 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_134(%input.354) %input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_135(%input.342, %input.358) %input.366 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_136(%input.362) %self_blocks_10_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_137(%input.366) %reshape_20.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_138(%self_blocks_10_attn_qkv.2) %permute_10.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_139(%reshape_20.1) %566 : Tensor[] = aten::unbind(%permute_10.1, %17) # .1:301:0 %self_blocks_10_attn_q_norm.1 : Tensor, %self_blocks_10_attn_k_norm.1 : Tensor, %getitem_32 : Tensor = prim::ListUnpack(%566) %scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_10_attn_q_norm.1, %self_blocks_10_attn_k_norm.1, %getitem_32, %5, %10, %19) # .1:307:0 %transpose_11.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_140(%scaled_dot_product_attention_10) %input.370 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_141(%transpose_11.1) %input.374 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_142(%input.370) %input.378 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_143(%input.362, %input.374) %input.382 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_144(%input.378) %input.386 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_145(%input.382) %input.390 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_146(%input.386) %input.394 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_147(%input.390) %input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_148(%input.378, %input.394) %input.402 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_149(%input.398) %self_blocks_11_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_150(%input.402) %reshape_22.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_151(%self_blocks_11_attn_qkv.2) %permute_11.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_152(%reshape_22.1) %612 : Tensor[] = aten::unbind(%permute_11.1, %17) # .1:329:0 %self_blocks_11_attn_q_norm.1 : Tensor, %self_blocks_11_attn_k_norm.1 : Tensor, %getitem_35 : Tensor = prim::ListUnpack(%612) %scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_11_attn_q_norm.1, %self_blocks_11_attn_k_norm.1, %getitem_35, %5, %10, %19) # .1:335:0 %transpose_12.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_153(%scaled_dot_product_attention_11) %input.406 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_154(%transpose_12.1) %input.410 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_155(%input.406) %input.414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_156(%input.398, %input.410) %input.418 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_157(%input.414) %input.422 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_158(%input.418) %input.426 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_159(%input.422) %input.430 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_160(%input.426) %input.434 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_161(%input.414, %input.430) %self_norm.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_162(%input.434) %983 : Tensor = prim::AIOFusionGroup_163(%self_norm.2) %input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu) = aten::select(%983, %18, %17) # .1:354:0 %985 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_164(%input.245) %657 : (Tensor) = prim::TupleConstruct(%985) return (%657) with prim::AIOFusionGroup_0 = graph(%x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %self.self_patch_embed_proj.weight : Float(128, 3, 10, 10, strides=[300, 100, 10, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_patch_embed_proj.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %3 : int[] = prim::Constant[value=[10, 10]]() %4 : int[] = prim::Constant[value=[0, 0]]() %5 : int[] = prim::Constant[value=[1, 1]]() %6 : bool = prim::Constant[value=0]() %7 : int = prim::Constant[value=1]() %8 : bool = prim::Constant[value=1]() %self_patch_embed_proj.2 : Float(1, 128, 11, 11, strides=[15488, 121, 11, 1], requires_grad=0, device=cpu) = aten::_convolution(%x, %self.self_patch_embed_proj.weight, %self.self_patch_embed_proj.bias, %3, %4, %5, %6, %4, %7, %6, %6, %8, %8), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 return (%self_patch_embed_proj.2) with prim::AIOFusionGroup_1 = graph(%self_patch_embed_proj.2 : Float(1, 128, 11, 11, strides=[15488, 121, 11, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=2]() %2 : int = prim::Constant[value=-1]() %flatten.1 : Float(1, 128, 121, strides=[15488, 121, 1], requires_grad=0, device=cpu) = aten::flatten(%self_patch_embed_proj.2, %1, %2) # .1:6:0 return (%flatten.1) with prim::AIOFusionGroup_2 = graph(%flatten.1 : Float(1, 128, 121, strides=[15488, 121, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %self_patch_embed_norm.2 : Float(1, 121, 128, strides=[15488, 1, 121], requires_grad=0, device=cpu) = aten::transpose(%flatten.1, %1, %2) # .1:7:0 return (%self_patch_embed_norm.2) with prim::AIOFusionGroup_3 = graph(%self_patch_embed_norm.2 : Float(1, 121, 128, strides=[15488, 1, 121], requires_grad=0, device=cpu)): %0 : Float(1, 1, 128, strides=[128, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %2 : Tensor[] = prim::ListConstruct(%0, %self_patch_embed_norm.2) return (%2) with prim::AIOFusionGroup_4 = graph(%0 : Tensor[]): %1 : int = prim::Constant[value=1]() %cat.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::cat(%0, %1) # .1:12:0 return (%cat.1) with prim::AIOFusionGroup_5 = graph(%cat.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_pos_embed : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %2 : int = prim::Constant[value=1]() %input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%cat.1, %self.self_pos_embed, %2) # .1:13:0 return (%input.2) with prim::AIOFusionGroup_6 = graph(%input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.6 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.2, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_0_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.6) with prim::AIOFusionGroup_7 = graph(%input.6 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_0_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_0_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.6, %self.self_blocks_0_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_0_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_0_attn_qkv.2) with prim::AIOFusionGroup_8 = graph(%self_blocks_0_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_0_attn_qkv.2, %1) # .1:19:0 return (%reshape.1) with prim::AIOFusionGroup_9 = graph(%reshape.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape.1, %1) # .1:20:0 return (%permute.1) with prim::AIOFusionGroup_10 = graph(%scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_1.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention, %1, %2) # .1:28:0 return (%transpose_1.1) with prim::AIOFusionGroup_11 = graph(%transpose_1.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.10 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_1.1, %1) # .1:29:0 return (%input.10) with prim::AIOFusionGroup_12 = graph(%input.10 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_0_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.14 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.10, %self.self_blocks_0_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_0_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.14) with prim::AIOFusionGroup_13 = graph(%input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.14 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.18 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.2, %input.14, %2) # .1:34:0 return (%input.18) with prim::AIOFusionGroup_14 = graph(%input.18 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.22 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.18, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_0_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.22) with prim::AIOFusionGroup_15 = graph(%input.22 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_0_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.26 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.22, %self.self_blocks_0_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_0_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.26) with prim::AIOFusionGroup_16 = graph(%input.26 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.30 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.26, %1), scope: __module.self_blocks_0_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.30) with prim::AIOFusionGroup_17 = graph(%input.30 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_0_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.34 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.30, %self.self_blocks_0_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_0_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.34) with prim::AIOFusionGroup_18 = graph(%input.18 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.34 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.18, %input.34, %2) # .1:44:0 return (%input.38) with prim::AIOFusionGroup_19 = graph(%input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.42 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.38, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_1_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.42) with prim::AIOFusionGroup_20 = graph(%input.42 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_1_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_1_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.42, %self.self_blocks_1_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_1_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_1_attn_qkv.2) with prim::AIOFusionGroup_21 = graph(%self_blocks_1_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_2.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_1_attn_qkv.2, %1) # .1:47:0 return (%reshape_2.1) with prim::AIOFusionGroup_22 = graph(%reshape_2.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_1.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_2.1, %1) # .1:48:0 return (%permute_1.1) with prim::AIOFusionGroup_23 = graph(%scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_2.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_1, %1, %2) # .1:56:0 return (%transpose_2.1) with prim::AIOFusionGroup_24 = graph(%transpose_2.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.46 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_2.1, %1) # .1:57:0 return (%input.46) with prim::AIOFusionGroup_25 = graph(%input.46 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_1_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.50 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.46, %self.self_blocks_1_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_1_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.50) with prim::AIOFusionGroup_26 = graph(%input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.50 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.54 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.38, %input.50, %2) # .1:62:0 return (%input.54) with prim::AIOFusionGroup_27 = graph(%input.54 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.58 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.54, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_1_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.58) with prim::AIOFusionGroup_28 = graph(%input.58 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_1_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.62 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.58, %self.self_blocks_1_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_1_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.62) with prim::AIOFusionGroup_29 = graph(%input.62 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.66 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.62, %1), scope: __module.self_blocks_1_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.66) with prim::AIOFusionGroup_30 = graph(%input.66 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_1_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.70 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.66, %self.self_blocks_1_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_1_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.70) with prim::AIOFusionGroup_31 = graph(%input.54 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.70 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.54, %input.70, %2) # .1:72:0 return (%input.74) with prim::AIOFusionGroup_32 = graph(%input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.78 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.74, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_2_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.78) with prim::AIOFusionGroup_33 = graph(%input.78 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_2_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_2_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.78, %self.self_blocks_2_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_2_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_2_attn_qkv.2) with prim::AIOFusionGroup_34 = graph(%self_blocks_2_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_4.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_2_attn_qkv.2, %1) # .1:75:0 return (%reshape_4.1) with prim::AIOFusionGroup_35 = graph(%reshape_4.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_2.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_4.1, %1) # .1:76:0 return (%permute_2.1) with prim::AIOFusionGroup_36 = graph(%scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_3.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_2, %1, %2) # .1:84:0 return (%transpose_3.1) with prim::AIOFusionGroup_37 = graph(%transpose_3.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.82 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_3.1, %1) # .1:85:0 return (%input.82) with prim::AIOFusionGroup_38 = graph(%input.82 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_2_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.86 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.82, %self.self_blocks_2_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_2_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.86) with prim::AIOFusionGroup_39 = graph(%input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.86 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.90 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.74, %input.86, %2) # .1:90:0 return (%input.90) with prim::AIOFusionGroup_40 = graph(%input.90 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.94 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.90, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_2_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.94) with prim::AIOFusionGroup_41 = graph(%input.94 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_2_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.98 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.94, %self.self_blocks_2_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_2_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.98) with prim::AIOFusionGroup_42 = graph(%input.98 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.102 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.98, %1), scope: __module.self_blocks_2_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.102) with prim::AIOFusionGroup_43 = graph(%input.102 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_2_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.106 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.102, %self.self_blocks_2_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_2_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.106) with prim::AIOFusionGroup_44 = graph(%input.90 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.106 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.90, %input.106, %2) # .1:100:0 return (%input.110) with prim::AIOFusionGroup_45 = graph(%input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.114 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.110, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_3_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.114) with prim::AIOFusionGroup_46 = graph(%input.114 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_3_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_3_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.114, %self.self_blocks_3_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_3_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_3_attn_qkv.2) with prim::AIOFusionGroup_47 = graph(%self_blocks_3_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_6.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_3_attn_qkv.2, %1) # .1:103:0 return (%reshape_6.1) with prim::AIOFusionGroup_48 = graph(%reshape_6.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_3.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_6.1, %1) # .1:104:0 return (%permute_3.1) with prim::AIOFusionGroup_49 = graph(%scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_4.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_3, %1, %2) # .1:112:0 return (%transpose_4.1) with prim::AIOFusionGroup_50 = graph(%transpose_4.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.118 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_4.1, %1) # .1:113:0 return (%input.118) with prim::AIOFusionGroup_51 = graph(%input.118 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_3_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.122 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.118, %self.self_blocks_3_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_3_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.122) with prim::AIOFusionGroup_52 = graph(%input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.122 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.126 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.110, %input.122, %2) # .1:118:0 return (%input.126) with prim::AIOFusionGroup_53 = graph(%input.126 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.130 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.126, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_3_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.130) with prim::AIOFusionGroup_54 = graph(%input.130 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_3_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.134 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.130, %self.self_blocks_3_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_3_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.134) with prim::AIOFusionGroup_55 = graph(%input.134 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.138 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.134, %1), scope: __module.self_blocks_3_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.138) with prim::AIOFusionGroup_56 = graph(%input.138 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_3_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.142 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.138, %self.self_blocks_3_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_3_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.142) with prim::AIOFusionGroup_57 = graph(%input.126 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.142 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.126, %input.142, %2) # .1:128:0 return (%input.146) with prim::AIOFusionGroup_58 = graph(%input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.150 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.146, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_4_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.150) with prim::AIOFusionGroup_59 = graph(%input.150 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_4_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_4_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.150, %self.self_blocks_4_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_4_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_4_attn_qkv.2) with prim::AIOFusionGroup_60 = graph(%self_blocks_4_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_8.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_4_attn_qkv.2, %1) # .1:131:0 return (%reshape_8.1) with prim::AIOFusionGroup_61 = graph(%reshape_8.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_4.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_8.1, %1) # .1:132:0 return (%permute_4.1) with prim::AIOFusionGroup_62 = graph(%scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_5.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_4, %1, %2) # .1:140:0 return (%transpose_5.1) with prim::AIOFusionGroup_63 = graph(%transpose_5.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.154 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_5.1, %1) # .1:141:0 return (%input.154) with prim::AIOFusionGroup_64 = graph(%input.154 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_4_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.158 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.154, %self.self_blocks_4_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_4_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.158) with prim::AIOFusionGroup_65 = graph(%input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.158 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.162 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.146, %input.158, %2) # .1:146:0 return (%input.162) with prim::AIOFusionGroup_66 = graph(%input.162 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.166 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.162, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_4_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.166) with prim::AIOFusionGroup_67 = graph(%input.166 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_4_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.170 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.166, %self.self_blocks_4_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_4_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.170) with prim::AIOFusionGroup_68 = graph(%input.170 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.174 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.170, %1), scope: __module.self_blocks_4_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.174) with prim::AIOFusionGroup_69 = graph(%input.174 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_4_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.178 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.174, %self.self_blocks_4_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_4_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.178) with prim::AIOFusionGroup_70 = graph(%input.162 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.178 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.162, %input.178, %2) # .1:156:0 return (%input.182) with prim::AIOFusionGroup_71 = graph(%input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.186 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.182, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_5_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.186) with prim::AIOFusionGroup_72 = graph(%input.186 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_5_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_5_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.186, %self.self_blocks_5_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_5_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_5_attn_qkv.2) with prim::AIOFusionGroup_73 = graph(%self_blocks_5_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_10.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_5_attn_qkv.2, %1) # .1:159:0 return (%reshape_10.1) with prim::AIOFusionGroup_74 = graph(%reshape_10.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_5.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_10.1, %1) # .1:160:0 return (%permute_5.1) with prim::AIOFusionGroup_75 = graph(%scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_6.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_5, %1, %2) # .1:168:0 return (%transpose_6.1) with prim::AIOFusionGroup_76 = graph(%transpose_6.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.190 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_6.1, %1) # .1:169:0 return (%input.190) with prim::AIOFusionGroup_77 = graph(%input.190 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_5_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.194 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.190, %self.self_blocks_5_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_5_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.194) with prim::AIOFusionGroup_78 = graph(%input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.194 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.198 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.182, %input.194, %2) # .1:174:0 return (%input.198) with prim::AIOFusionGroup_79 = graph(%input.198 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.202 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.198, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_5_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.202) with prim::AIOFusionGroup_80 = graph(%input.202 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_5_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.206 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.202, %self.self_blocks_5_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_5_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.206) with prim::AIOFusionGroup_81 = graph(%input.206 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.210 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.206, %1), scope: __module.self_blocks_5_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.210) with prim::AIOFusionGroup_82 = graph(%input.210 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_5_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.214 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.210, %self.self_blocks_5_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_5_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.214) with prim::AIOFusionGroup_83 = graph(%input.198 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.214 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.198, %input.214, %2) # .1:184:0 return (%input.218) with prim::AIOFusionGroup_84 = graph(%input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.222 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.218, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_6_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.222) with prim::AIOFusionGroup_85 = graph(%input.222 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_6_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_6_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.222, %self.self_blocks_6_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_6_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_6_attn_qkv.2) with prim::AIOFusionGroup_86 = graph(%self_blocks_6_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_12.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_6_attn_qkv.2, %1) # .1:187:0 return (%reshape_12.1) with prim::AIOFusionGroup_87 = graph(%reshape_12.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_6.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_12.1, %1) # .1:188:0 return (%permute_6.1) with prim::AIOFusionGroup_88 = graph(%scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_7.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_6, %1, %2) # .1:196:0 return (%transpose_7.1) with prim::AIOFusionGroup_89 = graph(%transpose_7.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.226 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_7.1, %1) # .1:197:0 return (%input.226) with prim::AIOFusionGroup_90 = graph(%input.226 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_6_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.230 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.226, %self.self_blocks_6_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_6_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.230) with prim::AIOFusionGroup_91 = graph(%input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.230 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.234 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.218, %input.230, %2) # .1:202:0 return (%input.234) with prim::AIOFusionGroup_92 = graph(%input.234 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.238 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.234, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_6_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.238) with prim::AIOFusionGroup_93 = graph(%input.238 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_6_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.242 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.238, %self.self_blocks_6_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_6_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.242) with prim::AIOFusionGroup_94 = graph(%input.242 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.246 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.242, %1), scope: __module.self_blocks_6_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.246) with prim::AIOFusionGroup_95 = graph(%input.246 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_6_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.250 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.246, %self.self_blocks_6_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_6_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.250) with prim::AIOFusionGroup_96 = graph(%input.234 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.250 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.234, %input.250, %2) # .1:212:0 return (%input.254) with prim::AIOFusionGroup_97 = graph(%input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.258 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.254, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_7_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.258) with prim::AIOFusionGroup_98 = graph(%input.258 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_7_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_7_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.258, %self.self_blocks_7_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_7_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_7_attn_qkv.2) with prim::AIOFusionGroup_99 = graph(%self_blocks_7_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_14.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_7_attn_qkv.2, %1) # .1:215:0 return (%reshape_14.1) with prim::AIOFusionGroup_100 = graph(%reshape_14.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_7.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_14.1, %1) # .1:216:0 return (%permute_7.1) with prim::AIOFusionGroup_101 = graph(%scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_8.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_7, %1, %2) # .1:224:0 return (%transpose_8.1) with prim::AIOFusionGroup_102 = graph(%transpose_8.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.262 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_8.1, %1) # .1:225:0 return (%input.262) with prim::AIOFusionGroup_103 = graph(%input.262 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_7_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.266 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.262, %self.self_blocks_7_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_7_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.266) with prim::AIOFusionGroup_104 = graph(%input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.266 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.270 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.254, %input.266, %2) # .1:230:0 return (%input.270) with prim::AIOFusionGroup_105 = graph(%input.270 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.274 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.270, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_7_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.274) with prim::AIOFusionGroup_106 = graph(%input.274 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_7_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.278 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.274, %self.self_blocks_7_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_7_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.278) with prim::AIOFusionGroup_107 = graph(%input.278 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.282 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.278, %1), scope: __module.self_blocks_7_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.282) with prim::AIOFusionGroup_108 = graph(%input.282 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_7_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.286 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.282, %self.self_blocks_7_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_7_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.286) with prim::AIOFusionGroup_109 = graph(%input.270 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.286 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.270, %input.286, %2) # .1:240:0 return (%input.290) with prim::AIOFusionGroup_110 = graph(%input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.294 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.290, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_8_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.294) with prim::AIOFusionGroup_111 = graph(%input.294 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_8_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_8_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.294, %self.self_blocks_8_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_8_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_8_attn_qkv.2) with prim::AIOFusionGroup_112 = graph(%self_blocks_8_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_16.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_8_attn_qkv.2, %1) # .1:243:0 return (%reshape_16.1) with prim::AIOFusionGroup_113 = graph(%reshape_16.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_8.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_16.1, %1) # .1:244:0 return (%permute_8.1) with prim::AIOFusionGroup_114 = graph(%scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_9.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_8, %1, %2) # .1:252:0 return (%transpose_9.1) with prim::AIOFusionGroup_115 = graph(%transpose_9.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.298 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_9.1, %1) # .1:253:0 return (%input.298) with prim::AIOFusionGroup_116 = graph(%input.298 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_8_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.302 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.298, %self.self_blocks_8_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_8_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.302) with prim::AIOFusionGroup_117 = graph(%input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.302 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.306 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.290, %input.302, %2) # .1:258:0 return (%input.306) with prim::AIOFusionGroup_118 = graph(%input.306 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.310 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.306, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_8_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.310) with prim::AIOFusionGroup_119 = graph(%input.310 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_8_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.314 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.310, %self.self_blocks_8_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_8_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.314) with prim::AIOFusionGroup_120 = graph(%input.314 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.318 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.314, %1), scope: __module.self_blocks_8_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.318) with prim::AIOFusionGroup_121 = graph(%input.318 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_8_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.322 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.318, %self.self_blocks_8_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_8_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.322) with prim::AIOFusionGroup_122 = graph(%input.306 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.322 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.306, %input.322, %2) # .1:268:0 return (%input.326) with prim::AIOFusionGroup_123 = graph(%input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.330 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.326, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_9_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.330) with prim::AIOFusionGroup_124 = graph(%input.330 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_9_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_9_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.330, %self.self_blocks_9_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_9_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_9_attn_qkv.2) with prim::AIOFusionGroup_125 = graph(%self_blocks_9_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_18.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_9_attn_qkv.2, %1) # .1:271:0 return (%reshape_18.1) with prim::AIOFusionGroup_126 = graph(%reshape_18.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_9.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_18.1, %1) # .1:272:0 return (%permute_9.1) with prim::AIOFusionGroup_127 = graph(%scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_10.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_9, %1, %2) # .1:280:0 return (%transpose_10.1) with prim::AIOFusionGroup_128 = graph(%transpose_10.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.334 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_10.1, %1) # .1:281:0 return (%input.334) with prim::AIOFusionGroup_129 = graph(%input.334 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_9_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.338 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.334, %self.self_blocks_9_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_9_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.338) with prim::AIOFusionGroup_130 = graph(%input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.338 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.342 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.326, %input.338, %2) # .1:286:0 return (%input.342) with prim::AIOFusionGroup_131 = graph(%input.342 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.346 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.342, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_9_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.346) with prim::AIOFusionGroup_132 = graph(%input.346 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_9_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.350 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.346, %self.self_blocks_9_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_9_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.350) with prim::AIOFusionGroup_133 = graph(%input.350 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.354 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.350, %1), scope: __module.self_blocks_9_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.354) with prim::AIOFusionGroup_134 = graph(%input.354 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_9_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.358 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.354, %self.self_blocks_9_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_9_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.358) with prim::AIOFusionGroup_135 = graph(%input.342 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.358 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.342, %input.358, %2) # .1:296:0 return (%input.362) with prim::AIOFusionGroup_136 = graph(%input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.366 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.362, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_10_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.366) with prim::AIOFusionGroup_137 = graph(%input.366 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_10_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_10_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.366, %self.self_blocks_10_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_10_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_10_attn_qkv.2) with prim::AIOFusionGroup_138 = graph(%self_blocks_10_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_20.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_10_attn_qkv.2, %1) # .1:299:0 return (%reshape_20.1) with prim::AIOFusionGroup_139 = graph(%reshape_20.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_10.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_20.1, %1) # .1:300:0 return (%permute_10.1) with prim::AIOFusionGroup_140 = graph(%scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_11.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_10, %1, %2) # .1:308:0 return (%transpose_11.1) with prim::AIOFusionGroup_141 = graph(%transpose_11.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.370 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_11.1, %1) # .1:309:0 return (%input.370) with prim::AIOFusionGroup_142 = graph(%input.370 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_10_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.374 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.370, %self.self_blocks_10_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_10_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.374) with prim::AIOFusionGroup_143 = graph(%input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.374 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.378 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.362, %input.374, %2) # .1:314:0 return (%input.378) with prim::AIOFusionGroup_144 = graph(%input.378 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.382 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.378, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_10_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.382) with prim::AIOFusionGroup_145 = graph(%input.382 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_10_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.386 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.382, %self.self_blocks_10_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_10_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.386) with prim::AIOFusionGroup_146 = graph(%input.386 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.390 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.386, %1), scope: __module.self_blocks_10_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.390) with prim::AIOFusionGroup_147 = graph(%input.390 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_10_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.394 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.390, %self.self_blocks_10_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_10_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.394) with prim::AIOFusionGroup_148 = graph(%input.378 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.394 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.378, %input.394, %2) # .1:324:0 return (%input.398) with prim::AIOFusionGroup_149 = graph(%input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.402 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.398, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_11_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.402) with prim::AIOFusionGroup_150 = graph(%input.402 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_11_attn_qkv.weight : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.bias : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_blocks_11_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.402, %self.self_blocks_11_attn_qkv.weight, %self.self_blocks_0_attn_qkv.bias), scope: __module.self_blocks_11_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%self_blocks_11_attn_qkv.2) with prim::AIOFusionGroup_151 = graph(%self_blocks_11_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %reshape_22.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_11_attn_qkv.2, %1) # .1:327:0 return (%reshape_22.1) with prim::AIOFusionGroup_152 = graph(%reshape_22.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %permute_11.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_22.1, %1) # .1:328:0 return (%permute_11.1) with prim::AIOFusionGroup_153 = graph(%scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=1]() %2 : int = prim::Constant[value=2]() %transpose_12.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::transpose(%scaled_dot_product_attention_11, %1, %2) # .1:336:0 return (%transpose_12.1) with prim::AIOFusionGroup_154 = graph(%transpose_12.1 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[1, 122, 128]]() %input.406 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%transpose_12.1, %1) # .1:337:0 return (%input.406) with prim::AIOFusionGroup_155 = graph(%input.406 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_11_attn_proj.weight : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.410 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.406, %self.self_blocks_11_attn_proj.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_11_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.410) with prim::AIOFusionGroup_156 = graph(%input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.410 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.398, %input.410, %2) # .1:342:0 return (%input.414) with prim::AIOFusionGroup_157 = graph(%input.414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %input.418 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.414, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_blocks_11_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%input.418) with prim::AIOFusionGroup_158 = graph(%input.418 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %self.self_blocks_11_mlp_fc1.weight : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.bias : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.422 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.418, %self.self_blocks_11_mlp_fc1.weight, %self.self_blocks_0_mlp_fc1.bias), scope: __module.self_blocks_11_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.422) with prim::AIOFusionGroup_159 = graph(%input.422 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %1 : str = prim::Constant[value="none"]() %input.426 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.422, %1), scope: __module.self_blocks_11_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 return (%input.426) with prim::AIOFusionGroup_160 = graph(%input.426 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu)): %self.self_blocks_11_mlp_fc2.weight : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %input.430 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.426, %self.self_blocks_11_mlp_fc2.weight, %self.self_blocks_0_norm1.bias), scope: __module.self_blocks_11_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%input.430) with prim::AIOFusionGroup_161 = graph(%input.414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %input.430 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=1]() %input.434 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.414, %input.430, %2) # .1:352:0 return (%input.434) with prim::AIOFusionGroup_162 = graph(%input.434 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.weight : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.bias : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %4 : float = prim::Constant[value=9.9999999999999995e-07]() %5 : bool = prim::Constant[value=1]() %self_norm.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.434, %1, %self.self_blocks_0_norm1.weight, %self.self_blocks_0_norm1.bias, %4, %5), scope: __module.self_norm # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 return (%self_norm.2) with prim::AIOFusionGroup_163 = graph(%self_norm.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu)): %1 : int = prim::Constant[value=0]() %2 : int = prim::Constant[value=9223372036854775807]() %3 : int = prim::Constant[value=1]() %4 : Tensor = aten::slice(%self_norm.2, %1, %1, %2, %3) # .1:354:0 return (%4) with prim::AIOFusionGroup_164 = graph(%input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu)): %self.self_head.weight : Float(1000, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_head.bias : Float(1000, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %3 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = aten::linear(%input.245, %self.self_head.weight, %self.self_head.bias), scope: __module.self_head # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%3) Graph after AIOFuser graph(%self.1 : __torch__.torch.fx.graph_module.___torch_mangle_421.GraphModule, %x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %19 : bool = prim::Constant[value=0](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %18 : int = prim::Constant[value=1](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %17 : int = prim::Constant[value=0](), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %10 : float = prim::Constant[value=0.](), scope: __module.self_pos_drop # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:1252:0 %5 : NoneType = prim::Constant() %7115 : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu), %7116 : bool = prim::AIOFusionGuard[types=[Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)]](%x) %7117 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7118 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7116) block0(): %permute.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2698 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_0(%7115) -> (%permute.5, %input.2698) block1(): %permute.1 : Tensor, %input.2 : Tensor = prim::FallbackGraph_1(%x) -> (%permute.1, %input.2) %106 : Tensor[] = aten::unbind(%7117, %17) # .1:21:0 %self_blocks_0_attn_q_norm.1 : Tensor, %self_blocks_0_attn_k_norm.1 : Tensor, %getitem_2 : Tensor = prim::ListUnpack(%106) %scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_0_attn_q_norm.1, %self_blocks_0_attn_k_norm.1, %getitem_2, %5, %10, %19) # .1:27:0 %7151 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7152 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7153 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7118, %scaled_dot_product_attention) %7154 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7155 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7153) block0(): %permute_1.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2702 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_2(%7151, %7152) -> (%permute_1.5, %input.2702) block1(): %permute_1.1 : Tensor, %input.38 : Tensor = prim::FallbackGraph_3(%7118, %scaled_dot_product_attention) -> (%permute_1.1, %input.38) %152 : Tensor[] = aten::unbind(%7154, %17) # .1:49:0 %self_blocks_1_attn_q_norm.1 : Tensor, %self_blocks_1_attn_k_norm.1 : Tensor, %getitem_5 : Tensor = prim::ListUnpack(%152) %scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_1_attn_q_norm.1, %self_blocks_1_attn_k_norm.1, %getitem_5, %5, %10, %19) # .1:55:0 %7188 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7189 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7190 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7155, %scaled_dot_product_attention_1) %7191 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7192 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7190) block0(): %permute_2.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2706 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_4(%7188, %7189) -> (%permute_2.5, %input.2706) block1(): %permute_2.1 : Tensor, %input.74 : Tensor = prim::FallbackGraph_5(%7155, %scaled_dot_product_attention_1) -> (%permute_2.1, %input.74) %198 : Tensor[] = aten::unbind(%7191, %17) # .1:77:0 %self_blocks_2_attn_q_norm.1 : Tensor, %self_blocks_2_attn_k_norm.1 : Tensor, %getitem_8 : Tensor = prim::ListUnpack(%198) %scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_2_attn_q_norm.1, %self_blocks_2_attn_k_norm.1, %getitem_8, %5, %10, %19) # .1:83:0 %7225 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7226 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7227 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7192, %scaled_dot_product_attention_2) %7228 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7229 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7227) block0(): %permute_3.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2710 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_6(%7225, %7226) -> (%permute_3.5, %input.2710) block1(): %permute_3.1 : Tensor, %input.110 : Tensor = prim::FallbackGraph_7(%7192, %scaled_dot_product_attention_2) -> (%permute_3.1, %input.110) %244 : Tensor[] = aten::unbind(%7228, %17) # .1:105:0 %self_blocks_3_attn_q_norm.1 : Tensor, %self_blocks_3_attn_k_norm.1 : Tensor, %getitem_11 : Tensor = prim::ListUnpack(%244) %scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_3_attn_q_norm.1, %self_blocks_3_attn_k_norm.1, %getitem_11, %5, %10, %19) # .1:111:0 %7262 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7263 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7264 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7229, %scaled_dot_product_attention_3) %7265 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7266 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7264) block0(): %permute_4.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2714 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_8(%7262, %7263) -> (%permute_4.5, %input.2714) block1(): %permute_4.1 : Tensor, %input.146 : Tensor = prim::FallbackGraph_9(%7229, %scaled_dot_product_attention_3) -> (%permute_4.1, %input.146) %290 : Tensor[] = aten::unbind(%7265, %17) # .1:133:0 %self_blocks_4_attn_q_norm.1 : Tensor, %self_blocks_4_attn_k_norm.1 : Tensor, %getitem_14 : Tensor = prim::ListUnpack(%290) %scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_4_attn_q_norm.1, %self_blocks_4_attn_k_norm.1, %getitem_14, %5, %10, %19) # .1:139:0 %7299 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7300 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7301 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7266, %scaled_dot_product_attention_4) %7302 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7303 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7301) block0(): %permute_5.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2718 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_10(%7299, %7300) -> (%permute_5.5, %input.2718) block1(): %permute_5.1 : Tensor, %input.182 : Tensor = prim::FallbackGraph_11(%7266, %scaled_dot_product_attention_4) -> (%permute_5.1, %input.182) %336 : Tensor[] = aten::unbind(%7302, %17) # .1:161:0 %self_blocks_5_attn_q_norm.1 : Tensor, %self_blocks_5_attn_k_norm.1 : Tensor, %getitem_17 : Tensor = prim::ListUnpack(%336) %scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_5_attn_q_norm.1, %self_blocks_5_attn_k_norm.1, %getitem_17, %5, %10, %19) # .1:167:0 %7336 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7337 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7338 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7303, %scaled_dot_product_attention_5) %7339 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7340 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7338) block0(): %permute_6.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2722 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_12(%7336, %7337) -> (%permute_6.5, %input.2722) block1(): %permute_6.1 : Tensor, %input.218 : Tensor = prim::FallbackGraph_13(%7303, %scaled_dot_product_attention_5) -> (%permute_6.1, %input.218) %382 : Tensor[] = aten::unbind(%7339, %17) # .1:189:0 %self_blocks_6_attn_q_norm.1 : Tensor, %self_blocks_6_attn_k_norm.1 : Tensor, %getitem_20 : Tensor = prim::ListUnpack(%382) %scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_6_attn_q_norm.1, %self_blocks_6_attn_k_norm.1, %getitem_20, %5, %10, %19) # .1:195:0 %7373 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7374 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7375 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7340, %scaled_dot_product_attention_6) %7376 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7377 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7375) block0(): %permute_7.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2726 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_14(%7373, %7374) -> (%permute_7.5, %input.2726) block1(): %permute_7.1 : Tensor, %input.254 : Tensor = prim::FallbackGraph_15(%7340, %scaled_dot_product_attention_6) -> (%permute_7.1, %input.254) %428 : Tensor[] = aten::unbind(%7376, %17) # .1:217:0 %self_blocks_7_attn_q_norm.1 : Tensor, %self_blocks_7_attn_k_norm.1 : Tensor, %getitem_23 : Tensor = prim::ListUnpack(%428) %scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_7_attn_q_norm.1, %self_blocks_7_attn_k_norm.1, %getitem_23, %5, %10, %19) # .1:223:0 %7410 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7411 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7412 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7377, %scaled_dot_product_attention_7) %7413 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7412) block0(): %permute_8.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2730 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_16(%7410, %7411) -> (%permute_8.5, %input.2730) block1(): %permute_8.1 : Tensor, %input.290 : Tensor = prim::FallbackGraph_17(%7377, %scaled_dot_product_attention_7) -> (%permute_8.1, %input.290) %474 : Tensor[] = aten::unbind(%7413, %17) # .1:245:0 %self_blocks_8_attn_q_norm.1 : Tensor, %self_blocks_8_attn_k_norm.1 : Tensor, %getitem_26 : Tensor = prim::ListUnpack(%474) %scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_8_attn_q_norm.1, %self_blocks_8_attn_k_norm.1, %getitem_26, %5, %10, %19) # .1:251:0 %7447 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7448 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7449 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7414, %scaled_dot_product_attention_8) %7450 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7451 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7449) block0(): %permute_9.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2734 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_18(%7447, %7448) -> (%permute_9.5, %input.2734) block1(): %permute_9.1 : Tensor, %input.326 : Tensor = prim::FallbackGraph_19(%7414, %scaled_dot_product_attention_8) -> (%permute_9.1, %input.326) %520 : Tensor[] = aten::unbind(%7450, %17) # .1:273:0 %self_blocks_9_attn_q_norm.1 : Tensor, %self_blocks_9_attn_k_norm.1 : Tensor, %getitem_29 : Tensor = prim::ListUnpack(%520) %scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_9_attn_q_norm.1, %self_blocks_9_attn_k_norm.1, %getitem_29, %5, %10, %19) # .1:279:0 %7484 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7485 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7486 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7451, %scaled_dot_product_attention_9) %7487 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7488 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7486) block0(): %permute_10.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2738 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_20(%7484, %7485) -> (%permute_10.5, %input.2738) block1(): %permute_10.1 : Tensor, %input.362 : Tensor = prim::FallbackGraph_21(%7451, %scaled_dot_product_attention_9) -> (%permute_10.1, %input.362) %566 : Tensor[] = aten::unbind(%7487, %17) # .1:301:0 %self_blocks_10_attn_q_norm.1 : Tensor, %self_blocks_10_attn_k_norm.1 : Tensor, %getitem_32 : Tensor = prim::ListUnpack(%566) %scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_10_attn_q_norm.1, %self_blocks_10_attn_k_norm.1, %getitem_32, %5, %10, %19) # .1:307:0 %7521 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7522 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7523 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7488, %scaled_dot_product_attention_10) %7524 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %7525 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::If(%7523) block0(): %permute_11.5 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu), %input.2742 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_22(%7521, %7522) -> (%permute_11.5, %input.2742) block1(): %permute_11.1 : Tensor, %input.398 : Tensor = prim::FallbackGraph_23(%7488, %scaled_dot_product_attention_10) -> (%permute_11.1, %input.398) %612 : Tensor[] = aten::unbind(%7524, %17) # .1:329:0 %self_blocks_11_attn_q_norm.1 : Tensor, %self_blocks_11_attn_k_norm.1 : Tensor, %getitem_35 : Tensor = prim::ListUnpack(%612) %scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu) = aten::scaled_dot_product_attention(%self_blocks_11_attn_q_norm.1, %self_blocks_11_attn_k_norm.1, %getitem_35, %5, %10, %19) # .1:335:0 %7558 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %7559 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu), %7560 : bool = prim::AIOFusionGuard[types=[Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)]](%7525, %scaled_dot_product_attention_11) %7561 : Tensor = prim::If(%7560) block0(): %983 : Tensor = prim::AIOFusionGroup_24(%7558, %7559) -> (%983) block1(): %7588 : Tensor = prim::FallbackGraph_25(%7525, %scaled_dot_product_attention_11) -> (%7588) %input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu) = aten::select(%7561, %18, %17) # .1:354:0 %7589 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu), %7590 : bool = prim::AIOFusionGuard[types=[Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu)]](%input.245) %7591 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = prim::If(%7590) block0(): %985 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = prim::AIOFusionGroup_26(%7589) -> (%985) block1(): %7595 : Tensor = prim::FallbackGraph_27(%input.245) -> (%7595) %657 : (Tensor) = prim::TupleConstruct(%7591) return (%657) with prim::AIOFusionGroup_0 = graph(%x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %17 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.bias.3 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.weight.3 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %14 : int[] = prim::Constant[value=[128]]() %self.self_pos_embed.4 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %35 : Float(1, 1, 128, strides=[128, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %47 : int = prim::Constant[value=-1]() %46 : int = prim::Constant[value=2]() %59 : bool = prim::Constant[value=1]() %58 : int = prim::Constant[value=1]() %57 : bool = prim::Constant[value=0]() %56 : int[] = prim::Constant[value=[1, 1]]() %55 : int[] = prim::Constant[value=[0, 0]]() %54 : int[] = prim::Constant[value=[10, 10]]() %self.self_patch_embed_proj.bias.9 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_patch_embed_proj.weight.9 : Float(128, 3, 10, 10, strides=[300, 100, 10, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_patch_embed_proj.2 : Float(1, 128, 11, 11, strides=[15488, 121, 11, 1], requires_grad=0, device=cpu) = aten::_convolution(%x, %self.self_patch_embed_proj.weight.9, %self.self_patch_embed_proj.bias.9, %54, %55, %56, %57, %55, %58, %57, %57, %59, %59), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %flatten.1 : Float(1, 128, 121, strides=[15488, 121, 1], requires_grad=0, device=cpu) = aten::flatten(%self_patch_embed_proj.2, %46, %47) # .1:6:0 %69 : int[] = prim::Constant[value=[0, 2, 1]]() %70 : Float(1, 121, 128, strides=[15488, 1, 121], requires_grad=0, device=cpu) = aten::permute(%flatten.1, %69) %37 : Tensor[] = prim::ListConstruct(%35, %70) %cat.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::cat(%37, %58) # .1:12:0 %input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%cat.1, %self.self_pos_embed.4, %58) # .1:13:0 %input.6 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.2, %14, %self.self_blocks_0_norm1.weight.3, %self.self_blocks_0_norm1.bias.3, %17, %59), scope: __module.self_blocks_0_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_0_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.6, %self.self_blocks_0_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_0_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_0_attn_qkv.2, %4) # .1:19:0 %permute.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape.1, %1) # .1:20:0 return (%permute.1, %input.2) with prim::FallbackGraph_1 = graph(%x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %2 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %5 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.bias.3 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.weight.3 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %8 : int[] = prim::Constant[value=[128]]() %self.self_pos_embed.4 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : Float(1, 1, 128, strides=[128, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %11 : int = prim::Constant[value=-1]() %12 : int = prim::Constant[value=2]() %13 : bool = prim::Constant[value=1]() %14 : int = prim::Constant[value=1]() %15 : bool = prim::Constant[value=0]() %16 : int[] = prim::Constant[value=[1, 1]]() %17 : int[] = prim::Constant[value=[0, 0]]() %18 : int[] = prim::Constant[value=[10, 10]]() %self.self_patch_embed_proj.bias.9 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_patch_embed_proj.weight.9 : Float(128, 3, 10, 10, strides=[300, 100, 10, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_patch_embed_proj.2 : Tensor = aten::_convolution(%x, %self.self_patch_embed_proj.weight.9, %self.self_patch_embed_proj.bias.9, %18, %17, %16, %15, %17, %14, %15, %15, %13, %13), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %flatten.1 : Tensor = aten::flatten(%self_patch_embed_proj.2, %12, %11) # .1:6:0 %self_patch_embed_norm.2 : Tensor = aten::transpose(%flatten.1, %14, %12) # .1:7:0 %24 : Tensor[] = prim::ListConstruct(%10, %self_patch_embed_norm.2) %cat.1 : Tensor = aten::cat(%24, %14) # .1:12:0 %input.2 : Tensor = aten::add(%cat.1, %self.self_pos_embed.4, %14) # .1:13:0 %input.6 : Tensor = aten::layer_norm(%input.2, %8, %self.self_blocks_0_norm1.weight.3, %self.self_blocks_0_norm1.bias.3, %5, %13), scope: __module.self_blocks_0_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_0_attn_qkv.2 : Tensor = aten::linear(%input.6, %self.self_blocks_0_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_0_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape.1 : Tensor = aten::reshape(%self_blocks_0_attn_qkv.2, %2) # .1:19:0 %permute.1 : Tensor = aten::permute(%reshape.1, %1) # .1:20:0 return (%permute.1, %input.2) with prim::AIOFusionGroup_2 = graph(%input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention, %78) %input.10 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:29:0 %input.14 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.10, %self.self_blocks_0_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_0_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.18 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.2, %input.14, %73) # .1:34:0 %input.22 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.18, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_0_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.26 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.22, %self.self_blocks_0_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_0_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.30 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.26, %37), scope: __module.self_blocks_0_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.34 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.30, %self.self_blocks_0_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_0_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.18, %input.34, %73) # .1:44:0 %input.42 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.38, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_1_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_1_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.42, %self.self_blocks_1_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_1_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_2.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_1_attn_qkv.2, %4) # .1:47:0 %permute_1.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_2.1, %1) # .1:48:0 return (%permute_1.1, %input.38) with prim::FallbackGraph_3 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_1.1 : Tensor = aten::transpose(%scaled_dot_product_attention, %18, %17) # .1:28:0 %input.10 : Tensor = aten::reshape(%transpose_1.1, %16) # .1:29:0 %input.14 : Tensor = aten::linear(%input.10, %self.self_blocks_0_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_0_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.18 : Tensor = aten::add(%0, %input.14, %18) # .1:34:0 %input.22 : Tensor = aten::layer_norm(%input.18, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_0_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.26 : Tensor = aten::linear(%input.22, %self.self_blocks_0_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_0_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.30 : Tensor = aten::gelu(%input.26, %7), scope: __module.self_blocks_0_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.34 : Tensor = aten::linear(%input.30, %self.self_blocks_0_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_0_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.38 : Tensor = aten::add(%input.18, %input.34, %18) # .1:44:0 %input.42 : Tensor = aten::layer_norm(%input.38, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_1_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_1_attn_qkv.2 : Tensor = aten::linear(%input.42, %self.self_blocks_1_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_1_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_2.1 : Tensor = aten::reshape(%self_blocks_1_attn_qkv.2, %3) # .1:47:0 %permute_1.1 : Tensor = aten::permute(%reshape_2.1, %2) # .1:48:0 return (%permute_1.1, %input.38) with prim::AIOFusionGroup_4 = graph(%input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_1, %78) %input.46 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:57:0 %input.50 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.46, %self.self_blocks_1_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_1_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.54 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.38, %input.50, %73) # .1:62:0 %input.58 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.54, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_1_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.62 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.58, %self.self_blocks_1_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_1_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.66 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.62, %37), scope: __module.self_blocks_1_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.70 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.66, %self.self_blocks_1_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_1_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.54, %input.70, %73) # .1:72:0 %input.78 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.74, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_2_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_2_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.78, %self.self_blocks_2_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_2_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_4.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_2_attn_qkv.2, %4) # .1:75:0 %permute_2.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_4.1, %1) # .1:76:0 return (%permute_2.1, %input.74) with prim::FallbackGraph_5 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_2.1 : Tensor = aten::transpose(%scaled_dot_product_attention_1, %18, %17) # .1:56:0 %input.46 : Tensor = aten::reshape(%transpose_2.1, %16) # .1:57:0 %input.50 : Tensor = aten::linear(%input.46, %self.self_blocks_1_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_1_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.54 : Tensor = aten::add(%0, %input.50, %18) # .1:62:0 %input.58 : Tensor = aten::layer_norm(%input.54, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_1_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.62 : Tensor = aten::linear(%input.58, %self.self_blocks_1_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_1_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.66 : Tensor = aten::gelu(%input.62, %7), scope: __module.self_blocks_1_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.70 : Tensor = aten::linear(%input.66, %self.self_blocks_1_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_1_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.74 : Tensor = aten::add(%input.54, %input.70, %18) # .1:72:0 %input.78 : Tensor = aten::layer_norm(%input.74, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_2_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_2_attn_qkv.2 : Tensor = aten::linear(%input.78, %self.self_blocks_2_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_2_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_4.1 : Tensor = aten::reshape(%self_blocks_2_attn_qkv.2, %3) # .1:75:0 %permute_2.1 : Tensor = aten::permute(%reshape_4.1, %2) # .1:76:0 return (%permute_2.1, %input.74) with prim::AIOFusionGroup_6 = graph(%input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_2, %78) %input.82 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:85:0 %input.86 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.82, %self.self_blocks_2_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_2_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.90 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.74, %input.86, %73) # .1:90:0 %input.94 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.90, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_2_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.98 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.94, %self.self_blocks_2_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_2_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.102 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.98, %37), scope: __module.self_blocks_2_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.106 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.102, %self.self_blocks_2_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_2_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.90, %input.106, %73) # .1:100:0 %input.114 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.110, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_3_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_3_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.114, %self.self_blocks_3_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_3_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_6.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_3_attn_qkv.2, %4) # .1:103:0 %permute_3.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_6.1, %1) # .1:104:0 return (%permute_3.1, %input.110) with prim::FallbackGraph_7 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_3.1 : Tensor = aten::transpose(%scaled_dot_product_attention_2, %18, %17) # .1:84:0 %input.82 : Tensor = aten::reshape(%transpose_3.1, %16) # .1:85:0 %input.86 : Tensor = aten::linear(%input.82, %self.self_blocks_2_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_2_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.90 : Tensor = aten::add(%0, %input.86, %18) # .1:90:0 %input.94 : Tensor = aten::layer_norm(%input.90, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_2_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.98 : Tensor = aten::linear(%input.94, %self.self_blocks_2_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_2_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.102 : Tensor = aten::gelu(%input.98, %7), scope: __module.self_blocks_2_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.106 : Tensor = aten::linear(%input.102, %self.self_blocks_2_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_2_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.110 : Tensor = aten::add(%input.90, %input.106, %18) # .1:100:0 %input.114 : Tensor = aten::layer_norm(%input.110, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_3_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_3_attn_qkv.2 : Tensor = aten::linear(%input.114, %self.self_blocks_3_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_3_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_6.1 : Tensor = aten::reshape(%self_blocks_3_attn_qkv.2, %3) # .1:103:0 %permute_3.1 : Tensor = aten::permute(%reshape_6.1, %2) # .1:104:0 return (%permute_3.1, %input.110) with prim::AIOFusionGroup_8 = graph(%input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_3, %78) %input.118 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:113:0 %input.122 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.118, %self.self_blocks_3_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_3_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.126 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.110, %input.122, %73) # .1:118:0 %input.130 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.126, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_3_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.134 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.130, %self.self_blocks_3_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_3_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.138 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.134, %37), scope: __module.self_blocks_3_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.142 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.138, %self.self_blocks_3_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_3_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.126, %input.142, %73) # .1:128:0 %input.150 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.146, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_4_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_4_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.150, %self.self_blocks_4_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_4_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_8.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_4_attn_qkv.2, %4) # .1:131:0 %permute_4.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_8.1, %1) # .1:132:0 return (%permute_4.1, %input.146) with prim::FallbackGraph_9 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_4.1 : Tensor = aten::transpose(%scaled_dot_product_attention_3, %18, %17) # .1:112:0 %input.118 : Tensor = aten::reshape(%transpose_4.1, %16) # .1:113:0 %input.122 : Tensor = aten::linear(%input.118, %self.self_blocks_3_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_3_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.126 : Tensor = aten::add(%0, %input.122, %18) # .1:118:0 %input.130 : Tensor = aten::layer_norm(%input.126, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_3_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.134 : Tensor = aten::linear(%input.130, %self.self_blocks_3_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_3_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.138 : Tensor = aten::gelu(%input.134, %7), scope: __module.self_blocks_3_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.142 : Tensor = aten::linear(%input.138, %self.self_blocks_3_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_3_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.146 : Tensor = aten::add(%input.126, %input.142, %18) # .1:128:0 %input.150 : Tensor = aten::layer_norm(%input.146, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_4_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_4_attn_qkv.2 : Tensor = aten::linear(%input.150, %self.self_blocks_4_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_4_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_8.1 : Tensor = aten::reshape(%self_blocks_4_attn_qkv.2, %3) # .1:131:0 %permute_4.1 : Tensor = aten::permute(%reshape_8.1, %2) # .1:132:0 return (%permute_4.1, %input.146) with prim::AIOFusionGroup_10 = graph(%input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_4, %78) %input.154 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:141:0 %input.158 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.154, %self.self_blocks_4_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_4_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.162 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.146, %input.158, %73) # .1:146:0 %input.166 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.162, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_4_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.170 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.166, %self.self_blocks_4_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_4_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.174 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.170, %37), scope: __module.self_blocks_4_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.178 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.174, %self.self_blocks_4_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_4_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.162, %input.178, %73) # .1:156:0 %input.186 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.182, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_5_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_5_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.186, %self.self_blocks_5_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_5_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_10.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_5_attn_qkv.2, %4) # .1:159:0 %permute_5.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_10.1, %1) # .1:160:0 return (%permute_5.1, %input.182) with prim::FallbackGraph_11 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_5.1 : Tensor = aten::transpose(%scaled_dot_product_attention_4, %18, %17) # .1:140:0 %input.154 : Tensor = aten::reshape(%transpose_5.1, %16) # .1:141:0 %input.158 : Tensor = aten::linear(%input.154, %self.self_blocks_4_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_4_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.162 : Tensor = aten::add(%0, %input.158, %18) # .1:146:0 %input.166 : Tensor = aten::layer_norm(%input.162, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_4_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.170 : Tensor = aten::linear(%input.166, %self.self_blocks_4_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_4_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.174 : Tensor = aten::gelu(%input.170, %7), scope: __module.self_blocks_4_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.178 : Tensor = aten::linear(%input.174, %self.self_blocks_4_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_4_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.182 : Tensor = aten::add(%input.162, %input.178, %18) # .1:156:0 %input.186 : Tensor = aten::layer_norm(%input.182, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_5_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_5_attn_qkv.2 : Tensor = aten::linear(%input.186, %self.self_blocks_5_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_5_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_10.1 : Tensor = aten::reshape(%self_blocks_5_attn_qkv.2, %3) # .1:159:0 %permute_5.1 : Tensor = aten::permute(%reshape_10.1, %2) # .1:160:0 return (%permute_5.1, %input.182) with prim::AIOFusionGroup_12 = graph(%input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_5, %78) %input.190 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:169:0 %input.194 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.190, %self.self_blocks_5_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_5_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.198 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.182, %input.194, %73) # .1:174:0 %input.202 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.198, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_5_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.206 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.202, %self.self_blocks_5_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_5_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.210 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.206, %37), scope: __module.self_blocks_5_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.214 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.210, %self.self_blocks_5_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_5_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.198, %input.214, %73) # .1:184:0 %input.222 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.218, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_6_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_6_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.222, %self.self_blocks_6_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_6_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_12.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_6_attn_qkv.2, %4) # .1:187:0 %permute_6.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_12.1, %1) # .1:188:0 return (%permute_6.1, %input.218) with prim::FallbackGraph_13 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_6.1 : Tensor = aten::transpose(%scaled_dot_product_attention_5, %18, %17) # .1:168:0 %input.190 : Tensor = aten::reshape(%transpose_6.1, %16) # .1:169:0 %input.194 : Tensor = aten::linear(%input.190, %self.self_blocks_5_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_5_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.198 : Tensor = aten::add(%0, %input.194, %18) # .1:174:0 %input.202 : Tensor = aten::layer_norm(%input.198, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_5_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.206 : Tensor = aten::linear(%input.202, %self.self_blocks_5_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_5_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.210 : Tensor = aten::gelu(%input.206, %7), scope: __module.self_blocks_5_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.214 : Tensor = aten::linear(%input.210, %self.self_blocks_5_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_5_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.218 : Tensor = aten::add(%input.198, %input.214, %18) # .1:184:0 %input.222 : Tensor = aten::layer_norm(%input.218, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_6_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_6_attn_qkv.2 : Tensor = aten::linear(%input.222, %self.self_blocks_6_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_6_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_12.1 : Tensor = aten::reshape(%self_blocks_6_attn_qkv.2, %3) # .1:187:0 %permute_6.1 : Tensor = aten::permute(%reshape_12.1, %2) # .1:188:0 return (%permute_6.1, %input.218) with prim::AIOFusionGroup_14 = graph(%input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_6, %78) %input.226 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:197:0 %input.230 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.226, %self.self_blocks_6_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_6_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.234 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.218, %input.230, %73) # .1:202:0 %input.238 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.234, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_6_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.242 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.238, %self.self_blocks_6_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_6_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.246 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.242, %37), scope: __module.self_blocks_6_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.250 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.246, %self.self_blocks_6_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_6_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.234, %input.250, %73) # .1:212:0 %input.258 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.254, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_7_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_7_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.258, %self.self_blocks_7_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_7_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_14.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_7_attn_qkv.2, %4) # .1:215:0 %permute_7.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_14.1, %1) # .1:216:0 return (%permute_7.1, %input.254) with prim::FallbackGraph_15 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_7.1 : Tensor = aten::transpose(%scaled_dot_product_attention_6, %18, %17) # .1:196:0 %input.226 : Tensor = aten::reshape(%transpose_7.1, %16) # .1:197:0 %input.230 : Tensor = aten::linear(%input.226, %self.self_blocks_6_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_6_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.234 : Tensor = aten::add(%0, %input.230, %18) # .1:202:0 %input.238 : Tensor = aten::layer_norm(%input.234, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_6_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.242 : Tensor = aten::linear(%input.238, %self.self_blocks_6_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_6_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.246 : Tensor = aten::gelu(%input.242, %7), scope: __module.self_blocks_6_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.250 : Tensor = aten::linear(%input.246, %self.self_blocks_6_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_6_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.254 : Tensor = aten::add(%input.234, %input.250, %18) # .1:212:0 %input.258 : Tensor = aten::layer_norm(%input.254, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_7_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_7_attn_qkv.2 : Tensor = aten::linear(%input.258, %self.self_blocks_7_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_7_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_14.1 : Tensor = aten::reshape(%self_blocks_7_attn_qkv.2, %3) # .1:215:0 %permute_7.1 : Tensor = aten::permute(%reshape_14.1, %2) # .1:216:0 return (%permute_7.1, %input.254) with prim::AIOFusionGroup_16 = graph(%input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_7, %78) %input.262 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:225:0 %input.266 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.262, %self.self_blocks_7_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_7_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.270 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.254, %input.266, %73) # .1:230:0 %input.274 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.270, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_7_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.278 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.274, %self.self_blocks_7_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_7_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.282 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.278, %37), scope: __module.self_blocks_7_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.286 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.282, %self.self_blocks_7_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_7_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.270, %input.286, %73) # .1:240:0 %input.294 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.290, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_8_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_8_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.294, %self.self_blocks_8_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_8_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_16.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_8_attn_qkv.2, %4) # .1:243:0 %permute_8.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_16.1, %1) # .1:244:0 return (%permute_8.1, %input.290) with prim::FallbackGraph_17 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_8.1 : Tensor = aten::transpose(%scaled_dot_product_attention_7, %18, %17) # .1:224:0 %input.262 : Tensor = aten::reshape(%transpose_8.1, %16) # .1:225:0 %input.266 : Tensor = aten::linear(%input.262, %self.self_blocks_7_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_7_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.270 : Tensor = aten::add(%0, %input.266, %18) # .1:230:0 %input.274 : Tensor = aten::layer_norm(%input.270, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_7_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.278 : Tensor = aten::linear(%input.274, %self.self_blocks_7_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_7_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.282 : Tensor = aten::gelu(%input.278, %7), scope: __module.self_blocks_7_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.286 : Tensor = aten::linear(%input.282, %self.self_blocks_7_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_7_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.290 : Tensor = aten::add(%input.270, %input.286, %18) # .1:240:0 %input.294 : Tensor = aten::layer_norm(%input.290, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_8_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_8_attn_qkv.2 : Tensor = aten::linear(%input.294, %self.self_blocks_8_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_8_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_16.1 : Tensor = aten::reshape(%self_blocks_8_attn_qkv.2, %3) # .1:243:0 %permute_8.1 : Tensor = aten::permute(%reshape_16.1, %2) # .1:244:0 return (%permute_8.1, %input.290) with prim::AIOFusionGroup_18 = graph(%input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_8, %78) %input.298 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:253:0 %input.302 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.298, %self.self_blocks_8_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_8_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.306 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.290, %input.302, %73) # .1:258:0 %input.310 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.306, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_8_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.314 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.310, %self.self_blocks_8_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_8_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.318 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.314, %37), scope: __module.self_blocks_8_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.322 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.318, %self.self_blocks_8_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_8_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.306, %input.322, %73) # .1:268:0 %input.330 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.326, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_9_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_9_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.330, %self.self_blocks_9_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_9_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_18.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_9_attn_qkv.2, %4) # .1:271:0 %permute_9.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_18.1, %1) # .1:272:0 return (%permute_9.1, %input.326) with prim::FallbackGraph_19 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_9.1 : Tensor = aten::transpose(%scaled_dot_product_attention_8, %18, %17) # .1:252:0 %input.298 : Tensor = aten::reshape(%transpose_9.1, %16) # .1:253:0 %input.302 : Tensor = aten::linear(%input.298, %self.self_blocks_8_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_8_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.306 : Tensor = aten::add(%0, %input.302, %18) # .1:258:0 %input.310 : Tensor = aten::layer_norm(%input.306, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_8_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.314 : Tensor = aten::linear(%input.310, %self.self_blocks_8_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_8_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.318 : Tensor = aten::gelu(%input.314, %7), scope: __module.self_blocks_8_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.322 : Tensor = aten::linear(%input.318, %self.self_blocks_8_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_8_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.326 : Tensor = aten::add(%input.306, %input.322, %18) # .1:268:0 %input.330 : Tensor = aten::layer_norm(%input.326, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_9_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_9_attn_qkv.2 : Tensor = aten::linear(%input.330, %self.self_blocks_9_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_9_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_18.1 : Tensor = aten::reshape(%self_blocks_9_attn_qkv.2, %3) # .1:271:0 %permute_9.1 : Tensor = aten::permute(%reshape_18.1, %2) # .1:272:0 return (%permute_9.1, %input.326) with prim::AIOFusionGroup_20 = graph(%input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_9, %78) %input.334 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:281:0 %input.338 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.334, %self.self_blocks_9_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_9_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.342 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.326, %input.338, %73) # .1:286:0 %input.346 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.342, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_9_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.350 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.346, %self.self_blocks_9_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_9_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.354 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.350, %37), scope: __module.self_blocks_9_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.358 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.354, %self.self_blocks_9_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_9_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.342, %input.358, %73) # .1:296:0 %input.366 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.362, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_10_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_10_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.366, %self.self_blocks_10_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_10_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_20.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_10_attn_qkv.2, %4) # .1:299:0 %permute_10.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_20.1, %1) # .1:300:0 return (%permute_10.1, %input.362) with prim::FallbackGraph_21 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_10.1 : Tensor = aten::transpose(%scaled_dot_product_attention_9, %18, %17) # .1:280:0 %input.334 : Tensor = aten::reshape(%transpose_10.1, %16) # .1:281:0 %input.338 : Tensor = aten::linear(%input.334, %self.self_blocks_9_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_9_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.342 : Tensor = aten::add(%0, %input.338, %18) # .1:286:0 %input.346 : Tensor = aten::layer_norm(%input.342, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_9_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.350 : Tensor = aten::linear(%input.346, %self.self_blocks_9_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_9_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.354 : Tensor = aten::gelu(%input.350, %7), scope: __module.self_blocks_9_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.358 : Tensor = aten::linear(%input.354, %self.self_blocks_9_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_9_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.362 : Tensor = aten::add(%input.342, %input.358, %18) # .1:296:0 %input.366 : Tensor = aten::layer_norm(%input.362, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_10_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_10_attn_qkv.2 : Tensor = aten::linear(%input.366, %self.self_blocks_10_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_10_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_20.1 : Tensor = aten::reshape(%self_blocks_10_attn_qkv.2, %3) # .1:299:0 %permute_10.1 : Tensor = aten::permute(%reshape_20.1, %2) # .1:300:0 return (%permute_10.1, %input.362) with prim::AIOFusionGroup_22 = graph(%input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %4 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %37 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %50 : bool = prim::Constant[value=1]() %49 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %46 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %69 : int[] = prim::Constant[value=[1, 122, 128]]() %73 : int = prim::Constant[value=1]() %78 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %79 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_10, %78) %input.370 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%79, %69) # .1:309:0 %input.374 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.370, %self.self_blocks_10_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_10_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.378 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.362, %input.374, %73) # .1:314:0 %input.382 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.378, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_10_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.386 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.382, %self.self_blocks_10_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_10_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.390 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.386, %37), scope: __module.self_blocks_10_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.394 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.390, %self.self_blocks_10_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_10_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.378, %input.394, %73) # .1:324:0 %input.402 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.398, %46, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %49, %50), scope: __module.self_blocks_11_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_11_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.402, %self.self_blocks_11_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_11_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_22.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_11_attn_qkv.2, %4) # .1:327:0 %permute_11.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_22.1, %1) # .1:328:0 return (%permute_11.1, %input.398) with prim::FallbackGraph_23 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=2]() %18 : int = prim::Constant[value=1]() %transpose_11.1 : Tensor = aten::transpose(%scaled_dot_product_attention_10, %18, %17) # .1:308:0 %input.370 : Tensor = aten::reshape(%transpose_11.1, %16) # .1:309:0 %input.374 : Tensor = aten::linear(%input.370, %self.self_blocks_10_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_10_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.378 : Tensor = aten::add(%0, %input.374, %18) # .1:314:0 %input.382 : Tensor = aten::layer_norm(%input.378, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_10_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.386 : Tensor = aten::linear(%input.382, %self.self_blocks_10_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_10_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.390 : Tensor = aten::gelu(%input.386, %7), scope: __module.self_blocks_10_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.394 : Tensor = aten::linear(%input.390, %self.self_blocks_10_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_10_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.398 : Tensor = aten::add(%input.378, %input.394, %18) # .1:324:0 %input.402 : Tensor = aten::layer_norm(%input.398, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_11_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_11_attn_qkv.2 : Tensor = aten::linear(%input.402, %self.self_blocks_11_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_11_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_22.1 : Tensor = aten::reshape(%self_blocks_11_attn_qkv.2, %3) # .1:327:0 %permute_11.1 : Tensor = aten::permute(%reshape_22.1, %2) # .1:328:0 return (%permute_11.1, %input.398) with prim::AIOFusionGroup_24 = graph(%input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=9223372036854775807]() %1 : int = prim::Constant[value=0]() %self.self_blocks_11_mlp_fc2.weight.3 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %29 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.5 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_mlp_fc1.weight.5 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %42 : bool = prim::Constant[value=1]() %41 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.6 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %38 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_proj.weight.8 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %61 : int[] = prim::Constant[value=[1, 122, 128]]() %65 : int = prim::Constant[value=1]() %70 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %71 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_11, %70) %input.406 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%71, %61) # .1:337:0 %input.410 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.406, %self.self_blocks_11_attn_proj.weight.8, %self.self_blocks_0_norm1.bias.8), scope: __module.self_blocks_11_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.398, %input.410, %65) # .1:342:0 %input.418 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.414, %38, %self.self_blocks_0_norm1.weight.6, %self.self_blocks_0_norm1.bias.8, %41, %42), scope: __module.self_blocks_11_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.422 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.418, %self.self_blocks_11_mlp_fc1.weight.5, %self.self_blocks_0_mlp_fc1.bias.5), scope: __module.self_blocks_11_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.426 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.422, %29), scope: __module.self_blocks_11_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.430 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.426, %self.self_blocks_11_mlp_fc2.weight.3, %self.self_blocks_0_norm1.bias.8), scope: __module.self_blocks_11_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.434 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.414, %input.430, %65) # .1:352:0 %self_norm.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.434, %38, %self.self_blocks_0_norm1.weight.6, %self.self_blocks_0_norm1.bias.8, %41, %42), scope: __module.self_norm # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %4 : Tensor = aten::slice(%self_norm.2, %1, %1, %2, %65) # .1:354:0 return (%4) with prim::FallbackGraph_25 = graph(%0 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=9223372036854775807]() %3 : int = prim::Constant[value=0]() %self.self_blocks_11_mlp_fc2.weight.3 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %5 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.5 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_mlp_fc1.weight.5 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %8 : bool = prim::Constant[value=1]() %9 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.6 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %11 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_proj.weight.8 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %14 : int[] = prim::Constant[value=[1, 122, 128]]() %15 : int = prim::Constant[value=2]() %16 : int = prim::Constant[value=1]() %transpose_12.1 : Tensor = aten::transpose(%scaled_dot_product_attention_11, %16, %15) # .1:336:0 %input.406 : Tensor = aten::reshape(%transpose_12.1, %14) # .1:337:0 %input.410 : Tensor = aten::linear(%input.406, %self.self_blocks_11_attn_proj.weight.8, %self.self_blocks_0_norm1.bias.8), scope: __module.self_blocks_11_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.414 : Tensor = aten::add(%0, %input.410, %16) # .1:342:0 %input.418 : Tensor = aten::layer_norm(%input.414, %11, %self.self_blocks_0_norm1.weight.6, %self.self_blocks_0_norm1.bias.8, %9, %8), scope: __module.self_blocks_11_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.422 : Tensor = aten::linear(%input.418, %self.self_blocks_11_mlp_fc1.weight.5, %self.self_blocks_0_mlp_fc1.bias.5), scope: __module.self_blocks_11_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.426 : Tensor = aten::gelu(%input.422, %5), scope: __module.self_blocks_11_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.430 : Tensor = aten::linear(%input.426, %self.self_blocks_11_mlp_fc2.weight.3, %self.self_blocks_0_norm1.bias.8), scope: __module.self_blocks_11_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.434 : Tensor = aten::add(%input.414, %input.430, %16) # .1:352:0 %self_norm.2 : Tensor = aten::layer_norm(%input.434, %11, %self.self_blocks_0_norm1.weight.6, %self.self_blocks_0_norm1.bias.8, %9, %8), scope: __module.self_norm # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %27 : Tensor = aten::slice(%self_norm.2, %3, %3, %2, %16) # .1:354:0 return (%27) with prim::AIOFusionGroup_26 = graph(%input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu)): %self.self_head.bias : Float(1000, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_head.weight : Float(1000, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %3 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = aten::linear(%input.245, %self.self_head.weight, %self.self_head.bias), scope: __module.self_head # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%3) with prim::FallbackGraph_27 = graph(%input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu)): %self.self_head.bias : Float(1000, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_head.weight : Float(1000, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %3 : Tensor = aten::linear(%input.245, %self.self_head.weight, %self.self_head.bias), scope: __module.self_head # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%3) Running DLS graph fuser taken 43342 microseconds Building AIO network from graph graph(%x : Float(1, 3, 110, 110, strides=[36300, 12100, 110, 1], requires_grad=0, device=cpu)): %1 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %2 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %5 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.bias.3 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_norm1.weight.3 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %8 : int[] = prim::Constant[value=[128]]() %self.self_pos_embed.4 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : Float(1, 1, 128, strides=[128, 128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %11 : int = prim::Constant[value=-1]() %12 : int = prim::Constant[value=2]() %13 : bool = prim::Constant[value=1]() %14 : int = prim::Constant[value=1]() %15 : bool = prim::Constant[value=0]() %16 : int[] = prim::Constant[value=[1, 1]]() %17 : int[] = prim::Constant[value=[0, 0]]() %18 : int[] = prim::Constant[value=[10, 10]]() %self.self_patch_embed_proj.bias.9 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_patch_embed_proj.weight.9 : Float(128, 3, 10, 10, strides=[300, 100, 10, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self_patch_embed_proj.2 : Float(1, 128, 11, 11, strides=[15488, 121, 11, 1], requires_grad=0, device=cpu) = aten::_convolution(%x, %self.self_patch_embed_proj.weight.9, %self.self_patch_embed_proj.bias.9, %18, %17, %16, %15, %17, %14, %15, %15, %13, %13), scope: __module.self_patch_embed_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/conv.py:459:0 %flatten.1 : Float(1, 128, 121, strides=[15488, 121, 1], requires_grad=0, device=cpu) = aten::flatten(%self_patch_embed_proj.2, %12, %11) # .1:6:0 %23 : int[] = prim::Constant[value=[0, 2, 1]]() %24 : Float(1, 121, 128, strides=[15488, 1, 121], requires_grad=0, device=cpu) = aten::permute(%flatten.1, %23) %25 : Tensor[] = prim::ListConstruct(%10, %24) %cat.1 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::cat(%25, %14) # .1:12:0 %input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%cat.1, %self.self_pos_embed.4, %14) # .1:13:0 %input.6 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.2, %8, %self.self_blocks_0_norm1.weight.3, %self.self_blocks_0_norm1.bias.3, %5, %13), scope: __module.self_blocks_0_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_0_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.6, %self.self_blocks_0_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_0_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_0_attn_qkv.2, %2) # .1:19:0 %permute.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape.1, %1) # .1:20:0 return (%permute.1, %input.2) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Creating blob for Data layer 2 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 1, 128] Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::_convolution layer to network Binding inputs for Convolution Layer Weight 0xaaab0de3f340 , Bias 0xaaab0de23440 , padding [0, 0] , stride [10, 10] , dilation [1, 1] , groups 1 Registering network input: Conv input index: 0 Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Creating blob for Data layer 5 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [128, 3, 10, 10] Creating blob for Data layer 6 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::flatten layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Creating blob (executor) for Data layer 9 with type INT32 shape [3] 0 2 1 Allocating 12 bytes (aligned) Adding prim::ListConstruct layer to network Adding aten::cat layer to network Binding inputs for Cat layer dim 1 tensor_inputs.size 2 Adding aten::add layer to network Creating blob for Data layer 12 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 14 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 15 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 16 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0de834c0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 18 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 19 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 21 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 23 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Allocating 16 bytes (aligned) Allocating 16 bytes (aligned) Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Allocating 12 bytes (aligned) Allocating 12 bytes (aligned) Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel TransposeBRC4x4 for layer Transpose : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Conv input Selected kernel TransposeBRC3x4 for layer Transpose : Selected kernel ConvViaJitMatmul for layer Convolution : PlatformInfo(vendor_id=3, cpu_family=8, cpu_model=3340, isa=NEON, L1=CacheInfo(size=65536, inclusive=1, share_count=1), L2=CacheInfo(size=1048576, inclusive=0, share_count=1), L3=CacheInfo(size=33554432, inclusive=0, share_count=80)) Tuning ConvTask(batch=1,idepth=1,iheight=110,iwidth=110,ichannels=3,odepth=1,oheight=11,owidth=11,ochannels=128,kdepth=1,kheight=10,kwidth=10,dstride=1,hstride=10,wstride=10,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset with best score CvjmPreset(in_regs=8,w_regs=2,batch_tile=8,inpf_step_tile=100,outf_tile=128) Allocating 153600 bytes (aligned) Selected kernel TransposeBRC4x4 for layer Transpose : Selected kernel ForwardingKernelFlatten for layer Flatten : Selected kernel Data for layer Data : Selected kernel TransposeIndexed for layer Transpose : Allocating 512 bytes (aligned) Selected kernel ConcatLastDim for layer Concat : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel TransposeBRC4x4 for layer Transpose : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Selected kernel ForwardingKernelOutput for layer Output : Merge of ( Transpose [1, 110, 110, 3] ) to Conv input ( Input Input ): Target layer type is not mergeable Merge of ( Convolution [1, 11, 11, 128] ) to ( Transpose TransposeBRC3x4 ): Target layer type is not mergeable Merge of ( Transpose [1, 128, 11, 11] ) to ( Convolution ConvViaJitMatmul ): Target layer type is not mergeable Merge of ( Flatten [1, 128, 121] ) to ( Transpose TransposeBRC4x4 ): Target layer type is not mergeable Merge of ( Concat [1, 128, 122] ) to ( Flatten ForwardingKernelFlatten ): Target layer type is not mergeable Considering merge of Add to ConcatLastDim Kernel ConcatLastDim rejected merge Merge of ( Add [1, 128, 122] ) to ( Concat ConcatLastDim ): Attempt merge failed Merge of ( Transpose [1, 122, 128] ) to ( Add BinaryOpVectorized[Add]@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Allocating 62464 bytes (aligned) Allocating 145200 bytes (aligned) Allocating 61952 bytes (aligned) Allocating 61952 bytes (aligned) Allocating 512 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 187392 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running TransposeBRC4x4 Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Allocating 48400 bytes (aligned) Jitted kernel for init: in_mode: ProxyInput acc_init: AccInitializer::ZERO in_dtype: FLOAT ref_grid: [8x2] out_features: 8 in_tail_cols: 3 int8_apply_filter_offset: 0 int8_shift_uint8_to_sint8: 0 postprocessing_ops: PP[BINOP_ADD_LINEAR,NONE,NONE,NONE,NONE,NONE,] inner_iter_length: [no-value] sparse_proxy_in_optimization: 0 strided_weights: 0 input_can_read_last_full_vector: 0 weights_can_read_last_full_vector: 0 prefetch_options: {w_ahead: 0} at 0xffff84cc0000 , used 5900 B Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Running TransposeIndexed Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne /usr/local/share//libampere-aio/data/lookup_files/conv_one_jit.csv 1 Could not parse lookup entry: Missing column task.extbatch Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset with best score ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Jitted kernel for init: in_mode: MultiStream acc_init: AccInitializer::ZERO in_dtype: FLOAT ref_grid: [7x3] out_features: 12 in_tail_cols: 0 int8_apply_filter_offset: 0 int8_shift_uint8_to_sint8: 0 postprocessing_ops: PP[BINOP_ADD_LINEAR,NONE,NONE,NONE,NONE,NONE,] inner_iter_length: [no-value] sparse_proxy_in_optimization: 0 strided_weights: 0 input_can_read_last_full_vector: 0 weights_can_read_last_full_vector: 0 prefetch_options: {w_ahead: 0} at 0xffff84cb0000 , used 3832 B Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Building AIO network from graph graph(%input.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_0_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention, %18) %input.10 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:29:0 %input.14 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.10, %self.self_blocks_0_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_0_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.18 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.2, %input.14, %17) # .1:34:0 %input.22 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.18, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_0_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.26 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.22, %self.self_blocks_0_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_0_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.30 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.26, %7), scope: __module.self_blocks_0_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.34 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.30, %self.self_blocks_0_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_0_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.18, %input.34, %17) # .1:44:0 %input.42 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.38, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_1_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_1_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.42, %self.self_blocks_1_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_1_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_2.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_1_attn_qkv.2, %3) # .1:47:0 %permute_1.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_2.1, %2) # .1:48:0 return (%permute_1.1, %input.38) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 38 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 40 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0deb3540 , Bias 0xaaab0dd9a000 Creating blob for Data layer 42 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 43 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 47 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 48 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 49 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0dec35c0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 51 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 52 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0df03640 , Bias 0xaaab0dd9a000 Creating blob for Data layer 55 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 56 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 59 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 60 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 61 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0df436c0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 63 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 64 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 66 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 68 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset with best score ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Jitted kernel for init: in_mode: MultiStream acc_init: AccInitializer::ZERO in_dtype: FLOAT ref_grid: [6x4] out_features: 16 in_tail_cols: 0 int8_apply_filter_offset: 0 int8_shift_uint8_to_sint8: 0 postprocessing_ops: PP[BINOP_ADD_LINEAR,BINOP_ADD_MATRIX,NONE,NONE,NONE,NONE,] inner_iter_length: [no-value] sparse_proxy_in_optimization: 0 strided_weights: 0 input_can_read_last_full_vector: 0 weights_can_read_last_full_vector: 0 prefetch_options: {w_ahead: 0} at 0xffff7c5c0000 , used 4652 B Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset with best score ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Jitted kernel for init: in_mode: MultiStream acc_init: AccInitializer::ZERO in_dtype: FLOAT ref_grid: [6x4] out_features: 16 in_tail_cols: 0 int8_apply_filter_offset: 0 int8_shift_uint8_to_sint8: 0 postprocessing_ops: PP[BINOP_ADD_LINEAR,NONE,NONE,NONE,NONE,NONE,] inner_iter_length: [no-value] sparse_proxy_in_optimization: 0 strided_weights: 0 input_can_read_last_full_vector: 0 weights_can_read_last_full_vector: 0 prefetch_options: {w_ahead: 0} at 0xffff7c590000 , used 3716 B Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset with best score ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.38 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_1 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_1_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_1, %18) %input.46 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:57:0 %input.50 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.46, %self.self_blocks_1_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_1_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.54 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.38, %input.50, %17) # .1:62:0 %input.58 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.54, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_1_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.62 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.58, %self.self_blocks_1_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_1_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.66 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.62, %7), scope: __module.self_blocks_1_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.70 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.66, %self.self_blocks_1_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_1_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.54, %input.70, %17) # .1:72:0 %input.78 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.74, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_2_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_2_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.78, %self.self_blocks_2_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_2_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_4.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_2_attn_qkv.2, %3) # .1:75:0 %permute_2.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_4.1, %2) # .1:76:0 return (%permute_2.1, %input.74) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 73 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 75 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0df73740 , Bias 0xaaab0dd9a000 Creating blob for Data layer 77 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 78 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 82 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 83 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 84 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0df837c0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 86 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 87 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0dfc3840 , Bias 0xaaab0dd9a000 Creating blob for Data layer 90 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 91 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 94 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 95 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 96 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e0038c0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 98 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 99 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 101 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 103 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.74 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_2 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_2_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_2, %18) %input.82 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:85:0 %input.86 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.82, %self.self_blocks_2_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_2_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.90 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.74, %input.86, %17) # .1:90:0 %input.94 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.90, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_2_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.98 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.94, %self.self_blocks_2_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_2_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.102 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.98, %7), scope: __module.self_blocks_2_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.106 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.102, %self.self_blocks_2_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_2_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.90, %input.106, %17) # .1:100:0 %input.114 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.110, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_3_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_3_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.114, %self.self_blocks_3_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_3_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_6.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_3_attn_qkv.2, %3) # .1:103:0 %permute_3.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_6.1, %2) # .1:104:0 return (%permute_3.1, %input.110) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 108 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 110 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e033940 , Bias 0xaaab0dd9a000 Creating blob for Data layer 112 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 113 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 117 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 118 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 119 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e0439c0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 121 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 122 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e083a40 , Bias 0xaaab0dd9a000 Creating blob for Data layer 125 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 126 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 129 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 130 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 131 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e0c3ac0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 133 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 134 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 136 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 138 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.110 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_3 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_3_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_3, %18) %input.118 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:113:0 %input.122 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.118, %self.self_blocks_3_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_3_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.126 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.110, %input.122, %17) # .1:118:0 %input.130 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.126, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_3_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.134 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.130, %self.self_blocks_3_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_3_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.138 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.134, %7), scope: __module.self_blocks_3_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.142 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.138, %self.self_blocks_3_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_3_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.126, %input.142, %17) # .1:128:0 %input.150 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.146, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_4_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_4_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.150, %self.self_blocks_4_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_4_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_8.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_4_attn_qkv.2, %3) # .1:131:0 %permute_4.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_8.1, %2) # .1:132:0 return (%permute_4.1, %input.146) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 143 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 145 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e0f3b40 , Bias 0xaaab0dd9a000 Creating blob for Data layer 147 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 148 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 152 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 153 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 154 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e103bc0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 156 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 157 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e143c40 , Bias 0xaaab0dd9a000 Creating blob for Data layer 160 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 161 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 164 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 165 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 166 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e183cc0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 168 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 169 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 171 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 173 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.146 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_4 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_4_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_4, %18) %input.154 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:141:0 %input.158 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.154, %self.self_blocks_4_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_4_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.162 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.146, %input.158, %17) # .1:146:0 %input.166 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.162, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_4_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.170 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.166, %self.self_blocks_4_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_4_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.174 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.170, %7), scope: __module.self_blocks_4_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.178 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.174, %self.self_blocks_4_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_4_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.162, %input.178, %17) # .1:156:0 %input.186 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.182, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_5_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_5_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.186, %self.self_blocks_5_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_5_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_10.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_5_attn_qkv.2, %3) # .1:159:0 %permute_5.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_10.1, %2) # .1:160:0 return (%permute_5.1, %input.182) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 178 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 180 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e1b3d40 , Bias 0xaaab0dd9a000 Creating blob for Data layer 182 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 183 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 187 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 188 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 189 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e1c3dc0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 191 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 192 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e203e40 , Bias 0xaaab0dd9a000 Creating blob for Data layer 195 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 196 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 199 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 200 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 201 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e243ec0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 203 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 204 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 206 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 208 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.182 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_5 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_5_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_5, %18) %input.190 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:169:0 %input.194 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.190, %self.self_blocks_5_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_5_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.198 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.182, %input.194, %17) # .1:174:0 %input.202 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.198, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_5_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.206 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.202, %self.self_blocks_5_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_5_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.210 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.206, %7), scope: __module.self_blocks_5_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.214 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.210, %self.self_blocks_5_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_5_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.198, %input.214, %17) # .1:184:0 %input.222 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.218, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_6_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_6_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.222, %self.self_blocks_6_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_6_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_12.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_6_attn_qkv.2, %3) # .1:187:0 %permute_6.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_12.1, %2) # .1:188:0 return (%permute_6.1, %input.218) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 213 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 215 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e273f40 , Bias 0xaaab0dd9a000 Creating blob for Data layer 217 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 218 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 222 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 223 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 224 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e283fc0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 226 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 227 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e2c4040 , Bias 0xaaab0dd9a000 Creating blob for Data layer 230 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 231 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 234 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 235 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 236 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e3040c0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 238 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 239 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 241 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 243 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.218 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_6 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_6_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_6, %18) %input.226 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:197:0 %input.230 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.226, %self.self_blocks_6_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_6_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.234 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.218, %input.230, %17) # .1:202:0 %input.238 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.234, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_6_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.242 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.238, %self.self_blocks_6_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_6_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.246 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.242, %7), scope: __module.self_blocks_6_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.250 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.246, %self.self_blocks_6_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_6_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.234, %input.250, %17) # .1:212:0 %input.258 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.254, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_7_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_7_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.258, %self.self_blocks_7_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_7_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_14.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_7_attn_qkv.2, %3) # .1:215:0 %permute_7.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_14.1, %2) # .1:216:0 return (%permute_7.1, %input.254) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 248 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 250 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e334140 , Bias 0xaaab0dd9a000 Creating blob for Data layer 252 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 253 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 257 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 258 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 259 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e3441c0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 261 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 262 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e384240 , Bias 0xaaab0dd9a000 Creating blob for Data layer 265 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 266 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 269 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 270 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 271 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e3c42c0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 273 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 274 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 276 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 278 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.254 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_7 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_7_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_7, %18) %input.262 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:225:0 %input.266 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.262, %self.self_blocks_7_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_7_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.270 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.254, %input.266, %17) # .1:230:0 %input.274 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.270, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_7_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.278 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.274, %self.self_blocks_7_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_7_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.282 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.278, %7), scope: __module.self_blocks_7_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.286 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.282, %self.self_blocks_7_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_7_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.270, %input.286, %17) # .1:240:0 %input.294 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.290, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_8_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_8_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.294, %self.self_blocks_8_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_8_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_16.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_8_attn_qkv.2, %3) # .1:243:0 %permute_8.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_16.1, %2) # .1:244:0 return (%permute_8.1, %input.290) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 283 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 285 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e3f4340 , Bias 0xaaab0dd9a000 Creating blob for Data layer 287 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 288 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 292 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 293 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 294 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e4043c0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 296 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 297 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e444440 , Bias 0xaaab0dd9a000 Creating blob for Data layer 300 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 301 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 304 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 305 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 306 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e4844c0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 308 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 309 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 311 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 313 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.290 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_8 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_8_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_8, %18) %input.298 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:253:0 %input.302 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.298, %self.self_blocks_8_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_8_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.306 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.290, %input.302, %17) # .1:258:0 %input.310 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.306, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_8_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.314 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.310, %self.self_blocks_8_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_8_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.318 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.314, %7), scope: __module.self_blocks_8_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.322 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.318, %self.self_blocks_8_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_8_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.306, %input.322, %17) # .1:268:0 %input.330 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.326, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_9_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_9_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.330, %self.self_blocks_9_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_9_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_18.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_9_attn_qkv.2, %3) # .1:271:0 %permute_9.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_18.1, %2) # .1:272:0 return (%permute_9.1, %input.326) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 318 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 320 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e4b4540 , Bias 0xaaab0dd9a000 Creating blob for Data layer 322 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 323 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 327 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 328 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 329 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e4c4ac0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 331 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 332 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e5053c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 335 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 336 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 339 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 340 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 341 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e545bc0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 343 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 344 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 346 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 348 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.326 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_9 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_9_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_9, %18) %input.334 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:281:0 %input.338 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.334, %self.self_blocks_9_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_9_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.342 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.326, %input.338, %17) # .1:286:0 %input.346 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.342, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_9_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.350 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.346, %self.self_blocks_9_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_9_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.354 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.350, %7), scope: __module.self_blocks_9_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.358 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.354, %self.self_blocks_9_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_9_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.342, %input.358, %17) # .1:296:0 %input.366 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.362, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_10_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_10_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.366, %self.self_blocks_10_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_10_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_20.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_10_attn_qkv.2, %3) # .1:299:0 %permute_10.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_20.1, %2) # .1:300:0 return (%permute_10.1, %input.362) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 353 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 355 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e5762c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 357 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 358 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 362 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 363 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 364 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e586ac0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 366 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 367 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e5c73c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 370 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 371 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 374 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 375 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 376 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e607bc0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 378 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 379 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 381 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 383 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.362 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_10 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int[] = prim::Constant[value=[2, 0, 3, 1, 4]]() %3 : int[] = prim::Constant[value=[1, 122, 3, 8, 16]]() %self.self_blocks_0_attn_qkv.bias.2 : Float(384, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_qkv.weight.2 : Float(384, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc2.weight.5 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %7 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.7 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_mlp_fc1.weight.7 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %10 : bool = prim::Constant[value=1]() %11 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %13 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.10 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_10_attn_proj.weight.10 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %16 : int[] = prim::Constant[value=[1, 122, 128]]() %17 : int = prim::Constant[value=1]() %18 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %19 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_10, %18) %input.370 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%19, %16) # .1:309:0 %input.374 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.370, %self.self_blocks_10_attn_proj.weight.10, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_10_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.378 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.362, %input.374, %17) # .1:314:0 %input.382 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.378, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_10_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.386 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.382, %self.self_blocks_10_mlp_fc1.weight.7, %self.self_blocks_0_mlp_fc1.bias.7), scope: __module.self_blocks_10_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.390 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.386, %7), scope: __module.self_blocks_10_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.394 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.390, %self.self_blocks_10_mlp_fc2.weight.5, %self.self_blocks_0_norm1.bias.10), scope: __module.self_blocks_10_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.378, %input.394, %17) # .1:324:0 %input.402 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.398, %13, %self.self_blocks_0_norm1.weight.8, %self.self_blocks_0_norm1.bias.10, %11, %10), scope: __module.self_blocks_11_norm1 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %self_blocks_11_attn_qkv.2 : Float(1, 122, 384, strides=[46848, 384, 1], requires_grad=0, device=cpu) = aten::linear(%input.402, %self.self_blocks_11_attn_qkv.weight.2, %self.self_blocks_0_attn_qkv.bias.2), scope: __module.self_blocks_11_attn_qkv # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %reshape_22.1 : Float(1, 122, 3, 8, 16, strides=[46848, 384, 128, 16, 1], requires_grad=0, device=cpu) = aten::reshape(%self_blocks_11_attn_qkv.2, %3) # .1:327:0 %permute_11.1 : Float(3, 1, 8, 122, 16, strides=[128, 46848, 16, 384, 1], requires_grad=0, device=cpu) = aten::permute(%reshape_22.1, %2) # .1:328:0 return (%permute_11.1, %input.398) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 388 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 390 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e6382c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 392 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 393 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 397 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 398 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 399 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e648ac0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 401 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 402 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e6893c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 405 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 406 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 409 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 410 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 411 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e6c9bc0 , Bias 0xaaab0dd9b2c0 Creating blob for Data layer 413 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [384, 128] Creating blob for Data layer 414 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [384] Adding aten::reshape layer to network Creating blob (executor) for Data layer 416 with type INT64 shape [5] [1, 122, 3, 8, 16] Allocating 40 bytes (aligned) Adding aten::permute layer to network Creating blob (executor) for Data layer 418 with type INT32 shape [5] 2 0 3 1 4 Allocating 20 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel ForwardingKernelOutput for layer Output : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel TransposeIndexed for layer Transpose : Allocating 187392 bytes (aligned) Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 384] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 3, 8, 16] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( Transpose [3, 1, 8, 122, 16] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Merge of ( Output [3, 1, 8, 122, 16] ) to ( Transpose TransposeIndexed ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=384,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=7,w_regs=3,w_prefetches=0,outf_tile=48,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 196608 bytes (aligned) Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Building AIO network from graph graph(%input.398 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu), %scaled_dot_product_attention_11 : Float(1, 8, 122, 16, strides=[15616, 1952, 16, 1], requires_grad=0, device=cpu)): %2 : int = prim::Constant[value=9223372036854775807]() %3 : int = prim::Constant[value=0]() %self.self_blocks_11_mlp_fc2.weight.3 : Float(128, 512, strides=[512, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %5 : str = prim::Constant[value="none"]() %self.self_blocks_0_mlp_fc1.bias.5 : Float(512, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_mlp_fc1.weight.5 : Float(512, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %8 : bool = prim::Constant[value=1]() %9 : float = prim::Constant[value=9.9999999999999995e-07]() %self.self_blocks_0_norm1.weight.6 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %11 : int[] = prim::Constant[value=[128]]() %self.self_blocks_0_norm1.bias.8 : Float(128, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_blocks_11_attn_proj.weight.8 : Float(128, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %14 : int[] = prim::Constant[value=[1, 122, 128]]() %15 : int = prim::Constant[value=1]() %16 : int[] = prim::Constant[value=[0, 2, 1, 3]]() %17 : Float(1, 122, 8, 16, strides=[15616, 16, 1952, 1], requires_grad=0, device=cpu) = aten::permute(%scaled_dot_product_attention_11, %16) %input.406 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::reshape(%17, %14) # .1:337:0 %input.410 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.406, %self.self_blocks_11_attn_proj.weight.8, %self.self_blocks_0_norm1.bias.8), scope: __module.self_blocks_11_attn_proj # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.414 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.398, %input.410, %15) # .1:342:0 %input.418 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.414, %11, %self.self_blocks_0_norm1.weight.6, %self.self_blocks_0_norm1.bias.8, %9, %8), scope: __module.self_blocks_11_norm2 # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %input.422 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::linear(%input.418, %self.self_blocks_11_mlp_fc1.weight.5, %self.self_blocks_0_mlp_fc1.bias.5), scope: __module.self_blocks_11_mlp_fc1 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.426 : Float(1, 122, 512, strides=[62464, 512, 1], requires_grad=0, device=cpu) = aten::gelu(%input.422, %5), scope: __module.self_blocks_11_mlp_act # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/activation.py:685:0 %input.430 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::linear(%input.426, %self.self_blocks_11_mlp_fc2.weight.3, %self.self_blocks_0_norm1.bias.8), scope: __module.self_blocks_11_mlp_fc2 # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 %input.434 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::add(%input.414, %input.430, %15) # .1:352:0 %self_norm.2 : Float(1, 122, 128, strides=[15616, 128, 1], requires_grad=0, device=cpu) = aten::layer_norm(%input.434, %11, %self.self_blocks_0_norm1.weight.6, %self.self_blocks_0_norm1.bias.8, %9, %8), scope: __module.self_norm # /usr/local/lib/python3.10/dist-packages/torch/nn/functional.py:2515:0 %27 : Tensor = aten::slice(%self_norm.2, %3, %3, %2, %15) # .1:354:0 return (%27) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::permute layer to network Registering network input: Permute input index: 1 Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob (executor) for Data layer 423 with type INT32 shape [4] 0 2 1 3 Allocating 16 bytes (aligned) Adding aten::reshape layer to network Creating blob (executor) for Data layer 425 with type INT64 shape [3] [1, 122, 128] Allocating 24 bytes (aligned) Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e6fa2c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 427 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 128] Creating blob for Data layer 428 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Registering network input: Lhs input index: 0 Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 432 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 433 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 434 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e70aac0 , Bias 0xaaab0dd9bdc0 Creating blob for Data layer 436 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [512, 128] Creating blob for Data layer 437 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [512] Adding aten::gelu layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e74b3c0 , Bias 0xaaab0dd9a000 Creating blob for Data layer 440 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [128, 512] Creating blob for Data layer 441 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [128] Adding aten::add layer to network Adding aten::layer_norm layer to network Creating blob (executor) for Data layer 444 with type INT32 shape [1] 128 Allocating 4 bytes (aligned) Creating blob for Data layer 445 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Creating blob for Data layer 446 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [128] Adding aten::slice layer to network Creating blob (executor) for data layer 448 with type INT64 Allocating 8 bytes (aligned) Creating blob (executor) for data layer 449 with type INT64 Allocating 8 bytes (aligned) Creating blob (executor) for data layer 450 with type INT64 Allocating 8 bytes (aligned) Creating blob (executor) for data layer 451 with type INT64 Allocating 8 bytes (aligned) Running AIO Network Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Layer FullyConnected got PlainDataFormat(FORMATF_BATCH_ROW_MAJOR)[0x0000000000000015] while it prefers PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] but no such conversion is available in DLS Selected kernel Data for layer Data : Slice step Selected kernel Data for layer Data : Slice end Selected kernel Data for layer Data : Slice start Selected kernel Data for layer Data : Slice dim Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Lhs input Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Permute input Selected kernel TransposeBERTVectorized@NEON for layer Transpose : Selected kernel ForwardingKernelReshape for layer Reshape : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel UnaryOpVectorized[Gelu]@NEON for layer Gelu : Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel BinaryOpVectorized[Add]@NEON for layer Add : Selected kernel LayerNormVectorized@NEON for layer LayerNorm : Selected kernel TorchSliceVectorized@NEON for layer TorchSlice : Selected kernel ForwardingKernelOutput for layer Output : Considering merge of Add to Input Kernel Input rejected merge Merge of ( Add [1, 122, 128] ) to Lhs input ( Input Input ): Attempt merge failed Merge of ( Transpose [1, 122, 8, 16] ) to Permute input ( Input Input ): Target layer type is not mergeable Merge of ( Reshape [1, 122, 128] ) to ( Transpose TransposeBERTVectorized@NEON ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Reshape ForwardingKernelReshape ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( FullyConnected [1, 122, 512] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Gelu [1, 122, 512] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( FullyConnected [1, 122, 128] ) to ( Gelu UnaryOpVectorized[Gelu]@NEON ): Target layer type is not mergeable Considering merge of Add to FCViaConvOne Merge of ( Add [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Successful Merge of ( LayerNorm [1, 122, 128] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable Merge of ( TorchSlice [1, 122, 128] ) to ( LayerNorm LayerNormVectorized@NEON ): Target layer type is not mergeable Merge of ( Output [1, 122, 128] ) to ( TorchSlice TorchSliceVectorized@NEON ): Target layer type is not mergeable External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Allocating 62464 bytes (aligned) Allocating 62464 bytes (aligned) Allocating 249856 bytes (aligned) Allocating 249856 bytes (aligned) Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 65536 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=512,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=64,in_mode=MS,d_minibatch=128,n_minibatch=61) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Tuning ConvTask(batch=122,idepth=1,iheight=1,iwidth=1,ichannels=512,odepth=1,oheight=1,owidth=1,ochannels=128,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset via lookup ConvOnePreset(in_regs=6,w_regs=4,w_prefetches=0,outf_tile=16,in_mode=MS,d_minibatch=512,n_minibatch=122) Allocating 262144 bytes (aligned) Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Building AIO network from graph graph(%input.245 : Float(1, 128, strides=[15616, 1], requires_grad=0, device=cpu)): %self.self_head.bias : Float(1000, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=]() %self.self_head.weight : Float(1000, 128, strides=[128, 1], requires_grad=0, device=cpu) = prim::Constant[value=]() %3 : Float(1, 1000, strides=[1000, 1], requires_grad=0, device=cpu) = aten::linear(%input.245, %self.self_head.weight, %self.self_head.bias), scope: __module.self_head # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py:114:0 return (%3) Adding prim::Constant layer to network Adding prim::Constant layer to network Adding aten::linear layer to network Binding inputs for Linear layer Weight 0xaaab0e78bd80 , Bias 0xaaab0e808dc0 Registering network input: Linear input index: 0 Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Creating blob for Data layer 455 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1000, 128] Creating blob for Data layer 456 with type FLOAT format PlainDataFormat(FORMATF_LINEAR)[0x0000000000000001] shape [1000] Running AIO Network Selected kernel Data for layer Data : Selected kernel Data for layer Data : Selected kernel Input for layer Input : Linear input Selected kernel FCViaConvOne for layer FullyConnected : Selected kernel ForwardingKernelOutput for layer Output : Merge of ( FullyConnected [1, 1000] ) to Linear input ( Input Input ): Target layer type is not mergeable Merge of ( Output [1, 1000] ) to ( FullyConnected FCViaConvOne ): Target layer type is not mergeable External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Tuning ConvTask(batch=1,idepth=1,iheight=1,iwidth=1,ichannels=128,odepth=1,oheight=1,owidth=1,ochannels=1000,kdepth=1,kheight=1,kwidth=1,dstride=1,hstride=1,wstride=1,ddilation=1,hdilation=1,wdilation=1,dpad=0,hpad=0,wpad=0,dtype=FLOAT,extbatch=1,mut_w=0) Found preset with best score ConvOnePreset(in_regs=3,w_regs=4,w_prefetches=0,outf_tile=32,in_mode=MS,d_minibatch=256,n_minibatch=200) Allocating 516096 bytes (aligned) Scratches: 0 @ 0 Jitted kernel for init: in_mode: MultiStream acc_init: AccInitializer::ZERO in_dtype: FLOAT ref_grid: [3x4] out_features: 16 in_tail_cols: 0 int8_apply_filter_offset: 0 int8_shift_uint8_to_sint8: 0 postprocessing_ops: PP[BINOP_ADD_LINEAR,NONE,NONE,NONE,NONE,NONE,] inner_iter_length: [no-value] sparse_proxy_in_optimization: 0 strided_weights: 0 input_can_read_last_full_vector: 0 weights_can_read_last_full_vector: 0 prefetch_options: {w_ahead: 0} at 0xffff6963a000 , used 1372 B Jitted kernel for init: in_mode: MultiStream acc_init: AccInitializer::ZERO in_dtype: FLOAT ref_grid: [3x4] out_features: 8 in_tail_cols: 0 int8_apply_filter_offset: 0 int8_shift_uint8_to_sint8: 0 postprocessing_ops: PP[BINOP_ADD_LINEAR,NONE,NONE,NONE,NONE,NONE,] inner_iter_length: [no-value] sparse_proxy_in_optimization: 0 strided_weights: 0 input_can_read_last_full_vector: 0 weights_can_read_last_full_vector: 0 prefetch_options: {w_ahead: 0} at 0xffff4c000000 , used 964 B Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Creating blob for Input layer 4 with type FLOAT format PlainDataFormat(FORMATF_CAFFE)[0x000000000000000a] shape [1, 3, 110, 110] Running AIO Network External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Layer Transpose 33 was const-folded Running Data Running Data Running Data Running Input Running TransposeBRC3x4 Running ConvViaJitMatmul Running TransposeBRC4x4 Running ForwardingKernelFlatten Running Data Layer Transpose 31 was const-folded Running ConcatLastDim Running BinaryOpVectorized[Add]@NEON Running TransposeBRC4x4 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Running ForwardingKernelOutput Creating blob for Input layer 37 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 45 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 72 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 80 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 107 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 115 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 142 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 150 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 177 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 185 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 212 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 220 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 247 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 255 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 282 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 290 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 317 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 325 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 352 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 360 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 387 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 395 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 1 , shape: [1, 122, 128] External allocation: allocating 187392 bytes Creating external output: 0 , shape: [3, 1, 8, 122, 16] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelReshape Running TransposeIndexed Running ForwardingKernelOutput Creating blob for Input layer 422 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 8, 122, 16] Creating blob for Input layer 430 with type FLOAT format PlainDataFormat(FORMATF_ANY)[0x0000000000000018] shape [1, 122, 128] Running AIO Network External allocation: allocating 62464 bytes Creating external output: 0 , shape: [1, 122, 128] Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Data Running Input Running Data Running Data Running Data Running Data Running Input Running TransposeBERTVectorized@NEON Running ForwardingKernelReshape Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running FCViaConvOne Scratches: 0 @ 0 Running UnaryOpVectorized[Gelu]@NEON Running FCViaConvOne Scratches: 0 @ 0 Running LayerNormVectorized@NEON Running TorchSliceVectorized@NEON Running ForwardingKernelOutput Creating blob for Input layer 454 with type FLOAT format PlainDataFormat(FORMATF_ROW_MAJOR)[0x0000000000000004] shape [1, 128] Running AIO Network External allocation: allocating 4000 bytes Creating external output: 0 , shape: [1, 1000] Running Data Running Data Running Input Running FCViaConvOne Scratches: 0 @ 0 Running ForwardingKernelOutput Latency: 19ms, rate: 52 per second