## •Analytical Cache Modeling and Tile-size Optimization for Tensor Contractions

Rui Li<sup>1</sup>, Aravind Sukumaran-Rajam<sup>2</sup>, Richard Veras<sup>3</sup>, Tze-Meng Low<sup>4</sup>, Fabrice Rastello<sup>5</sup>, Atanas Rountev<sup>6</sup>, P. Sadayappan<sup>1</sup>

Utah<sup>1</sup>, WSU<sup>2</sup>, LSU<sup>3</sup>, CMU<sup>4</sup>, INRIA<sup>5</sup>, OSU<sup>6</sup>

#### Overview

#### Goal

•High performance tensor contraction on CPU

#### Challenge

- •Exponential space of valid code configurations
- •How to select the best performing code configuration?

#### **Solution**

- •Data movement is the critical factor affects performance
- •Develop analytical models to estimate the data movement based on the code configuration
- •Estimate the time of each code configuration and select the best performing version

#### **Contributions**

- •Modeling data movement of a multi-level memory hierarchy system as a non-linear optimization problem
- •Estimating the execution time of tensor contractions on a multi-level memory system

# Data Movement Modeling for Tiled Loop Code

#### **6D Tiled Loop Code**

.6! Permutations brought by

#### 6 tiling loops

- Tile sizes are selected such that the data foot print fits in cache
- Permutation of point loops has no effects to data movement

```
for(ti=0:Ni:Ti)
  for(tj=0:Ni:Tj)
  for(tk=0:Nk:Tk)
    for(tl=0:Nl:Tl)
    for(tm=0:Nm:Tm)
        for(tn=0:Nn:Tn)
        for(j=tj:tj+Tj:1)
            for(k=tk:tk+Tk:1)
            for(l=tl:tl+Tl:1)
            for(m=tm:tm+Tm:1)
            for(n=tn:tn+tn:1)
            C[i][j][k][l]+=
            A[i][m][k][n]*
            B[j][n][l][m];
```

List.1 Tiled version of tensor contraction (single level cache), Ni, Nj,..., Nn are problem ranges, Ti, Tj,..., Tn are tiles sizes.

#### **Automatic Data Movement Calculation**

- •Bottom-up algorithm for each permutation
- •Prune loop permutations with same data movement expression

$$\begin{aligned} & N_{i}N_{j}N_{k}N_{l} + N_{i}N_{m}N_{k}N_{n}(\frac{N_{l}}{T_{l}})(\frac{N_{j}}{T_{j}}) + N_{j}N_{n}N_{l}N_{m}\left(\frac{N_{k}}{T_{k}}\right)(\frac{N_{i}}{T_{l}}) \\ & \text{for (ti=0; ti$$

#### Problem Formalization for Single Level Cache dataMov4 Bandwidth4

•Conditional Optimization problem based on data movement expression and cache capacity constraint

•argmin 
$$N_i N_j N_k N_l + N_i N_m N_k N_n (\frac{N_l}{T_l}) (\frac{N_j}{T_j}) + N_j N_n N_l N_m (\frac{N_k}{T_k}) (\frac{N_i}{T_i})$$
  
• under constraint  $T_i T_j T_k T_l + T_i T_m T_k T_n + T_j T_n T_l T_m \leq C$ 

#### **Edge Cases Handling: Tile Size Equal to Range**

- •If some problem range  $N_x$  can fit in cache, setting  $T_x = N_x$  changes the cost
- •2<sup>d</sup> cases to be traversed for a pruned permutation Must be performed within and across trees
- •Fig.1 shows all cases for a fixed permutation of Matmul
- •Each leaf contains a data movement expression
- •Merge leaves of all trees with same cost expression



#### Time Cost Model on Multi-level Cache



- •Optimization Problem:  $\arg\min_{T}(\max_{i=1,2,3,4}\frac{dataMov_i}{bandwidth_i})$
- • $P^L$  different max-min problems to solve for L-level memory hierarchy system, (P =
- #Pruned permutations)
- •Solver: Couenne (from COIN-OR)
- •Utilize BLIS micro-kernel to guarantee highthroughput on FMA unit
- •Packing is required to reduce conflict misses

### **Experimental Evaluation**

•Geometric Mean speedup is 1.34x versus TBLIS, 1.27x versus TCL-MKL, and 1.34x versus TCL-BLIS on i7-6700K •Performance model can also be used for predicting execution time (less than 10% for most cases)



Fig.2 Evaluation on TTCG benchmark on i7-6700K, single thread, comparing to TBLIS, TCL-BLIS and TCL-MKL