# Matrix Multiplication

In [1]:
import torch
import ttnn

torch.manual_seed(0)

device_id = 0
device = ttnn.open(device_id)

[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | Initializing device 0
[38;2;000;128;000m                 Device[0m | [1m[38;2;100;149;237mINFO    [0m | Opening user mode device driver
[32m2023-12-05 02:52:39.770[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 1 PCI device
[32m2023-12-05 02:52:39.781[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0xfaca revision: 0)
[0;33m---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
[0m[32m2023-12-05 02:52:39.879[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Disable PCIE DMA
[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | AI CLK for d

## Enable program cache

Enabling the program cache will speed up the execution of operations that run repeatedly

In [2]:
ttnn.enable_program_cache()

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program Cache: enabled.


# Configuration

In [3]:
m = 1024
k = 1024
n = 1024

## Initialize tensors a and b with random values using torch

In [4]:
torch_a = torch.randn((m, k), dtype=torch.bfloat16)
torch_b = torch.randn((k, n), dtype=torch.bfloat16)

In [5]:
a = ttnn.from_torch(torch_a)
b = ttnn.from_torch(torch_b)

a = ttnn.to_device(a, device)
b = ttnn.to_device(b, device)

## Matrix multiply tensor a and b
The operation will run longer the first time because the kernels need to get compiled

In [6]:
output = a @ b

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in     0.465105517 seconds
[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in      0.00173577 seconds
[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Matmul                               finished in     0.563651304 seconds


Re-running the operation shows significant speed up by utilizing program caching

In [7]:
output = a @ b

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in      0.00178385 seconds
[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in     0.001768481 seconds
[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Matmul                               finished in     0.001274913 seconds


## Inspect the layout of matrix multiplication output

In [8]:
print(output.layout)

Layout.TILE


As can be seen, matrix multiplication produces outputs in a tile layout. That is because it's much more efficient to use this layout for computing matrix multiplications on TensTorrent accelerators compared to a row-major layout.

And this is aslo why the logs show 2 tilize operations, as the inputs get automatically convered to the tile layout if they are in a row-major layout.

Learn more about tile layout here: TODO

## Inspect the result of the matrix multiplication

To inspect the results we will first convert to row-major layout.

In [9]:
output = ttnn.to_layout(output, ttnn.ROW_MAJOR_LAYOUT)

print("Printing ttnn tensor")
print(f"shape: {output.shape}")
print(f"chunk of a tensor:\n{output[:1, :32]}")

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Untilize                             finished in     0.426100972 seconds
Printing ttnn tensor
shape: [1024, 1024]
chunk of a tensor:
Tensor([ [34.25, 9.625, 11.3125, 0.964844, 1.45312, -26.875, 23.125, -1.39062, -20.375, 33, 5.8125, 10.6875, -18.625, 14.5, -42.75, -18.375, 27.75, 44.25, -27.25, -20.5, 43.5, -5.75, -46.75, -45.75, 43.75, 33, -16.125, 39.25, 11.6875, 9.4375, -39.75, -6.5625]], dtype=bfloat16 )



## Tilize tensors before running matrix multiplication
Inputs can be tilized before running matrix multiplication. 

And there are times when it's highly recommended to do that. For example, during inference, pre-tilizing weights gets rid of unnecessarily tilizing the weights every time a matmul is running

In [10]:
a = ttnn.to_layout(a, ttnn.TILE_LAYOUT)
b = ttnn.to_layout(b, ttnn.TILE_LAYOUT)

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in     0.002108158 seconds
[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in      0.00173901 seconds


Running matrix multiplication with tilized inputs shows that only matrix multiplication operation is invoked without any additional pre-processing operations

In [11]:
output = a @ b

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Matmul                               finished in     0.001403602 seconds


## Matrix multiply tensor a and b by using more performant config
By default, matrix multiplication might not be as effecient as it could be. To speed it up further, the user can specify how many cores they want matrix multiplication to use. This can speed up the operation significantly.

In [12]:
a = ttnn.from_torch(torch_a)
b = ttnn.from_torch(torch_b)

a = ttnn.to_device(a, device, memory_config=ttnn.L1_MEMORY_CONFIG)
b = ttnn.to_device(b, device, memory_config=ttnn.L1_MEMORY_CONFIG)

a = ttnn.to_layout(a, ttnn.TILE_LAYOUT)
b = ttnn.to_layout(b, ttnn.TILE_LAYOUT)

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in     0.409176746 seconds
[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::tt_metal::Tilize                               finished in     0.000644077 seconds


Run once to compile the kernels

In [13]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=(8, 8))

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::operations::primary::Matmul                    finished in     0.627620922 seconds


Enjoy a massive speed up on the subsequent runs

In [14]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=(8, 8))

[38;2;000;128;000m                     Op[0m | [1m[38;2;100;149;237mINFO    [0m | Program of Operation tt::operations::primary::Matmul                    finished in     0.000367018 seconds


## Close the device

In [15]:
ttnn.close(device)

[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | Closing device 0
