# Tensor and Add Operation

ttnn.Tensor is the central type of ttnn.

It is similar to torch.Tensor in the sense that it represents multi-dimensional matrix containing elements of a single data type.

The are a few key differences:

- ttnn.Tensor can be stored in the SRAM or DRAM of TensTorrent devices
- ttnn.Tensor doesn't have a concept of the strides, however it has a concept of row-major and tile layout
- ttnn.Tensor has support for data types not supported by torch such as `bfp8` for example
- ttnn.Tensor's shape stores the padding added to the tensor due to TILE_LAYOUT

## Creating a tensor

The recommended way to create a tensor is by using torch create function and then simply calling `ttnn.from_torch`. So, let's import both `torch` and `ttnn`

In [1]:
import torch
import ttnn

2024-08-21 15:48:07.215 | DEBUG    | ttnn:<module>:82 - Initial ttnn.CONFIG:
Config{cache_path=/home/thienluu/.cache/ttnn,model_cache_path=/home/thienluu/.cache/ttnn/models,tmp_dir=/tmp/ttnn,enable_model_cache=false,enable_fast_runtime_mode=true,throw_exception_on_fallback=false,enable_logging=false,enable_graph_report=false,enable_detailed_buffer_report=false,enable_detailed_tensor_report=false,enable_comparison_mode=false,comparison_mode_pcc=0.9999,root_report_path=generated/ttnn/reports,report_name=std::nullopt,std::nullopt}


If you're using a Wormhole card (N150/N300), you will need to set the full Tensix available to be able to continue with this tutorial

In [2]:
import os
# os.environ["WH_ARCH_YAML"] = "wormhole_b0_80_arch_eth_dispatch.yaml"
os.environ["GS_ARCH_YAML"] = "grayskull_120_arch.yaml"

And now let's create a torch Tensor and convert it to ttnn Tensor

In [3]:
torch_tensor = torch.rand(3, 4)
ttnn_tensor = ttnn.from_torch(torch_tensor)

print(f"shape: {ttnn_tensor.shape}")
print(f"layout: {ttnn_tensor.layout}")
print(f"dtype: {ttnn_tensor.dtype}")

shape: ttnn.Shape([3, 4])
layout: Layout.ROW_MAJOR
dtype: DataType.FLOAT32


As expected we get a tensor of shape [3, 4] in row-major layout with a data type of float32.

## Host Storage: Borrowed vs Owned

In this particular case, ttnn Tensor will borrow the data of the torch Tensor because ttnn Tensor is in row-major layout, torch tensor is contiguous and their data type matches.

Let's print the current ttnn tensor, set element of torch tensor to 1234 and print the ttnn Tensor again to see borrowed storage in action

In [4]:
print(f"Original values:\n{ttnn_tensor}")
torch_tensor[:] = 1234
print(f"New values are all going to be 1234:\n{ttnn_tensor}")

Original values:
ttnn.Tensor([[ 0.14538,  0.60650,  ...,  0.07408,  0.95921],
             [ 0.61710,  0.61210,  ...,  0.70045,  0.36176],
             [ 0.42633,  0.62021,  ...,  0.52240,  0.18872]], shape=Shape([3, 4]), dtype=DataType::FLOAT32, layout=Layout::ROW_MAJOR)
New values are all going to be 1234:
ttnn.Tensor([[1234.00000, 1234.00000,  ..., 1234.00000, 1234.00000],
             [1234.00000, 1234.00000,  ..., 1234.00000, 1234.00000],
             [1234.00000, 1234.00000,  ..., 1234.00000, 1234.00000]], shape=Shape([3, 4]), dtype=DataType::FLOAT32, layout=Layout::ROW_MAJOR)


We try our best to use borrowed storage but if the torch data type is not supported in ttnn, then we don't have a choice but to automatically pick a different data type and copy data

In [6]:
W = 32
H = 32
torch_tensor = torch.rand(W,H).to(torch.int32)
for i in range(W):
    for j in range(H):
        torch_tensor[i][j] = i * H + j

print(torch_tensor)
ttnn_tensor = ttnn.from_torch(torch_tensor)
# print("torch_tensor.dtype:", torch_tensor.dtype)
# print("ttnn_tensor.dtype:", ttnn_tensor.dtype)
# print(f"Original values:\n{ttnn_tensor}")
print() 
print(torch_tensor[1,3])

print() 
print(ttnn_tensor[1,3])

ttnn_tensor = ttnn.to_layout(ttnn_tensor, ttnn.TILE_LAYOUT)
print()
test_tensor = ttnn_tensor[1]
print(ttnn_tensor)


tensor([[   0,    1,    2,  ...,   29,   30,   31],
        [  32,   33,   34,  ...,   61,   62,   63],
        [  64,   65,   66,  ...,   93,   94,   95],
        ...,
        [ 928,  929,  930,  ...,  957,  958,  959],
        [ 960,  961,  962,  ...,  989,  990,  991],
        [ 992,  993,  994,  ..., 1021, 1022, 1023]], dtype=torch.int32)

tensor(35, dtype=torch.int32)

input_rank =  2
slice =  (1, 3)
normalized slice =  (slice(None, 1, None), slice(None, 3, None))
ttnn.Tensor([[    0,     1,     2]], shape=Shape([1, 3]), dtype=DataType::INT32, layout=Layout::ROW_MAJOR)

input_rank =  2
slice =  (slice(None, 1, None),)
normalized slice =  (slice(None, 1, None),)
ttnn.Tensor([[    0,     1,  ...,    46,    47],
             [   64,    65,  ...,   110,   111],
             ...,
             [  912,   913,  ...,   958,   959],
             [  976,   977,  ...,  1022,  1023]], shape=Shape([32, 32]), dtype=DataType::INT32, layout=Layout::TILE)


## Data Type

The data type of the ttnn tensor can be controlled explicitly when conversion from torch.

In [8]:
torch_tensor = torch.rand(3, 4).to(torch.float32)
ttnn_tensor = ttnn.from_torch(torch_tensor, dtype=ttnn.bfloat16)
print(f"torch_tensor.dtype: {torch_tensor.dtype}")
print(f"ttnn_tensor.dtype: {ttnn_tensor.dtype}")
for att in ttnn_tensor.__dir__():
    print(att)

torch_tensor.dtype: torch.float32
ttnn_tensor.dtype: DataType.BFLOAT16
__init__
__doc__
__module__
shape
dtype
layout
deallocate
to
track_ref_count
sync
extract_shard
cpu
cpu_sharded
pad
unpad
pad_to_tile
unpad_from_tile
__repr__
get_legacy_shape
volume
storage_type
device
devices
to_torch
to_numpy
buffer
buffer_address
get_layout
memory_config
is_allocated
is_contiguous
is_sharded
get_dtype
shape_without_padding
reshape
tensor_id
__matmul__
__add__
__radd__
__sub__
__mul__
__rmul__
__eq__
__ne__
__gt__
__ge__
__lt__
__le__
__getitem__
__new__
__hash__
__str__
__getattribute__
__setattr__
__delattr__
__reduce_ex__
__reduce__
__subclasshook__
__init_subclass__
__format__
__sizeof__
__dir__
__class__


## Layout

TensTorrent hardware is most efficiently utilized when running tensors using [tile layout](https://tenstorrent.github.io/ttnn/latest/ttnn/tensor.html#layout).
The current tile size is hard-coded to [32, 32]. It was determined to be the optimal size for a tile given the compute, memory and data transfer constraints.


ttnn provides easy and intuitive way to convert from row-major layout to tile layout and back.

In [2]:
torch_tensor = torch.rand(3, 4).to(torch.float16)
ttnn_tensor = ttnn.from_torch(torch_tensor)
print(f"Tensor in row-major layout:\nShape {ttnn_tensor.shape}\nLayout: {ttnn_tensor.layout}\n{ttnn_tensor}")
ttnn_tensor = ttnn.to_layout(ttnn_tensor, ttnn.TILE_LAYOUT)
print(f"Tensor in tile layout:\nShape {ttnn_tensor.shape}\nLayout: {ttnn_tensor.layout}\n{ttnn_tensor}")
ttnn_tensor = ttnn.to_layout(ttnn_tensor, ttnn.ROW_MAJOR_LAYOUT)
print(f"Tensor back in row-major layout:\nShape {ttnn_tensor.shape}\nLayout: {ttnn_tensor.layout}\n{ttnn_tensor}")

NameError: name 'torch' is not defined

Note that padding is automatically inserted to put the tensor into tile layout and it automatically removed after the tensor is converted back to row-major layout

The conversion to tile layout can be done when caling `ttnn.from_torch`

In [37]:
torch_tensor = torch.rand(3, 4).to(torch.float16)
ttnn_tensor = ttnn.from_torch(torch_tensor, layout=ttnn.TILE_LAYOUT)
print(f"Tensor in row-major layout:\nShape {ttnn_tensor.shape}; Layout: {ttnn_tensor.layout}")
ttnn_tensor = ttnn.to_layout(ttnn_tensor, ttnn.TILE_LAYOUT)

Tensor in row-major layout:
Shape ttnn.Shape([3[32], 4[32]]); Layout: Layout.TILE


Note that `ttnn.to_torch` will always convert to row-major layout

## Device storage

Finally, in order to actually utilize the tensor, we need to put it on the device. So, that we can run `ttnn` operations on it

## Open the device

Use `ttnn.open` to get a handle to the device

In [33]:
device_id = 0
device = ttnn.open_device(device_id=device_id)

[38;2;000;128;000m                 Device[0m | [1m[38;2;100;149;237mINFO    [0m | Opening user mode device driver
[32m2024-08-21 04:39:07.161[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 8 PCI devices : [0, 1, 2, 3, 4, 5, 6, 7]
[32m2024-08-21 04:39:07.249[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 8 PCI devices : [0, 1, 2, 3, 4, 5, 6, 7]
[32m2024-08-21 04:39:07.263[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 8 PCI devices : [0, 1, 2, 3, 4, 5, 6, 7]
[32m2024-08-21 04:39:07.280[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 8 PCI devices : [0, 1, 2, 3, 4, 5, 6, 7]
[32m2024-08-21 04:39:07.300[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 8 PCI devices : [0, 1, 2, 3, 4, 5, 6, 7]
[32m2024-08-21 04:39:07.315[0m | [1m[38;2;100;149;237mINFO    [0m | [36mSiliconDriver  [0m - Detected 8 PCI devices : [0, 1,

## Initialize tensors a and b with random values using torch

To create a tensor that can be used by a `ttnn` operation:
1. Create a tensor using torch
2. Use `ttnn.from_torch` to convert the tensor from `torch.Tensor` to `ttnn.Tensor`, change the layout to `ttnn.TILE_LAYOUT` and put the tensor on the `device`

In [None]:
torch.manual_seed(0)

torch_input_tensor_a = torch.rand((32, 32), dtype=torch.bfloat16)
torch_input_tensor_b = torch.rand((32, 32), dtype=torch.bfloat16)

input_tensor_a = ttnn.from_torch(torch_input_tensor_a, layout=ttnn.TILE_LAYOUT, device=device)
input_tensor_b = ttnn.from_torch(torch_input_tensor_b, layout=ttnn.TILE_LAYOUT, device=device)

cmd_wait
 DISPATCH WAIT 1a3b0 count 0
cmd_write_paged is_dram: 1
process_write_paged - pages: 1

 page_size: 2048 dispatch_cb_page_size: 4096
cmd_wait
 DISPATCH WAIT 1a3b0 count 0
cmd_write_paged is_dram: 1
process_write_paged - pages: 1 page_size: 2048 dispatch_cb_page_size: 4096


## Add tensor a and b

`ttnn` supports operator overloading, therefore operator `+` can be used instead of `torch.add`

In [None]:
output_tensor = input_tensor_a + input_tensor_b

cmd_wait
 DISPATCH WAIT 1a3b0 count 0
cmd_write_paged is_dram: 1
process_write_paged - pages: 8 page_size: 2048 dispatch_cb_page_size: 4096


write offset: 0 102240 0
cmd_write_packed
dispatch_write_packed: 36 48 156096 108 102240 
cmd_wait
 DISPATCH BARRIER
 DISPATCH WAIT 1a3b0 count 0
cmd_write_packed
dispatch_write_packed: 272 272 167968 1 106592 
cmd_write_packed_large
cmd_wait
 DISPATCH BARRIER
cmd_write_packed
dispatch_write_packed: 36 48 184352 1 32 


## Inspect the output tensor of the add in ttnn

As can be seen the tensor of the same shape, layout and dtype is produced

In [40]:
print(f"shape: {output_tensor.shape}")
print(f"dtype: {output_tensor.dtype}")
print(f"layout: {output_tensor.layout}")

shape: ttnn.Shape([32, 32])
dtype: DataType.BFLOAT16
layout: Layout.TILE


In general we expect layout and dtype to stay the same when running most operations unless explicit arguments to modify them are passed in. However, there are obvious exceptions like an embedding operation that takes in `ttnn.uint32` and produces `ttnn.bfloat16`

## Convert to torch and inspect the attributes of the torch tensor

When converting the tensor to torch, `ttnn.to_torch` will move the tensor from the device, convert to tile layout and figure out the best data type to use on the torch side

In [41]:
output_tensor = ttnn.to_torch(output_tensor)
print(f"shape: {output_tensor.shape}")
print(f"dtype: {output_tensor.dtype}")

shape: torch.Size([32, 32])
dtype: torch.bfloat16
cmd_wait
 DISPATCH BARRIER
 DISPATCH WAIT 1a3b0 count 108
cmd_write_linear_h_host
process_write_host_h: 2064
cmd_wait
 DISPATCH WAIT 1a3b0 count 108
cmd_write_packed
dispatch_write_packed: 16 16 200736 1 107408 
cmd_write_linear_h_host
process_write_host_h: 32


## Close the device

Close the handle the device. This is a very important step as the device can hang currently if not closed properly

In [42]:
ttnn.close_device(device)

cmd_wait
 DISPATCH WAIT 1a3b0 count 108
cmd_write_packed
dispatch_write_packed: 16 16 213024 1 107408 
[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | Closing device 0
cmd_write_linear_h_host
process_write_host_h: 32
dispatch terminate
dispatch_11: out
prefetcher_11: out
[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | DPRINT Server dettached device 0
[38;2;000;128;000m                  Metal[0m | [1m[38;2;100;149;237mINFO    [0m | Disabling and clearing program cache on device 0
