# Data Types and Sizes

In [1]:
import torch

## Integers
An unsigned integer data type is used to represent a positive integer. Range of an n-bit unsigned integer **[0, $2^{n-1}$]**. Minimum value of 8-bit integer is 0 and maximum is 255. The computer allocates a sequence of 8 bits to store the 8-bit integer. For an unsigned integer, the decoding process is: if the bit is 0 then value is 0, if the bit is 1 then the decoded value is a power of 2, for first bit the value is $2^0$, for second bit the value is $2^1$, and so on for the 8th bit the value is $2^7$.<br>
Example with 8-bit (torch.uint8), the allocated sequence of bits [1,0,0,0,1,0,0,1] → indexing starts from right to left → 20+0+0+23+0+0+0+27=137.
For signed integers (used to represent negative or positive integers), **2's complement** representation is used. Range is **[-$2^{n-1}$, $2^{n-1}$-1]**. Example with 8-bit (torch.int8) [-128, 127]. Here the bit in the last position (left most) will have a negative value, for n-bit (-$2^{n-1}$). Example [1,0,0,0,1,0,0,1] → 20+0+0+23+0+0+0+(-27) = 119.

## Integers in PyTorch
8-bit signed integer → torch.int8<br>
8-bit unsigned integer → torch.uint8<br>
16-bit signed integer → torch.int16 → torch.short<br>
32-bit signed integer → torch.int32 → torch.int<br>
64-bit signed integer → torch.int64 → torch.long<br>

Here we will be using **torch.iinfo()** method of PyTorch. This function is similar to that of the NumPy function, **np.iinfo()** which returns information about the data types with the smallest and largest values that can be represented by that type.

In [2]:
# Information of 8-bit unsigned integer
torch.iinfo(torch.uint8)

iinfo(min=0, max=255, dtype=uint8)

In [3]:
# Information of 8-bit signed integer
torch.iinfo(torch.int8)

iinfo(min=-128, max=127, dtype=int8)

In [4]:
# Information of 16-bit signed integer
torch.iinfo(torch.int16)

iinfo(min=-32768, max=32767, dtype=int16)

In [5]:
# Information of 32-bit signed integer
torch.iinfo(torch.int32)

iinfo(min=-2.14748e+09, max=2.14748e+09, dtype=int32)

In [6]:
# Information of 64-bit signed integer
torch.iinfo(torch.int64)

iinfo(min=-9.22337e+18, max=9.22337e+18, dtype=int64)

## **Floating Point**
3 components:
- **sign :-** positive, negative and always 1 bit
- **exponent :-** range, impact the representable range of number
- **fraction :-** precision, impact on the precision of the number

Here precision means defining a number as 0.4999999 or just 0.5.<br>
**FP32, BF16, FP16, FP8** are floating point formats with a specific number of bits for exponent and the fraction.<br>

**1. Floating Point 32**<br>
- **sign :-** 1 bit
- **exponent (range) :-** 8 bit
- **fraction (precision) :-** 23 bit
- **Total :-** 32 bit<br>
For positive values we can define very small numbers as **$10^{-45}$** and as big as **$10^{38}$**. For negative values the range is the same with a minus sign in front.<br>
For FP we have two formulas to decode the sequence. First to represent very small values which are also called **subnormal values (E=0) -1SF2-126** and second to represent very big values called **normal values (E!=0) -1S(1+F)2E-127**. This data type is very important in ML since most models store weights in FP32.

**2. Floating Point 16**<br>
- **sign :-** 1 bit
- **exponent (range) :-** 5 bit
- **fraction (precision) :-** 10 bit
- **Total :-** 16 bit<br>
Here we have only 6 bits for the exponent and 10 for fraction. So the smallest positive value is **$10^{-8}$** and the biggest is **$10^{4}$**.

**2. Brain Floating Point 16**<br>
- **sign :-** 1 bit
- **exponent (range) :-** 8 bit
- **fraction (precision) :-** 7 bit
- **Total :-** 16 bit<br>
Here we have 8 bits for the exponent and 7 for fraction. So the smallest positive value is **$10^{-41}$** and the biggest is **$10^{38}$**. Compared with FP16 we have more range to store. But the downside is the precision.

FP32 → best precision   → max ~**$10^{38}$**<br>
FP16 → better precision → max ~**$10^{4}$**<br>
BF16 → good precision  → max ~**$10^{38}$**<br>

## **FP in PyTorch**
16-bit floating point → torch.float16 → torch.half<br>
16-bit brain floating point → torch.bfloat16<br>
32-bit floating point → torch.float32 → torch.float<br>
64-bit floating point → torch.float64 → torch.double

In [7]:
# by default, python stores float data in FP64
value = 1/3

In [8]:
# Let's check the number that we stored till 60 decimal values
format(value, '.60f')

'0.333333333333333314829616256247390992939472198486328125000000'

In [9]:
# 64-bit floating point
tensor_fp64 = torch.tensor(value, dtype = torch.float64)
print(f"FP64 tensor: {format(tensor_fp64.item(), '.60f')}")

FP64 tensor: 0.333333333333333314829616256247390992939472198486328125000000


In [10]:
tensor_fp32 = torch.tensor(value, dtype = torch.float32)
tensor_fp16 = torch.tensor(value, dtype = torch.float16)
tensor_bf16 = torch.tensor(value, dtype = torch.bfloat16)

In [11]:
print(f"fp64 tensor: {format(tensor_fp64.item(), '.60f')}")
print(f"fp32 tensor: {format(tensor_fp32.item(), '.60f')}")
print(f"fp16 tensor: {format(tensor_fp16.item(), '.60f')}")
print(f"bf16 tensor: {format(tensor_bf16.item(), '.60f')}")

fp64 tensor: 0.333333333333333314829616256247390992939472198486328125000000
fp32 tensor: 0.333333343267440795898437500000000000000000000000000000000000
fp16 tensor: 0.333251953125000000000000000000000000000000000000000000000000
bf16 tensor: 0.333984375000000000000000000000000000000000000000000000000000


Observe that the **less bits** we have, the **less precise** the approximation will be. As mentioned above precision is worst for bfloat16 we can clearly see that it only gives the value till 9 decimal places.

In [12]:
# Information of `16-bit floating point`
torch.finfo(torch.float16)

finfo(resolution=0.001, min=-65504, max=65504, eps=0.000976562, smallest_normal=6.10352e-05, tiny=6.10352e-05, dtype=float16)

In [13]:
# Information of `16-bit brain floating point`
torch.finfo(torch.bfloat16)

finfo(resolution=0.01, min=-3.38953e+38, max=3.38953e+38, eps=0.0078125, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=bfloat16)

In [14]:
# Information of `32-bit floating point`
torch.finfo(torch.float32)

finfo(resolution=1e-06, min=-3.40282e+38, max=3.40282e+38, eps=1.19209e-07, smallest_normal=1.17549e-38, tiny=1.17549e-38, dtype=float32)

In [15]:
# Information of `64-bit floating point`
torch.finfo(torch.float64)

finfo(resolution=1e-15, min=-1.79769e+308, max=1.79769e+308, eps=2.22045e-16, smallest_normal=2.22507e-308, tiny=2.22507e-308, dtype=float64)

## **Downcasting**
Downcasting happens when we convert a higher data type to a lower data type. The value will be converted to the nearest value in the lower data type. For example **FP32 = 0.1** downcasted to an 8-bit integer will be converted to **0**, hence there is **loss of data**.

#### **Advantages**
1. Reduced memory footprint.
- More efficient use of GPU memory.
- Enables the training of larger models.
- Enables larger batch sizes.
2. Increased computation and speed.
- Computation using low precision (FP16, BF16) can be faster than FP32 since it requires less memory.
- Depending on the hardware (Google TPU, NVIDIA A100).

#### **Disadvantages**
- Less precision :- we are using less memory, hence computation is less precise.

#### **Use case of downcasting**
Mixed precision training
- Do computation in smaller precision (FP16/BF16/FP8).
- Store and update the weights in higher precision (FP32).

In [16]:
# Create a random pytorch tensor: float32, size=1000
tensor_fp32 = torch.rand(1000, dtype = torch.float32)

In [17]:
# first 5 elements of the random tensor
tensor_fp32[:5]

tensor([0.9997, 0.9861, 0.8572, 0.2733, 0.2319])

In [18]:
# Downcast the tensor to bfloat16 using the "to" method
tensor_fp32_to_bf16 = tensor_fp32.to(dtype = torch.bfloat16)

In [19]:
tensor_fp32_to_bf16[:5]

tensor([1.0000, 0.9844, 0.8555, 0.2734, 0.2314], dtype=torch.bfloat16)

We can see that after downcasting the values are changed but they are very close to the original ones. Let's check the impact of downcasting on multiplication. For this we will use the **.dot()** method of PyTorch.

In [20]:
# tensor_fp32 x tensor_fp32
m_float32 = torch.dot(tensor_fp32, tensor_fp32)

In [21]:
m_float32

tensor(313.5908)

In [22]:
# tensor_fp32_to_bf16 x tensor_fp32_to_bf16
m_bfloat16 = torch.dot(tensor_fp32_to_bf16, tensor_fp32_to_bf16)

In [23]:
m_bfloat16

tensor(314., dtype=torch.bfloat16)