In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [2]:
!nvcc f7a1.cu -o a71.out

      float final_result;
            ^




In [3]:
!./a71.out

Dot product (float16): 512.000000


$
\text{Dot product computation in CUDA using float16:} \\
\text{Given } \mathbf{a}, \mathbf{b} \text{ are arrays of size } N, \\
\text{compute } \sum_{i=0}^{N-1} \text{float16}(a_i) \times \text{float16}(b_i)
$

The function `hmul(a, b)` computes the element-wise product of arrays  

$ \mathbf{a}  $ and  $  \mathbf{b} $,

where each element is in float16 format. This operation is defined as:

$ \text{hmul}(\mathbf{a}, \mathbf{b}) = \left\{ c_i = \text{float16}(a_i) \times \text{float16}(b_i) \quad \forall i \right\} $

where
$ \mathbf{a} = [a_0, a_1, \ldots, a_{N-1}] $

and

$ \mathbf{b} = [b_0, b_1, \ldots, b_{N-1}] $

are arrays of size $ N $.

**Example:**

Let

$ \mathbf{a} = [1.0_{\text{float16}}, 2.0_{\text{float16}}, 3.0_{\text{float16}}] $

 and

 $ \mathbf{b} = [0.5_{\text{float16}}, 0.5_{\text{float16}}, 0.5_{\text{float16}}] $.

Then, the result of `hmul(a, b)` would be:

$ \text{hmul}(\mathbf{a}, \mathbf{b}) = [0.5_{\text{float16}}, 1.0_{\text{float16}}, 1.5_{\text{float16}}] $



### Float16 Format

Float16 (half-precision floating-point) is a 16-bit format that stores floating-point numbers. It consists of:
- **1 sign bit**: Indicates the sign of the number (0 for positive, 1 for negative).
- **5 exponent bits**: Represent the exponent of the number.
- **10 fraction bits**: Represent the significand or mantissa of the number.

### Polynomial Representation and Multiplication

Let's represent `a` and `b` as polynomials in both float16 and float32 formats:

For float16:
- Let $ a_{\text{float16}} $ and $ b_{\text{float16}} $ denote the float16 representation of `a` and `b`.
- Suppose $ a_{\text{float16}} = \text{sign}_a \times 2^{\text{exp}_a} \times (1 + \text{frac}_a) $ and similarly for $ b_{\text{float16}} $.

Their product $ a_{\text{float16}} \times b_{\text{float16}} $ can be computed as:
$$ a_{\text{float16}} \times b_{\text{float16}} = \text{sign}_a \times \text{sign}_b \times 2^{(\text{exp}_a + \text{exp}_b - 15)} \times (1 + \text{frac}_a) \times (1 + \text{frac}_b) $$

For float32:
- Let $ a_{\text{float32}} $ and $ b_{\text{float32}} $ denote the float32 representation of `a` and `b`.
- Suppose $ a_{\text{float32}} = \text{sign}_a \times 2^{\text{exp}_a} \times (1 + \text{frac}_a) $ and similarly for $ b_{\text{float32}} $.

Their product $ a_{\text{float32}} \times b_{\text{float32}} $ can be computed as:
$$ a_{\text{float32}} \times b_{\text{float32}} = \text{sign}_a \times \text{sign}_b \times 2^{(\text{exp}_a + \text{exp}_b - 127)} \times (1 + \text{frac}_a) \times (1 + \text{frac}_b) $$

### Example Calculation

Let's consider an example with specific float16 and float32 values for `a` and `b`:

- Suppose $ a_{\text{float16}} = 1.5 $ and $ b_{\text{float16}} = 0.5 $.
- Compute $ a_{\text{float16}} \times b_{\text{float16}} $ in float16 format.
- Compute $ a_{\text{float32}} \times b_{\text{float32}} $ in float32 format.

### Comparison

Compare the results of $ a \times b $ in float16 and float32 formats to observe differences in precision and range.



For float16 representation of numbers `a = 12` and `b = 7`, we need to convert them into their respective float16 formats, including the fraction part (`frac`).

### Conversion to Float16 Format

#### Step 1: Represent `a` and `b` in Binary
- **a = 12**:
  - Binary representation: `1100`
  - Normalize to: `1.100` (shifted left by 3 bits)
  - Exponent: $ 3 + 15 = 18 $ (in binary: `10010`)
  - Fraction (`frac_a`): `1000000000` (10 bits)
  - Float16 representation: `0 10010 1000000000`

- **b = 7**:
  - Binary representation: `0111`
  - Normalize to: `1.11` (shifted left by 2 bits)
  - Exponent: $ 2 + 15 = 17 $ (in binary: `10001`)
  - Fraction (`frac_b`): `1100000000` (10 bits)
  - Float16 representation: `0 10001 1100000000`

### Calculation

#### Float16 Format
- For `a`:
  - Float16 representation: `0 10010 1000000000`
  - Sign (`sign_a`): `0`
  - Exponent (`exp_a`): `10010` (binary) = $ 18 $ (decimal)
  - Fraction (`frac_a`): `1000000000`
  - $ a_{\text{float16}} = (-1)^{0} \times 2^{(18-15)} \times (1 + 0.5) = 12 $

- For `b`:
  - Float16 representation: `0 10001 1100000000`
  - Sign (`sign_b`): `0`
  - Exponent (`exp_b`): `10001` (binary) = $ 17 $ (decimal)
  - Fraction (`frac_b`): `1100000000`
  - $ b_{\text{float16}} = (-1)^{0} \times 2^{(17-15)} \times (1 + 0.75) = 7 $

### Conclusion

Using the above representations and calculations, we can verify the float16 representations and their fractional components (`frac_a` and `frac_b`) for `a = 12` and `b = 7`. This demonstrates how these numbers are represented in float16 format, highlighting the fraction part (`frac`) used in the conversion process.

To compute the product \( a \times b \) using float16 representation, let's proceed with the calculation based on the provided float16 formats for \( a = 12 \) and \( b = 7 \).

### Float16 Representation Recap

For \( a = 12 \):
- Float16 representation: `0 10010 1000000000`
- Sign (`sign_a`): `0`
- Exponent (`exp_a`): \( 10010 \) (binary) = \( 18 \) (decimal)
- Fraction (`frac_a`): `1000000000`

For \( b = 7 \):
- Float16 representation: `0 10001 1100000000`
- Sign (`sign_b`): `0`
- Exponent (`exp_b`): \( 10001 \) (binary) = \( 17 \) (decimal)
- Fraction (`frac_b`): `1100000000`

### Calculation of \( a \times b \)

#### Step 1: Compute the Multiplication in Float16 Format
To compute \( a \times b \):

1. **Sign**: Both \( a \) and \( b \) have a sign bit of `0`, indicating positive numbers.
   
2. **Exponent Adjustment**:
   - Exponent of \( a \): \( 18 \) (actual exponent) - \( 15 \) (float16 bias) = \( 3 \)
   - Exponent of \( b \): \( 17 \) (actual exponent) - \( 15 \) (float16 bias) = \( 2 \)
   - Combined exponent for multiplication: \( 3 + 2 = 5 \)
   - Adjusted exponent for float16: \( 5 + 15 = 20 \) (binary: \( 10100 \))

3. **Fraction Multiplication**:
   - Multiply the fractions (`frac_a` and `frac_b`) directly:
     \[
     \text{frac_a} \times \text{frac_b} = 1000000000 \times 1100000000 = 1100000000000000000
     \]

4. **Normalize and Round**:
   - Normalize the fraction and adjust the exponent accordingly.

### Result

After performing the multiplication and adjusting for float16 format, the result would need to be rounded and normalized correctly to fit into float16 representation. The exact floating-point representation details would typically involve additional rounding and normalization steps, ensuring the result conforms to the float16 format standards.

This process illustrates how multiplication is handled in float16 format, leveraging the provided details for \( a \) and \( b \).


In [6]:
%%writefile f7a2.cu
#include <stdio.h>
#include <cuda_fp16.h>

#define N 512
#define BLOCK_SIZE 256

__global__ void dot_product_float16(__half *a, __half *b, float *result) {
    __shared__ float temp[BLOCK_SIZE];
    int tid = threadIdx.x;
    int index = blockIdx.x * blockDim.x + threadIdx.x;

    // Perform half-precision multiplication and accumulate in single-precision
    temp[tid] = (index < N) ? __half2float(__hmul(a[index], b[index])) : 0.0f;
    __syncthreads();

    // Reduce within block
    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) {
            temp[tid] += temp[tid + stride];
        }
        __syncthreads();
    }

    // Store result of this block in global memory
    if (tid == 0) {
        atomicAdd(result, temp[0]);
    }
}

int main() {
    __half *a, *b;
    float *result;
    __half *d_a, *d_b;
    float *d_result;
    float final_result;

    // Allocate host memory
    a = (__half*)malloc(N * sizeof(__half));
    b = (__half*)malloc(N * sizeof(__half));
    result = (float*)malloc(sizeof(float));

    // Initialize host arrays
    for (int i = 0; i < N; ++i) {
        a[i] = __float2half(1.0f); // Initialize to 1.0 in float16
        b[i] = __float2half(1.0f); // Initialize to 1.0 in float16
    }
    *result = 0.0f;

    // Allocate device memory
    cudaMalloc((void**)&d_a, N * sizeof(__half));
    cudaMalloc((void**)&d_b, N * sizeof(__half));
    cudaMalloc((void**)&d_result, sizeof(float));

    // Copy data from host to device
    cudaMemcpy(d_a, a, N * sizeof(__half), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, N * sizeof(__half), cudaMemcpyHostToDevice);
    cudaMemcpy(d_result, result, sizeof(float), cudaMemcpyHostToDevice);

    // Launch kernel
    dot_product_float16<<<(N + BLOCK_SIZE - 1) / BLOCK_SIZE, BLOCK_SIZE>>>(d_a, d_b, d_result);

    // Copy result back to host
    cudaMemcpy(result, d_result, sizeof(float), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_result);

    // Print result
    printf("Dot product (float16): %f\n", *result);

    // Free host memory
    free(a);
    free(b);
    free(result);

    return 0;
}

Overwriting f7a2.cu


In [7]:
!nvcc f7a2.cu -o a72.out

      float final_result;
            ^




In [8]:
!./a72.out

Dot product (float16): 512.000000
