## Week 1

### Week-1: Summary

**Introduction to HLS Coding:** Definition, need for HLS, comparison with Verilog coding, advantages, limitations, and applications (e.g., IIR/FIR filters, matrix multipliers).

Working with Vivado HLS: Steps for using Vivado HLS, illustrated with an "Addition of 2 Numbers" example. This section also addresses challenges encountered with user-defined inputs and file I/O, including handling input/output files and generating Verilog code.

Working with Datatypes in HLS: Observations and code examples (C and Verilog) for unsigned int(), signed int(), char(), and float(). This section highlights how HLS handles different data types, including the creation of FSMs for float operations and the use of ap\_fixed<W, I> in C++ to manage fixed-point precision.

This document provides a summary of the work completed during Week 1 of a summer internship, spanning from June 2nd to June 6th, 2025. The primary focus of this period was on High-Level Synthesis (HLS) and its diverse applications.

### The summary encompasses the following key areas:

**Introduction to HLS Coding:** This section defines HLS, articulates the necessity for its implementation, and offers a comparative analysis with Verilog coding. It further explores the advantages, limitations, and practical applications of HLS, including examples such as IIR/FIR filters and matrix multipliers.

**Working with Vivado HLS:** This part outlines the procedural steps for utilizing Vivado HLS, exemplified through an "Addition of 2 Numbers" scenario. It also addresses challenges encountered concerning user-defined inputs and file I/O, detailing methods for handling input/output files and the generation of Verilog code.

Working with Datatypes in HLS: This section presents observations and provides corresponding C and Verilog code examples for unsigned int(), signed int(), char(), and float(). It elucidates how HLS processes various data types, highlighting the creation of Finite State Machines (FSMs) for float operations and the application of ap\_fixed<W, I> in C++ for managing fixed-point precision.

# 02/06/2025

### **HLS CODING**

- Introduction:
- High level Synthesis (HLS) refers to procedure of transformation of a high-level implementation (c, c++, systemc etc.) to an RTL (Register Transfer Level) implementation.
- The obtained RTL implementation can then be synthesized onto an FPGA.
- Top level function arguments are synthesized into RTL I/O ports.
- C functions are synthesized into blocks in RTL hierarchy. (If the C code contains a hierarchy of sub functions, they are synthesized as hierarchy of modules in RTL implementation)
- Loops are kept rolled by default. If unrolled and optimized loops are run parallelly.
- Arrays are implemented as RAM in the final FPGA design
- Need for HLS: It is mainly used to bridge the gap between software and hardware design by allowing people to write code in high level languages. It allows for users to explore the aspects of power, performance and area while designing.

### 2. HLS v/s Verilog coding:

- i. HLS uses high level languages whereas Verilog is a hardware description language that directly maps the code to the hardware.
- ii. HLS works on a higher level of abstraction allowing the user to focus on the logic and its implementation. Verilog works on a lower level of abstraction describing the hardware structure and its logic in terms of components such as registers and pipelines.
- iii. HLS is used to build hardware accelerators or custom processors where the algorithm is given a higher priority. In Verilog coding, we make hardware designs such as simple logic blocks to complex systems.

iv. HLS has fewer lines of code making it easier and faster to compile and run whereas Verilog has more lines of code taking more time to compile.

### Advantages:

- The primary advantage of HLS lies in the speed of the development time.
- It increases productivity of Hardware designers.
- The high level of abstraction provided by c allows for creation of highperformance hardware.
- Computationally intensive parts of the algorithm compilation can be accelerated by running it on the FPGA.
- Verification of algorithms at C level is much quicker compared to traditional HDLs (Hardware Development Language).

### 5. Limitations:

- Performance can be affected based on the complexity of the algorithm.
- Can sometimes result in less efficient hardware implementation compared to HDL coding.
- Often the code must be rewritten and pragma insertion is required for higher performance.

### 6. Applications:

- 1. HLS enables pipelining and parallelism which is Important in the designing of IIR/FIR filters.
- 2. HLS coded matrix multipliers for real-time physics simulations.
- 3. Code reuse helps in reducing redundancy in code

# 03/06/2025

# Working with Vivado HLS:

- 1. Install Xilinx and then open Vivado HLS.
- 2. Create New Project, select the required parts and name the main and test bench with the extension of .c, namely hello.c and hello\_tb.c.
- 3. Write the respective codes in their folders and save them.
- 4. Click on **C Simulation** and confirm if the test bench runs as expected without errors.
- 5. Check the output if it is being displayed correctly.
- 6. Now, click on **Run C Synthesis** and after its done, go to **Solution Reports** where you will be able to see the Verilog and HDL equivalent of the written code.

#### Addition of 2 Numbers:

First, addition of 2 numbers was performed by hard coding and then by user defined inputs. By user- defined inputs,

```
hello.c hello_tb.c hello_csim.log

1 void hello(int a, int b, int* sum)
2 {
3 *sum = a + b;
4 }
5
```

```
hello.c
         le hello_tb.c ⋈ \ le hello_csim.log
  1 #include <stdio.h>
  2 #include <stdlib.h>
  3 void hello(int a, int b, int* sum);
  5 int main(int argc, char* argv[])
  6 {
  7
        if (argc != 3)
  8
  9
            printf("Usage: %s <int a> <int b>\n", argv[0]);
10
            return 1;
11
        }
12
13
        int a = atoi(argv[1]);
14
        int b = atoi(argv[2]);
15
        int sum;
16
17
       hello(a, b, &sum);
18
19
        printf("Result: %d\n", sum);
        return 0;
20
21 }
22
```

We saw that we cannot use the regular scanf() or fscanf() or fopen() and fclose() commands to take in values during run-time of the program and also, HLS takes the given arguments as string type which had to be converted to integer type. So we gave the inputs before compile time.



# 04/06/2025

# Addition using File I/O:

Today, we performed the addition of 2 numbers using File Inputs i.e. giving the inputs as values stored in a file "input.txt", extracting information from there and writing the outputs to a new file "output.txt".

```
add3_tb.c

    add3.c 
    □ input1.txt

                                                                                                 output1.txt
1 #include <stdio.h>
                                                            1 // This is the design (HLS target) function
 void add3(int a, int b, int* sum);
                                                            2 void add3(int a, int b, int* sum)
 3⊖int main()
                                                            3 {
 4 {
 5
      FILE *input1, *output1;
                                                                   *sum = a + b:
 6
      int a, b, sum;
                                                            5 }
      input1 = fopen("input1.txt", "r");
                                                            6
 8
      if (input1 == NULL)
 9
                                                            input1.txt ⋈
                                                                                add *add
10
          printf("Error: Cannot open input file.\n");
                                                               115 12
11
          return 1;
12
                                                               2 45 78
13
      output1 = fopen("output1.txt", "w");
                                                               3 - 9 85
14
      if (output1 == NULL)
                                                              484 12
15
                                                               5 45 21
16
          printf("Error: Cannot open output file.\n");
                                                               6 - 96 23
17
          fclose(input1);
18
          return 1;
19
                                                            🖹 output1.txt 🛭 🔪 🖹 input1.txt
                                                                                                    *add3_tb
20
      while(fscanf(input1, "%d %d", &a, &b) == 2)
21
                                                               115 + 12 = 27
22
          add3(a, b, &sum);
                                                               245 + 78 = 123
23
          fprintf(output1, "%d + %d = %d\n", a, b, sum);
                                                               3 - 9 + 85 = 76
24
                                                               484 + 12 = 96
25
      fclose(input1);
26
                                                               545 + 21 = 66
      fclose(output1);
27
       printf("Test completed. Result written to output.txt\n");
                                                               6 - 96 + 23 = -73
28
       return 0;
29 }
```

The generated Verilog code for the same code:

```
/ timescale 1 ns / 1 ps
Timescale I ns / I ps
8 (* CORE_GENERATION_INFO="add3,hls_ip_2017_4,{HLS_INPUT_TYPE=c,HLS_INPUT_FLOAT=0,HLS_INPUT_FIXED
10 module add3 (
11
         ap_start,
12
           ap_done,
13
          ap_idle,
14
         ap_ready,
15
          a,
16
          b,
17
          sum.
18
          sum_ap_vld
19);
20 input ap_start;
21 output ap_done;
22 output ap_idle;
           ap_ready;
23 output
24 input [31:0] a;
25 input [31:0] b;
26 output [31:0] sum;
27 output sum_ap_vld;
28 reg sum_ap_vld;
29 always @ (*) begin
     if ((ap_start == 1'b1)) begin
31
          sum_ap_vld = 1'b1;
32
      end else begin
      sum_ap_vld = 1'b0;
33
      end
34
35 end
36 assign ap_done = ap_start;
37 assign ap_idle = 1'b1;
38 assign ap_ready = ap_start;
\overline{39} assign sum = (b + a);
40 endmodule //add3
```

The issue we saw today was the output file being in a different location than expected and leading to thinking that the code written was wrong and after a little tinkering and exploring the interface, we found the output file.

The next step was to take the generated Verilog code and implement it in Xilinx Vivado to check the correctness of the code by writing a testbench with hardcoded values for variables.

```
C:/Users/matta/Xilinx projects/add3_hls/add3_hls.srcs/sim_1/new/add3_tb.v
Q 💾 ← → 🐰 🖺 🖺 // 🖩 🗘
          module tb_add3;
 2 🖨
             // DUT Inputs
            reg ap start:
            reg [31:0] a, b;
             // DUT Outputs
          wire ap_done, ap_idle, ap_ready;
wire [31:0] sum;
wire sum_ap_vld;
// Instantiate the DUT
add3 dut (
    .ap_start(ap_start),
10
11
            12
13
14
15
         .ap_ready(ap
.a(a),
.b(b),
.sum(sum),
.sum_ap_vld(
);
initial begin
$display("$t
16
17
18
19
                .sum_ap_vld(sum_ap_vld)
20
21 🖨
             $display("Starting Verilog testbench...");
23
               // Test Case 1
               a = 10;
25
               b = 20;
26
               ap start = 1;
32 0
                a = 100;
33 0
                 b = 50;
34
       0
                  ap_start = 1;
35
        0
                  #10;
36
                   ap_start = 0;
37
                   #10;
```



# 05/06/2025 - 06/06/2025

# Working with Datatypes in HLS

Unsigned int():
 C code and testbench:

```
dataypes1.c
              datatypes1_tb.c ⋈ \ ≈n! datatypes1.v
    #include <stdio.h>
  2
    unsigned int datatypes1(unsigned int a, unsigned int b);
  4
  50 int main()
  6
   {
  7
        unsigned int result;
  8
        result = datatypes1(43,768);
 9
        printf("Result: %u\n", result);
 10
        return 0;
11 }
12
```

### Verilog code and testbench:



#### Observations:

- I. Test Case 1: a=10, b=20
  - In this case, since both are positive, we see that the result is displayed properly i.e. 30
- II. Test Case 2: a = 32'hFFFFFFF, b = 1In this case, since a is -1, it displays 4294967295 ( $2^{32}$ -1) and b as 1 itself and display the sum as 0 (-1+1=0).
- III. Test Case 3: a= -1000, b=5000
  In this case, since a is negative, it is shown as 2<sup>32</sup>-1 in terms of unsigned keeping b as

5000 itself giving the result as 4000.

### 2. Signed int():

C code and testbench:

```
dataypes1.c
                                                                     datatypes1_tb.c ⋈ \ RT datatypes1.v
                                                                                                  datatypes1_csim.log
                                                                                                                       Synthe
🖟 dataypes1.c 🛚
                       datatypes1_tb.c
                                                 RTL C
                                                          1 #include <stdio.h>
                                                            int datatypes1(int a, int b);
  1 int datatypes1(int a, int b)
                                                          3 int main()
                                                          4 {
                                                          5
                                                                int result; // Now signed
                                                          6
                                                                result = datatypes1(-567, -56); // Both inputs are signed
            return a + b;
                                                          7
                                                                printf("Result: %d\n", result); // Use %d for signed integers
                                                          8
                                                                return 0;
                                                          9 }
                                                         10
```

### Verilog code and testbench:





#### Observations:

- I. Test case 1: a=0, b=0
  - Result will be displayed as 0.
- II. Test case 2: a=43, b=768
  - Result will be displayed as 811.
- III. Test Case 3: a= 100, b= -50

In this case, -50 will be written in its 2's complement form and then in unsigned decimal form is 4294967246, therefore addition is performed as 100+4294967246=50.

IV. Test Case 4: a= -200, b= -300

In this case both numbers are converted to unsigned decimal form and then added giving the result as 4294966796 which is -500.

(11111111111111111111111000001100 is the binary equivalent)

V. Test Case 5: a= 32'h7FFFFFF, b= -1

This particular case represents a as the highest value stored in a possible for int i.e.  $2^{32}+1$  and b as -1, so simple addition of these 2 will result in 2147483646.

3. char():

C code and test bench:

```
datatypes1_tb.c 🖾 🗐 Synthesis(solution1)
dataypes1.c
                                                    RT. datatypes1.v
  1 #include <stdio.h>
  2
  3 // Function declaration using char
  4 char datatypes1(char a, char b);
  60 int main()
 7 {
        char result; // Now char (8-bit signed)
        result = datatypes1(-57, -56); // Inputs within char range
 9
       printf("Result: %d\n", result); // %d prints as signed decimal
10
11
        return 0;
12 }
13
```

Verilog Code and Testbench:



#### Observations:

- a. Test Case 1: a=0, b=0 Result is printed as 0.
- b. Test case 2: a=50, b=25 Result is printed as 75
- c. Test Case 3: a=-30, b= 20

The 2's complement is taken and the result is displayed as 246.

d. Test Case 4: a= -60, b=-70

The 2's complement is again taken and the result displayed correspondingly as 126.

e. Test Case 5: a=100, b=50

Addition is performed normally as 100+50=150 but since 150 crosses the upper limit of char (28-1), it wraps around to 106.

f. Test Case 6: a=127, b=1

Addition is performed normally as 127+1=128 and displays accordingly.

<u>Point to note:</u> When not specified by the compiler or user explicitly, the verilog code generated is unsigned char by default.

4. float():

C++ code and testbench:

```
datatypes1.cpp
                                                                                               Synthesis(solut
datatype
                                                          1 #include <iostream>
     #include "ap_fixed.h"
                                                          2 #include "ap fixed.h"
  2 typedef ap_fixed<22,5> fix_t;
                                                         3 typedef ap fixed<22,5> fix t; // 5 int bits, 5 fractional bits
  3
                                                          4 fix_t datatypes1(fix_t a, fix_t b);
                                                          50 int main()
    // Function definition using float
                                                          6 {
  5@fix_t datatypes1(fix_t a, fix_t b)
                                                          7
                                                               fix_t a = fix_t(2.1); // Explicit cast to fix_t
                                                          8
                                                               fix t b = fix t(3.0);
          return a + b;
                                                          9
                                                               fix_t result = datatypes1(a, b);
                                                         10
  8
                                                               std::cout << "Result: " << result << std::endl;</pre>
                                                         11
  9
                                                         12
                                                               return 0;
                                                         13 }
                                                         14
```

Verilog Code and testbench with simulation:

```
dataypes1.v * × datatypes1_tb.v × Untitled 3 ×
C:/Users/matta/Xilinx projects/datatypes1/datatypes1.srcs/sources_1/new/dataypes1.v
 Q | 🛗 | 🛧 | 🥕 | 🐰 | 📵 | 🛅 | // | 🖩 | 父 |
            timescale 1 ns / 1 ps
            (* CORE_GENERATION_INFO="datatypes1,hls ip 2017 4,{HLS INPUT TYPE=cxx,HLS INPUT FLOAT=0,HLS INPUT FIXED=1,HLS INPUT PART=xc7a35tcpg236-1,HLS INPUT CLOCK=10.000000
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
                      ap_start,
                      ap_done,
ap_idle,
                     ap_ready,
a_V,
b_V,
                      ap_return
            input ap_start;
output ap_done;
            output ap_done;
output ap_idle;
output ap_ready;
            input [21:0] a_V;
input [21:0] b_V;
output [21:0] ap_return;
            assign ap_done = ap_start;
30
31
32
       assign ap_idle = 1'bl;
        assign ap_ready = ap_start;
33
34
        assign ap_return = (b_V + a_V);
35
datatypes1_tb.v *
 C/Users/matta/Xilinx projects/datatypes1/datatypes1.srcs/sim 1/new/datatypes1 tb.v
 Q H + + X E E // E Q
                                                                                                                                                                                                                     0
         timescale lns / lps
module tb_datatypesl;
            // Inputs
reg ap_start; reg [21:0] a_V; reg [21:0] b_V;
            wire ap_done; wire ap_idle; wire ap_ready; wire [21:0] ap_return;
            // Instantiate the DUT datatypes dut (.ap_start(ap_start), .ap_dome(ap_dome), .ap_idle(ap_idle), .ap_ready(ap_ready), .a_V(a_V), .b_V(b_V), .ap_return(ap_return)); function real fixed22_to_real(input [21:0] fixed_val);
begin
if (fixed_val[21]) // Negative
fixed22_to_real = -((-fixed_val + 1'bl) & 22'h3FFFFF) / 131072.0;
```



#### Observations:

- 1. We observed that when we tried coding for float type in C, FSMs were created to solve as float operation cannot be completed in a single clock cycle.'
- 2. **Floating-point arithmetic** (IEEE 754) is complex and involves multiple steps like alignment, normalization, rounding, and exception handling.
- 3. These steps cannot be completed in a **single clock cycle** due to their sequential dependencies and bit manipulations.
- 4. Vivado HLS maps float operations to **multi-stage IP cores**, typically pipelined over several cycles.
- 5. An **FSM** (**Finite State Machine**) is automatically generated to control the sequence and timing of these stages.
- 6. FSM ensures correct data flow, result readiness, and ap\_done/ap\_ready signaling during multi-cycle float operations.

To resolve this, we shift to using C++ code which makes use of ap\_fixed<W,I> where W represents the total number of spaces used and I represents the number of integer spaces used which gives the number of fraction spaces as W-I and this allows us to limit the number of spaces it will be used. In our code, we have given 5 as the integer spaces which means 4 bits for calculation and 1 bit for sign. The larger value of W, more is its accuracy and precision. This plays an important role when conversion of decimal to binary numbers when the decimal does not have a n equivalent binary representation and needs more bits to represent it accurately.

# Week 2 09/06/2025

### D FLIP-FLOP

C Code and Testbench:

```
■ Syn
                dff.c
                      dff_csim.log
                                         1 // dff.c
                                           #include "dff.h"
 1 // dff.h
 2 #ifndef DFF H
                                         4 void dff(int clk, int rst, int d, int *q) {
                                               static int q_reg = 0;
 3 #define DFF H
                                               if (rst) {
 4
                                                  q_reg = 0;
                                         8
 5 void dff(int clk, int rst, int d, int *q);
                                                else if (clk) {
                                         9
                                        10
                                                  q_reg = d;
 6
                                        11
                                        12
 7
   #endif
                                               *q = q_reg;
                                        13
 8
                                        14
                                        15
```

```
Synthesis(solution1)
h dff.h

    *dff_tb.c 
    □ dff.c

                                dff_csim.log
                                                                      RTL dff.V
  1 #include <stdio.h>
  2 #include "dff.h"
 3⊖int main() {
         int clk = 0, rst = 1, d = 0, q = 0;
         printf("Applying reset...\n");
         int reset_cycles = 2;
 7
         while (reset_cycles > 0)
         {
 9
             dff(clk, rst, d, &q);
 10
             clk = !clk;
 11
             reset_cycles--;
 12
         }
 13
         rst = 0;
 14
         int d_values[] = {1, 0, 1, 1, 0};
 15
         int index = 0;
 16
         int total values = sizeof(d values) / sizeof(d values[0]);
 17
         printf("\nStarting simulation...\n");
 18
        while (index < total_values) {</pre>
 19
             d = d_values[index];
 20
             clk = 1;
             dff(clk, rst, d, &q); // rising edge
 21
 22
23
             printf("Cycle %d | D: %d | CLK: %d | Q: %d\n", index, d, clk, q);
             clk = 0;
 24
             dff(clk, rst, d, &q); // falling edge
 25
             index++;
 26
         }
 27
         return 0;
28 }
29
```

Here, we have used a header file "dff.h" where the function call is stored and is used in other modules "dff.c" and "dff\_tb.c" where it is used for better code modularity and reusability.

### Verilog code and testbench:

```
CAlseniantHandMonterpotentialWinterpotentialWinterpotential Company (1997) | Winterpotential Compan
```

C:/Users/matta/Xilinx projects/dff/dff.srcs/sources 1/new/dff.v

```
Q 💾 ← → 🐰 📳 🗈 // 🖩 🗘
48 🖨 end
49 - always @ (*) begin
50 = if (((ap_start == 1'b0) & (1'b1 == ap_CS_fsm_statel))) begin
           ap_idle = 1'bl;
51
52
       end else begin
           ap idle = 1'b0;
54
55 😑 end
56 - always @ (*) begin
57 🖯
       if ((1'bl == ap_CS_fsm_state2)) begin
         ap_ready = 1'b1;
58
59 🗐
        end else begin
60
         ap_ready = 1'b0;
61
       end
62 🖨 end
63 🗀 always @ (*) begin
64 Ė
      if ((1'bl == ap CS fsm state2)) begin
         q ap vld = 1'bl;
66
       end else begin
67
         q_ap_vld = 1'b0;
       end
68
69 🖨 end
70 - always @ (*) begin
71 <del>|</del> 72 <del>|</del>
       case (ap CS fsm)
          ap_ST_fsm_statel : begin
73
              if (((ap_start == 1'bl) & (1'bl == ap_CS_fsm_statel))) begin
74
                  ap_NS_fsm = ap_ST_fsm_state2;
75
              end else begin
76
                 ap NS fsm = ap ST fsm statel;
77
              end
78
79
           ap_ST_fsm_state2 : begin
80 :
              ap_NS_fsm = ap_ST_fsm_statel;
81
           end
82 🖯
           default : begin
83
            ap NS fsm = 'bx;
84
           end
85 🖹
        endcase
86 🗎 end
87 assign ap_CS_fsm_state1 = ap_CS_fsm[32'd0];
end
assign ap_CS_fsm_statel = ap_CS_fsm[32'd0];
assign ap CS fsm state2 = ap CS fsm[32'dl];
assign q = q reg;
assign tmp 1 fu 59 p2 = ((clk == 32'd0) ? 1'b1 : 1'b0);
assign tmp fu 47 p2 = ((rst == 32'd0) ? 1'b1 : 1'b0);
endmodule //dff
```

```
C:/Users/matta/Xilinx projects/dff/dff.srcs/sim_1/new/dff_tb.v
```

```
Q | 2 | ★ | → | ¾ | 3 | 3 | 1 | 1 | 1 | 1 | 2
       1 'timescale lns / lps
2  module tb_dff;
                       reg ap_clk;reg ap_rst;reg ap_start;reg [3:0] clk;reg [3:0] rst;reg [3:0] d;
                        wire ap_done; wire ap_idle; wire ap_ready; wire [3:0] q; wire q_ap_vld;
// Instantiate the DFF module

dff uut (.ap_clk(ap_clk),.ap_rst(ap_rst),.ap_st

// Clock generation

initial begin

ap_clk = 0;

forever $5 ap_clk = -ap_clk; // 10ns period

end

// Stimulus

initial begin

// Initialize

ap_rst = 1; ap_start = 0;

clk = 1;

clk = 1;

d = 0;

$$\frac{$\frac{1}{2}$}$ rst = 1;

d = 0;

$$\frac{$\frac{1}{2}$}$ // Wait for some clock cycles

// De-assert reset

ap_rst = 0; // active-low reset logic in module

// Start FSM

ap_start = 1;

// Apply data input

d = 4*d5;

clk = 1;

$$\frac{1}{2}$

#$\frac{1}{2}$

#$\frac{1}{
                       off nut (ap_clk(ap_clk), ap_rst(ap_rst), ap_start(ap_start), ap_done(ap_done), ap_idle(ap_idle), ap_ready(ap_ready), clk(clk), rst(rst), .d(d), .q(q), .q_ap_vld(q_ap_vld));
       40
                                                                       #20;
                                                                    // Reset again
       41
       42
                                                                  ap_rst = 1;
       43
                                                                  rst = 1;
       44
                                                                   #10;
       45
                                                                ap rst = 0;
                                                                   rst = 0;
        46
        47
                                                                    // Final input
        48
                                                                      ap start = 1;
                                                           d = 4'd21;
        49
        50
                                                                      #10;
       51
                                                                        ap start = 0;
        52
                                                                         #20;
       53
                                                                         $finish;
        54
                                                           end
        55 endmodule
       56
```

#### Simulation:



### Observations:

- 1. All inputs and outputs are 4 bits.
- 2. We observed that there are 2 clks in the code, ap\_clk and an external clk. The program runs on ap\_clk and the external clk is used as an enable flag.
- 3. It is a synchronous circuit and the first sequential circuit that we have designed.
- 4. Since all the inputs are 4 bits, there are 4 resets but even 1 button is enough and the same applies to clk as well.
- 5. Since it is a synchronous circuit, it works on a clock that is controlled by us which is given by a push button.
- 6. Initially the reset button is off and the data stored is latched onto q and when reset=1, data stored goes to 0.
- 7. We did face a few initial problems such as pin mapping or misreading the circuit as asynchronous when it was actually a synchronous circuit leading us to change the mapping of pins which we had given junk values prior.
- 8. When ap\_start=0, the circuit doesn't work and begins to work when ap\_start goes high and ap\_idle=0. After ap\_start=1, we can give the input data and watch it getting stored.

# **Pre-distortion Circuits**

# **Pre-distortion Circuits**

# **Constellation Mapper**

Constellation Mapper

### C code:

```
    ★constellation5.cpp 
    ★ Synthesis(solution1)

 1 #include <ap int.h>
  2 #include <hls_stream.h>
3 // Fixed-point type for I/Q values
 4 typedef ap_int<16> symbol_type;
 50// Standard QPSK constellation values (normalized to 0.7071)
 6 // Scaled for 16-bit fixed point representation (Q15 format)
 7 // 0.7071 * 32767 = 23170, -0.7071 * 32768 = -23170
 8 const symbol type QPSK I VALUES[4] = { 23170, -23170, -23170, 23170}; // I values
 9 const symbol type OPSK 0 VALUES[4] = { 23170, -23170, -23170}; // 0 values
11 * @brief QPSK Constellation mapper function
12 *
    * @param input_bytes Input byte array to map to QPSK constellation points
13
    * @param output_symbols_I I-channel output symbols
    * @param output_symbols_Q Q-channel output symbols
     * @param num_bits Number of bits to process (should be even for QPSK) */
170 void constellation5(
18
        ap uint<8> *input bytes,
19
        symbol type *output symbols I,
20
        symbol type *output symbols Q,
21
        unsigned int num bits
22 ) {
23
        #pragma HLS INTERFACE m axi port=input bytes offset=slave depth=256
24
        #pragma HLS INTERFACE m_axi port=output_symbols_I offset=slave depth=256
25
        #pragma HLS INTERFACE m_axi port=output_symbols_Q offset=slave depth=256
26
        #pragma HLS INTERFACE s_axilite port=num_bits bundle=AXILiteS
27
        #pragma HLS INTERFACE s axilite port=return bundle=AXILiteS
28
        // OPSK: Process 2 bits per symbol
        unsigned int num symbols = (num bits + 1) / 2; // Ceiling division by 2
30
        for (unsigned int i = 0; i < (num symbols + 3) / 4; <math>i++) { // Process bytes
31
            ap_uint<8> byte_val = input_bytes[i];
32
            // Process each 2-bit group in the byte
33
            for (int j = 0; j < 4 && (i*4 + j) < num_symbols; <math>j++) {
34
                // Extract 2 bits for QPSK
35
                ap_uint<2> symbol_idx = (byte_val >> (6 - j*2)) \& 0x3;
 36
                // Map to I/Q constellation point
 37
                output_symbols_I[i*4 + j] = QPSK_I_VALUES[symbol_idx];
38
                output_symbols_Q[i*4 + j] = QPSK_Q_VALUES[symbol_idx];
```

### Purpose:

Maps input bits to QPSK constellation points (I/Q values) for digital communication.

### Input:

- Bitstream packed as bytes.
- Each QPSK symbol uses 2 bits.

### Output:

Two arrays: one for I (in-phase), one for Q (quadrature) components.

### Constellation Mapping:

const symbol\_type QPSK\_I\_VALUES[4] = {23170, -23170, -23170, 23170}; // Gray coded

```
const symbol type QPSK Q VALUES[4] = {23170, 23170, -23170};
```

The mapping here is done in whole numbers which is just a scaled version of the decimal values, i.e. 0.7071\*32767=23170.

#### Process:

- For each symbol, extract 2 bits from the input.
- Use these bits as an index into I/Q lookup tables.
- Store the corresponding I and Q values in the output arrays.

### Looping:

- Iterates over the number of symbols (half the number of input bits).
- Handles bit extraction even if symbols cross byte boundaries.

### Efficiency:

- Uses lookup tables for fast mapping.
- Suitable for both software simulation and hardware (FPGA) implementation.

### Verilog Code:

The C code generates 5 Verilog files as shown:

- ent constellation5\_AXILiteS\_s\_axi.v
- nd constellation5\_gmem\_m\_axi.v
- ™ constellation5\_gmem2\_m\_axi.v
- ฅาณี constellation5\_mubkb.v
- RT. constellation5.v

### The main file is constellation5.v .

Simulation Graph:

|                              |                         | 76.667 ns |                |              |             |     |                |               |       |               |                |           |
|------------------------------|-------------------------|-----------|----------------|--------------|-------------|-----|----------------|---------------|-------|---------------|----------------|-----------|
| Name                         | Value                   | :         | 65 ns          | 70 ns        | 75 n        |     | 80 ns          | 85 ns         | 90 ns | 95 ns         | 100 ns         | 105 ns    |
| l clk                        | 1                       |           |                |              |             |     |                |               |       |               | فنفنف          |           |
| l rst_n                      | 1                       |           |                |              |             |     |                |               |       |               |                |           |
| > ## test_input[0:3][7:0]    | 179,85,240,15           |           |                |              |             |     | 179,           | 85,240,15     |       |               |                |           |
| > 👭 i_value[15:0]            | 42366                   | Х         | 23.            | 70           | $\forall =$ | 423 | 66             | 23            | 70    | *             | 42366          |           |
| > 🖷 q_value[15:0]            | 42366                   | х         |                | 42           | 66          |     |                | 23            | 70    | *             | 42366          |           |
| I start_test                 | 0                       |           |                |              |             |     |                |               |       |               |                |           |
| > M num_bits[31:0]           | 32                      |           |                |              |             |     |                | 32            |       |               |                |           |
| ₩ done                       | 0                       |           |                |              |             |     |                |               |       |               |                |           |
| > M symbol_index(3:0)        | 2                       | 0         |                |              |             | 2   |                | <u> </u>      | 3     | *             | 4              |           |
| ₩ symbol_valid               | 0                       |           |                |              |             |     |                |               |       |               |                |           |
| > Minput_byte_addr[1:0]      | 0                       |           |                |              |             | 0   |                |               |       | *             | 1              |           |
| output_I_mem[0:15][15:0]     | 23170,0,0,0,0,0,0,0,0,0 | 0,0,0     | ,0,0,0,0,0,0,0 | ,0,0,0,0,0,0 | X           | 231 | 70,0,0,0,0,0,0 | 0,0,0,0,0,0,0 | .0,0  | 23170,23170,0 | ,0,0,0,0,0,0,0 | 0,0,0,0,0 |
| > M output_Q_mem[0:15][15:0] | 42366,0,0,0,0,0,0,0,0,0 | 0,0,0     | ,0,0,0,0,0,0,0 | 0,0,0,0,0,0  | $\langle -$ | 423 | 6,0,0,0,0,0,0  | 0,0,0,0,0,0,0 | .0,0  | 42366,23170,0 | ,0,0,0,0,0,0,0 | 0,0,0,0,0 |
| > 👊 i[31:0]                  | 16                      |           |                |              |             |     |                | 16            |       |               |                |           |
| > symbol_count[31:0]         | 1                       |           | 0              |              |             |     |                |               |       | *             | 2              |           |
|                              |                         |           |                |              |             |     |                |               |       |               |                |           |

# Pulse Shaping Filter

## Pulse Shaping Filter

C code:

```
- -
Synthesis(solution1)
                    pulse_shape.cpp 🖾 🔎 pulse_shape_tb.cpp
  1 #include "ap fixed.h"
  2 #include "hls math.h"
 4 #define DATA LEN 512
  5 #define NUM_WEIGHTS 41
  6 #define SPS 25
 7 #define ALPHA 0.5
 9 typedef ap_fixed<16, 4> fixed_t;
 10
 11 void raised_cosine_filter(fixed_t rc[NUM_WEIGHTS]) {
 12
        const fixed_t PI = fixed_t(3.14159);
        const fixed_t ALPHA_FIXED = fixed_t(ALPHA);
 13
        const fixed t EPS = fixed_t(1e-5);
 14
 15
        const fixed_t EPS_X = fixed_t(1e-3);
        const fixed_t SCALE = fixed_t(0.9999);
 16
 17
 18
        int mid = NUM WEIGHTS / 2;
        fixed_t sum = 0;
 19
 20
 21
        for (int i = 0; i < NUM WEIGHTS; i++) {</pre>
 22
            fixed_t idx = fixed_t(i - mid);
 23
             fixed_t x = SCALE * idx / fixed_t(SPS);
 24
            fixed_t pi_x = PI * x;
 25
 26
            fixed_t sinc;
            if (hls::abs(x) < EPS_X)</pre>
 27
 28
                sinc = fixed_t(1.0);
 29
            else
 30
                sinc = hls::sin(pi_x) / pi_x;
 31
            fixed_t denom = fixed_t(1.0) - fixed_t(4.0) * ALPHA_FIXED * ALPHA_FIXED * x * x;
 32
 33
            if (hls::abs(denom) < EPS)</pre>
 34
                 denom = EPS;
 35
 36
            // FIXED this line to avoid synthesis error
             ap_fixed<32, 6> angle = PI * ALPHA_FIXED * x;
 37
 38
             fixed_t cos_part = hls::cos(angle);
```

```
38
            fixed_t cos_part = hls::cos(angle);
39
            rc[i] = sinc * (cos part / denom);
40
41
            sum += rc[i];
12
        }
43
        for (int i = 0; i < NUM_WEIGHTS; i++) {</pre>
44
45
            rc[i] = rc[i] / sum;
46
        }
47 }
48
49@ void convolve(const fixed_t data[DATA_LEN], const fixed_t filter[NUM_WEIGHTS], fixed_t result[Num_weight]
        int mid = NUM_WEIGHTS / 2;
51
52
        for (int i = 0; i < DATA LEN; i++) {
53
            fixed t acc = 0;
            for (int j = 0; j < NUM_WEIGHTS; j++) {</pre>
54
55
                int k = i - mid + j;
56
                if (k >= 0 && k < DATA_LEN)
57
                    acc += data[k] * filter[j];
58
59
            result[i] = acc;
        }
60
61 }
62
63@void pulse_shape(fixed t i data[DATA LEN], fixed t q data[DATA LEN], fixed t i out[DATA LEN],
64 #pragma HLS INTERFACE s axilite port=return bundle=CTRL
65 #pragma HLS INTERFACE bram port=i data
66 #pragma HLS INTERFACE bram port=q data
67 #pragma HLS INTERFACE bram port=i_out
68 #pragma HLS INTERFACE bram port=q_out
69
70
        fixed_t rc_filter[NUM_WEIGHTS];
71
        raised_cosine_filter(rc_filter);
72
        convolve(i_data, rc_filter, i_out);
73
        convolve(q_data, rc_filter, q_out);
74 }
75
```

### Input:

- Two arrays:
  - i\_data[DATA\_LEN]: in-phase component
  - o q\_data[DATA\_LEN]: quadrature component
- Each array contains baseband samples (fixed-point format)

### **Output:**

- Two arrays:
  - i\_out[DATA\_LEN]: filtered in-phase output
  - o q\_out[DATA\_LEN]: filtered quadrature output

### Filter Design:

• Raised Cosine Filter with:

```
NUM_WEIGHTS = 41 taps
```

- ALPHA = 0.5 roll-off factor
- SPS = 25 samples per symbol
- · Coefficients computed with:
  - Normalized sinc() function
  - o Cosine roll-off term
  - o Avoids divide-by-zero using small epsilon
- Normalization step ensures filter gain = 1

### **Process:**

- Compute rc\_filter[] coefficients using raised\_cosine\_filter()
- 2. Convolve i\_data[] and q\_data[]
   with rc\_filter[] using convolve()
- 3. Store results in i\_out[], q\_out[]

## Looping:

- Filter loop (NUM\_WEIGHTS times) for each output sample
- Handles signal edges using boundary checks
- Accumulates sum of products for FIR convolution

## Efficiency:

- Uses fixed-point math for FPGA compatibility
- Coefficients and output fit within 16-bit precision
- Fully synthesizable via Vivado HLS
- Interface-ready via AXI & BRAM pragmas

# Pre Distorter

# DIGITAL PRE DISTORTION BLOCK

#### **Pre Distorter**

The DPD circuit architecture followed is known as the Indirect Learning Architecture (ILA)



**Indirect Learning Architecture** 

#### Purpose:

To apply the calculated inverse distortion to the input signal.

### Input:

Pulse shaped I and Q values from the pulse shape filter.

Weight coefficients generated by the Pre Distortion Algorithm.

### **Output:**

Complex Multiplication product of input signal (I&Q) values and the weights generated by the DPD algorithm.

#### Process:

Receive I and Q shaped values from the pulse shape filter.

Collect the weighted coefficients from the Pre Distortion algorithm block.

Generate z values for pre distorted outputs.

```
sum_i += w_prev[k][m].real * real_phi[k][m] - w_prev[k][m].imag *
imag_phi[k][m];
sum_q += w_prev[k][m].real * imag_phi[k][m] + w_prev[k][m].imag *
real_phi[k][m];
```

The above assignment is done by complex multiplication.

### Efficiency:

Generation of coefficients is done recursively using the LMS method.

# Pre-Distortion Algorithm

# PRE-DISTORTION ALGORITHM

### 1. Orthogonal Polynomial Method:

### Purpose:

To generate orthogonal polynomials and the weighted coefficients for digital pre distortion.

### Inputs:

Delayed pulse shaped I and Q inputs (x), amplified output (y) from the feedback loop.

#### Outputs:

Weighted coefficients (w) for RLS applications.

#### Process:

The delayed signal is subtracted from the feedback output to calculate the error function (e(n)).

```
// Compute error: e = x_ref - y_model
data_t err_i = i_ref - *i_out;
data_t err_q = q_ref - *q_out;
```

```
// Compute |z|^2
phi_t abs2(data_t i, data_t q) {
   return i*i + q*q;
// Iterative shifted Legendre polynomial (or similar orthogonal poly)
phi_t iterative_P(int k, phi_t x) {
   phi t P0 = 1, P1 = x - 1, Pk = 0;
   if (k == 0) return P0;
   if (k == 1) return P1;
   for (int n = 2; n \leftarrow k; n++) {
      Pk = ((2*n-1)*(x-1)*P1 - (n-1)*P0)/n;
       PØ = P1;
       P1 = Pk;
   return Pk:
// Compute all orthogonal polynomial basis functions for all memory taps
void compute_phi_all(
   const data_t i_in[MEMORY_DEPTH], const data_t q_in[MEMORY_DEPTH],
   phi_t real_phi[K][MEMORY_DEPTH], phi_t imag_phi[K][MEMORY_DEPTH]
    for (int m = 0; m < MEMORY_DEPTH; ++m) {
       phi_t x = abs2(i_in[m], q_in[m]);
       for (int k = 0; k < K; ++k) {
#pragma HLS UNROLL
```

The orthogonal polynomial is calculated recursively to generate the polynomial coefficients. The polynomial coefficients and the error function are used to compute the weighted coefficients.

### Efficiency:

Use of the RLS method for polynomial coefficient enables faster iterations. Calculation by LMS method is quick and efficient.

The Mean Squared Error is highly reduced and distortions are neatly fixed.

#### 2. Modified Differential Evolution

#### Purpose:

To generate a new vector(signal) value for digital predistortion.

Inputs:

Delayed pulse shaped I and Q inputs (x), amplified output (y) from the feedback loop.

#### Outputs:

Most fit vector value with highest Digital Pre Distortion.

#### Process:

Generate a population of values from the feedback outputs.

```
// 1. Initialize population
   if (!init_done) {
   for (int i = 0; i < POPULATION_SIZE; i++) {</pre>
#pragma HLS UNROLL
           for (int k = 0; k < K; k++) {
#pragma HLS UNROLL
               for (int tap = 0; tap < MEMORY_DEPTH; tap++) {</pre>
#pragma HLS UNROLL
                   data_t perturbation_real = (generate_random_float(&
                    rand_seed) - ap_fixed<24,8>(0.5)) * ap_fixed<24,8>(2.0);
                    data_t perturbation_imag = (generate_random_float(&
                   rand_seed) - ap_fixed<24,8>(0.5)) * ap_fixed<24,8>(2.0);
                    if (k == 0 && tap == 0) {
                       population[i][k][tap].real = ap_fixed<24,8>(1.0) +
                        perturbation_real;
                        population[i][k][tap].imag = perturbation imag;
                    } else {
                       population[i][k][tap].real = perturbation_real;
                        population[i][k][tap].imag = perturbation_imag;
                    fitness[i] = 1000.0f;
        init_done = true;
```

The vectors are generated randomly.

Finding the best individual from the population based on fitness.

```
// 2. Find best individual
int best_idx = 0;
data_t best_fitness = fitness[0];
for (int i = 1; i < POPULATION_SIZE; i++) {

#pragma HLS UNROLL

    if (fitness[i] < best_fitness) {
        best_fitness = fitness[i];
        best_idx = i;
    }
}

// Print fitness progress every 10 generations
static int adapt_print_counter = 0;
adapt_print_counter++;
if (adapt_print_counter % 10 == 0) {
    std::cout << "[MDE] Generation " << generation_count << " best
    fitness: " << best_fitness << std::endl;
}</pre>
```

### Compute z using the best individual based on fitness and cost factor Output DPD using best individual

The vectors are mutated and evaluated for fitness.

```
// 4. Output DPD for the batch using best individual
for (int b = 0; b < BATCH SIZE; b++) {</pre>
   phi_t real_phi[K][MEMORY_DEPTH], imag_phi[K][MEMORY_DEPTH];
   compute_phi_all(i_in_batch[b], q_in_batch[b], real_phi, imag_phi);
   acc t sum i = 0, sum q = 0;
   for (int k = 0; k < K; k++) {
        for (int tap = 0; tap < MEMORY_DEPTH; tap++) {</pre>
           sum_i += population[best_idx][k][tap].real * real_phi[k][tap]
            - population[best_idx][k][tap].imag * imag_phi[k][tap];
           sum_q += population[best_idx][k][tap].real * imag_phi[k][tap]
           + population[best_idx][k][tap].imag * real_phi[k][tap];
   z_i_batch[b] = data_t(double(sum_i) / g_dpd_scale);
   z_q_batch[b] = data_t(double(sum_q) / g_dpd_scale);
   // Evaluate trial fitness over batch
   data_t trial_total_fitness = 0;
   for (int b = 0; b < BATCH_SIZE; b++) {</pre>
       phi_t real_phi[K][MEMORY_DEPTH], imag_phi[K][MEMORY_DEPTH];
       compute_phi_all(i_in_batch[b], q_in_batch[b], real_phi, imag_phi);
       acc_t sum_i = 0, sum_q = 0;
       for (int k = 0; k < K; k++) {
            for (int tap = 0; tap < MEMORY_DEPTH; tap++) {
                sum_i += trial[k][tap].real * real_phi[k][tap] - trial[k]
                [tap].imag * imag_phi[k][tap];
               sum_q += trial[k][tap].real * imag_phi[k][tap] + trial[k]
               [tap].imag * real_phi[k][tap];
       data_t y_i = data_t(double(sum_i) / g_dpd_scale);
```

### Update of weights after one full DE cycle.

```
// 5. DE mutation/crossover/selection (using batch fitness)
for (int i = 0; i < POPULATION_SIZE; i++) {
    int r1 = generate_random_index(i, POPULATION_SIZE, &rand_seed);
    int r2 = generate_random_index(r1, POPULATION_SIZE, &rand_seed);
    int r3 = generate_random_index(r2, POPULATION_SIZE, &rand_seed);
    ccoef_t trial[K][MEMORY_DEPTH];
    for (int k = 0; k < K; k++) {
        for (int tap = 0; tap < MEMORY_DEPTH; tap++) {
           trial[k][tap].real = population[r1][k][tap].real + F SCALE *
            (population[r2][k][tap].real - population[r3][k][tap].real);
           trial[k][tap].imag = population[r1][k][tap].imag + F_SCALE *
            (population[r2][k][tap].imag - population[r3][k][tap].imag);
           int cross rand = generate random index(0, 1000, &rand seed);
           data_t rand_val = (data_t)cross_rand * (data_t)0.001;
           if (rand val > CR PROB) {
               trial[k][tap] = population[i][k][tap];
```

```
data_t y_i = data_t(double(sum_i) / g_dpd_scale);
        data_t y_q = data_t(double(sum_q) / g_dpd_scale);
        trial_total_fitness += compute_fitness(i_ref_batch[b], q_ref_batch
        [b], y_i, y_q);
    data_t trial_fitness = trial_total_fitness / BATCH_SIZE;
    if (trial_fitness < fitness[i]) {</pre>
        for (int k = 0; k < K; k++) {
            for (int tap = 0; tap < MEMORY_DEPTH; tap++) {</pre>
                population[i][k][tap] = trial[k][tap];
        fitness[i] = trial_fitness;
// 6. Update weights after full DE cycle
generation_count++;
if (generation_count >= MAX_GENERATIONS) {
    for (int k = 0; k < K; k++) {
        for (int tap = 0; tap < MEMORY_DEPTH; tap++) {</pre>
           w[k][tap] = population[best_idx][k][tap];
    generation_count = 0;
```

# DAC

# **Digital To Analog Converter**

### Inputs:

It takes discrete digital inputs from the pre-distorter.

### Outputs:

It outputs discrete analog values.

#### **Process:**

The discrete input values are scaled for the expected 8-bit output.

While the inputs range lies in [-1,1] the outputs are 8 bits in size.

Hence the values are scaled to adjust to the 8 bit values with the range in [-128 ,127].

```
// Multi-bit DAC with channel select (8-bit output)
void dac_multibit_with_select(data_ty din, ap_fixed<24,8> &dout, bool
channel_select) {
#pragma HLS INLINE

// Scale input from [-1, 1) to [-128, 127]
  data_ty quantized = din * data_ty(128);
  if (quantized > 127) quantized = 127;
  if (quantized < -128) quantized = -128;
  dout = ap_fixed<24,8>(quantized);
  printf( "dout value is %f",dout.to_float());
}
```

Values which fall outside the range are quantized either to -128 or 127.

This module is used so that the data can be easily quadrature modulated and upconverted.

# **Quadrature Modulation**

## **Quadrature Modulation**

### Inputs:

Analog discrete I and Q values generated by the DAC (Digital to Analog Converter).

### **Outputs:**

Real RF quadrature modulated signal with both magnitude and phase preserved .

#### Process:

The NCO (Numerically Controlled Oscillator) is used to generate 64 cos and sin values (in the range of 0-pi/2).

```
// Get base values from LUT
ap_fixed<16,4> lut_val = cos_lut[index];
// Generate cos and sin based on quadrant
switch(quadrant) {
    case 0: // 0 to pi/2
       cos val = (data t)lut val;
        sin_val = (data_t)cos_lut[63-index]; // sin = cos(pi/2 - teta)
       break;
    case 1: // pi/2 to pi
       cos_val = (data_t)(-cos_lut[63-index]);
       sin_val = (data_t)lut_val;
       break:
    case 2: // pi to 3pi/2
     cos_val = (data_t)(-lut_val);
       sin_val = (data_t)(-cos_lut[63-index]);
       break:
    case 3: // 3pi/2 to 2pi
       cos val = (data t)cos lut[63-index];
        sin_val = (data_t)(-lut_val);
       break;
cos_lo = cos_val;
sin_lo = sin_val;
phase += phase_inc;
```

The quadrant and index are extracted from the inputs such that they may be conserved.

```
// Extract quadrant and index
ap_uint<2> quadrant = phase >> 6; // Upper 2 bits for quadrant
ap_uint<6> index = phase & 0x3F; // Lower 6 bits for LUT index
data_t cos_val, sin_val;
```

The extracted input is then used in the digital\_qm mixer to return the quadrature modulated RF baseband real values with conserved phase and magnitude.

```
data_t digital_qm(data_t I, data_t Q, data_t cos_lo, data_t sin_lo) {
#pragma HLS INLINE
    acc_t mix = (acc_t)I * cos_lo - (acc_t)Q * sin_lo;
    return (data_t)mix;
}
```

### Efficiency:

The quadrature modulated real values can easily be upconverted and amplified.

# Digital Upconverter

## **Digital Upconverter**

### Inputs:

Input to the block is taken from the output of the Quadrature Modulator

### **Outputs:**

Outputs of the block are upconverted and upsampled filtered RF realbaseband values.

#### Process:

The inputs are interpolated by a factor of 8 (upsampled). This is achieved by considering a FIR filter.

The filter coefficients are then applied to the inputs for interpolation.

Cos and Sin values are generated by a numerically controlled oscillator and are saved in LUTS.

The obtained sin and cos values are then multiplied with the interpolated carrier signal for complete upconversion.

```
// Read one input sample
sample_type i_sample = 0, q_sample = 0;
if (!i_in_empty() && !q_in_empty()) {
    i_in >> i_sample;
    q_in >> q_sample;
} else {
    return;
}

// Perform interpolation and upconversion for each interpolated sample
for (int phase = 0; phase < INTERPOLATION_FACTOR; phase++) {
    #pragma HLS PIPELINE II=1

// For simple interpolation, only use new input on phase 0, else use 0
    sample_type i_fir_in = (phase == 0) ? i_sample : sample_type(0);
    sample_type i_interp, q_interp;
    sample_type i_interp, q_interp;
    fir_filter(i_shift_reg, i_fir_in, i_interp);

fir_filter(q_shift_reg, i_fir_in, q_interp);

// Numerically controlled oscillator
    phase_acc += freq_control_word;
    phase_type lut_addr = phase_acc >> (32 - NCO_LUT_BITS);
    sample_type sin_val = sine_lut[lut_addr];

    sample_type cos_val = cosine_lut[lut_addr];

// Complex multiplication (upconversion)
    sample_type upconverted = i_interp * cos_val - q_interp * sin_val;

// Output the upconverted sample
signal_out << upconverted;</pre>
```

### Efficiency:

Since sin and cos values are precomputed and stored in LUTS . This module is not computationally intensive.

# Power Amplifier

### Power Amplifier

### Inputs:

The inputs to the PA are RF realbaseband values from the Upconverter

Phase Distortions are simulated to substitute Q

### **Outputs:**

The output are amplified I and Q magnitude values, Linear Gain and Gain in dB scale.

### **Process:**

RF inputs are extracted from the Quadrature Modulator, and these values, along with constant parameters, are utilized to generate amplitude and phase amplification values. Amplitude compression is subsequently applied to the RF values. Non-linear distortion effects are introduced to create harmonics and intermodulation distortion (IMD). Additionally, AM and PM modulation effects are incorporated. The resulting I values are stored as out\_i. The previously generated distortions serve as realistic RF q\_out values. Finally, the ultimate outputs are scaled by a factor of 5 and set.

```
// Saleh amplitude model (creates compression)
float A r = (alpha a * rf magnitude) / (1 + beta a * rf magnitude *
rf magnitude);
// Saleh phase model (creates AM-PM distortion)
float P r = (alpha p * rf_magnitude * rf_magnitude) / (1 + beta p *
rf_magnitude * rf_magnitude);
// Apply amplitude compression to the RF signal
float gain_factor = A_r / rf_magnitude;
float compressed_rf = in_i.to_float() * gain_factor;
// Add nonlinear distortion effects (creates harmonics and IMD)
float distorted_rf = compressed_rf;
if (std::abs(compressed rf) > 0.1f) {
    // Add cubic and quintic nonlinearities
    float norm signal = compressed rf / 3.0f; // Normalize to prevent
    overflow
   float cubic term = norm signal * norm signal * norm signal * 0.2f;
   float quintic_term = norm_signal * norm_signal * norm_signal *
    norm signal * norm signal * 0.05f;
    distorted_rf = compressed_rf + cubic_term + quintic_term;
```

```
// Put distorted RF in out i, create some Q due to phase distortion
out_i = data_t(distorted_rf);
// Phase distortion creates small Q component (realistic for RF PA)
float q distortion = distorted_rf * P r * 0.1f * std::sin(P_r * 5.0f);
out q = data t(q distortion);
// Apply the 5.0 scaling
out i = data t(out i.to float());
out_q = data_t(out_q.to_float());
// Set other outputs
magnitude = data t(A r);
gain_lin = data_t(gain_factor);
// FIXED: HLS-compatible dB calculation using pre-computed constant
const float INV LN10 = 0.434294481903f; // 1/ln(10) to avoid division
if (gain factor > 0.001f) {
   gain_db = data_t(20.0f * std::log(gain_factor * 5.0f) * INV_LN10);
} else {
   gain_db = data_t(-60.0f); // Very low gain
```

## **ADC**

# Analog to Digital Converter

### Inputs:

Down converted and demodulated analog I and Q value arrays.

### **Outputs:**

Discrete digital I and Q value arrays.

### **Process:**

Based on Vref values the maximum and minimum values of the range are computed.

```
const ap_fixed<24,8> V_REF = 1.0;  // Update type
const ap_fixed<24,8> V_MIN = -V_REF;  // Update type
const ap_fixed<24,8> V_MAX = V_REF;  // Update type
const int scale = (1 << (W-1)) - 1; // 32767 for 16-bit</pre>
```

Scale is computed to normalize voltage range to digital range. All the input samples are iterated through, and compared with the maximum and minimum range values. After the values are adjusted in the acceptable range ,these values are scaled accordingly.

```
for (int i = 0; i < N; i++) {
#pragma HLS PIPELINE II=1
   // I channel
    ap_fixed<24,8> clamped_I = (I_analog_in[i] > V_MAX) ? V_MAX :
                               ((I_analog_in[i] < V_MIN) ? V_MIN :</pre>
                              I_analog_in[i]);
    double scaled_I = clamped_I.to_double() * scale;
   I_digital_out[i] = (ap_int<W>)scaled_I;
   // Q channel
   ap_fixed<24,8> clamped_Q = (Q_analog_in[i] > V_MAX) ? V_MAX :
                               ((Q_analog_in[i] < V_MIN) ? V_MIN :
                              Q_analog_in[i]);
    double scaled Q = clamped Q.to double() * scale;
   Q_digital_out[i] = (ap_fixed<24,8>)scaled_Q;
   // Debug output during simulation
   #ifndef __SYNTHESIS_
   if (i < 10 || i > N-10) {
       printf("Sample %d: I_analog=%f, I_digital=%d | Q_analog=%f,
       Q_{digital}=%d\n",
             i, clamped_I.to_double(), (int)I_digital_out[i],
              clamped_Q.to_double(), (int)Q_digital_out[i]);
    #endif
```

# **DDC**

### **Demodulation and Downconversion**

### Inputs:

Amplified (RF) values from the Power Amplifier.

### **Outputs:**

Demodulated and Down converted I and Q complex baseband signals.

#### **Process:**

I and Q static buffers are initialized to zero. A Numerically Controlled Oscillator generates sine and cosine values. RF samples are multiplied by cosine and negative sine to produce I and Q values. These mixed I and Q values are then added to the buffer. For demodulation, the buffer

values are multiplied by the low-pass filter coefficients. The resulting output values are scaled down by a factor of 8 and stored as I\_out and Q out.

```
// FIXED: Initialize buffers only once
if (!buffers_initialized) {
    for (int i = 0; i < FIR_TAPS; i++) {
        #pragma HLS UNROLL
        i_buffer[i] = 0;
        q_buffer[i] = 0;
    buffers_initialized = true;
// FIXED: Static NCO phase accumulator (persists between calls)
static ap_uint<32> phase_acc = 0;
// Track output sample count
int out_sample_idx = 0;
int decim count = 0;
// Calculate expected output samples
const int expected outputs = num samples / DECIM FACTOR;
// FIXED: Reasonable scaling - your gain is way too high!
const filter_accum_t REASONABLE_GAIN = filter_accum_t(gain.to_float() /
1000.0f); // Scale down by 1000x
// Main processing loop
for (int n = 0; n < num_samples; n++) {</pre>
    #pragma HLS PIPELINE II=1
    #pragma HLS LOOP_TRIPCOUNT min=8192 max=65536 avg=32768
    // Bounds check
    if (n >= num_samples) break;
    // Get input sample
    rf_sample_t rf_sample = rf_in[n];
```