#### NANYANG TECHNOLOGICAL UNIVERSITY

# Techniques for Multi-Standard Cognitive Radios on FPGAs

Pham Hung Thinh

### School of Computer Engineering

A thesis submitted to Nanyang Technological University in partial fulfilment of the requirements for the degree of Doctor of Philosophy

March 2015

## Acknowledgments

It is great pleasure to me in expressing my gratitude to all those people who have continuously supported me and had their contributions in making this thesis possible.

I would like to express my sincere thanks and appreciation to my supervisors, Prof. Suhaib Fahmy, and Prof. Ian McLoughlin for giving me constant trust during the entire research of my Ph.D. studies, for their helpful suggestions and advices, for their continuous support and their teachings essential to achieve this objective.

I express my sincere gratitude towards Prof. Samarjit Chakraborty at Institute for Real-Time Computer Systems, TU Munich for providing me an internship opportunity in the final stage of my PhD.

I also wish to thank all colleagues and technical staffs in CHiPES for their prompt support and helpful in providing all the facilities required for my research work.

Last but not least, I would like to acknowledge my family in Viet Nam, for their constant love and encouragement.

### Abstract

The thesis studies and explores Orthogonal Frequency Division Multiplexing (OFDM) techniques for cognitive radio. A cognitive radio is a wireless node that is able to adapt its parameters to optimise performance based on interaction with the environment, as well as perform dynamic spectrum access that can improve the efficiency of using radio spectrum resources. OFDM has been adopted for implementation within the field of Cognitive Radio. It is an efficient multicarrier modulated technique that provides robustness to frequency selective channels and the possibility of employing spectrum pooling. Cognitive radios that support multiple standards and modify operation depending on environmental conditions are becoming more important as the demand for higher bandwidth. Multiple Standard Cognitive Radios (MSCRs) are hence a more flexible generalisation of CRs as they can operate across different bands with different specified standards to increase efficient spectrum usage and avoid the spectrum access congestion. Despite the advantages of OFDM, there are several issues related to this technique that would become more challenge in case of MSCRs. OFDM performance is very sensitive to synchronisation. Frequency offset causes inter-subcarrier interference and errors in timing synchronisation can lead to inter-symbol interference. Particularly, MSCR should access a wide range of frequencies depending on the standard in operation leading to an increased CFO. Thus, the MSCR synchronisation has to be more robust to the large CFO. MSCR demands a flexible and strict spectral leakage filter for compressing inherent large side lobes, that are charactersitic of OFDM, to avoid Inter Channel Interference (ICI). Moreover, MSCR requires a short reconfiguration time when switching baseband processing from one standard to another to not interrupt the communication. This research focuses on three critical challenges for implementing an MSCR baseband:

First, A robust and efficient synchronisation is proposed with the aim of maximising the performance of OFDM-based systems, and reducing power dissipation of systems. The proposed method is robust to large CFO and displays more accurate fractional frequency offset (FFO) as well as accurate integer frequency offset (IFO) estimation compared to autocorrelation-based conventional schemes. The multipliess technique is applied for the proposed method to achieve less complex computation, and lower resource ultilisation for hardware implementation, particularly when using reconfigurable logic devices.

Second, The research proposes a novel method that embeds baseband filtering within a cognitive radio (CR) architecture, that is able to meet the most stringent specification of recent standards for spectral leakage. The proposed method, performed at baseband, relaxes the otherwise strict RF front-end filter requirements to significantly reduces total system cost.

Last but not least, the research explores the feasibility of designing low power, low cost multi-standard radios to improve bandwidth efficiency and avoid spectrum congestion. The proposed system is based on OFDM and implemented on an FPGA, coupling parameterised modules and PR modules to achieve flexibility while minimising reconfiguration time.

## Contents

|    |      |         | edgments                                            |                  |
|----|------|---------|-----------------------------------------------------|------------------|
|    | Ab   | stract  |                                                     | ii               |
| Li |      |         | evations<br>otation                                 | <b>xi</b><br>xiv |
| 1  | Intr | roduct  | ion                                                 | 1                |
|    | 1.1  | Backg   | round                                               | 2                |
|    | 1.2  |         | tive and Motivation                                 |                  |
|    | 1.3  | Resea   | rch Contributions                                   | 5                |
|    | 1.4  | Organ   | nization                                            | 8                |
|    | 1.5  | Public  | eation                                              | 10               |
| 2  | Bac  | kgrou   | nd Literature                                       | 12               |
|    | 2.1  | _       | tive and Software Defined Radio                     | 12               |
|    |      | 2.1.1   | Multi-Standard Cognitive Radios                     |                  |
|    |      | 2.1.2   | Existing Radio Platforms                            |                  |
|    | 2.2  | Ortho   | gonal Frequency Division Multiplexing               |                  |
|    |      | 2.2.1   | The Cyclic Prefix                                   |                  |
|    |      | 2.2.2   | OFDM Radio System                                   | 20               |
|    |      | 2.2.3   | Evaluating OFDM                                     |                  |
|    |      | 2.2.4   | OFDM Synchronisation                                | 24               |
|    |      |         | 2.2.4.1 Timing Offsets                              | 24               |
|    |      |         | 2.2.4.2 Frequency Offset                            | 27               |
|    |      |         | 2.2.4.3 Phase Noise                                 | 30               |
|    |      | 2.2.5   | Shaping OFDM Spectral Leakage                       | 32               |
|    |      |         | 2.2.5.1 Spectrum Emission Masks in Recent Standards | 32               |
|    |      |         | 2.2.5.2 Dynamic Channel Requirements                | 33               |
|    |      |         | 2.2.5.3 Filtering in OFDM Implementations           | 34               |
|    | 2.3  | Field   | Programmable Gate Arrays                            | 36               |
|    |      | 2.3.1   | FPGAs for Radio Platforms                           | 36               |
|    |      | 2.3.2   | Power Dissipation on FPGA                           | 38               |
|    |      | 2.3.3   | Power Estimation                                    | 41               |
|    | 2.4  | Summ    | nary                                                | 43               |
| 3  | Mu   | ltiplie | rless Correlator Design for low-power systems       | 44               |

CONTENTS

|   | 3.1 | Introduction                                               | 44  |
|---|-----|------------------------------------------------------------|-----|
|   | 3.2 | Implementing Correlators                                   | 45  |
|   |     | 3.2.1 Design of DSP48E1 Based Correlator                   | 47  |
|   |     | 3.2.2 Design of Multiplierless Correlator                  | 49  |
|   |     | 3.2.3 Implementation Results                               | 50  |
|   | 3.3 | Simulation and Discussion                                  | 53  |
|   | 3.4 | Summary                                                    | 54  |
| 4 | ΑN  | Method for OFDM Timing Synchronisation                     | 56  |
|   | 4.1 | Introduction                                               | 56  |
|   | 4.2 | Related Work                                               | 57  |
|   |     | 4.2.1 Coarse STO and Fractional CFO Estimation             | 58  |
|   |     | 4.2.2 Fractional CFO Compensation                          | 60  |
|   |     | 4.2.3 Fine STO Estimation                                  | 61  |
|   | 4.3 | Proposed Fractional CFO Estimation and Synchronisation     | 63  |
|   |     | 4.3.1 Frame Synchronisation and Fractional CFO Estimation  | 65  |
|   |     | 4.3.2 Fractional CFO Compensation                          | 67  |
|   |     | 4.3.3 Simulation Results and Discussion                    | 68  |
|   |     | 4.3.3.1 Performance in AWGN                                | 69  |
|   |     | 4.3.3.2 Performance in Fading Channels                     | 71  |
|   |     | 4.3.3.3 Performance with Large Frequency Offset            | 72  |
|   |     | 4.3.4 Hardware Implementation                              | 73  |
|   |     | 4.3.4.1 Implementation of Conventional Synchronizer        | 74  |
|   |     | 4.3.4.2 Implementation of Proposed Synchronizer            | 75  |
|   |     | 4.3.4.3 Effect of Reduced Precision                        | 77  |
|   |     | 4.3.4.4 Optimized Alternatives                             | 80  |
|   | 4.4 | Summary                                                    | 82  |
| 5 | A ( | CFO Estimation Method for OFDM Synchronisation             | 83  |
|   |     | Introduction                                               | 83  |
|   | 5.2 | Related Work                                               | 84  |
|   | 5.3 | Enhanced OFDM Synchronization Through Novel IFO Estimation |     |
|   |     | Architecture                                               | 87  |
|   |     | 5.3.1 Proposed Algorithm                                   | 88  |
|   |     | 5.3.2 Proposed Architecture                                | 89  |
|   |     | 5.3.3 Simulation                                           | 93  |
|   |     | 5.3.3.1 Performance Comparison                             | 95  |
|   |     | 5.3.3.2 Wordlength Optimisation                            |     |
|   |     | 5.3.4 FPGA Implementation                                  |     |
|   |     | 5.3.4.1 Conventional Approach                              |     |
|   |     | 5.3.4.2 Proposed Approach                                  |     |
|   |     | 5.3.5 Implementation Results                               |     |
|   | 5.4 | Summary                                                    | 105 |
| 6 | A S | pectrum Efficient Shaping Method                           | 106 |

CONTENTS vi

|    | 6.1   |         | uction                                                    |       |
|----|-------|---------|-----------------------------------------------------------|-------|
|    | 6.2   |         | Model for Spectral Leakage Filtering                      |       |
|    |       | 6.2.1   | Signal Model                                              |       |
|    |       | 6.2.2   | 802.11p Signal and Channel Models                         |       |
|    |       | 6.2.3   | 802.11af Signal and Channel Models                        |       |
|    | 6.3   |         | ed Work                                                   |       |
|    |       | 6.3.1   | Pulse Shaping                                             |       |
|    |       | 6.3.2   | Image Spectrum Cancellation By FIR Filter                 |       |
|    | 6.4   | A Spe   | ctrum Efficient Shaping Method                            |       |
|    |       | 6.4.1   | New Spectral Leakage Filtering Method                     |       |
|    |       | 6.4.2   | Novel CR Filtering Architecture                           |       |
|    | 6.5   | Simula  | ation Results and Discussion                              | . 122 |
|    |       | 6.5.1   | Configuration and Performance Evaluation for 802.11p      | . 122 |
|    |       | 6.5.2   | Configuration and Performance Evaluation for 802.11af     |       |
|    |       | 6.5.3   | 802.11af Spectral Efficiency                              |       |
|    | 6.6   | Summ    | ary                                                       | . 128 |
| 7  | A N   | lovel A | Architecture for Multiple Standard Cognitive Radios       | 130   |
|    | 7.1   | Introd  | uction                                                    | . 130 |
|    | 7.2   | Relate  | ed Work                                                   | . 131 |
|    | 7.3   | Propos  | sed OFDM-based baseband modulation for MSCR               | . 132 |
|    |       | 7.3.1   | System Description                                        | . 133 |
|    |       | 7.3.2   | Module Description                                        | . 136 |
|    | 7.4   | Perfor  | mance Analysis and Discussion                             | . 145 |
|    |       | 7.4.1   | Analysing the latency and halting time of PR module-based |       |
|    |       |         | systems                                                   | . 145 |
|    |       | 7.4.2   | Analysing results of proposed OFDM-based MSCR architec-   |       |
|    |       |         | ture                                                      | . 149 |
|    | 7.5   | Summ    | ary                                                       | . 155 |
| 8  | Futi  | ure Wo  | ork and Conclusion                                        | 156   |
|    | 8.1   | Conclu  | asion                                                     | . 156 |
|    | 8.2   | Future  | e Work                                                    | . 159 |
|    |       | 8.2.1   | Efficiently adaptive shaping spectral leakage             | . 160 |
|    |       | 8.2.2   | Flexible and efficient MSCR platform                      |       |
| ъ. | hlion | graphy  |                                                           | 162   |

# List of Figures

| 2.1  | The spectrum of subcarriers in OFDM [1]                                                | 17         |
|------|----------------------------------------------------------------------------------------|------------|
| 2.2  | OFDM transmission without cyclic prefix results ISI among adja-                        |            |
|      | cent symbol                                                                            | 19         |
| 2.3  | OFDM transmission with cyclic prefix avoids ISI among adjacent                         |            |
|      | symbol                                                                                 | 20         |
| 2.4  | Inserting Cyclic Prefix in the OFDM symbol                                             | 20         |
| 2.5  | An OFDM system model                                                                   | 21         |
| 2.6  | Block diagram of an OFDM radio system                                                  | 22         |
| 2.7  | OFDM received symbol with timing offsets of -1, 1, -5 and 5 in a,                      |            |
|      | b, c, d, respectively                                                                  | 27         |
| 2.8  | Inter carrier interference (ICI) caused by frequency offset $\Delta f$                 | 28         |
| 2.9  | The constellations of OFDM received symbol with frequency offets                       |            |
|      | of $0.025$ , $0.5$ , $0.1$ and $0.25$ sub-carries spacing in a, b, c, d, respectively. | 29         |
| 2.10 | The constellations of 5 consecutive OFDM received symbols with                         |            |
|      | frequency offsets of $0.025$ and $0.05$ in a, b respectively                           | 30         |
| 2.11 | The constellations of an OFDM received symbol and 5 consecutive                        |            |
|      | OFDM received symbols with phase noise variance of $0.25 \ rad^2$ in                   |            |
|      | (a), (b) respectively                                                                  | 31         |
| 3.1  | Downlink preamble symbols for IEEE 802.16                                              | 46         |
| 3.2  | Transposed direct form correlator                                                      | 46         |
| 3.3  | Structure of DSP48E1 block inside the Virtex-6 [2]                                     | 48         |
| 3.4  | Pipeline structure of the complex number multiply-add                                  | 48         |
| 3.5  | Pipeline structure of correlator using DSP48E1 blocks                                  | 48         |
| 3.6  | Structure of multiplierless correlators                                                | 49         |
| 3.7  | Correlator power consumption at different frequencies                                  | 52         |
| 3.8  | Correlator output with $SNR = 10  dB$                                                  | 53         |
| 3.9  | Detection failure rate with increasing SNR                                             | 54         |
|      |                                                                                        |            |
| 4.1  | The timing metric in [3] applied to IEEE 802.16-2009 preamble in                       | <b>F</b> C |
| 4.0  | the AWGN channel (SNR = 10dB)                                                          | 58         |
| 4.2  | The timing metric in [4] apply to IEEE 802.16 preamble in the                          | CC         |
| 4.0  | AWGN channel (SNR = 10dB)                                                              | 62         |
| 4.3  | Proposed timing metrics applied to the IEEE 802.16 preamble in                         | C A        |
|      | AWGN (SNR = $10  dB$ , CFO = $10.5$ )                                                  | 64         |

LIST OF FIGURES viii

| 4.4                  | The synchronisation flow according to the received samples within<br>the preamble showing its packet format above the conventional syn-<br>chronisation scheme flow, and proposed scheme below | 65                                      |
|----------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|
| 4.5                  | Performance of the frame synchronisation versus selecting threshold in the AWGN channel (SNR =10dB)                                                                                            | 66                                      |
| 4.6                  | Performance of time synchronization in AWGN channels with a frequency offset of 0.5 time subcarrier spacing                                                                                    | 69                                      |
| 4.7                  | Performance of fractional frequency offset estimation in AWGN channels                                                                                                                         | 70                                      |
| 4.8                  | Frame synchronization performance of various methods in an SUI1 channel with respect to SNR                                                                                                    | 71                                      |
| 4.9                  | Frame synchronization performance of various methods in an SUI2 channel with respect to SNR                                                                                                    | 71                                      |
| 4.10                 | Performance of frame synchronization in an AWGN channel with uniform random frequency offset varying from -10 to 10 times carrier spacing, with respect to SNR                                 | 72                                      |
| 4.11                 | Architecture of the conventional synchronization FPGA implementation.                                                                                                                          | 74                                      |
| 4.12                 | Architecture for the proposed synchronization method implemented on FPGA                                                                                                                       | 75                                      |
|                      | Implementation of energy correlator on FPGA                                                                                                                                                    | 76                                      |
| 4.15                 | P'                                                                                                                                                                                             | 78<br>79                                |
| 5.11<br>5.12<br>5.13 | Baseband processing block diagram. Pilots in the long preamble of IEEE 802.16-2009. Circuit of known pilots shift register                                                                     | 92<br>96<br>96<br>97<br>97<br>99<br>100 |
| 6.1                  | Pulse Shaping operation performed on OFDM symbols                                                                                                                                              | 113                                     |

 $LIST\ OF\ FIGURES$  ix

| 6.2  | Spectral envelope due to pulse shaping OFDM symbols using three                                                                                   |
|------|---------------------------------------------------------------------------------------------------------------------------------------------------|
|      | smoothing functions and different roll-off factors for $802.11p$ . Class C and D spectral emission mask limits are overlaid as dotted lines $113$ |
| 6.3  | Spectrum of 802.11p OFDM symbols shaped with different window                                                                                     |
| 0.5  | functions, with the image spectrum included                                                                                                       |
| 6.4  | Spectra of OFDM symbols for 802.11p using different FIR interpo-                                                                                  |
| 0.1  | lation filters, with $L=8.\ldots\ldots117$                                                                                                        |
| 6.5  | The CR-Based architecture for adaptive OFDM spectral leakage                                                                                      |
|      | shaping                                                                                                                                           |
| 6.6  | Spectrum of 802.11p signal of the proposed CR architecture after                                                                                  |
|      | interpolation                                                                                                                                     |
| 6.7  | Spectrum of 802.11p signal using option <i>Prop1</i> with 20th order FIR                                                                          |
|      | filtering                                                                                                                                         |
| 6.8  | Spectrum of 802.11p signal for $Prop2$ with 12th order FIR filtering. 125                                                                         |
| 6.9  | Spectrum of 802.11af signal using the proposed CR architecture $126$                                                                              |
| 6.10 | Fitting Filtered Spectrum of 802.11af signal to SEMs                                                                                              |
| 7.1  | The structure of a generic MSCR system                                                                                                            |
| 7.2  | The received FIFO module                                                                                                                          |
| 7.3  | The block diagram of Synchronisation module                                                                                                       |
| 7.4  | The block diagram of frequency compensation module                                                                                                |
| 7.5  | The block diagram of fine STO estimation module                                                                                                   |
| 7.6  | The block diagram of IFO estimation and channel equalisation 143                                                                                  |
| 7.7  | The block diagram of phase tracking module                                                                                                        |
| 7.8  | Comparison of system latency                                                                                                                      |
| 7.9  | The bitstream size of PR modules                                                                                                                  |
| 7.10 | The latency of sub-modules for three standards                                                                                                    |
| 7.11 | The configuration time and latency of sub-modules for OFDM-                                                                                       |
|      | based MSCR system                                                                                                                                 |
| 7.12 | A scenario of a transmission                                                                                                                      |
| 7.13 | The halting time comparison of the system for three different ap-                                                                                 |
|      | proaches                                                                                                                                          |
| 7.14 | A comparison of the three approaches in terms of system latency                                                                                   |
|      | and FIFO requirement 154                                                                                                                          |

# List of Tables

| 3.1 | Resource utilisation summary                                         |
|-----|----------------------------------------------------------------------|
| 3.2 | Correlator power consumption at 50 MHz                               |
| 4.1 | Resources required for computing $P'$ on FPGA with different word    |
|     | lengths, $Q1.f1.$                                                    |
| 4.2 | Resources required for computing $R'$ on FPGA with different word    |
|     | lengths, $Q1.f2.$                                                    |
| 4.3 | Total resources consumed by a full word length implementation of     |
|     | SoA and four reduced complexity instances of the proposed method.    |
|     | Dynamic (Dpwr) and quiescent power (Qpwr) consumption are re-        |
|     | ported in mA. Maximum frequency is reported in MHz 80                |
| 4.4 | Detailed Resource Comparison between two synchronization methods. 81 |
| 5.1 | Resource utilisation and dynamic power consumption of IFO esti-      |
|     | mators                                                               |
| 6.1 | Major parameters of 802.11p and 802.11af OFDM PHYs 109               |
| 6.2 | Popular window-based FIR filter lengths                              |
| 7.1 | System specifications of three supported OFDM-based standards 133    |
| 7.2 | Parameterised values according to supported standards 139            |
| 7.3 | Allocation vector coding                                             |
| 7.4 | Resources for 802.22 OFDM-based implementation                       |
|     |                                                                      |

### List of Abbrevations

ADC Analogue to Digital Converter

ASICs Application Specific Integrated Circuits

AWGN Additive White Gaussian Noise

AXI Advanced eXtensible Interface

BCU Basic Channel Unit

BER Bit Error Rate

BPSK Binary Phase Shift Keying

CFO Carrier Frequency Offset

CIR Channel Impulse Response

COTS Commercial Off-The-Shelf

CP Cyclic Prefix

CPE Common Phase Error

CR Cognitive Radio

DAC Digital to Analogue Converter

DFT Discrete Fourier Transform

DMT Discrete Multi Tone

DSA Dynamic Spectrum Access

DSP Digital Signal Processing

DSRC Dedicated Short-Range Communications

FBMC Filter Bank Multi-Carrier

FDM Frequency Division Multiplexing

FFO Fractional Frequency Offset

FFT Fast Fourier Transform

FPGA Field Programmable Gate Array

ICAP Internal Configuration Access Port

ICI Inter Carrier Interference

IDFT Inverse Discrete Fourier Transform

IFFT Inverse Fast Fourier Transform

IFO Integer Frequency Offset

IQ In-phase - Quadrature

ISI Inter Symbol Interference

IUs Incumbent Users

LNA Low Noise Amplifier

MAN Metropolitan Area Networks

MCM Multi-Carrier Modulation

MIMO Multi-Input Multi-Output

ML Maximum Likelihood

MPoC Multiple Processors on Chip

MSB Most Significant Bit

MSCRs Multiple Standard Cognitive Radios

MSE Mean Squared Error

NC-OFDM Non-Contiguous Orthogonal Frequency Division Multiplexing

OFDM Orthogonal Frequency Division Multiplexing

PAPR Peak-to-Average Power Ratio

PAR Place And Route

PDF Probability Density Function

PLL Phase-Locked Loop

PR Partial Reconfiguration

PUs Primary Users

QAM Quadrature Amplitude Modulation

QPSK Quadrature Phase Shift Keying

RF Radio Frequency

RTL Register Transfer Level

RTO Residual Timing Offset

RTV Road to Vehicle

SEM Spectrum Emission Mask

SER Symbol Error Rate

SNR Signal to Noise Ratio

STO Symbol Timing Offset

SUI Stanford University Interim

SUs Secondary Users

TVBD TV Band Devices

TVWS Television White Spaces

V2V Vehicle-to-Vehicle

WLAN Wireless Local Area Networks

XPA Xilinx Power Analyzer

XPE Xilinx Power Estimator

# List of Notation

| *                | Convolution                                                     |
|------------------|-----------------------------------------------------------------|
| $(\cdot)^T$      | Matrix or vector transpose                                      |
| i                | imaginary unit                                                  |
| a                | absolute value of the number a                                  |
| $\angle a$       | argument of a complex number in $[0, 2\pi]$                     |
| $\hat{a}$        | compensated for the parameter a                                 |
| x(t)             | time continuous signal                                          |
| x[d]             | time discrete signal; $d$ is time index                         |
| x[d]'            | offseted discrete signal                                        |
| $\widehat{x[d]}$ | compensated discreate signal                                    |
| $\Delta f_C$     | carrier frequency offset normalized to the intercarrier spacing |
| sinc(t)          | $	riangleq rac{sin(\pi t)}{(\pi t)}$                           |
| (f*g)(m)         | $\triangleq \sum_{n} f(n)g(m-n)$ convolution product            |

## Chapter 1

### Introduction

Wireless transmission plays a key role in our every day lives, and has enabled the exponential growth in connectivity that we have witnessed over the last few decades. An exponential increase in the number of users and nodes, and the throughput demanded between them, means significant developments are crucial in the fundamental methods by which this wireless communication is enabled. The previous approach of defining fixed wireless standards for use in fixed portions of radio spectrum is giving way to a more dynamic approach to exploiting this scarce resource. Practical studies have shown that many licensed bands are relatively unused across time and frequency [5]. To improve the efficiency of radio spectrum use, the concept of unlicensed users temporarily reusing unused spectrum in licensed bands is currently being researched. This concept is known as dynamic spectrum access (DSA) [6]. Wireless communication systems for realising DSA must be reconfigurable, to support different radio standards in different environments, and adaptive, to react to changing channel conditions without interfering with licensed users and other unlicensed opportunistic users.

A cognitive radio (CR) is a node that is able to adapt its parameters to optimise performance based on interaction with the environment, as well as to perform DSA. A cognitive radio can modify parameters such as transmit power, coding rate, frame size, bandwidth, and centre frequency, in real time to obtain suitable performance in a given, changing, environment. The radio baseband should also be

reconfigurable to enable support for multiples standards such as WiFi, WiMAX, GSM, WCDMA, and other defined access schemes. More advanced adaptive standards are intrinsically flexible and this flexibility should be managed to optimise performance in the channel.

This thesis explores techniques for enabling the mapping and implementation of dynamic cognitive radios on FPGA platforms. By dynamic, we are referring to the ability to modify baseband processing to suit difference transmission standards. The cognitive portion of a radio is where decisions are made to modify properties of the communication. In the context of our work, we are interested in providing a generalised interface for cognitive radio designers to be able to leverage the dynamic capabilities of the hardware platform at a higher level.

FPGAs are silicon devices that allow us to build customised hardware datapaths for a variety of applications. By exploiting parellelism inherent in many algorithms, it is possible to develop implementations that are significantly faster that equivalent software running on general purpose processors. FPGAs have long been established as a platform of choice in signal processing due to their suitability for parallel bit-level architectures that align well with many signal processing algorithms [7]. Another key capability of FPGAs that makes them attractive for cognitive radios is their reconfigurability. The hardware implemented on an FPGA can be modified at runtime, thereby enabling the dynamic capability required for implementing cognitive radios.

#### 1.1 Background

Orthogonal Frequency Division Multiplexing (OFDM) has been adopted for implementation within the Cognitive Radio field, and is a prime candidate for DSA networks in which the radio is required to be spectrally aware and able to dynamically access idle parts of the spectrum. OFDM is an efficient multi-carrier modulated (MCM) technique that provides robustness to frequency selective channels and the DSA ability based upon spectrum pooling, where unlicensed users

may temporarily access spectral resources during the idle periods of licensed users [8]. When utilising a given bandwidth of spectrum, OFDM subcarriers that cause interference with licensed users can be selectively disabled. This technique is defined as non-contiguous orthogonal frequency multiplexing (NC-OFDM) [9].

The lower priority of CR raises a challenge in term of transmission capability and quality of service. When the spectrum allowed for a CR system is fully occupied by PUs and IUs, the transmission of CRs can be blocked. Multiple Standard Cognitive Radios (MSCRs) are able to operate in multiple frequency bands with different specified standards. MSCRs are hence a more flexible generalisation of CRs as they can operate across different bands and standards to increase transmission capability and enhance bandwidth efficiency.

### 1.2 Objective and Motivation

MSCR requires platform that provides sufficient flexibility, high computational throughput, and power efficiency. Most practical CRs are built using powerful general purpose processors to achieve flexibility through software but they fail to offer the computational throughput and also tend to suffer from high power consumption. Multiple processor-on-chip (MPoC) architectures can enhance the throughput; however, these platforms require problems to be formulated in a way suitable for parallel programming and tend to need additional memories to buffer data transferred between parallel processes. Moving data around these processes may consume significant power. Conversely, custom hardware designs such as application specific integrated circuits (ASICs) offer highly efficient computation, high throughput and possibly low power dissipation; however, they suffer from a lack of flexibility, in which operating parameters tend to need to be decided at design time. In an application area with fast moving standards and occupied by multiple standards, ASICs are likely to be inefficient in terms of both time and cost for MSCR. However, modern FPGAs which support partial reconfiguration (PR) are an attractive candidate for cognitive radios. They provide not only the high performance of a custom data-path implementation, but also offer flexibility with low power dissipation.

The feasibility of MSCR implementation depends on the flexibility of switching system parameters specified by from one standard to another. Multi-carrier modulation based techniques are widely investigated for state of the art and also nextgeneration wireless standards. These techniques divide communication channels into multi-sub-channels that allow a system to perform parallel data transmission over smaller sub-channels to combat amplitude and phase distortion, impulse noise, multipath propagation and so on. OFDM and Filter Bank Multi Carrier (FBMC) are two types of multi-carrier modulations. OFDM modulation has been the dominant technique adopted for multiple applications in high bit-rate wireless communication systems such as Wireless Local Area Networks (WLAN) standardized in IEEE 802.11 and Metropolitan Area Networks (MAN) in IEEE 802.16. It is effective to perform spectral sensing and carrier allocation for CRs. Furthermore, OFDM modulation requires a simple and low cost implementation and effectively parameterizes the system to flexibly switch from one standard to another in comparison to FBMC modulation. OFDM should be a suitable candidate for a MSCR system. The advantages of coupling OFDM modulation and the FPGA platform are investigated for the feasibility of implementing the proposed low cost, low power MSCR system. The ability to perform PR in FPGA and allow effective parameterization in OFDM modulation can enable a MSCR to be built for not only current OFDM-based standards such as 802.11, 802.16, and 802.22, but also be potentially able to accept soft upgrades for future OFDM-based standards.

OFDM-based systems have two main intrinsic disadvantages related to synchronization and spectral leakage. They become critical challenges in scenarios of OFDM-based MSCR. The state of the art synchronization of OFDM-based system can tolerate a small carrier frequency offset (CFO) that leads to highly strict constraints for RF front-end design. In MSCR systems, an RF front-end is required to have the ability to switch frequency carriers across a wide frequency range according to spectral operating areas. This tends to suggest that the constraints of small CFO may not be feasible. Therefore, new synchronization methods that

are robust to large CFO should be researched for OFDM-based MSCR systems. Another challenge is that CRs normally demand small spectral leakage for both in-band and out-of-band of transmitted signals, whereas OFDM is well known to induce significant amounts of spectral leakage. Pulse shaping techniques have therefore been widely studied to limit the spectral leakage caused by the OFDM signal. But pulse shaping techniques cannot help to effectively filter out-of-band spectrum in case of small frequency guard because of the present of image spectrum caused by interpolation or DAC operation. Fortunately, MSCR systems provide flexible parameters that allow the possibility of a frequency guard extending technique which, when combined with an effective pulse shaping technique, can meet the spectral leakage requirements.

The motivation for this research is to study a low power architecture for MSCR based on coupling PR modules and parameterised modules to minimise the adaptation time. In addition, a novel method for synchronization and an effective pulse shaping technique have been proposed and shown to be suitable for the requirements of MSCR systems.

#### 1.3 Research Contributions

This thesis addresses the design of low complexity, low power wideband radios with the flexibility to support multiple standards. The contributions of the thesis, which are published or under review as listed in "Publications" section, are elaborated as following:

1. A multipilerless cross-correlation is proposed in [J1] to perform OFDM synchronisation for a low power, low complexity systems. The conventional approach, with the availability of embedded DSP blocks on these FPGAs, is to use standard multiplier-based cross-corellation. However, this can consume a significant number of DSP blocks, and may not fit on low-power devices. A comparison of DSP48E1 Slice-based design and four different

quantisations of multiplierless correlation is investigated in terms of resource utilisation and power consumption on Xilinx Virtex-6 and Spartan-6 FPGA devices. OFDM timing synchronisation accuracy is evaluated for each system at varying signal-to-noise ratios. This research shows that even relatively coarse multiplierless coefficient quantisation can yield accurate timing synchronisation, and do so at high clock speeds. Multiplierless designs enjoy reduced power consumption over the DSP48E1 Slice-based design, and can be used where DSP Slice resources are insufficient, such as on low-power FPGA devices.

- 2. A novel OFDM synchronization method that combines robust performance with computational efficiency is proposed in [J2]. FPGA prototyping is used to investigate the trade-off between the number of computations to be performed and computation word length with respect to both synchronization performance and power consumption. Through simulation, the proposed method is proven to provide accurate fractional CFO estimation as well as STO estimation in a range of channels. In particular, it can yield excellent synchronization performance in the face of a CFO that is larger than many state-of-the-art synchronization implementations can handle. The system implementation demonstrates efficient resource usage and reduced power consumption compared to existing methods and this is explored as a fine-grained trade-off between performance and power consumption. The proposed method is robust and suitable for use in low-power radios or some multi-standard radios, enabling less precise analogue front-ends to be used.
- 3. The third contribution presented in [J3] proposes a novel approach for implementing IFO estimation, which is shown to be able to reduces both the power and computational cost of OFDM implementations. Performing the IFO cross-correlation using four-fold resource sharing reduces the estimation cost. Meanwhile, adopting a multiplierless technique and carefully optimising word lengths yields significant power reduction, while maintaining sufficient accuracy to meet performance requirements. The method is studied

theoretically, using numerical simulations as well as with post place-and-route analysis. The novel method is shown to achieve excellent performance, similar to the theoretically achievable bound. In fact, performance is significantly better than conventional techniques, while being much more efficient. In case of application for IEEE 802.16-2009, the proposed method saves significant power over the conventional technique on low-power FPGA devices. The method is also applicable to IEEE 802.11 and IEEE 802.22. Coupling the robust OFDM synchronisation presented in [J2] and Performing IFO estimation at baseband is importance to allow the RF front-end specification to be relaxed, thus reducing system cost. In fact, for some multi-standard radios, and applications suffering significant Doppler shift, RF constraints may be infeasible without techniques such as IFO estimation.

4. The forth contribution firstly published in [C1] and then extended in [J4] proposes a novel method that explores CR architecture in a new filtering scheme for adaptively shaping spectral leakage of OFDM signal according to the transmitted power and Spectrum Emission Mask (SEM) requirements. OFDM presents a disadvantage in terms of spectral leakage due to large side lobes in the signal spectrum. In addition, some recent OFDM-based standards such as 802.11p for vehicular communication and 802.11af for reusing Television White Spaces (TVWS) demand the strict requirements on spectral leakage that raises a tough challenge for radio frequency (RF) front-end circuits. The proposed method can achieve the specification for the most stringent SEM of 802.11p. For 802.11af, it not only meets the requirement for strict SEM filtering but also feasibly increases the spectrum usage by an additional 10 sub-carriers in a basic channel band, compared to conventional techniques, without violating the SEM specifications. The proposed method, performed at baseband to relax the strict constraints of the RF front-end, also allows the RF front-end to be implemented using commercial off-the-shelf (COTS) RF hardware from older standards such as 802.11a or 802.11ac, resulting in a much reduced total system cost.

5. Cognitive radios that support multiple standards and modify operation depending on environmental conditions are becoming more important as the demand for higher bandwidth and efficient spectrum use increases. Traditional implementations in custom ASICs cannot support such flexibility, with standards changing at a faster pace, while software implementations of baseband communication fail to achieve the performance required. Hence, FPGAs offer an ideal platform bringing together flexibility, performance, and efficiency. The fifth contribution presented in [C2, J5] proposes and explores the possible techniques for designing multi-standard radios on FPGAs. This contribution presents a mathematical analysis of the performance of the proposed architecture for MSCR based on a heterogeneous mixture of the PR modules and parameterised modules. The calculated results based on the FPGA systhesis show that the proposed architecture achieves a significant reduction in terms of system latency compared to conventional structures. The proposed method also requires a much smaller FIFO than conventional structures. This allows a MSCR be implemeted on an FPGA platform, yielding a low cost, low power system.

### 1.4 Organization

This thesis is organized as follows: Chapter 2 presents the comprehensive background of this research. Power consumption on FPGA devices is investigated. Power estimation tools and also low power design strategies, particularly focused on implementing OFDM systems on FPGA, are studied and discussed. This chapter provides the background for OFDM in terms of mathematical representation and functionality, and then the advantages and limitations of OFDM are also discussed. Moreover, the chapter also contains an introduction to MSCR. The related works on MSCR research are presented and discussed to show the main challenges of implementing a MSCR. The synchronisation issue of OFDM systems is deeply considered, and the related work focusing on achieving the good performance of synchronisation is discussed on its merits as well as limitations. The challenge of

shaping spectral leakage is studied for the stringent SEM constraints of the state of the art wireless standards.

Chapter 3 presents the design of several correlators for timing synchronisation with preamble symbols based upon IEEE 802.16d standard. The comparison between a DSP48E1 Slice-based design to four different quantisations of multiplierless correlation is shown and discussed in terms of resource utilisation and power consumption. OFDM timing synchronisation accuracy is evaluated for each system at varying signal to-noise ratio.

Chapter 4 researches the issue of synchronisation in OFDM system receivers in terms of timing offset and frequency offset. A robust and efficient synchronisation method is proposed and discussed. The performance of the proposed method is evaluated in comparison to methods in previous works in terms of robustness to large CFO, accuracy of time synchronisation and fractional CFO estimation in both additive white Gaussian noise (AWGN) and frequency selective channels. The results shows the proposed method can achieve a more accurate fractional CFO estimation and be robust to the large CFO while still obtaining acceptable accuracy for frame synchronisation. In addition, this chapter presents a novel approach for implementing IFO estimation, which reduces both power and computational cost in implementation. The efficiency and hardware reduction are shown in the simulation results as well as implementation reports.

Chapter 6 studies the novel shaping spectral leakage scheme at the baseband replied upon the cognitive radio architecture. The proposed method can meet the specification of class D, the most stringent of the four 802.11p SEMs as well as the stringent SEM of 802.11af. The proposed method can also enhance spectral efficiency in the case of reusing the Television White Spaces in 802.11af standard.

Chapter 7 explores the feasibility of designing efficient multi-standard radios to improve bandwidth efficiency and avoid spectrum congestion. A novel MSCR architecture is proposed and investigated in term of performance and hardware reduction.

Finally, Chapter 8 gives a brief summary of the research contributions as well as the conclusion of the works presented in this thesis. Some research directions for future work related to the contributions are also identified.

#### 1.5 Publication

Some of the work presented in the thesis has been written up in a number of published and submitted papers listed below:

- T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Low-power correlation for IEEE 802.16 synchronisation on FPGA," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 21, no. 8, pp. 1549 - 1553, Aug. 2013.
- 2. T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Robust and Efficient OFDM Synchronisation for FPGA-Based Radios," in *Circuits, Systems, and Signal Processing*, vol. 33, no. 8, pp. 2475 2493, Aug. 2014, Springer.
- 3. T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Shaping Spectral Leakage for IEEE 802.11p Vehicular Communications," to appear in *Proceedings of IEEE Vehicular Technology Conference (VTC Spring), Seoul, Korea, May 2014.*
- 4. T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient Multi-Standard Cognitive Radios on FPGAs," PhD Forum Poster in *Proceedings of the International Conference on Field Programmable Logic and Applications (FPL)*, Munich, Germany, September 2014.
- 5. T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient Integer Frequency Offset Estimation Architecture for Enhanced OFDM Synchronization," to be submitted to *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*.

- 6. T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Spectrally Efficient Emission Mask Shaping for OFDM Cognitive Radios," to be submitted to *IEEE Transactions on Communications (TCOM)*.
- 7. T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient OFDM-based baseband processing for Multi-Standard Cognitive Radios on FPGAs," to be prepared.

## Chapter 2

## Background Literature

#### 2.1 Cognitive and Software Defined Radio

Software Defined Radios (SDRs) are radio communication systems in which the processing components (e.g. mixers, filters, modulators/demodulators, etc.) are realised by means of software on a computer or embedded system instead of in hardware circuits. SDR provides a flexible platform on which application-specific radio systems can be implemented. The rapidly increasing level of flexibility and functionality provided by software platforms has led to an increase in the variety of radio applications realizable by SDR. Cognitive Radio (CR) is an evolution of SDR incorporating intelligence in adapting the radio to a changing environment.. A significant portion of CR research, as in early work by Mitola [10], focused primarily on upper layer adaptation, in which the radio platform can adapt to anticipated user or application requirements. According to Mitola [11], the evolution from SDR to CR can be illustrated by gaining three main capabilities [12]: awareness, adaptation, and cognition. Awareness allows the radio to predict or enhance information from the environment. For example, RF-location awareness allows a wireless terminal to correlate information from different types of sensing to determine location. Adaptation can be performed once a terminal is aware of the environment. A location aware radio, when moved, can prioritise free spectrum search in bands that are historically inactive in the new location. Cognition learns from the environment and deduces adaptation rules based on its experience on a general set of objectives, for example obtaining the best quality of service or the lowest communication cost at a baseline service quality.

Other researchers have taken a more system level view of cognitive radio. Research at Virginia Tech [13] explores how to exploit the capabilities of SDR platforms to maximise different aspects of performance. Rieser et al. [14, 15] proposed the concept of a cognitive engine, separating the cognition from the radios and focusing on the physical layer. They developed this component in a way that it could intelligently control multiple radios. Cognitive radios can be built using fixed function radios, but the flexibility provided by SDRs allows more complex applications to be explored. Intelligent algorithms coupled with capable radio platforms allow provide a CR to change functionality to adapt to dynamic conditions [13]. With a fixed platforms, CRs are limited to supporting simple tasks like spectrum sensing, followed by setting parameters of hardware components in response. SDR platforms allow cognitive radios to significantly change configuration and role according to a variety of stimuli. Such adaptation is important in situations where there are finite resources or when the desired behaviour might be re-defined after deployment.

Recently, CR has gained further importance due to spectrum scarcity and inefficient spectrum usage [5, 16]. CRs enable a situation where spectrum allocated to licensed users, known as Primary Users (PUs), can be reused by unlicensed users, referred to as Secondary Users (SUs), when the PUs are not using it. SUs using locally unoccupied spectrum can improve overall utilisation efficiency of licensed spectrum. The Federal Communications Commission (FFC) has also given a definition for CR as "a radio or system that senses its operational electromagnetic environment and can dynamically and autonomously adjust its radio operating parameters to modify system operation, such as maximize throughput, mitigate interference, facilitate interoperability, access secondary markets." [17].

#### 2.1.1 Multi-Standard Cognitive Radios

Building cognitive radios to act as secondary users (SUs) requires that they are able to find and transmit in unoccupied spectrum assigned to primary users (PUs), and this must be done without causing harmful interference to the PUs. Other incumbent users (IUs) must also be avoided. Apart from the critical issues of sensing for unused spectrum and allocating bands for transmission, the lower priority of SUs presents a problem in terms of transmission capability and quality of service. If the spectrum bands allowed for a CR are fully occupied by PUs and IUs transmission might be blocked. Multi-standard cognitive radios can operate in multiple frequency bands with different specified standards, providing greater flexibility.

Multi-carrier modulation techniques offer an ideal opportunity for such systems due to their regularity and parameterisation. OFDM and Filter Bank Multi Carrier (FBMC) are two types of multi-carrier modulations. OFDM modulation has been the dominant technique adopted for many wireless standards and has been investigated in terms of spectral sensing and carrier allocation for CRs. Furthermore, OFDM system implementation is simple, low cost, and can be more effectively parameterised in comparison to FBMC systems. A single baseband implementation can be made to flexibly support multiple standards like 802.11 [18], 802.16 [19], and 802.22 [20], as well as supporting future OFDM-based standards.

This requires the ability to switch baseband processing from one standard to another. This in turn means perform variable length FFT/IFFT operations, insert cyclic prefixes of configurable length, and handling different pilot vectors as well as different preambles. As a result, the processing modules should be designed to support all requirements of the different standards.

Two additional challenges must be addressed. OFDM systems typically tolerate a small carrier frequency offset (CFO) leading to strict constraints on the design of the RF front-end. In a multi-standard system, the RF front-end accesses a wide range of frequencies depending on the standard in operation. Such a precise and

yet wide ranging frequency requirement makes RF front-end design difficult and requires very expensive components. CRs also demand small spectral leakage for both in-band and out-of-band transmitted signals to avoid causing harmful interference to primary users, while OFDM signals have intrinsically large side lobes leading to a potentially large degree of spectral leakage. Hence, synchronisation and leakage management are more pressing issues in multi-standard radios.

The interface to higher layer processing is another important factor. Many hardware radio platforms are extremely difficult to design for or to modify. Hence, only hardware experts can use them. While detailed optimisation of low level blocks is important, providing a general interface for implementing higher layer processing is also important. This ensures that radio experts can use the system to investigate cognitive radio techniques without the need for specific advanced low-level FPGA expertise. Our work tries to offer well-designed and documented parameterised signal processing blocks in hardware, with a high level management interface to enable radio designers to benefit from the dynamic capabilities of FPGA platforms.

#### 2.1.2 Existing Radio Platforms

There have been comprehensive efforts at many research institutions and labs in the areas of SDR and CR. The Kansas University Agile Radio (KUAR) hardware platform [21] is a mature radio platform built around a fully-featured Pentium PC with a Xilinx Virtex II FPGA. Trinity College Dublin also worked on radio platform development with the Iris architecture for cognitive systems [22]. Iris [23] received some limited support for FPGAs and partial reconfiguration [24, 25] but the hardware-software interface had significant overhead. More recently, they have shown a desire to use support the Zynq architecture [26]. Rutgers University developed the WiNC2R platform [27] that integrates FPGAs for both baseband and network layer implementation. The Berkeley Wireless Research Center developed a CR network emulator on FPGA [28]. WARP [29] is another well-established FPGA based platform developed at Rice University with a recent revision (v3)

hosting a Xilinx Virtex 6 FPGA and up to 2 RF interfaces. The baseband can be implemented in the FPGA fabric, while the higher layers are coded in C as a standalone application on an embedded MicroBlaze processor. Microsoft Research Asia launched SORA [30], a PCI card designed to allow powerful radios to be implemented in Windows desktop computers A Xilinx Virtex 5 FPGA is used to provide high bandwidth deterministic communication between the PC and the radio front-end, but the physical layer is designed to be implemented in software on high performance desktop class processors. After much promise, little traction was gained due to the overhead of programming complex signal processing on general purpose processors.

GNU Radio [31] is a widely used platform in academia, and is typically coupled with the Ettus USRP radio front end. It is a software application designed to run on general purpose processors. Embedded implementation is also possible, e.g. on the Ettus USRP E100, where GNU Radio is executed on an embedded ARM processor. The radio systems are modeled by flowgraphs that represent the flow of moving data through the components of the system. These are designed by connecting processing blocks in the GNU Radio Companion (GRC) graphical tool [32]. GReasy is a extension of GNU Radio that supports the computational benefits of FPGA acceleration targeting to reduce FPGA compile times [33]. Within the GReasy environment, FPGA-based processing units are added to the GNU Radio library allowing a user to arbitrarily insert optimized hardware and software modules into a given design. GReasy employs TFlow, a toolset developed at Virginia Tech that allows the rapid assembly of FPGA accelerator modules through a precompiled hardware library, for back-end bitstream generation, which places and routes parameterized pre-compiled modules into a final FPGA bitstream [34].

FPGAs have gained wider traction in SDR and CR frameworks in recent years Recent devices can process at rates of over 5000 GMAC/s (integer multiply accumulates) per second. This enables highly advanced baseband systems to be implemented at a low power budget compared with other programmable architectures. The dynamic reconfiguration ability of FPGAs offers an opportunity to take this integration to the next level with flexible baseband processing. Most



FIGURE 2.1: The spectrum of subcarriers in OFDM [1].

platforms described here see the FPGA as a platform for static baseband implementation. We believe that partial reconfiguration offers the flexibility required for CR implementation with the performance benefits of hardware processing.

# 2.2 Orthogonal Frequency Division Multiplexing

OFDM is a multicarrier modulation scheme used in both wireline and wireless communication in which a high-rate data stream is split into multiple parallel low rate streams that are modulated by multiple sub-carriers. The adjacent modulated sub-carriers are theoretically orthogonal with zero mutual interfere to each other. OFDM signals are modulated using sub-carriers across the frequency range similar to frequency division multiplexing (FDM). But the main difference is that FDM conventionally multiplexes the signals into separate small bands in which the signal in each band is modulated using a specific sinusoidal carrier, while OFDM signals are modulated using orthogonal sub-subcarriers. Each sub-carrier is mathematically represented by a sinc pulse, which is overlapped with other subcarriers in the frequency domain as shown in Fig. 2.1. Note that the sub-carrier of the sinc pulses will null at the centre points where the other subcarriers are located. Ideally, there is thus zero inter-carrier interference (ICI) in an OFDM signal.

An OFDM symbol signal can be expressed at baseband as a sum of modulated complex exponentials:

$$s(t) = \sum_{k=0}^{N-1} X_k e^{i2\pi\Delta f t},$$
(2.1)

where  $X_k$  represents a data modulated symbol such as a BPSK, QPSK, or QAM, and is a complex number modulated by the kth subcarrier of N subcarriers and  $\Delta f$  is the subcarrier spacing. Sampling this OFDM symbol signal with sampling period of  $T_S$  is expressed as:

$$s(nT_S) = \sum_{k=0}^{N-1} X_k e^{i2\pi\Delta f nT_S},$$
 (2.2)

A sample of the OFDM signal is equivalent to an inverse N-point discrete Fourier transform (IDFT), taking  $X_k$  as a discrete point in the frequency domain. Inversely, the sampled OFDM symbol signal can be demodulated using the discrete Fourier transform (DFT). OFDM modulation and demodulation are hence performed by computing the IDFT and DFT, respectively, expressed as:

$$s[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] e^{i2\pi \frac{k}{N}n}, \qquad (2.3)$$

$$X[k] = \sum_{n=0}^{N-1} s[n]e^{-i2\pi \frac{k}{N}n},$$
(2.4)

In order to achieve efficient computation, The inverse Fast Fourier Transform (IFFT) and Fast Fourier Transform (FFT) are implemented in OFDM systems to modulate and demodulate the signal instead of the IDFT and DFT, respectively.



FIGURE 2.2: OFDM transmission without cyclic prefix results ISI among adjacent symbol.

These optimised algorithms generally rely on the number of points, and hence carriers, being a power of 2.

#### 2.2.1 The Cyclic Prefix

When transmitting OFDM symbols over a delay-dispersive multi-path channel, the received signal is the linear convolution of the transmitted symbol with the channel impulse response (CIR)

$$y[n] = h * s[n], \tag{2.5}$$

where h, assuming it has a length of L, denotes the equivalent impulse response of the channel. and \* is the convolution operation. The received symbols y[n] are the result of convolution between CIR h and transmitted symbols s[n] which has a length of N. So, y[n] has a length of N+L-1. In addition, the received signal is obtained by concatenating the received OFDM symbols. Because the received symbols, having a length of N+L-1, are overlapped with the adjacent received symbols, adding the overlap of adjacent received symbols leads to the introduction of inter-symbol interference (ISI) in the received signal, shown in Fig. 2.2.

In order to avoid ISI, a guard interval (or cyclic prefix), having a length of  $L_{CP}$ , must be added before each OFDM symbol as demonstrated in Fig. 2.3. If the



FIGURE 2.3: OFDM transmission with cyclic prefix avoids ISI among adjacent symbol.



FIGURE 2.4: Inserting Cyclic Prefix in the OFDM symbol.

length of CIR, L, is smaller than that of the guard interval,  $L_{CP}$ , adding the overlap of adjacent received symbols will not interfere with the succeeding received OFDM symbol. The ISI is hence missing in the received symbol. The guard interval adopted in many OFDM standards can be commonly performed by a copy of the last  $L_{CP}$  samples of the symbol as shown in Fig. 2.4, that is called a cyclic prefix (CP).

In addition, the use of a CP also guarantees the orthogonality of subcarriers avoiding ICI. Performing the DFT operation and a single-tap equalizer per subcarrier allows recovery of the transmitted symbols [1].

#### 2.2.2 OFDM Radio System

An OFDM system model can be considered as shown in Fig. 2.5.

In the transmitter, the data modulated symbols X[n] are grouped in blocks of N sub-carrier symbols known as an OFDM symbol, expressed by a vector  $X[n] = (X[1], X[2], ..., X[n])^T$ . Next, the OFDM symbol signal in the time domain is



FIGURE 2.5: An OFDM system model.

modulated by performing the IDFT on each OFDM symbol, and a cyclic prefix of length  $L_{CP}$  is inserted at the begin of OFDM signal. So, the complex signal of m, the OFDM-symbol in baseband discrete time, can be expressed as

$$s_m[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] e^{i2\pi k \frac{n-L_{CP}}{N}}, \qquad (2.6)$$

where n is the discrete time index, m denotes the index of the OFDM symbol. The complete transmitted signal in the discrete time domain, s[n], is given by the concatenation of all OFDM symbols,  $s_m[n]$ ,

$$s[n] = \sum_{m=0}^{\infty} s_m [n - m(N + L_{CP})], \qquad (2.7)$$

When transmitting OFDM signals over a multi-path channel, the received signal is obtained through the linear convolution of the transmitted symbol with the CIR h[i] and adding additive white Gaussian noise (AWGN) n. Assuming that the synchronisation between the transmitter and receiver are perfectly achieved, the channel fading is slow enough to consider as a time invariant channel during one OFDM symbol interval, and the length of the cyclic prefix is longer than that of the CIR  $(h[i] = 0 \text{ for } i < 0i > L_{CP} - 1)$ ,



FIGURE 2.6: Block diagram of an OFDM radio system.

$$y[n] = \sum_{i=0}^{L_{CP}-1} h[i]s[n-i] + n[n], \qquad (2.8)$$

In the receiver, the incoming samples y[n] are synchronously grouped into blocks of OFDM symbols and then the cyclic prefix in each OFDM symbol is removed. The received symbols can be expressed in a vector  $y_m = (y_1, y_2, ...)$ , with  $y_m[n] = y[m(Nc+Ncp)+Ncp+n]$ . The received data symbols associated with  $m^{th}$  OFDM symbol  $R_m[n]$  are retrieved by performing an N-point DFT:

$$R_m[n] = \sum_{n=0}^{N-1} y_m[n] e^{-i2\pi \frac{nk}{N}}, \qquad (2.9)$$

Fig. 2.6 presents a common block diagram of an OFDM radio system. OFDM is perform in the baseband, IFFT and FFT blocks are used to compute the IDFT and DFT for OFDM modulation and demodulation, respectively. In the transceiver, the channel coded data from higher abstracted layers is modulated to data symbols (BPSK, QPSK, QAM, ...). The data symbols are then grouped together with the pilots to form N FFT points in parallel. After performing the IFFT, the CP is inserted, and then the OFDM samples are serialised and split in to in-phase (I) and quadrature (Q) channels corresponding to the real and imaginary parts of the

OFDM sample. The digital to analogue converted signals of the I and Q channels are modulated by an intermediate frequency,  $f_{IF}$ , then the signal is up-converted to high frequency by an RF carrier,  $f_C$ . Before transmitting, the signal should be amplified by an low noise amplifier (LNA).

In the receiver, after down-converting and I/Q demodulation, the signal is sampled. The samples are formed from I and Q channels corresponding to the real and imaginary parts of the OFDM sample. The timing and frequency synchronisation blocks detect the frame start, recover timing of the frame and estimate the frequency offset. The received samples are then compensated with estimated frequency offset in the discrete time domain. After demodulating an OFDM symbol using the FFT, the channel is estimated and channel equalisation as well as phase error compensation are performed to improve performance. The block of parallel samples in an OFDM symbol is serialised to data symbol sequences that are then demodulated, and coded data is sent to higher layers to decode.

## 2.2.3 Evaluating OFDM

The main advantages of OFDM are its spectrally efficient usage and robustness against multi path propagation. This makes OFDM suitable for high performance wireless applications. OFDM uses multiple sub-carriers which are overlapped with other subcarriers in the frequency domain, resulting in greater spectral efficiency than FDM. Performing OFDM is equivalent to splitting a data stream into several parallel low-rate streams before transmission. This makes the OFDM signal more robust against fading when transmitted through the channel. Thanks to the cyclic prefix, the ISI and ICI caused by the multi-path channel can be eliminated. The CP creates a guard period for an OFDM symbol, which should be longer than the CIR to ensure no ISI. Repeating samples of the OFDM symbol in a guard period, the CP helps to maintain the orthogonality of subcarriers avoiding the ICI. Thus, performing the DFT and a single-tap equalizer per subcarrier allows recovery of the transmitted symbols.

On the other hand, OFDM has some disadvantages. Firstly, an OFDM signal is the sum of multiple modulated sub-carriers, and thus suffers a high peak-to-average power ratio (PAPR). This results in demand on high power and wide range linearity in amplifiers increasing the cost of OFDM systems. Secondly, the use of a guard period reduces bandwidth efficiency. Last but not least, OFDM performance is sensitive to receiver synchronisation. Frequency offset causes inter-subcarrier interference and errors in timing synchronisation can lead to inter-symbol interference. Much effort is needed to improve the accuracy of both frequency and time synchronizers for OFDM.

## 2.2.4 OFDM Synchronisation

OFDM performance is sensitive to receiver synchronisation. Frequency offset causes inter-subcarrier interference, and errors in timing synchronisation can lead to inter-symbol interference, making synchronisation critical in OFDM systems. There are two main errors implicit in synchronisation: sample clock timing offsets and carrier frequency offsets. In order to obtain good synchronisation performance, timing offsets and frequency offsets must be studied in terms of their cause and effect on the degradation of OFDM received data symbols. Additionally, there are issues of common phase error (CPE), generated from clock jitter and phase noise, that causes a random rotation of the entire signal constellation. This must also be taken into account and compensated for in order to achieve good performance.

### 2.2.4.1 Timing Offsets

When sampling a signal at the receiver, the different times of sampling between samples in the receiver and transmitter are referred as timing error. In a single carrier system, the symbol clock in the transmitter can be recovered at the receiver using a phase-lock loop (PLL) [1]. This can correct a timing error in the receiver relatively easily.

In OFDM, however, timing errors comprise two categories: fractional and integer. Fractional timing error, that is errors that are smaller than one sample period, are caused by different phases between the sampling clock of the analogue to digital converter (ADC) in the receiver and the phase of the transmitted signal, while integer timing error is that which is greater than one sample period, causing index shifting, or offset, in the sample sequence.

Timing error in the time domain is equivalent to a phase rotation in the frequency domain:

$$s(t-\tau) \Leftrightarrow e^{-i2\pi f \tau} R(f),$$
 (2.10)

where  $\tau$  denotes timing error resulting in a phase shift of  $e^{-i2\pi f\tau}$ . s(t) is the received signal in the time domain, and R(f) is the spectrum of s(t) in the frequency domain. The phase shift is proportional to both time errors and the frequency of carriers. In the case of multi-carriers with increasing frequency, the phase shift is increased according to the carriers leading to the phase rotation of sub-carriers. Carrier rotations caused by fractional timing error  $\Delta t$  like those caused by fading can be estimated by a channel estimator and compensated for after performing the DFT:

$$\widehat{R[n]} = R[n]e^{\frac{i2\pi n\Delta t}{N}},\tag{2.11}$$

where R[n],  $\widehat{R[n]}$  denotes received data symbols before and after compensation, respectively.  $\Delta t$  is estimated phase rotation, and N is the number of sub-carriers.

Moreover, the received samples in the receiver are synchronously grouped into blocks of OFDM symbols. Integer timing errors lead to a symbol timing offset (STO) referring to the difference between the correct sample index and the actual sample index of received samples that causes a misaligned window for DFT demodulation in the receiver. If the timing offset is late, the samples of the following symbol are used for the current symbol, resulting in ISI and hence degrading the

performance of the OFDM system. The effect of ISI caused by later timing offsets on the OFDM received symbol is illustrated in Fig. 2.7(a) and Fig. 2.7(c). If the timing offset is early, some samples in the CP of the current symbol are used to calculate the DFT, leading to sub-carrier rotation expressed in Equ.2.10 in the frequency domain. The effect of sub-carrier rotation caused by earlier timing offsets on the OFDM received symbol is illustrated in Fig. 2.7(b) and Fig. 2.7(d).

$$s[n-t_{off}] \Leftrightarrow R[n]e^{\frac{i2\pi nt_{off}}{N}},$$
 (2.12)

where s[n], R[n] denotes received data symbols in the timing domain and the frequency domain, respectively.  $t_{off}$  is a timing offset, and N is the number of sub-carriers.

Fig. 2.7 illustrates the effect of timing offset on a single 256 sub-carrier OFDM symbol utilizing QPSK sub-carriers (based on the IEEE 802.16 standard). As can be seen, the earlier timing, for instance timing offsets of 1 and 5, shown respectively in Figs. 2.7(b) and 2.7(d), cause a carriers rotation similar to that of fractional timing errors, and fading that can be estimated by a channel estimator. However, the later timing, for instance timing offsets of 1 and 5, as shown in Figs. 2.7(a) and 2.7(c), lead to ISI that prevents the OFDM constellation from being recovered.

Therefore, timing synchronisation is required to correct the timing offset, and avoid ISI. Correlation is commonly performed to estimate timing offset [35, 36, 37]. One of the contributions in the thesis proposes a novel low-power correlation for OFDM synchronisation. The related work and proposed correlation are discussed in Chapter 3. The accuracy of timing synchronisation also depends on the timing metrics and the algorithms applied on these metrics. A novel efficient and robust timing synchronisation based on the low-power correlation is proposed in this thesis. The related work [3, 4, 38, 39, 40] and proposed method for timing synchronisation will be discussed in Chapter 4.



FIGURE 2.7: OFDM received symbol with timing offsets of -1, 1, -5 and 5 in a, b, c, d, respectively.

### 2.2.4.2 Frequency Offset

Carrier frequency offset (CFO) refers to a difference in frequency between the receiver clock with respect to the 'correct' frequency of carriers in a transmitted OFDM symbol. CFO is introduced due to an imperfect clock in the RF front-end, as well as by frequency variation caused by the Doppler effect when a signal is transmitted through a frequency selective channel. This leads to the misalignment of sampling in sub-carriers in the frequency domain that causes a loss of orthogonality because at the point of frequency offset in the sub-carrier, the other sub-carriers are not null as expected (shown in Fig. 2.8). CFO is normalised by sub-carrier spacing and usually divided into an integer part (IFO), as a multiple of sub-carrier spacings, and a fractional part (FFO). IFO causes a circular shift of



FIGURE 2.8: Inter carrier interference (ICI) caused by frequency offset  $\Delta f$ .

the sub-carrier in the frequency domain while FFO results in ICI because of lost orthogonality between sub-carriers.

With no frequency offset, the frequency bin of the DFT will be sampled at the peak of each sub-carrier, sinc(x) pulse, and other adjacent pulses are null at this point. However, if frequency offset is introduced, the frequency bin of the DFT will sum the energy from other sub-carriers. This means that the adjacent sub-carrier introduces an interference component resulting in ICI.

As can be seen, the adjacent sub-carrier introduces an interference component that is about half the amplitude of the sub-carrier of interest. All other sub-carriers introduce an interference component of much lower amplitude. This is known as a loss of orthogonality, and must be compensated for in order to properly demodulate the OFDM symbol. The effect of CFO can be easily considered in the time domain by taking an inverse Fourier transform expressed as follows:

$$R(f - \Delta f) \Leftrightarrow e^{i2\pi\Delta ft} s(t),$$
 (2.13)

In the discrete time domain, the signal sample sequence can be expressed:

$$s[n]' = s[n]e^{\frac{i2\pi\Delta fn}{N}}, \tag{2.14}$$



FIGURE 2.9: The constellations of OFDM received symbol with frequency offets of 0.025, 0.5, 0.1 and 0.25 sub-carries spacing in a, b, c, d, respectively.

where s[n]' and s[n] are the frequency offset samples and the original samples, respectively.  $\Delta f$  denotes the frequency offset, and N is the number of sub-carriers.

The effect of frequency offset is shown in Fig. 2.9. Each plot illustrates the constellation of QPSK symbols demodulated from one 256 sub-carrier OFDM symbol based on the IEEE 802.16 standard.

As can be seen, OFDM performance is sensitive to even small frequency offsets. The effect of CFO causes dispersion, similar to AWGN, and also phase rotation in the QPSK constellation demodulated from the OFDM symbol. If multiple data symbols are transmitted in a packet, the phase rotation of each OFDM symbol increases, and even small CFO will lead to a large drift of constellation points, shown



FIGURE 2.10: The constellations of 5 consecutive OFDM received symbols with frequency offsets of 0.025 and 0.05 in a, b respectively.

in Fig. 2.10, degrading the performance of demodulation. CFO must be estimated and compensated for, in order to properly demodulate the OFDM symbol.

Most published OFDM implementations limit their synchronisation to dealing with small CFO that can be corrected along with FFO. Although IFO estimation methods have been explored in theory [41, 42, 43, 44, 45], their implementation in hardware has not been published in the research literature primarily due to high computational complexity. In Chapter 5, we present a low cost, low power IFO estimation architecture for more robust OFDM synchronization.

### 2.2.4.3 Phase Noise

The intrinsic imperfection of the local clock at the RF front-end of a receiver or clock jitter of an ADC may introduce parasitic phase noise which can affect the performance of baseband data symbols, such as QPSK, QAM, ..., during demodulation. Phase noise can be considered in two different parts: common phase error and inter-carrier interference [46]. The effect of phase noise can be expressed in the discrete time domain as:

$$s[n]' = s[n]e^{i\phi[n]},$$
 (2.15)



FIGURE 2.11: The constellations of an OFDM received symbol and 5 consecutive OFDM received symbols with phase noise variance of  $0.25 \ rad^2$  in (a), (b) respectively.

where s[n] and s[n] are the frequency offset samples and the original samples, respectively.  $\phi[n]$  denotes the phase noise.

If the phase noise is varying more slowly than the OFDM symbol interval, it can be considered a constant phase term added to each sample resulting in CPE [46]. However, if the phase noise is varying much faster than the OFDM interval, different phase noise is added to each sample causing the loss of orthogonality, and thus, inter-carrier interference. Fig. 2.11 illustrates the effect of phase noise on base-band data symbol demodulation in an OFDM received symbol and 5 consecutive OFDM received symbols.

As can be seen in Fig. 2.11, the degradation of OFDM demodulation includes two different phenomena affected by phase noise. First, the constellations of the data symbol are rotated similarly to the effect of fractional timing error or residual frequency offset. Second, the constellations of data symbols are dispersed like the effect of AWGN, causing a loss of orthogonality between sub-carriers. However, the difference from the effect of frequency offset will be a constant constellation rotation for all OFDM symbol instead of a different constellation rotation for each symbol in the case of frequency offset.

## 2.2.5 Shaping OFDM Spectral Leakage

One drawback of OFDM is spectral leakage due to the summation of sinusoidal sub-carriers windowed by a rectangular function. Some recent OFDM-based standards demand very strict requirements on spectral leakage to avoid inter-channel interference between adjacent transmission channels. This raises a significant challenge in terms of how to shape the spectrum of the OFDM signal.

### 2.2.5.1 Spectrum Emission Masks in Recent Standards

In 2009, the FCC issued regulatory rules for reusing television white space (TVWS) spectrum. IEEE 802.11af was developed under the 802.11 Working Group as a standard that enables a Wi-Fi service in TVWS spectral regions [47]. The scope of the standard is to define amendments to the high throughput 802.11's PHY and MAC layers to meet the requirements for channel access and coexistence in the TVWS regions. One of the main challenges is the stringent spectral emission mask (SEM) requirements that are mandated by the FCC for these services. Re-using existing 802.11 standards means hardware can be cheap, but the high throughput 802.11 scaled SEM is significantly inferior to the required spectral emissions shape for TVWS [48]. For instance, the 802.11 scaled SEM requires an attenuation of 20 dB at the edge of the channel whereas the equivalent requirement for portable TV band devices (TVBD) is 55 dB.

In 2010, the IEEE defined a standard for PHY and MAC layers [49], named IEEE 802.11p, for Dedicated Short-Range Communications (DSRC), the wireless channel for new vehicular safety applications through vehicle-to-vehicle (V2V) and Road to Vehicle (RTV) communications. The PHY in 802.11p is largely inherited from the well-established IEEE 802.11a OFDM PHY, with several changes aimed at improving performance in vehicular environments. The advantage of building on 802.11a is a potential significant reduction in the cost and development effort necessary to develop new 802.11p hardware and software. It also plays an important role in allowing backwards compatibility from 802.11p to 802.11a [50, 51].

Essentially, three changes are made in IEEE 802.11p [52]: First, 802.11p defines a 10 MHz channel width instead of the 20 MHz used by 802.11a. This extends the guard interval to address the effects of Doppler spread and inter-symbol interference in a VC channel. Secondly, 802.11p defines several improvements in receiver adjacent channel rejection performance to reduce the effect of cross channel interference that is especially important in dense vehicle communication channels. Finally, 802.11p defines four SEMs corresponding to class A to D operations that are specified and issued in FCC CFR47 Sections 90.377 and 95.1509. These are more stringent than for current 802.11 radios, in order to improve performance in urban vehicle scenarios. In addition, 802.11p will operate in the 5.9 GHz DSRC spectrum divided into seven 10 MHz bands. This channelization allows the MAC layer to perform multi-channel operations [53]. The mechanism allows safety and other applications to occupy separate channels to reduce interference. The four strict 802.11p SEMs are defined to reduce the effect of ICI between the channels. Wu et al. [54] showed that transmitters on adjacent service channels still causes inter-channel interference (ICI) in the safety channel, even if they satisfy the class C requirement. Shaping 802.11p spectral leakage is thus potentially important in helping to eliminate ICI.

### 2.2.5.2 Dynamic Channel Requirements

For static wireless devices operating in licensed spectral regions, the characteristics of communication systems that are licensed to occupy adjacent bands may be known. Hence, spectral leakage masks for ICI avoidance in neighbouring systems can be statically specified. However, in the case of shared and reused spectrum, the authors of SEMs should be defined in a more general and flexible way [55]. In other words, CRs operating in dynamic spectrum access (DSA) environments must adapt their current transmission SEMs based upon their current operating region – another argument for baseband digital filtering. More sophisticated examples of SEMs are studied in [56], which deals with the broader concept of dynamic SEMs.

Time-varying SEMs may also need to consider that neighbouring systems are themselves able to change their SEMs, and hence through negotiation with each other change their masks for in- and out-of-band emission levels separately in accordance with their mutual temporal variations (for example, to adapt to communications traffic density or spatial deployment density). In such future systems, an SEM defined by the regulator may simply be a starting point in a collaborative process in which neighbouring communication systems negotiate and renegotiate new SEMs as their status changes (e.g. to optimise computational power or increasing throughput).

Unfortunately, [56] did not present a complete solution for dynamic filtering of spectral emissions, however they discussed deactivating sub-carriers or changing transmission power to satisfy the requirements of a dynamic SEM. The former solution leads to a reduction in throughput due to reduced spectral band occupancy, whereas the latter impacts range. A combined approach was presented in [57] where some sub-carriers are reduced in power instead of being deactivated, in order to reduce spectral leakage to adjacent channels.

### 2.2.5.3 Filtering in OFDM Implementations

Modern OFDM implementations tend to favour subsuming as much processing as possible within the baseband digital components, in order to simplify the frontend RF hardware. Although many alternative transmitter designs exist, direct up-conversion (DUC) architectures are commonly selected due to inherent implementation, cost and performance advantages [58]. Within the transmitter, orthogonal intermediate frequency (IF) signals are generated directly by digital baseband hardware, high-pass filtered and then quadrature up-converted to RF for transmission. This contrasts with more traditional digital radio implementations in which the digital hardware generates baseband signals which are up-converted to IF in one or more analogue steps before conversion to RF. Those systems would perform channel filtering predominantly with analogue filters, which require discrete precision components, and which tend to be inflexible in terms of carrier frequency and

other characteristics. For cognitive radio (CR) systems, where frequency agility is a requirement, and in SDR (software defined radio), both up-conversion to IF as well as channelisation filtering are performed in the digital domain [59, 60]. Typically, this enables a relaxation of stringent IF and RF filtering requirements, which in turn allows a reduction in system cost, though requiring more complex signal processing. A further advantage of baseband filtering is agility and flexibility. In a CR context in particular, both channel and time agility are required, and this can be best achieved in the digital domain.

Within the baseband, OFDM symbols are constructed in the frequency domain and then transformed to a complex time domain representation through the IFFT. A critically sampled construction process requires a sample rate of double the signal bandwidth. This signal is up-converted to IF using an interpolation process, during which images of the original OFDM frequency response are created at integer multiples of the original sampling rate. Image rejection filtering must then be performed on this signal prior to being output by the digital-to-analogue converters (DACs) and subsequent transmission, since the images lie out-of-band (OOB) and hence are a cause of ICI.

However any such filtering induces time-domain smearing of the transmitted signals [61] which adds to the similar effects caused by the channel impulse response (CIR) between transmitter and receiver, all of which potentially induce intersymbol interference (ISI).

OFDM systems combat ISI by dividing information finely in the frequency domain across sub-channels to implement narrow (in frequency) but long (in time) transmitted symbols, and then providing a guard interval between successive symbol blocks. The guard interval is determined by the duration of the expected channel and filter impulse responses that are traversed by each symbol on the path from transmitter baseband to receiver baseband. The guard interval typically contains a cyclic prefix (CP) which is inserted to combat another cause of ISI: received frequency components ringing during the CIR due to the abrupt onset of modulation at the beginning of each symbol. The nearly rectangular OFDM

symbols in the time domain naturally have a frequency domain response consisting of overlapping Sinc shapes, complete with large side lobes that lie outside the main frequency channel. These are another source of OOB interference which contributes to ICI. As noted previously, both 802.11p and 802.11af (in common with most OFDM-based standards) specify an SEM which requires that ICI is controlled.

# 2.3 Field Programmable Gate Arrays

Field programmable gate arrays (FPGAs) are silicon devices with an architecture designed to be flexible enough to implement any type of circuit [62]. They consist of programmable logic components coupled with configurable routing, and flexible I/O for interfacing with external components. How these components and connections are set up is determined according to the contents of a configuration memory. To implement a circuit, the designer typically describes it using a hardware description language like Verilog or VHDL. Alternatively, tools exist that can map higher level languages like C or MATLAB Simulink directly [63]. This description is then processed through synthesis and implementation tools, to determine a suitable mapping of the circuit to the components on the FPGA, and the connections between them. This results in a bitstream that can be loaded into the configuration memory to set up the circuit described [64].

### 2.3.1 FPGAs for Radio Platforms

FPGAs have a long history of use in digital signal processing (DSP) applications. They offer a way of exploiting the inherent parallelism in many DSP algorithms through custom architectures, resulting in significant speedup and efficiency compared to software based implementations. On recent FPGAs, additional components like DSP blocks for high-performance signal processing improve performance further. Generally, the designer will describe the architecture generally, and the implementation tools will decide when to use these components [65].

An exciting recent development is the emergence of new platforms that couple high-performance processors with a flexible FPGA fabric. For example, the Xilinx Zynq family is a new generation of reconfigurable devices, which has a tightly coupled hardened 1 GHz dual-core ARM Cortex-A9 processor and reconfigurable fabric architecture along with several built-in peripherals [66]. In such systems, the ARM can host a fully functioning software stack while the baseband can be implemented in the reconfigurable fabric. The connectivity between the two is very tight and offers high bandwidth and low latency. This represents an ideal platform for cognitive radio systems, offering both high computational performance and flexibility, while also facilitating a simple software programmable interface to higher layers of the radio. A number of research groups are exploring ways to use these platforms in cognitive radios [67].

Partial reconfiguration (PR) is also an advanced technique available in recent FP-GAs which further provides the flexibility [68, 69]. By modifying only a portion of the configuration memory, it is possible to modify only some system functionality while the remainder continues to operate without interruption. This allows portions of a circuit to be changed at runtime, and is ideally suited to applications like cognitive radio [70, 71]. It is clear that PR has significant benefits for dynamic systems like cognitive radios. However, the design process remains suitable only for FPGA experts. Recent efforts have begun to bridge this gap, allowing a modular approach to PR design, with automated partitioning and floorplanning of the FPGA [72, 73]. Combined with the improving efficiency of high-level synthesis, this will enable radio designers with minimal FPGA experience to leverage their power and flexibility in the design of cognitive radios.

Reduced power consumption is another benefit of PR that is important in radio systems. FPGA based systems consume power in proportion to the size of the device, resource utilisation and operating frequency. PR based designs save resources, and may even allow use of a smaller FPGA, hence saving power.

### 2.3.2 Power Dissipation on FPGA

The increasing computational requirements, and growing requirements for deployment in portable devices have prompted serious attention to reducing the power consumption of systems, including radios with a focus on the physical layer. High power dissipation by a system introduces high operating temperatures and follow-on effects that increase the cost of packaging as well as requiring larger batteries in the case of portable devices, or other power supply solutions. Power dissipation has thus come to be considered as one of the most important design metrics, along with area and performance. Reducing power consumption has been investigated at all levels of the digital design flow: from the system design stage to the layout stage. At every stage, the designer can act to optimise the power consumption of the design.

However, small changes at the lower levels may require more work and time to update the higher level optimisations, thus increasing the complexity of the optimization process. Moreover, at the system design level stage, the designer has more options to reduce power dissipation by a greater degree. At the lower level, the architecture of the system is designed in terms of data flow between registers, leaving few options to optimize the system. Thus, most power saving opportunities may exist at higher level of the design abstraction [74].

FPGAs, with their highly parallel architecture, are suitable for the increasing computational requirement of signal processing in wireless communication systems. In order to optimise the power dissipation of systems on FPGAs, the power consumption of FPGAs needs to be clearly understood, and accurate, flexible power estimation tools are needed in order to provide power profiles of the system to help the designer makes the best decisions. In addition, low power design techniques must be investigated to optimise power consumption when a system design is implemented.

Total FPGA power is calculated as follows:

$$P_{total} = P_{devicestatic} + P_{designstatic} + P_{dynamic}, (2.16)$$

where  $P_{total}$  represents the total power consumption,  $P_{devicestatic}$  refers to the leakage power which is proportional to the size of the device and resource utilisation when the device is powered,  $P_{designstatic}$  is the additional power dissipation when the design is configured on the device and has no switching activity. Leakage current is the only source of this static power dissipation.  $P_{dynamic}$  represents the additional average power consumption from user logic utilization and switching activity caused by signal transitions in the design. Increasing operating frequency leads to a rise in switching activity and thus results in increased dynamic power consumption. The most significant source of dynamic power dissipation is the charging and discharging of parasitic capacitance as a result of signal transitions.

Efficient power-aware designs for FPGA-based systems require estimation tools that gauge power consumption at an early stage in the design flow. These tools allow design tradeoffs to be considered at a high level of abstraction, thus reducing design effort and cost. Significant research on FPGA power consumption has appeared in the literature [75, 76, 77, 78, 79]. These papers have shown that power consumption of FPGA devices is predominantly confined to the programmable interconnect. In the Xilinx Virtex-II family, for instance, it was reported that between 50-70% of total power is dissipated in the interconnection network [75]. The majority of power dissipation in FPGAs is dynamic power dissipation [75] as characterized by:

$$P_{dynamic} = \frac{1}{2} \sum_{i \in allnets} C_i \cdot f_i \cdot V^2, \tag{2.17}$$

where  $P_{dynamic}$  represents average power consumption,  $C_i$  is the capacitance of a net,  $f_i$  is the average transition rate (switching activity) of a corresponding net,

and V is the voltage supply. Thus, estimating dynamic power (2.17) requires two parameters for each net: switching activity and capacitance. The net's capacitance is the parasitic effect on interconnection wires. It depends on the used interconnect resources. The switching activity represents the signal transitions on the net, depending on how circuit delays are accounted for. Zero delay activity can be calculated assuming logic and routing delays are zero. Logic delay activity can be calculated using complete logic and routing delays. Routed delay activity can be calculated using complete logic and routing delays. When delays are accounted for, the presence of glitches, which are spurious logic transitions due to unequal path delays to the nets driving gate, leads to an increase in switching activity. The additional activity due to glitching actually causes a significant increase in dynamic power [76]. The path delays in FPGAs are also significantly dominated by interconnects, which have a different delay, compared to primitive logic delays. Therefore, the result of glitching on total power consumption may be more severe in FPGAs versus ASICs.

It is generally accepted that the dominant source of optimising power dissipation on FPGA is optimising dynamic power dissipation. Dynamic power dissipation is proportional to switching activity and the equivalent capacitance of the circuit. There are many techniques and strategies to reduce the power dissipation in FPGA presented in the literature [80, 81, 82, 83, 84]. However, only the strategies which appear to be suitable for wireless systems are discussed here. These strategies will be discussed in detail for specific implementations in the following chapters, but the general approach is discussed here.

Firstly, in signal processing dominated systems, storing and transferring data between functional modules consumes a large proportion of total energy [83]. The power consumption thus heavily depends on the way a system is partitioned and modularized and hence the use of data buffers to synchronously transfer data between functional modules. In order to optimise power dissipation at a system level, such systems are often realized to support stream processing in which buffering data between functional modules requires much less memory. The energy for storing and transferring data is thus reduced significantly.

Secondly, latching registers with enable signals should be used at the inputs of functional modules. These can reduce switching activity inside functional modules when they are waiting for synchronization in a streaming process. Power estimation tools should be used to evaluate the power dissipation of the system at each stage of the flow. This provides information to assist in making decisions for reducing power dissipation.

### 2.3.3 Power Estimation

Several studies investigating power estimation techniques at different levels from the circuit level up to the system level have appeared in the literature. At the circuit level, a circuit simulator such as SPICE [85] provides one of the most accurate power estimation methods, but the computational overhead involved in this is not really suitable for highly integrated and dense FPGA based systems and complex designs.

Several studies have shown that the efficiency of power optimization is significantly better at the higher levels [86, 87, 88, 79]. High-level power estimation is needed to conveniently validate power budgets for the different parts of the design and identify the most power hungry parts in the design and to quickly evaluate the effects of optimizations on the overall system power budget. Furthermore, the long run time and and the complexity of synthesizing and validating a gate-level netlist makes lower level estimation approaches highly inefficient for exploring high-level design abstraction tradeoffs.

The contemporary Xilinx FPGA design suite supports two power estimation tools: XPower Estimator (XPE) and XPower Analyzer (XPA). These tools provide the power profile of a system in the early stages when the RTL description is being implemented that helps the designer improve power characteristics when implementing the system. XPE is a Microsoft Excel spreadsheet that is used to estimate power distribution typically used in the pre-design and pre-implementation phases of the design flow [89] . XPE supports selecting the relevant power supply and

thermal management components based on the system's architecture evaluation and device selection. XPE reads the resource usage of the implemented system, toggle rates, I/O loading, and many other factors from a designer's input and combines these with the appropriate device models to estimate power distribution. The appropriate device models are obtained from measurements, simulation, and/or extrapolation. There are two primary components that contribute to the accuracy of XPE. One is the designer's input estimates such as toggle rates, I/O loading. The other is device data models integrated into the spreadsheet that are selected based on the device selection. Realistic input must be provided to obtain an accurate estimation.

As an early exploration tool, XPE's accuracy is heavily influenced by the numbers input by the designer. After synthesis and implementation, XPA provides a more accurate estimation and power analysis.

XPA analyses the design on real design data from the system such as the (native circuit description (NCD) file output once the design has been fully implemented. XPA employs a vectorless estimation algorithm in which the switching activity of nodes is assigned to appropriate values even if they are not defined in the input file. However, simulation activity files such as a Value core dump (VCD) file from a functional simulation in timing mode are required for accurate power analysis. These offer low-level switching activity for all signals in the design, factoring in timing delays.

The power profile of the system, including resource usage and the capacitance of nodes and nets is extracted from the NCD file, while switching activities are computed with high accuracy based on timing simulation from the VCD files. Much more realistic power information is obtained at this stage, thus, XPA can achieve more accurate results in terms power estimation. XPA generates a text-based power report that shows the power distribution on the system that can used for optimisation.

Finally it is possible to measure power of a design loaded and functioning in an FPGA using lab equipment to monitor the currents on the various voltage rails. This can give a better view of overall board-level power.

# 2.4 Summary

Reducing power dissipation has become a crucial issue in wireless communication systems, especially for portable devices. In this chapter, the power consumption of FPGA systems is discussed. The power estimation and analysis tools of Xilinx are also studied and employed in this research to evaluate the power dissipation of the researched system for power consumption optimisation. Some low-power design strategies are suggested. This chapter also provides the background of OFDM in terms of its mathematical representation and functionality, and then the advantages and limitations of OFDM are also discussed. The concept of a MSRC based on OFDM techniques is presented. The challenges of implementing the MSCR system are introduced regarding the architecture and its performance. The synchronisation effects on the OFDM performance are also considered. Last but not least, the challenge in terms of OFDM spectral leakage are discussed in case of the strict requirements imposed by recent wireless standards.

# Chapter 3

# Multiplierless Correlator Design for low-power systems

# 3.1 Introduction

The correlation operation plays an important role in a number of signal processing algorithms, and is commonly used to perform basic synchronisation in OFDM systems. Auto-correlation based techniques are preferred for implementing OFDM synchronisation on FPGA because of their lower hardware costs. Dick and Harris [35] reported on an FPGA implementation of an OFDM transceiver including such synchronisation. Wang et al. [37] also presented an FPGA implementation of an OFDM-WLAN synchronizer, in which the timing synchronisation is performed by double auto-correlation based on short training symbols, allowing a reduction in hardware cost on FPGA. Fort et al. [36] compared the performance and complexity of FPGA implementations of auto-correlation and cross-correlation, showing that the accuracy of cross-correlation algorithms is better, but at a significant hardware cost. They proposed a new cross-correlation, but it is still at least 5 times more complex to implement than auto-correlation, due to the fact that several multipliers are required.

Cross-correlation between received samples and a known preamble can achieve highly accurate timing synchronisation, however this requires significant resources. Multiplierless correlators for timing synchronisation were introduced in [90], designed for IEEE 802.11a OFDM frames, based on expressing the correlator coefficients as sums of powers of two that only require shift and add operations. The authors identified a correlator that eliminates the need for multiplication, requiring only 26 additions/subtractions per output while maintaining similar synchronisation accuracy to a multiplier-based implementation.

Modern FPGAs contain various resources that can be used to implement cross-correlation. This chapter presents the design of several correlators for timing synchronisation with preamble symbols based upon the IEEE 802.16d standard. We compare designs using specialised DSP Slices to a multiplierless approach on Xilinx Virtex-6 and low cost Spartan-6 FPGA devices. Attempting to implement correlation on FPGAs without considering and designing for the underlying architecture is likely to result in an inefficient implementation. In this chapter, we show optimised designs, built to fit the FPGA architecture, and evaluate performance, timing synchronisation accuracy, resource utilisation, and power consumption, to understand whether a multiplier-based mapping is beneficial when using modern devices.

The work presented in this chapter has also been discussed in:

• T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Low-power correlation for IEEE 802.16 synchronisation on FPGA," in *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 21, no. 8, pp. 1549 - 1553, Aug. 2013.

# 3.2 Implementing Correlators

The downlink preamble in IEEE 802.16d [19] contains two consecutive OFDM symbols as shown in Fig. 3.1. The short symbol consists of 4 identical 64-sample



FIGURE 3.1: Downlink preamble symbols for IEEE 802.16.



FIGURE 3.2: Transposed direct form correlator.

fragments in time preceded by a cyclic prefix (CP). This is followed by the long symbol which contains two repetitions of a 128-sample fragment and a CP [19].

The 64 samples in the short symbol are used to perform cross-correlation with received samples for timing synchronisation. Therefore, the correlators are designed to compute cross-correlation with 64 constant coefficients. In this chapter, we explore two approaches to implementing such correlators. The first is based on Xilinx Virtex-6 FPGA DSP48E1 Slices, the standard approach to such an implementation. The second is using multiplierless correlation implemented on both a Xilinx Virtex-6 and a low power Xilinx Spartan-6 device. Both implementations process real and imaginary 16-bit samples in Q1.15 fixed point format. The output is the sum of 64 coefficient products, each being smaller than unity. So, the complex output words are in 21-bit fixed point Q6.15 format.

If such a design was implemented blindly, with no consideration for the FPGA architecture, the synthesis tools would infer the use of embedded DSP blocks for multiplication, but would likely achieve poor timing performance due to an inability to optimise the use of the DSP block. The DSP48E1 primitives on the Virtex-6 FPGAs have additional circuitry within them that enables the design of optimised datapaths, but this must be done manually, through writing code in a particular style. Otherwise, the synthesis tools cannot always infer the most

efficient structure. In our design, we have taken into account the internal structure of the DSP block, and made the design as lean as possible. The multiplierless design is specified manually, and cannot be inferred by the tools from a higher level description.

### 3.2.1 Design of DSP48E1 Based Correlator

The DSP48E1 block inside the Virtex-6 shown in Fig. 3.3 contains a multiplier followed by a configurable arithmetic unit to provide many independent functions, e.g., multiply, multiply-accumulate, multiply-add, three-input add, and more [2]. It also allows the datapath to be configured for various input combinations and register stages; a three-stage pipeline offers maximum performance. Since the DSP block is designed to mirror the structure of an FIR filter tap, it is ideally suited to implement correlation, and would hence be the method of choice for this application. Our first design uses non-pipelined DSP48E1 blocks in transpose direct form, as shown in Fig. 3.2, with 64 coefficients corresponding to the 64 samples in the preamble. The coefficients are pre-computed according to the IEEE 802.16d specification. The second design spreads the complex multiply-adds in a five-stage pipeline, shown in Fig. 3.4, consisting of DSP48E1 blocks configured for three stage internal pipelining. Ri\_Re, Ri\_Im are the real and imaginary parts of the received sample, respectively.  $Pr\_Re$ ,  $Pr\_Im$  similarly represent known preamble. The pipeline registers for the  $Pr_{-}Re$ ,  $Pr_{-}Im$  are eliminated because they have a constant value. Re, Im are the real and imaginary parts of the previous multiplyadd. Fig. 3.5 presents the pipeline structure of the correlator. The additional pipeline registers are required to handle the received sample. Adding pipeline registers should increase performance significantly.



\*These signals are dedicated routing paths internal to the DSP48E1 column. They are not accessible via fabric routing resources

FIGURE 3.3: Structure of DSP48E1 block inside the Virtex-6 [2].



FIGURE 3.4: Pipeline structure of the complex number multiply-add.



FIGURE 3.5: Pipeline structure of correlator using DSP48E1 blocks.



Figure 3.6: Structure of multiplierless correlators.

## 3.2.2 Design of Multiplierless Correlator

The principle of multiplierless correlators is to represent the coefficients and round them in the form of summed powers of two. Hence, shifts and additions are performed instead of multiplying by coefficients. It is expected that multiplierless correlation is more efficient, but with embedded hard multipliers in modern FPGAs, it is unclear whether they should still be considered favourable. Furthermore, synchronisation accuracy must be considered. To explore this, four alternative multiplierless correlators are implemented using four coefficient sets with an increasing degree of rounding, to compare the cost and performance and evaluate against multiplier-based correlators. The coefficient sets are found by quantising the 64 normalised preamble samples with quantisations of 1, 0.5, 0.25, and 0.125.

The proposed structure for multiplierless correlators is shown in Fig. 3.6. This structure is based on the transposed direct form in Fig. 3.2. Instead of using multipliers to multiply input samples by coefficients, the  $Shift\_Add$  block and multiplexers are used to perform the equivalent operation without an actual multiplication. The  $Shift\_Add$  block, multiplexers and value of Pr[n] are different depending upon the quantised coefficient set being used, and these perform shifts and additions on received samples according to the degree of quantisation that is applied. To optimise resources in the case of small numbers of bit quantisation, one common  $Shift\_Add$  block is used for all 64 coefficients instead of 64 separate

Shift\_Add blocks. This common Shift\_Add block calculates all possible values for 64 coefficients. The multiplexers are used to select the corresponding Shift\_Add outputs to accumulate in order to generate the correlator output. These are based on expressed coefficients Pr[n] that are pre-computed based on quantizing the 64 preamble samples. Since the Pr[n] values are constants, after synthesising the design, the multiplexer is optimised as hard-wired logic, and the preamble cannot be changed. To support different OFDM preambles, the Pr[n] could be stored in a register, and a real multiplexer used instead of hard-wired logic. This results in increased resource utilisation but provides a more flexible solution.

### 3.2.3 Implementation Results

The designs presented were synthesised and fully implemented using Xilinx ISE 13.2, targeting Xilinx Virtex-6 (V6) and Spartan-6 (S6) devices. The results of implementation are reported in terms of the number of occupied slices, DSP48E1s blocks and maximum frequency as summarised in Table 5.1.

Design DSP48E1s Slices Freq. (MHz) V6S6V6V6S6DSPc742 (6%)256 (88%) 119  $DSP_{-}pp$ 1,110 (9%) 256 (88%) 398 ML1661 (5%) 762 (6%)0(0%)309 174 ML20(0%)983 (8%) 1,071 (9%) 268 158 ML31,191 (10%) 0(0%)1,257 (10%)234 136 0(0%)ML41,496 (12%) 1,517 (13%)208 124

Table 3.1: Resource utilisation summary.

DSPc, DSP\_pp are correlator designs using DSP Slices, in non-pipelined and pipelined structures, respectively. ML1, ML2, ML3, ML4 are multiplierless correlators with coefficient quantisations of 1, 0.5, 0.25, 0.125, respectively.

Table 5.1 reveals that the  $DSP_{-}pp$  uses more logic slices due to its pipelined structure. The slices in DSP48E1 based designs are used for registers and route-thrus

while the slices in the multiplierless designs are mostly used as logic. The number of slices used in the multiplierless designs increases as the coefficient quantisation becomes finer. The DSP48E1-based designs use 256 DSP Slices, 4 for each complex multiply plus 6-9% of logic resources. The multiplierless designs use only logic to compute the cross correlation with 64 complex coefficients. The total logic area is a small fraction of the whole device: around 5-12% of total resources in the Virtex-6, and around 6-13% of total resources in the equivalent Spartan-6. While Spartan-6 devices do include DSP Slices, their number is insufficient to implement the full 64 sample complex cross correlation. This shows an ideal scenario where multiplierless correlation makes sense, and hence serves as a motivation for this study.

The maximum frequencies, reported after place and route, decrease for multiplierless designs according to the degree of coefficient quantisation. Meanwhile the non-pipelined DSP48E1 design is slower than the multiplierless designs. However, the pipelined DSP48E1 design can achieve significantly higher frequency.

A post-place-and-route simulation in ModelSim was used to estimate the power consumption of the system using the Xilinx XPower tool. Table 3.2 shows the power dissipation of the designs running at 50 MHz. The DSP48E1-based correlators consume more power than the multiplierless correlators, but this is due primarily to increased dynamic power when using the DSP48E1s on the Virtex-6. The dynamic power of the non-pipelined DSP48E1-based correlator *DSPc* is greatest at 846 mW, but pipelining reduces this by a factor of more than 2.5 times, due to reduced switching activity between the multiplier and adder. The dynamic power of the multiplierless designs increases from 133 mW to 203 mW on the Virtex-6 and from 149 mW to 294 mW on Spartan-6 as finer coefficient quantisation is used. It is important to note that the quiescent power of the Spartan-6 is much lower by design. Hence, we can see that using this multiplierless technique allows us to implement synchronisation on a Spartan-6 device, where a multiplier-based design is not possible due to resource limitations, while also saving significant power. In the next section, we will evaluate the impact on accuracy.

| TABLE 5.2. Correlator power consumption at 90 MHz. |               |     |             |     |           |     |
|----------------------------------------------------|---------------|-----|-------------|-----|-----------|-----|
| Correlators                                        | Quiescent(mW) |     | Dynamic(mW) |     | Total(mW) |     |
|                                                    | V-6           | S-6 | V-6         | S-6 | V-6       | S-6 |
| DSPc                                               | 1312          | -   | 846         | -   | 2158      | _   |
| $DSP\_pp$                                          | 1300          | _   | 328         | -   | 1628      | _   |
| ML1                                                | 1296          | 67  | 133         | 149 | 1429      | 216 |
| ML2                                                | 1296          | 68  | 160         | 197 | 1456      | 265 |
| ML3                                                | 1297          | 70  | 182         | 239 | 1479      | 309 |
| ML4                                                | 1297          | 71  | 203         | 294 | 1500      | 365 |

Table 3.2: Correlator power consumption at 50 MHz.



Figure 3.7: Correlator power consumption at different frequencies.

We also investigated how total power consumption varies with frequency, as shown in Fig. 3.7. As frequency increases, the finer quantisations and DSP48E1-based designs begin to consume proportionally more power. Overall, multiplierless designs on the Spartan-6 consume 75% to 85% less power than the same designs on the Virtex-6, and a 0.25 quantisation design on the Spartan-6 consumes 81% to 85% less power then the DSP48E1-based design on a Virtex-6.

The *DSPc* implementation represents how an architecture-independent design would be mapped. Our architecture-aware designs show significantly improved performance, reduced area, and reduced power consumption.



Figure 3.8: Correlator output with  $SNR = 10 \, dB$ .

# 3.3 Simulation and Discussion

In order to validate our designs at the application level, we simulate them using ModelSim with an IEEE 802.16 OFDM frame created using MATLAB, including the preamble symbols, data symbols and effects of an AWGN channel. Cross-correlation results using the correlator designs are compared to corresponding results in MATLAB to verify the correctness of implementation. To evaluate the accuracy of timing synchronisation achievable by these designs, the correlation outputs are plotted in Fig. 3.8 for random data frames at 10 dB SNR. The output of each correlator is slightly different because of rounding, but the timing synchronisation depends upon the location of the peaks being at the position of the preamble. All correlator designs achieve this most of the time, as shown at indices 34 and 98 for the CP samples and the first preamble samples respectively, for a single frame.

In order to evaluate the synchronisation accuracy of these approaches, we simulate 10,000 correlation operations in an AWGN channel with a detection strategy as follows. First, find the first peak, P1, over 64 samples. Next, find the second peak, P2, in the 64 samples following the first peak and compute the average value, avg, of the samples between two peaks. If  $(P1 - avg) \leq 0.75 * (P2 - avg)$ , the start of frame is detected and the correctness of the position can be checked. It should be noted that in all cases, the peaks were known to be located within the two search regions and that the detection strategies described above are compatible



FIGURE 3.9: Detection failure rate with increasing SNR.

with those in other published work such as [4] and [90]. Fig. 3.9 plots the results in terms of failure rate against AWGN SNR and shows that the designs are able to accurately detect the start of frame even in low SNR conditions. The failure rate of ML1 is the highest, as expected, due to the coarse quantisation. For SNRs above 4dB, the failure rates of ML2, ML3, ML4, and DSPc differ less than 0.05% from each other. This suggests that sacrificing arithmetic accuracy by using multiplierless cross-correlation is feasible and has negligible impact on synchronisation accuracy. Combined with the results in the previous section, we can be confident that low-power FPGAs, such as the Spartan-6, with insufficient resources for multiplier-based correlation, are still feasible for implementing robust OFDM receivers.

# 3.4 Summary

The DSP48E1 blocks on modern Virtex-6 FPGA devices seem to offer the ideal resource for implementing correlation-based frame synchronisers. However, as we have discovered, in the context of synchronisation for IEEE 802.16 OFDM systems, simplified multiplierless designs offer comparable synchronisation performance. While the DSP48E1-based correlators can obtain higher clock speeds, this is only possible through a detailed pipelined design. Furthermore, their power consumption and resource usage is considerably greater. Since low-power, low-cost devices such as the Xilinx Spartan-6 do not include sufficient DSP blocks, this

suggests adopting multiplierless designs for low-power implementations. We have shown that while very low quantisation resolution does impact synchronisation performance, with a step size of just 0.5, synchronisation accuracy is on par with multiplier-based correlation. Multiplierless correlation on a Spartan-6 can save over 85% power compared to a DSP block design on a Virtex-6 FPGA.

# Chapter 4

# A Method for OFDM Timing Synchronisation

# 4.1 Introduction

OFDM performance is sensitive to receiver synchronisation [91]. Frequency offset causes inter-subcarrier interference, and errors in timing synchronisation can lead to inter-symbol interference. Therefore, accurate synchronisation is critical to the performance in OFDM systems. This section summarizes conventional synchronisation methods used for OFDM systems, discussing their merits and drawbacks, and then the following sections in this chapter presents a new and efficient method that is robust to large frequency offset, obtains accurate and robust synchronisation and requires relatively low complexity computation. It is suitable for hardware implementation on reconfigurable systems.

The method presented in this chapter has also been discussed in:

• T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Robust and Efficient OFDM Synchronisation for FPGA-Based Radios," in *Circuits, Systems, and Signal Processing, vol. 33, no. 8, pp. 2475 - 2493, Aug. 2014, Springer* [92].

# 4.2 Related Work

The autocorrelation-based method is commonly preferred for implementing synchronisation because of its low computation requirements. In [3], timing metrics are defined as follows: the normalised power measure M[d]:

$$M[d] = \frac{|P[d]|^2}{(R[d])^2},\tag{4.1}$$

where d denotes a time index corresponding to the first sample in a search space comprising 2L samples of received signal r. The power measure R, which represents the energy of the second half of the receiver search window, is defined as:

$$R[d] = \sum_{m=0}^{L-1} |r[d+m+L]|^2, \tag{4.2}$$

and the normalisation parameter P computes the correlation between two periodic halves in the search window as:

$$P[d] = \sum_{m=0}^{L-1} (r^*[d+m]r[d+m+L]), \tag{4.3}$$

The metric M[d] for a typical channel is illustrated in Fig. 4.1, where it can be seen to form a distinct plateau when the preamble is presented within a received window. Clearly, M[d] accurately detects the preamble, however the nature of the plateau having a flat top presents an uncertainty of positioning, and hence degrades the accuracy of determining the exact start of the frame in practice.



FIGURE 4.1: The timing metric in [3] applied to IEEE 802.16-2009 preamble in the AWGN channel (SNR = 10dB)

In order to improve the accuracy of synchronisation, a scheme is often used as in [93, 94, 38, 39, 40] based on combining the above metric for detecting frame, estimated coarse STO and fractional CFO, and additional cross-correlation operations for computing fine STO and integer CFO.

#### 4.2.1 Coarse STO and Fractional CFO Estimation

First, the plateau of the M[d] metric is used to detect a frame start and estimate coarse STO by comparing the magnitude of M[d] to a given threshold. After this, the CFO is estimated using the P metric in (4.3). We now assume that we are receiving a signal s[d] which has a normalised carrier frequency offset  $\xi$  with respect to r[d]:

$$s[d] = e^{j2\pi\xi \frac{d}{N}} * r[d]. \tag{4.4}$$

where N is the numbers of subcarriers of OFDM symbol. At the plateau of M[d], the received samples in the window will be periodic:

$$r[d] = r[d+L]. \tag{4.5}$$

Substituting (4.4) and (4.5) in (4.3)

$$P[d] = \sum_{m=0}^{L-1} (s^*[d+m]s[d+m+L])$$

$$= e^{j2\pi\xi \frac{L}{N}} \sum_{m=0}^{L-1} |r[d+m]|^2$$
(4.6)

The CFO can be determined as follows;

$$\xi = \frac{\angle P[d] + 2\pi z}{2\pi \frac{L}{N}} \tag{4.7}$$

where  $\angle P[d]$  is the angle of P[d] within the range  $-\pi$  to  $\pi$ . The CFO consists of 2 parts: the fractional part,  $\hat{\lambda} = \frac{\angle P[d]}{2\pi \frac{L}{N}}$ , can be estimated replied on  $\angle P[d]$  and the integer part (IFO) is represented by an integer z,  $\hat{\epsilon} = \frac{zN}{L}$ . The fractional CFO estimation has a limitation as follows:

$$-\frac{N}{2L} < \hat{\lambda} < \frac{N}{2L} \tag{4.8}$$

Clearly, the range of estimating fractional CFO can be increased by decreasing the length of period L. However, because of channel noise, a smaller value of L may degrade the accuracy of estimating the fractional CFO  $\hat{\lambda}$ . In the case of the IEEE 802.16-2009 preamble [19], the number of subcarriers N and length of period L are commonly chosen to be 512 and 64 samples, respectively [95]. So the fractional CFO estimation range is within -2 to +2 subcarrier spacings. And the value of IFO parts  $\hat{\epsilon}$  is estimated as an integer that is multiple of 4.

This method is fast and robust for estimating coarse STO and fractional CFO. But it has some drawbacks. First, the coarse STO is estimated by comparing the metric to a threshold. It is not certain that the samples for estimating the fractional CFO are within the plateau of M[d] in which the received samples in the window will be periodic. This degrades the performance of fractional CFO estimation. Second, the CFO estimation only performs correctly in the limited range of the fractional CFO shown in (4.8). If the CFO is outside this range, it is necessary to estimate the integer CFO separately.

## 4.2.2 Fractional CFO Compensation

Before fine STO estimation is performed using cross-correlation, that is sensitive to CFO, the estimated fractional CFO must be used for frequency offset compensation by phase de-rotating the received samples in the time domain:

$$\widehat{r[d]} = e^{-j2\pi\xi \frac{d'}{N}} * s[d]; \quad d' = d + t_{off}.$$
 (4.9)

 $\widehat{r[d]}$  and s[d] are the compensated samples and frequency offset samples, respectively. d' is the coarse estimated timing index that differs from the correct timing index  $t_{off}$  samples. Substituting (4.4) in (4.9), we get:

$$\widehat{r[d]} = e^{-j2\pi\xi \frac{(d+t_{off})}{N}} * e^{j2\pi\xi \frac{d}{N}} * r[d]$$

$$= e^{-j2\pi\xi \frac{t_{off}}{N}} * r[d]. \tag{4.10}$$

When compensating the fractional CFO, the fine STO has still not been estimated. This results in a common phase error within the coarse CFO compensated samples as shown in (4.10).

#### 4.2.3 Fine STO Estimation

The purpose of timing synchronisation is to find the correct starting point for receiver demodulation. This starting point will generally mark the beginning of a discrete Fourier transform (DFT) window. Coarse time synchronisation is commonly based on auto-correlation as mentioned above, leading to relatively simple hardware implementation. However, with purely coarse STO, it is almost impossible to achieve sufficient accuracy to correctly detect the starting point. Thus, fine STO estimation is necessary to refine the accuracy of timing synchronisation. To obtain more accurate fine STO estimation, cross-correlation is usually performed between the received samples and known transmitted preamble. The peak crosscorrelation occurs when the received samples match to the known preamble. Fine STO can thus be found by locating this peak within a search window. In order to reduce the overhead of cross-correlation, a sign bit multiplier can be used instead of a real multiplier to calculate cross-correlation. However, this is much more sensitive to CFO in practice [93]. Some researchers employ enhanced methods based on cross-correlation for the time synchronisation. For example, Kishore and Reddy [4] presented an algorithm using cross-correlation between the known transmitted and received preamble symbols. The timing metrics for synchronisation using this method perform the normalised M(d) calculation, and use the same denominator R as in (7.1) and (4.2), however they refine the numerator P as follows;

$$P[d] = \sum_{m=0}^{L-1} (r[d+m]a[m])^* (r[d+m+L]a[m]), \tag{4.11}$$

where d again denotes the time index corresponding to the first sample in a window of 2L samples of received signal r and superscript "\*" again denotes complex conjugation. In (4.11) the samples a are now the known, transmitted, time domain preamble samples.



FIGURE 4.2: The timing metric in [4] apply to IEEE 802.16 preamble in the AWGN channel (SNR = 10dB)

The cross correlation now produces distinct peaks at the times when received samples match to the known transmitted samples of the preamble. Fig. 4.2 illutrates this by plotting metrics M versus the sample index d for an example channel in AWGN with SNR = 10 dB. A frame is detected when M crosses a threshold and the start of frame is found by searching the peaks of M. This can accurately determine the start of frame even at low SNRs. Moreover, this method is robust to large CFO for time synchronisation. Simulations [4] show that time synchronisation can be performed with CFO = 10.5 sub-carrier spacings. However, the cross-correlation operation requires complex computation. Although the complexity of cross-correlation can be reduced using multiplierless correlation [90], the multiplierless correlator degrades the precision of metric P, used for estimating frequency offset. In general, this method is appropriate for time synchronisation but requires significant hardware resources for cross-correlation when implemented.

# 4.3 Proposed Fractional CFO Estimation and Synchronisation

Considering the limitations of previous work, we propose new timing metrics to take advantages of preamble characteristics such as period and energy distribution. Again, the proposed method is illustrated for the specific case of the IEEE 802.16-2009 preamble, as shown in Fig.4.4.

Firstly, we define the autocorrelation between the two halves of the receiver window for normalisation purposes;

$$P'[d] = \sum_{m=0}^{M-1} (r^*[d+m]r[d+m+L])$$
 (4.12)

where d denotes a time index of received signal r that now corresponds to the first sample in a window of L + M samples of received signal, L being the length of preamble period (64 for the IEEE802.16 preamble) and M being the length of received samples for estimation, respectively.

Next, a power measure, R' is proposed which takes account of both recevied and transmitted symbol power;

$$R'[d] = \sum_{m=0}^{M-1} |r[d+m+L]|^2 |a[m]|^2$$
(4.13)

In common with the metric developed by Schmidl and Cox [3], our P'(d) detects the periodic characteristic of the preamble and can thus be used for estimating fractional CFO. P' forms a plateau is evident in Fig.4.3 when the preamble is present within the receiver window. R' is then used to find the starting point of the preamble based on its energy distribution. R' causes peaks shown in Fig.4.3 at the point where the energy distribution of the received samples matches that of the transmitted preamble. Since the R' metric that is used to estimate time synchronisation is computed on the square of the amplitude of received samples,



FIGURE 4.3: Proposed timing metrics applied to the IEEE 802.16 preamble in AWGN (SNR =  $10\,\mathrm{dB}$ , CFO = 10.5).

the time synchronisation is insensitive to the CFO that effects the phase of the received samples. Moreover, computing on amplitude squared, a real number, requires less resources than using complex numbers as in alternative approaches such as [4]. In addition, the time synchronisation is performed based on the peak value of R' rather than its absolute value. Therefore R' is a good candidate for implementation using a multiplierless correlator to reduce computational complexity. Fig.4.4 illustrates the flow of the proposed synchronisation according to the received samples of the preamble, compared to that of the conventional scheme. As can be seen in the conventional scheme, the coarse STO and fractional CFO estimation are performed in the two first periods of the short training symbol from (i) to (iii), and then the estimated fractional CFO is used for compensation. The fine STO is estimated in the last period of the short training symbol from (iv) to (vi), and integer CFO should be computed in the long training symbol from (vii) to (viii). By contrast, the proposed method performs STO synchronisation and fractional CFO estimation in the three first periods of the short training symbol from (i) to (iv), and the start of frame will be detected at (ii). Then the fractional CFO compensation is computed, and integer CFO estimation is performed in the long training symbol from (vii) to (viii).



FIGURE 4.4: The synchronisation flow according to the received samples within the preamble showing its packet format above the conventional synchronisation scheme flow, and proposed scheme below.

The method used for synchronisation based upon these metrics is as follows:

# 4.3.1 Frame Synchronisation and Fractional CFO Estimation

First, frame detection is performed by comparing the metric P' to R' with a threshold thr as shown in (4.14)

$$|P'[d]| > thr * R'[d].$$
 (4.14)



FIGURE 4.5: Performance of the frame synchronisation versus selecting threshold in the AWGN channel (SNR =10dB)

The threshold is different for each channel and needs to be determined empirically by simulation (in common with other systems such as [4]). Fig.4.5 shows the performance of the frame synchronisation in the given channel for different threshold values. With each threshold value, 1000 frame detections are simulated. The number of correct detections indicates how often the time synchronisation has been found correctly at the start of the frame. The number of false detections indicate that the time synchronisation gets the wrong start of the frame. Otherwise, the frame is not detected (i.e. |P'[d]| can be not greater than thr \* R'[d]), and a miss detection is declared. From the results plotted, the optimum value can be seen to be around 0.55. If the threshold is too small, noise may cause a failure of detection. On the other hand, when the threshold is large, the frame detection may be missed if the noise reduces the amplitude of the timing metric, which consequentially does not cross the threshold. Assuming the channel is known, the threshold can be selected from a look-up table. In the simulations reported in this chapter, each channel threshold has been pre-computated, based upon the current

channel model and conditions.

After a frame is detected, the starting point of the short preamble is found by searching for the peaks in R'[d]. As can be seen in Fig. 4.3, two peaks in R'[d] will bracket the transition to the plateau in P'[d] shown at sample indices of -31 and 33, respectively. The second peak is used as the starting point for the DFT window. This is accomplished for the maximum values of R'[d] + R'[d - L] over the next L samples after frame detection. Second, the fractional CFO estimation is presented in (4.15) based on the P'[d] metric similarly to other work such as [3, 93, 94, 38, 39, 40];

$$\hat{\lambda} = \frac{\angle P'[d]}{2\pi \frac{L}{N}} \tag{4.15}$$

where  $\hat{\lambda}$  is the estimated fractional CFO,  $\angle P'[d]$  denotes the angle of P'[d] and N is the number of subcarriers. In this proposed method, the fractional CFO is estimated at the starting point of the preamble to guarantee that the correct angle of P'[d] is taken for estimation. Using the conventional techniques, the inherent uncertainty in the coarse STO does not allow this to be achieved in practice. Thus, the estimated fractional CFO in the proposed method will, on average, tend to be more accurate. In addition, this method separates the length of preamble period L and the length of received samples for estimation M, and because of that L can be small to extend the range of fractional CFO, and M can be longer to increase the robustness to channel noise. For the IEEE802.16 preamble, M is set equal to 2L to improve the overall precision of synchronisation.

# 4.3.2 Fractional CFO Compensation

Compensating the Fractional CFO is performed in the traditional way by phase de-rotating the received samples in the time domain;

$$\widehat{r[d]} = e^{-j2\pi\Delta f_c dT_s} * s[d].$$

$$= e^{-j2\pi\xi \frac{d}{N}} * e^{j2\pi\xi \frac{d}{N}} * r[d]$$

$$= r[d]. \tag{4.16}$$

However, in this proposed method, the frame synchronisation is achieved before performing fractional CFO compensation, therefore,  $t_{off}$  is removed, and thus, the common phase errors caused by fractional CFO compensation are missing. The compensated received samples are then demodulated using the DFT to obtain the received symbols in the frequency domain.

#### 4.3.3 Simulation Results and Discussion

We evaluate the performance of the proposed synchronization method, applied to the IEEE 802.16-2009 downlink preamble, in MATLAB under both AWGN channels as well as the more realistic SUI channels that are widely used in the research literature [4, 95].

The Stanford University Interim (SUI) channel model [96] is used to simulate a frequency selective channel and takes into account many wireless channel effects including delay spread, Doppler spread, phase noise, and channel interference. In addition, the value of metrics computed in MATLAB are later verified with corresponding values from the FPGA simulation, to ensure practical functional equivalence.

In total, 100,000 OFDM frames, preceded by noise with randomly seeded AWGN and followed by preamble and data symbols, are used to evaluate synchronization performance for each method. The proposed method is compared to the state of the art method, in terms of accuracy of both time synchronization and fractional CFO estimation. The performance of STO estimation is measured in terms of failure rate (%), and the accuracy of CFO estimation is evaluated in terms of mean

square error (MSE). We separately evaluate the robustness of time synchronization against large CFO for each method. Three versions of the proposed approach are constructed by varying the length, C, of received samples for estimation based on (4.12). These are investigated to determine the tradeoff between accuracy and computational cost, with C defined as follows in each case:

- Prop 1: C = L
- Prop 2: C = 2L
- Prop 3: C = 3L

These are compared to the state of the art method (denoted as SoA) [83, 40], and the method of Kishore and Reddy [4] (denoted as KER) for a number of evaluation scenarios. First, the performance of each method is found for AWGN channels beginning with CFO = 0.5, then AWGN channels with CFO varying from -10 to +10 times carrier spacing, then SUI1, and finally SUI2 channels.

#### 4.3.3.1 Performance in AWGN



FIGURE 4.6: Performance of time synchronization in AWGN channels with a frequency offset of 0.5 time subcarrier spacing.



Figure 4.7: Performance of fractional frequency offset estimation in AWGN channels.

Fig. 4.6 and Fig. 4.7 plot the performance results of STO and CFO estimation in AWGN with a frequency offset of 0.5 subcarrier spacings, respectively. SoA and the proposed methods have much better performance than K&R in these tests, achieving perfect synchronization when SNR exceeds 5 dB. Prop1's STO estimation has slightly better accuracy with SNR below 3 dB but is worse at higher SNR than SoA. Increasing the length of received samples for estimation, i.e., setting C = 2L, allows Prop2 to obtain a remarkable improvement in STO estimation, clearly better than the estimation achievable by SoA. Prop3, with C = 3L, demonstrates decreasing gains: it is not able to enhance accuracy as much as Prop2, despite a considerable hardware cost incurred when increasing C. In addition, the CFO of Prop3 and Prop2 achieve significant improvement compared to the other methods in Fig. 4.7, while the accuracy of Prop1 and SoA are identical (the curve for Prop1 is hidden behind the curve for SoA as a result).

The gap between Prop1 and Prop2 is much larger than that between Prop2 and Prop3, again demonstrating decreasing gains as C is extended. Thus, increasing C from L to 2L is a more effective improvement than extending C from 2L to 3L. The accuracy of CFO estimation in Prop2 is improved by about  $5\,\mathrm{dB}$  in comparison to SoA and K&R. This improvement is as a result of the increased

length of received samples for estimation, C. These results show the proposed method to be competitive with state of the art methods.

#### 4.3.3.2 Performance in Fading Channels



FIGURE 4.8: Frame synchronization performance of various methods in an SUI1 channel with respect to SNR.



Figure 4.9: Frame synchronization performance of various methods in an SUI2 channel with respect to SNR.

Fig. 4.8 and Fig. 4.9 present the performance results of STO estimation in SUI1 and SUI2 channels, respectively. The proposed methods are seen to achieve much better accuracy than the K&R method. Compared to SoA, Fig. 4.8 reveals that the estimation of Prop1, Prop2 and Prop3 is more accurate when SNR is below 3 dB. However for higher SNRs, the accuracy of SoA is slightly better than that of the proposed method. Increasing the length of received samples for estimation achieves an improvement when SNR is below about 5 dB, although the results for Prop1, Prop2 and Prop3 saturate and become almost identical at higher SNRs, as does K&R.

#### 4.3.3.3 Performance with Large Frequency Offset



FIGURE 4.10: Performance of frame synchronization in an AWGN channel with uniform random frequency offset varying from -10 to 10 times carrier spacing, with respect to SNR.

The proposed method is designed to work even with large frequency offsets. Fig. 4.10 explores performance over 100,000 tests where the frequency offset is chosen randomly (with uniform distribution) from -10 to +10 times the subcarrier spacing for each test, in an AWGN channel. This experiment is specifically designed to investigate the robustness of STO estimation in large CFO conditions and shows that the proposed methods still maintain good performance. K&R exhibits some

accuracy degradation, however all methods are seen to outperform SoA. The proposed methods are therefore seen to offer robustness of STO estimation against large CFO. This robustness against large CFO is because the proposed metric is computed on magnitude values that are insensitive to phase errors. CFO estimation is not evaluated here because large CFO estimation requires an integer CFO estimator that is investigated in the subsequent section.

In summary, simulation results show that Prop3 and Prop2 have better STO and CFO estimation accuracy compared to other methods. Prop3 enjoys just a small improvement in terms of STO estimation compared to Prop2 but this improvement incurs a significant hardware cost because the length of received samples for estimation must increase from 2L to 3L. Given the results described in this sub-section, Prop2 is selected as an implementation candidate in the subsequent sub-section, where the trade-off between accuracy and hardware cost is explored in more detail.

### 4.3.4 Hardware Implementation

This sub-section discusses the implementation of, and investigates the hardware optimization of, the proposed method in terms of word size. The target FPGA is a low-power Xilinx Spartan-6 XC6SLX45 device, with ISE 13.2 used to evaluate both hardware resource and power consumption. The results illustrate the trade-off between hardware consumption and the accuracy of the proposed method. The above sub-section already revealed that the performance of  $K\mathcal{E}R$  is generally worse than the other methods, moreover it requires many complex multiply operations to compute the cross-correlation for the P Metric. For this reason,  $K\mathcal{E}R$  is not implemented, only the proposed method and SoA are compared here, in terms of hardware resources, power consumption, and accuracy. As mentioned above we set C = 2L since Prop2 consistently showed performance close to Prop3, and significantly better than Prop1. It therefore offers a good balance between increased hardware cost and performance.

#### 4.3.4.1 Implementation of Conventional Synchronizer

First, let us consider a conventional synchronizer. We have implemented this as shown in Fig. 4.11, following the efficient implementation methods presented in [94, 37, 38, 83]. The P and R timing metrics in (4.3) and (4.2), respectively are computed using delay and summation, whilst a very efficient signed bit multiplier [93] is used to significantly reduce the computational overhead of the cross correlation for fine timing synchronization.



Figure 4.11: Architecture of the conventional synchronization FPGA implementation.

We assume a 16-bit two's complement fixed point representation with 15 fractional bits (i.e. Q1.15 format in Q-notation). The auto-correlation and squared amplitude of received samples are computed with a complex multiplier IP core that uses DSP slices. Results are scaled to Q1.15 to reduce resource usage in subsequent pipeline stages. Since this synchronizer is for the IEEE 802.16 preamble, the length of the delay is 64 samples. Each element is scaled to be less than 1, guaranteeing that the final summation result is less than 64, needing just 7 bits for representation in two's complement. Hence, the P and R values are represented in Q7.15 format. For the |P| metric, the auto-correlation uses a complex multiplier IP core to multiply the current received sample and the 64th delayed sample. The magnitude of the P metric is approximated to reduce hardware complexity as per

[83]. For the R metric, another complex multiplier is used to compute the squared magnitude of the received sample. The threshold is commonly chosen to be 0.5, which can be implemented using a right shift by 1 bit [95] instead of using a multiplier. After the frame is detected, the P metric is used to estimate and correct the fractional CFO. A CORDIC IP core is used to determine the phase of the P metric, and to derive the estimated fractional CFO. The received samples are then compensated using phase accumulation and phase rotation. These compensated samples are now used to determine fine timing synchronization.

#### 4.3.4.2 Implementation of Proposed Synchronizer

The architecture for our proposed method is shown in Fig. 4.12.



FIGURE 4.12: Architecture for the proposed synchronization method implemented on FPGA.

The format of received samples is similar to the conventional design. The R' metric is determined using an energy correlator as illustrated in Fig. 4.13. The number of samples used to compute it is 128 and since the 128 samples of the short preamble are arranged in two identical spans of 64 samples each, the correlator only needs

64 taps. Eqn. (4.17) shows the derived equations of this optimization:

$$R'(z) = I(z)A_{127} + I(z)z^{-1}A_{126} + \dots + I(z)z^{-63}A_{64}$$

$$+I(z)z^{-64}A_{63} + \dots + I(z)z^{-127}A_{0},$$

$$= I(z)A_{63} + I(z)z^{-1}A_{62} + \dots + I(z)z^{-63}A_{0}$$

$$+I(z)z^{-64}A_{63} + \dots + I(z)z^{-127}A_{0}],$$

$$= (I(z) + I(z)z^{-64})A_{63} + \dots$$

$$+(I(z) + I(z)z^{-64})z^{-63}A_{0},$$

$$= I(z)(1 + z^{-64})A_{63} + z^{-1}(I(z)(1 + z^{-64})A_{62}$$

$$+z^{-1}(\dots + z^{-1}I(z)(1 + z^{-64})A_{0})), \qquad (4.17)$$

where I is the squared amplitude of the received sample and  $A_n$  denotes the normalised squared amplitude of known preambles. Following this, a multiplierless correlator, as described in detail in [97], is used to compute the output, shown in block diagram form in Fig. 4.13.



Figure 4.13: Implementation of energy correlator on FPGA.

The values of  $A_n$  are quantised to 0.5 and a shift/multiplex operation replaces the multiplication. R' is always positive and smaller than twice the sum of all  $A_n$ elements, i.e., 63. So the R' metric requires just 6 bits to represent its integer part.

The P' metric is computed in an identical way to the conventional one, but the number of samples used in the computation is 128 instead of 64 since C = 2L. So, the moving summation uses a delay buffer of 128 samples, and the result value now requires 8 bits to represent the integer part.

#### 4.3.4.3 Effect of Reduced Precision

Let f1 and f2 be the number of bits representing the fractional part for computing P' and R' respectively. Thus P' has fixed point format Q8.f1 and R' has fixed point format Q6.f2. The effects of reducing the number of bits used to represent these fractional components of P' and R' will now be investigated, with the aim of optimising the reduced precision against hardware savings.

Recall the results of CFO estimation in the simulation sub-section where Prop2 exhibited a significant improvement in CFO estimation accuracy compared to SoA thanks to an increase in the evaluation window size obtained by setting C = 2L. This requires more multiplications, leading to increased hardware cost to compute the P' metric. Reducing f1 allows a reduction in this extra hardware cost by making each individual computation simpler. The question is to determine how much reduction in f1 can be sustained without losing the performance advantage enjoyed by Prop2.

To understand precisely how f1 reduction can save hardware, Table 4.1 details the hardware resource required for implementing five representative word sizes. Meanwhile, Fig. 4.14 plots CFO performance curves for the corresponding sizes of f1. It is clear from the graph that all degraded precision computations perform well – even the lowest Q1.4 precision computation (Prop-4b) can outperform SoA below about 7dB. The optimal choice for this range of SNRs is probably f1 = 7bits (Prop-7b) which suffers just a slight decrease in accuracy compared to the 'full length' 15 bit version (*Prop-15b*). This allows a reduction of 21%, 24%, and 33% in the number of flipflops (FF), Look-up-tables (LUT), and BRAM blocks, respectively. Moreover, *Prop-7b* still achieves excellent performance when compared to the state of the art method, *SoA*.



FIGURE 4.14: Performance of CFO estimation in an AWGN channel against SNR, with different numbers of fractional bits used in the computation of P'.

Similarly, the optimized tradeoff between the accuracy of STO estimation and hardware usage for computation of the R' metric is obtained based on reducing f2. In this case, Table 4.2 reveals the corresponding reduction achieved in computation resources and Fig. 4.15 plots the frame synchronization fail rate with SNR for several values of f2. The performance of the proposed method with f2 = 6bit,

Table 4.1: Resources required for computing P' on FPGA with different word lengths, Q1.f1.

| f1       | FF  | LUT | BRAM | DSP |
|----------|-----|-----|------|-----|
| Prop-15b | 304 | 427 | 96   | 3   |
| Prop-7 b | 240 | 323 | 64   | 3   |
| Prop-6 b | 232 | 311 | 60   | 3   |
| Prop-5 b | 224 | 297 | 56   | 3   |
| Prop-4 b | 216 | 269 | 52   | 3   |

| 1611gtills, Q1.j2. |      |      |      |     |  |  |
|--------------------|------|------|------|-----|--|--|
| f2                 | FF   | LUT  | BRAM | DSP |  |  |
| Prop-15b           | 1404 | 1017 | 16   | 2   |  |  |
| Prop-7 b           | 884  | 633  | 8    | 2   |  |  |
| Prop-6 b           | 818  | 589  | 7    | 2   |  |  |
| Prop-5 b           | 753  | 537  | 6    | 2   |  |  |
| Prop-4 b           | 689  | 504  | 5    | 2   |  |  |

Table 4.2: Resources required for computing R' on FPGA with different word lengths, Q1, f2.



FIGURE 4.15: Performance of frame synchronization in an AWGN channel against SNR, with different numbers of fractional bits used in the computation of R'.

Prop-6b, can be seen to be almost identical to the 'full length' computation using 15 fractional bits, Prop-15b. Overall, Prop-6b achieves much more accurate estimation compared to the state of the art method, SoA. Reducing f2 to 6 bits allows a reduction of 41%, 42%, and 56% in the number of FFs, LUTs, and BRAM blocks.

#### 4.3.4.4 Optimized Alternatives

The preceding results are now used to define four alternative implementations of the proposed method to compare against the state-of-the-art method, SoA which uses full length Q1.15 arithmetic. These alternatives are namely:

- Prop-A1: a non-optimized instance of the proposed method with both f1 and f2 set to 15.
- Prop-A2: only P' is optimized with f1 = 7 while f2 remains set to 15.
- Prop-A3: only R' is optimized with f2 = 6 while f1 remains set to 15.
- Prop-A4: both P' and R' are optimized by setting f1 = 7 and f2 = 6.

Table 4.3: Total resources consumed by a full word length implementation of SoA and four reduced complexity instances of the proposed method. Dynamic (Dpwr) and quiescent power (Qpwr) consumption are reported in mA. Maximum frequency is reported in MHz.

|         | Slices | BRAM | DSP | Qpwr | Dpwr | Frequency |
|---------|--------|------|-----|------|------|-----------|
| SoA     | 930    | 112  | 13  | 37   | 41   | 121       |
| Prop-A1 | 1000   | 118  | 14  | 37   | 43   | 133       |
| Prop-A2 | 923    | 86   | 14  | 37   | 39   | 142       |
| Prop-A3 | 869    | 109  | 14  | 37   | 38   | 137       |
| Prop-A4 | 777    | 77   | 14  | 37   | 35   | 142       |

Table 4.3 reports the overall hardware resources required for these instances of the synchronizer, as well as detailing the power consumption of each. It should be noted that the CFO estimation and frame synchronization performance of these instances can be seen by choosing the corresponding word length from plots of P' & R' in Figs. 4.14 & 4.15 respectively. In other words, all instances have been simulated and reported in the previous plots. From the table, it is evident that reducing word length can yield a significant reduction in both hardware requirement and power consumption. The fully optimized alternative, Prop-A4, achieves a reduction of 16.4%, 31.2%, and 14.6% in the number of occupied slices, BRAMs,

and in dynamic power consumption, when compared to SoA. The maximum frequencies of the SoA and proposed method implementations are also reported. The required frequency for baseband processing in IEEE 802.16 ranges from from 5.6 to 22.4 MHz, and this requirement is easily met by all tested implementations.

Table 5.1 details the comparison between SoA and Prop-A4 in terms of their constituent building block resources (where the function names in this table correspond to the block diagrams of Figs. 4.11 and 4.12).

Table 4.4: Detailed Resource Comparison between two synchronization meth-

|        | Function               | ods.            | LUT             | BRAM      | DSP     |
|--------|------------------------|-----------------|-----------------|-----------|---------|
| SoA    | P  metric              | 303             | 427             | 64        | 3       |
|        | R metric               | 168             | 186             | 16        | 2       |
|        | CFO comp               | 1478            | 1517            | 0         | 8       |
|        | Coarse time            | 7               | 33              | 0         | 0       |
|        | Fine time              | 942             | 1009            | 32        | 0       |
|        |                        |                 |                 |           |         |
|        | Total                  | 2898            | 3172            | 112       | 13      |
| Prop-  | Total $ P' $ metric    | <b>2898</b> 240 | <b>3172</b> 323 | 112<br>64 | 13<br>3 |
| PropA4 |                        |                 |                 |           |         |
| •      | P'  metric             | 240             | 323             | 64        | 3       |
| •      | P'  metric $R'$ metric | 240<br>818      | 323<br>600      | 64        | 3 2     |

The function named 'CFO comp', which performs frequency offset estimation and compensation, clearly consumes the largest amount of hardware in both methods. The R' metric computation in Prop-A4 uses more hardware than the computation of R in SoA. However fine timing estimation, 'Fine time', takes a large proportion of the total hardware cost in SoA while the alternative in the proposed method (timing synchronization, known as 'Time sync'), requires much less hardware.

# 4.4 Summary

Although the state of the art synchronisation methods achieve good performance when the CFO is in the range of fractional CFO estimation, they can not work with larger CFO. Some methods employ cross-correlation for the time synchronisation; these method are robust to large CFO and can obtain acceptable performance at low SNR. However, the much higher computational resources needed for cross-correlation tend to make such methods unsuitable for hardware implementation. The methods have been presented to improve upon these drawbacks of previous reported works. The method takes the advantage of period and energy distribution characteristics of the preamble to perform time synchronisation. The synchronisation performance results, obtained through simulation, demonstrates good performance and the robustness to large CFO. Although the method just estimates and compensates the fractional CFO, the method still performs well with the larger CFO may presenting the interger CFO and an enhanced OFDM synchronisation method that provides an efficient and low cost IFO estimation will be presented in the next chapter.

# Chapter 5

# A CFO Estimation Method for OFDM Synchronisation

## 5.1 Introduction

The previous chapter, the robust OFDM synchronisation agains to a large CFO was presented. Although the timing synchronisation is able to perform well in the case of large CFO, the CFO estimation of this method only estimate the fractional CFO. The large CFO can occur as a result of the Doppler Effect and/or due to local oscillator instability. The CFO is normalised by subcarrier spacing and usually divided into an integer part (IFO), as a multiple of subcarrier spacings, and a fractional part (FFO). IFO causes a circular shift of the subcarrier in the frequency domain while FFO results in ICI because of lost orthogonality between subcarriers. The Fig. 5.1 illustrates the FFO and IFO estimation in the baseband processing of a typical OFDM system. The IFO estimation is performed after the FFT and relies upon cross-correlation, which consumes significant hardware resources. A typical OFDM receiver avoids the need for IFO estimation by limiting CFO tolerance to be smaller than the range of FFO estimation, resulting in a set of very strict constraints on the design of the RF front-end. However, if CFO exceeds the range of FFO estimation, IFO is present and the system cannot function correctly. This

is more pronounced in applications that subject the front-end to intensive Doppler effects, for very high carrier frequencies, or when supporting multiple frequency bands for different standards.

In this chapter, A novel methods for IFO estimation are presented to overcome this challenge targeting to efficient and low-power systems. The method has also been discussed in:

T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient Integer Frequency
Offset Estimation Architecture for Enhanced OFDM Synchronization," to
be submitted to *IEEE Transactions on Very Large Scale Integration (VLSI)*Systems.

#### 5.2 Related Work

As mentioned in the previous chapter, the estimation range of fractional CFO has a limitation that is shown in (4.8); for instance its range is up to  $\pm 2$  subcarrier spacings in the case of the IEEE 802.16 preamble. The value of the IFO is a multiple of the coarse CFO's range. It is commonly estimated by performing cross-correlation [98, 95] in the frequency domain.

Considering that the signal is transmitted over a frequency selective channel that has channel impulse response (CIR) h with length, L ( $L < N_{CP}$ ). and it is also corrupted by AWGN. The received signal, with frequency offset and timing offset,



FIGURE 5.1: Baseband processing block diagram.

can be expressed in the time domain as

$$y[n] = \sum_{l=0}^{L-1} h[l]x[n - \tau - l]e^{i(2\pi\xi\frac{n-\tau}{N} + \phi_0)} + w[n]$$
 (5.1)

where w[n] denotes AWGN in the time domain,  $\tau$ ,  $\phi_0$  are residual timing offset (RTO) and error phase, respectively, and  $\xi$  is the normalised CFO that can be divided in to a FFO part  $\lambda$  and IFO part  $\epsilon$  as  $\xi = \lambda + \epsilon$ .

Asumming FFO and STO be compensated by earlier stages of synchronisation, as has been investigated in detail by other authors [95, 92]. The received preamble symbol after CP removal at FFT output is

$$Y[k] = e^{i(\phi_0 - 2\pi \frac{\tau k}{N})} H[k - \epsilon] X[k - \epsilon] + W[n]$$
(5.2)

where W[k] and H[k] are the frequency domain representations of AWGN and CIR, respectively.

As mentioned previously, IFO results in a cyclic shift in the frequency domain. It is commonly estimated by performing cross-correlation in the frequency domain. By contrast, RTO causes a linear phase rotation on samples in the frequency domain that may cause degradation of cross-correlation performing IFO estimation. Based on a differential demodulation of the FFT output, the IFO can be enhancedly determined with robustness to frequency selective channel and RTO using the correlation function [99] expressed by:

$$\hat{\epsilon} = \underset{\tilde{\epsilon}}{\operatorname{argmax}} \left| \sum_{k=1}^{N} Y^*[k-1]Y[k]X^*[k-\tilde{\epsilon}]X[k-1-\tilde{\epsilon}] \right|$$
 (5.3)

where (.)\* denotes complex conjugation,  $\hat{\epsilon}$ ,  $\tilde{\epsilon}$  are estimated and trial values of  $\epsilon$ , respectively, Y[k] and X[k] denote the  $k^{th}$  frequency symbol index of the received symbol and the known transmitted preamble, respectively, and the symbol size N is equal to the FFT size.

The estimated IFO can achieve high precision using cross-correlation in the frequency domain, however implementing cross-correlation clearly involves a significant hardware overhead, with a multiplier needed for each element in the cross-correlation. Sign-bit cross-correlation [93] is a widely adopted approach to reducing correlation complexity using only the most significant bit (MSB) of signed numbers in the correlation computation. In this way, complexity is reduced at the cost of performance degradation. Despite the adoption of such methods, cross-correlation remains computationally expensive, especially when dealing with a large FFT size. It should be noted here that several IFO estimation methods have been published which claim robustness to frequency selective channels and RTO. However published FPGA implementations of these methods are lacking to date, possibly because the hardware costs are considerable – even when adopting a sign-bit cross-correlation approach.

There are few practical implementations for IFO estimation and correction. A notable exception was presented in [95], in which time-domain cross correlation is performed between received samples and pre-rotated versions of the preamble corresponding possible IFO values. Although the method uses efficient sign-bit correlation to reduce hardware cost (at the cost of decreased estimation accuracy and increased sensitive to frequency selective channels), it is not efficient when the possible IFO range is large since it effectively performs an exhaustive search.

The state of the art OFDM synchrnisation methods typically has a small tolerance of CFO that requires the hardware be strictly constrained to have a small CFO that doesn't introduce IFO. This requirement leads to an increase in total system cost. Particularly, in the case of MSCR such a need for precision coupled with a requirement for a wide ranging frequency band would make the RF front-end design difficult, if not impossible.

# 5.3 Enhanced OFDM Synchronization Through Novel IFO Estimation Architecture

In this section, we present a novel method in which hardware resources and power consumption are reduced by determining IFO estimates for only a subset of possible IFO values. This in turn enables an efficient resource sharing folded architecture to be adopted. Adjusting the precision of individual correlation computations within this novel architecture leads to a fine degree of control on the trade-off between performance and power consumption. The implementation uses the long preamble of IEEE 802.16-2009 [19] for estimating IFO values. This has a symbol size of 256 with 100 pilots, as illustrated in Fig. 5.2. These 100 pilots are distributed, 50 pilots per side, at even sub-carrier spacings from 2 to 100 and from 156 to 254. The remaining sub-carriers are null. In OFDM-based systems such as 802.11, 802.16, and 802.22, the short preamble is used to estimate and compensate for FFO. Typically, fractional CFO estimation has a limited range of up to  $\pm 2$  sub-carrier spacings depending on the format of the short preamble. This results in the possible IFO values being a multiple of 4. Readers should refer to [95] to understand in detail the limitation of fractional CFO and possible IFO values. It should be noted that although the proposed method is applied to 802.16 for evaluating the accuracy of IFO estimation, this method can be applied in other OFDM-based systems having similar preamble structures, such as 802.11 and 802.22.



FIGURE 5.2: Pilots in the long preamble of IEEE 802.16-2009.

#### 5.3.1 Proposed Algorithm

Firstly, a subset of possible IFO values is determined. We assume that the RF front end can provide CFO stability in a range from -14 to +18 subcarrier spacings, which is relaxed compared to the strict RF front end constraints in 802.16 that would typically otherwise lead to increased RF hardware costs. Generally, the larger the range of IFO estimation that is performed, the larger the CFO that the baseband system can tolerate. A wider CFO tolerance in turn leads to a relaxation in RF front-end specification, reducing system cost. In this system, there are 8 possible values for IFO estimation in the assumed CFO range. The subset of possible IFO values is denoted  $S_{IFO} = \{-12, -8, -4, 0, 4, 8, 12, 16\}$ . Moreover, samples are pre-offset by 12 subcarrier spacings prior to calculation to ensure that all possible values of IFO are positive,  $S'_{IFO} = \{0:4:28\}$ . This means that received symbols will only ever need to be shifted right to compensate IFO, thus reducing buffer memory requirements.

Secondly, a resource sharing folded architecture is designed to significantly reduce the hardware cost. Conventionally, to obtain high accuracy, IFO estimation is computed across all pilots in the preamble. This results in considerable hardware overhead, especially with a large number of pilots, as is the case for IEEE 802.16-2009. Evidently, as the number of pilots increases, the correlation result shows greater robustness to noise. However, we will demonstrate in Section 5.3.3 that the beneficial effect of calculating across additional pilots leads to decreasing gains in performance. In fact the performance improvement reaches a plateau. We therefore propose making use of only a subset of pilots, while maintaining estimation accuracy as much as possible. Then, by spreading the chosen subset of pilots carefully in the time domain, it becomes possible to share resources when computing the cross-correlation, hence reducing area and power consumption. When the proposed method is applied to IEEE 802.16-2009 offset estimation, the pilots used for the IFO computation are selected at subcarrier indices that are multiples of 4, leading to a natural four-fold resource sharing architecture. Hence, the IFO

estimation can be expressed as:

$$\hat{\epsilon} = \underset{\tilde{\epsilon} \in S'_{IFO}}{\operatorname{argmax}} \left| \sum_{k=1}^{N/4} P(4k) A(4k - \tilde{\epsilon}) \right|$$
(5.4)

where  $P(4k) = Y^*(4k-2)Y(4k)$  denotes the correlation of two consecutive received pilots, and  $A(4k) = X(4k-2)X^*(4k)$  is the correlation of two consecutive transmitted pilots that can be pre-computed.

Thirdly, although sign-bit cross-correlation is often used in conventional implementations [95, 93], as it can significantly reduce computational complexity, it also leads to reduced precision and hence reduced estimation performance, especially in the case of frequency selective channels. For this reason, we instead apply multiplierless correlation to enhance the accuracy of estimation compared to the sign-bit approach. In [97], the authors demonstrated a trade-off between cost and accuracy for multiplierless OFDM synchronisation. We further investigate this effect in the current chapter in terms of the trade-off between cost and accuracy when reducing the wordlength used to represent P(4k).

# 5.3.2 Proposed Architecture

The proposed method divides IFO estimation into multiple repeated computations with resource sharing based upon the four-sample timing between selected spread pilots. The estimation of IFO in (7.5) can now be rewritten as follows:

$$\hat{\epsilon} = \underset{\tilde{\epsilon} \in S'_{IFO}}{\operatorname{argmax}} |V_{\tilde{\epsilon}}|,$$

$$V_{\tilde{\epsilon}} = \sum_{k=0}^{25} P(4k) A_{\tilde{\epsilon}}(4k) + P(L+4k) A_{\tilde{\epsilon}}(L+4k),$$
(5.5)

where  $A_{\tilde{\epsilon}}$  denotes the correlation of two consecutive pre-rotated known pilots corresponding to one IFO value, and  $V_{\tilde{\epsilon}}$  is the cross-correlation between received pilots and pre-rotated known pilots. Since the pilots of the long preamble are distributed on two sides of the OFDM symbol in the frequency domain, at even sub-carrier



FIGURE 5.3: Circuit of known pilots shift register

spacings from 2 to 100 and 156 to 254, L denotes the index of the first pilots in second half. In the case of IEEE 802.16, L equals 156. Each value,  $V_{\tilde{\epsilon}}$ , can be computed simultaneously and separately using a multi-add accumulation scheme. The challenge in implementation is find a way to efficiently share resources when computing  $V_{\tilde{\epsilon}}$ . The pilots that are used to compute the correlation arrive every four cycles so there are four spare cycles between two consecutive computed pilots, allowing one multiply accumulate block to sequentially compute 4 separate correlations.

There are 8 sets of  $A_{\tilde{\epsilon}}$  corresponding to 8 possible IFOs. These sets of  $A_{\tilde{\epsilon}}$  can be precomputed and stored separately. Thanks to the spreading of the computed pilots, the  $A_{\tilde{\epsilon}}$  sets have many identical pilots – this naturally allows sharing between prerotated pilot sets – so that only 64 memory locations are required instead of 400. Thus, an 84% reduction in the memory used to store  $A_{\tilde{\epsilon}}$  sets is achieved. Fig. 5.3 illustrates the  $A_{\tilde{\epsilon}}$  sets and circuitry for combining all  $A_{\tilde{\epsilon}}$  sets.

Multiply accumulate blocks are shared for computing four sequential  $V_{\tilde{\epsilon}}$  values over four successive clock periods. Fig. 5.4 demonstrates how this is done.  $P_k$  is received in every clock cycle.  $P_{4k}$  is the subsampling of  $P_k$ , taking a subset of the most significant bits from  $P_k$  every four cycles. The cross-correlation is performed with the values of  $P_{4k}$ . Two multipliers, M1 and M2 are used to compute the values of 8 cross-correlations  $V_{\tilde{\epsilon}}$  in parallel. Each multiplier performs multiplications sequentially between  $P_{4k}$  and the corresponding transmitted pilots in 4 sets of  $A_{\tilde{\epsilon}}$ . The products are accumulated to the values of  $V_{\tilde{\epsilon}}$ . When all pilots



FIGURE 5.4: Resource sharing approach for computing  $V_{\tilde{\epsilon}}$ .

are computed, the maximum operation, argmax|V|, is performed on 8  $V_{\tilde{\epsilon}}$  values to estimate the IFO.

To design an optimised architecture for IFO estimation, the multiply-add formula of  $V_{\tilde{\epsilon}}$  is mathematically manipulated into what is effectively a multiply-accumulate form. When one received sample is taken,  $V_{\tilde{\epsilon}}$  can be expressed in the form of accumulation as

$$V_{\tilde{\epsilon}}(n) = A_{\tilde{\epsilon}}(n)P(n) + V_{\tilde{\epsilon}}(n-1),$$

$$= (\Re\{c\} - i\Im\{c\})(\Re\{P(n)\} + i\Im\{P(n)\})$$

$$+ (\Re\{V_{\tilde{\epsilon}}(n-1)\} + i\Im\{V_{\tilde{\epsilon}}(n-1)\})$$
(5.6)



FIGURE 5.5: Architecture of proposed IFO estimator.

where  $\Re\{.\}$  and  $\Im\{.\}$  denote the real and imaginary parts, respectively, and P(n),  $A_{\tilde{\epsilon}}(n)$  denote the current values of the correlation of two consecutive received pilots and corresponding transmitted pilots, respectively.  $V_{\tilde{\epsilon}}(n-1)$  is the accumulated value of  $V_{\tilde{\epsilon}}$ . To employ multiplierless correlation,  $A_{\tilde{\epsilon}}$  are normalized to values c whose real and imaginary parts have values in  $\{-1, 0, 1\}$ , and the wordlength of P(n) and  $V_{\tilde{\epsilon}}$  in fixed point format can be adjusted to increase estimation accuracy at the cost of increased hardware resource consumption. The real and imaginary

parts of  $V_{\tilde{\epsilon}}$  can be computed as

$$\Re\{V_{\tilde{\epsilon}}(n)\} = \Re\{c\}\Re\{P(n)\} + \Im\{c\}\Im\{P(n)\} + \Re\{V_{\tilde{\epsilon}}(n-1)\},$$

$$\Im\{V_{\tilde{\epsilon}}(n)\} = \Re\{c\}\Im\{P(n)\} - \Im\{c\}\Re\{P(n)\} + \Im\{V_{\tilde{\epsilon}}(n-1)\}$$
(5.7)

The proposed resource sharing architecture for the IFO estimator is shown in Fig. 5.5. The ArgMax module finds the maximum of the 8  $V_{\tilde{\epsilon}}$  values in order to identify the corresponding IFO estimate. The novelty of this IFO estimator is in terms of algorithmic improvements and architecture optimisation. This is achieved by firstly reformulating the equation into an expression that is computable with efficient shared resource circuitry. Thanks to the significant hardware reduction achieved, this IFO estimator can be feasibly implemented on a low-power, limited hardware resource FPGA, while simultaneously ensuring that performance is maintained by a trade off between number of pilots against word length.

We are able to demonstrate that the algorithm and structure optimisations mentioned above retain competitive estimation accuracy compared to conventional approaches, while also offering significant reductions in hardware resource usage. This makes it possible to implement a high-performance OFDM receiver on a low-power FPGA that has a limited number of DSP blocks available.

#### 5.3.3 Simulation

Many variants of the proposed method were simulated in MATLAB using different channel models and the parameter set of the IEEE 802.16-2009 downlink preamble. Performance of the implementation was compared to the theoretical performance of some state of the art methods. This was assessed primarily in terms of the probability of failed estimation (POFE) with respect to channel SNR. POFE, which is widely used to evaluate the performance of IFO estimation [99, 41, 42], measures

the number of fail estimations divided by the total number of IFO estimations. Overall, 100,000 IFO estimations were simulated in AWGN and Stanford University Interim (SUI) [96] frequency selective channels. IFO estimation is performed with non-ideal FFO compensation, and FFO is determined and compensated using the method of Kim and Park [95]. The simulation also verifies the performance of the proposed method under the effect of RTO caused by imperfect STO estimation (assuming that STO estimation is still within the CP and does not cause ISI). In addition, a randomly generated amount of STO is added in the range from 0 to  $N_{CP} - L - 1$ , where L is the length of CIR.

We first investigate the performance degradation compared to theoretical performance as a result of reducing the number of pilots as proposed. Next, we will investigate the effect of wordlength optimisation. In both cases, comparisons are made with established methods in the literature that can be simulated but are otherwise infeasible for hardware implementation, namely the conventional method in [99] (PCH) that is applied to one training block with 100 pilots. In addition, two state of the art methods are also simulated for comparison. Firstly, metric SY from [41] as defined by,

$$\mu_{SY}(\tilde{\epsilon}) = \Re \left\{ \sum_{k=1}^{\frac{N}{2}} Y_{(2k-2)}^* Y_{(2k)} X_{(2k-\tilde{\epsilon})}^* X_{(2k-2-\tilde{\epsilon})} \right\}$$
 (5.8)

where  $\hat{\epsilon} = \underset{\tilde{\epsilon} \in S_{IFO}}{\operatorname{argmax}} \{ \mu(\tilde{\epsilon}) \}$ . Secondly, metric MM from [42],

$$\mu_{MM}(\tilde{\epsilon}) = \Re \left\{ e^{i\frac{\pi}{4}} \sum_{k=1}^{\frac{N}{2}} Y_{(2k-2)}^* Y_{(2k)} X_{(2k-\tilde{\epsilon})}^* X_{(2k-2-\tilde{\epsilon})} \right\}$$
 (5.9)

where  $\Re\{.\}$  denotes the real part.

We are unaware of any published circuits for these methods, because of their complex computation. The very large hardware requirement of the respective metrics does not lend these methods to feasible implementation on a low cost, low power FPGA (unlike the proposed method).

#### 5.3.3.1 Performance Comparison

The performance of the proposed method, denoted *Prop*, is evaluated in comparison to the theoretical performance of state of the art methods by Park et. al. [99], Shim and You [41] and Morelli and Moretti [42], denoted *PCH*, *SY* and *MM*, respectively in the previous subsection. The theoretical performance is computed with full precision using full multiplication. However, it must be noted that implementing this directly in hardware would be prohibitive due to the large number of multiplication operations needed. In fact, hardware implementation would conventionally use sign-bit correlation instead of full precision correlation, as mentioned previously. Thus the full multiplication results shown here are undoubtedly better than those achievable in practice, and thus can be considered as upper performance bounds. For more realistic data, we also provide results from sign-bit correlation versions of the above, denoted *PCH\_sb*, *SY\_sb*, and *MM\_sb* respectively.

The proposed method, evaluated against these, uses 50 spread pilots with indices that are multiples of 4. For the sake of comparison, an additional implementation of *PCH* is reported which, like the proposed method, uses 50 continuous pilots. This is denoted *PCH\_50*. Figs. 5.6 and 5.7 plot performance results for all methods in an AWGN channel, with and without RTO respectively, and reveal that the proposed method generally performs well, especially at higher SNRs. Considering more realistic channel models, Figs. 5.8 and 5.9 plot the performance of all methods in SUI1, and SUI2 channels respectively, and similarly show that the proposed method performs well, especially at higher SNRs.

Under these experimental conditions, PCH and SY achieve equivalent performance in AWGN without RTO and in SUI1 channels. However, SY degrades more drastically than PCH in the SUI2 channel at SNRs above 1 dB. SY also deteriorates in the case of AWGN with RTO. This method appears to be very sensitive to large RTO, while MM and PCH exhibit better robustness to RTO. The accuracy of MM is slightly lower than that of PCH at SNRs below 0 dB, while performance is very similar at larger SNRs. Also note the performance of the



FIGURE 5.6: Fail rate of IFO estimation methods in AWGN channel without RTO.



FIGURE 5.7: Fail rate of IFO estimation methods in AWGN channel with RTO.



FIGURE 5.8: Fail rate of IFO estimation methods in SUI1 channel.



FIGURE 5.9: Fail rate of IFO estimation methods in SUI2 channel.

conventional approach implementations,  $PCH\_sb$ ,  $SY\_sb$ ,  $MM\_sb$  which degrade significantly with SNR, especially in the SUI1, SUI2 channels.

Apart from at very low SNRs, the proposed method, Prop, achieves almost identical performance to the simulated upper bound PCH, even in the presence of RTO. It should be noted that Prop achieves this while allowing the use of resource sharing through sparse pilot computation. This will be used to achieve a significant hardware saving. According to simulation results, Prop, with its spread pilots, is also more accurate than  $PCH_{-}50$  that uses the same number of pilots spread continuously.

#### 5.3.3.2 Wordlength Optimisation

Since sign-bit correlation degrades IFO estimation performance, specially in frequency selective channels, the proposed method instead employs multiplierless correlation to improve accuracy and robustness. The complexity of this approach is dependent on the wordlength chosen for the correlation computation. We now investigate how wordlength affects the performance of the proposed implementation, again compared to the theoretical bound, PCH, as well as to the performance of a conventional sign-bit implementation,  $PCH_-sb$ . We denote wordlength using the notation Q1.f, meaning a single integer bit and f fractional bits. Evaluations are performed for f of 1, 2, 7, and 15 bits. These results will be plotted in the following figures with the labels  $Prop_-1b$ ,  $Prop_-2b$ ,  $Prop_-7b$ , and  $Prop_-15b$ , respectively. Figs. 5.10 and 5.11 plot the performance in AWGN (with and without RTO), with all tested wordlengths in the proposed method performing comparably to PCH (and being much better than  $PCH_-sb$  at SNRs exceeding about 2 dB). Figs. 5.12 and 5.13 plot the results when using the more realistic SUI1 and SUI2 channel models.

It can be seen from the plots that each of the tested wordlengths achieves much better performance and exhibits greater robustness to frequency selective channels than the sign-bit realisation of the conventional approach, *PCH\_sb*. Additionally, these realisations of the proposed method do not suffer as much degradation in



Figure 5.10: Fail rate for different wordlengths in AWGN channel without RTO.



FIGURE 5.11: Fail rate for different wordlengths in AWGN channel with RTO.



FIGURE 5.12: Fail rate for different wordlengths in SUI1 channel.



FIGURE 5.13: Fail rate for different wordlengths in SUI2 channel.

the presence of RTO. Moreover, it is possible to improve low SNR performance by adopting a longer wordlength with the proposed method, at a cost of increased hardware complexity. Increasing wordlength does improve results slightly, with the step from 1 to 2 bits being the most significant gain. By contrast, increasing from 2 to 7 or from 7 to 15 bits has little impact. In general,  $Prop_2b$  achieves an estimation accuracy close to that of the theoretical performance bound, PCH, at intermediate and higher SNRs, even though it involves computation with fewer bits, and can hence be implemented more efficiently.

#### 5.3.4 FPGA Implementation

The analysis in Section 5.3.3 suggests that the proposed method offers comparable estimation performance to existing methods in the literature. As a result of the simplifications inherent in the proposed approach, this should be achievable at a reduced hardware cost. This section now quantifies this hardware cost, for an FPGA-based implementation. It is important to note that these hardware savings are accessible for a number of target implementation devices, although we are interested primarily in FPGA implementation as part of our work on leveraging FPGA reconfigurability in cognitive radios.

#### 5.3.4.1 Conventional Approach

To obtain the theoretical performance previously discussed in Section 5.3.3 and denoted as *PCH*, the computation of the estimation metric in [99], using 100 pilots, would require about 200 complex multipliers, resulting in the use of over 600 DSP blocks. This may exceed the available resources on small devices, or leave insufficient resources for other tasks on larger devices. As the number of multiplications required for a full implementation of the approach is prohibitive, the conventional approach for implementation, as we have discussed, uses sign-bit correlation [95]. This conventional implementation uses all 100 pilots in the long preamble to perform sign-bit correlation, and multiply\_adds are eliminated at taps



FIGURE 5.14: DSP block based 3-input adder for correlation.

where the pilots of the long preamble are not used. This implementation mirrors the *PCH\_sb* in Section 5.3.3, and allows us to quantify the benefits of our proposed approach against a known reference benchmark.

#### 5.3.4.2 Proposed Approach

The proposed architecture for IFO estimator implemented with several different wordlengths of P(n) and  $V_{\tilde{\epsilon}}$  in (5.6) are compared, to allow us to explore the hardware costs associated with the respective implementations. Four fixed point formats for P(n), are investigated: Q1.1, Q1.2, Q1.7, and Q1.15.  $V_{\tilde{\epsilon}}$  is represented accordingly in Q7.1, Q7.2, Q7.7, and Q7.15 formats to avoid overflow.

In order to obtain a comprehensive optimised implementation, these circuits are each implemented using two different structures. The first uses only logic elements (LE) for computation, while the second uses Xilinx DSP48A1 [100] primitives. Considering (5.7),  $\Re\{V_{\bar{\epsilon}}(n)\}$ ,  $\Im\{V_{\bar{\epsilon}}(n)\}$  can be computed effectively using two DSP blocks as 3-input adders, instead of 4 blocks as would be usual. Fig. 5.14 illustrates how this is done for  $\Re\{V_{\bar{\epsilon}}(n)\}$ , and similarly for  $\Im\{V_{\bar{\epsilon}}(n)\}$ . Note that the solution presented in Fig. 5.14 is optimised for QPSK modulated pilots (since their amplitudes are identical) as specified in IEEE 802.16, as well as in most

OFDM-based standards. The normalisation performed in (5.7) allows the correlation to be reduced to two DSP blocks operating as 3-input adders (instead of 4 DSP blocks with multipliers as would be usual). These methods correspond to  $Prop_1b$ ,  $Prop_2b$ ,  $Prop_1b$ , and  $Prop_1b$  that were investigated for estimation accuracy in Section 5.3.3.

#### 5.3.5 Implementation Results

The circuits were synthesised and fully implemented using Xilinx ISE 13.2, targeting the low-power Xilinx Spartan-6 XC6SLX75T FPGA. The results are reported in terms of the number of flip-flops (FF), look-up tables (LUT), and DSP48 blocks, along with dynamic power consumption, as summarised in Table 5.1.

Table 5.1: Resource utilisation and dynamic power consumption of IFO esti-

| mators.                          |           |           |                  |        |                  |  |  |
|----------------------------------|-----------|-----------|------------------|--------|------------------|--|--|
| IFO est. Cir.                    | #FF       | #LUT#     | ⊭DS <del>I</del> | e (MHz | D. Power         |  |  |
| $\mathrm{conv}_{-}100\mathrm{p}$ | 3270 (3%) | 1837 (3%) | 3                | 142    | 42 mW            |  |  |
| $Prop_1b_LE$                     | 328 (1%)  | 370 (1%)  | 3                | 136    | $9~\mathrm{mW}$  |  |  |
| ${ m Prop\_2b\_LE}$              | 350 (1%)  | 390 (1%)  | 3                | 136    | $10~\mathrm{mW}$ |  |  |
| $Prop_{-}7b_{-}LE$               | 460 (1%)  | 471 (1%)  | 3                | 136    | $12~\mathrm{mW}$ |  |  |
| $Prop_15b_LE$                    | 735 (1%)  | 696 (1%)  | 3                | 134    | $17~\mathrm{mW}$ |  |  |
| Prop_1b_DSP                      | 328 (1%)  | 306 (1%)  | 7                | 78     | $11~\mathrm{mW}$ |  |  |
| $Prop_2b_DSP$                    | 350 (1%)  | 319 (1%)  | 7                | 78     | $12~\mathrm{mW}$ |  |  |
| Prop_7b_DSP                      | 460 (1%)  | 379 (1%)  | 7                | 77     | $14~\mathrm{mW}$ |  |  |
| $Prop_15b_DSP$                   | 735 (1%)  | 591 (1%)  | 7                | 77     | 18 mW            |  |  |

 $conv\_100p$  refers to the conventional approach, implemented using sign-bit correlation over 100 pilots.  $Prop\_fb\_LE$ ,  $Prop\_fb\_DSP$ , in which f=1, 2, 7 and 15 (corresponding to received sample format Q1.f), denote the circuits of corresponding wordlengths, implemented using logic elements and DSP48 blocks, respectively. Referring to the table, the proposed implementation shows that significant improvement in resource usage and dynamic power consumption is possible with the proposed method.

The hardware resources used by  $Prop\_fb\_LE$  and  $Prop\_fb\_DSP$  increase gradually, in terms of FFs and LUTs, as the wordlength increases. The number of FFs used in  $Prop\_fb\_DSP$  and  $Prop\_fb\_LE$  is equal, while  $Prop\_fb\_DSP$  uses fewer LUTs since the DSP blocks are used for the 3-input additions.

The  $Prop\_fb\_LE$  implementations use 3 DSP blocks to compute P(n), while  $Prop\_fb\_DSP$  require an additional 4 DSP blocks to perform the correlation.  $Prop\_fb\_LE$ ,  $Prop\_fb\_DSP$  both consume far fewer LUT and FF resources than the conventional  $conv\_100p$  implementation.

For  $Prop_2b_LE$ , the number of FFs and LUTs is reduced by 90% and 79% respectively compared to the conventional  $conv_100p$  approach.

The maximum frequencies of circuits, reported after place and route, comfortably exceed the timing requirements for IEEE 802.16 synchronisation whose sampling frequency is below 25 MHz. A post-place-and-route simulation was used to estimate the power consumption of the system at a clock rate of 50 MHz using the Xilinx XPower tool – also shown in table 5.1.

 $Prop\_fb\_LE$  implementations consume less power than the equivalent  $Prop\_fb\_DSP$  implementations. All implementations of the proposed method consume significantly less power than the conventional implementation.  $Prop\_2b\_LE$  consumes just 22% of the power consumed by  $conv\_100p$ .

In subsection 5.3.3, we established that  $Prop\_2b\_LE$  easily outperforms the conventional approach in terms of estimation accuracy. In this subsection, we have shown that it does so with a significant hardware resource saving, and with significantly reduced power consumption. In fact, the estimation performance of  $Prop\_2b\_LE$ , in AWGN and SUI channels (except at very low SNR), is close to the theoretical bound of PCH, which would demand a significant proportion of the FPGAs resources if it were implemented conventionally. Meanwhile,  $Prop\_2b\_LE$  is extremely efficient, consuming less than 1% of the resources available on a low-power Spartan-6 XC6SLX75 FPGA. Beyond IEEE 802.16-2009, the folded resource sharing architecture, which leverages sub-sample spaced OFDM pilots, can

be adopted for use in other OFDM standards (including IEEE 802.11 and IEEE 802.22).

## 5.4 Summary

This chapter investigated IFO estimation in OFDM-based systems such as IEEE 802.16. A technique is proposed for efficient implementation of IFO estimation, which aims in particular for a low power and low resource utilisation. Since IFO estimation contributes significantly to the complexity of a robust synchroniser design, this work is important for multistandard radios, or applications where significant frequency variation is expected. Robust IFO estimation allows for relaxed analogue RF constraints, leading to reduced cost. A modified timing metric is derived which allows for resource sharing to reduce both resource requirements as well as power consumption. The proposed implementation makes use of a four-fold resource sharing architecture to significantly reduce hardware cost, while multiplierless correlation with optimised wordlength is used to improve estimation accuracy in comparison to a conventional implementation approach using signbit correlation. The method is shown to perform as well as current state-of-the-art methods that employ multiplier-based correlation, and yet with significantly improved power and resource requirements. Dynamic power consumption is reduced by 78% over even a sign-bit version of the conventional approach, yet it offers better estimation performance in both AWGN and frequency selective channels. Beyond IEEE 802.16-2009, the folded resource sharing method, which leverages sub-sample spaced OFDM pilots, can be adopted for use in other OFDM standards (including IEEE 802.11 and IEEE 802.22). Multiplierless correlation with word length optimisation is already used within communication systems, however the trade off between the number of pilots and computational accuracy for timing synchronisation is relatively unexplored in the research literature.

## Chapter 6

# A Spectrum Efficient Shaping Method

#### 6.1 Introduction

This chapter is concerned with the OFDM spectral leakage challenge for OFDM-based CRs. OFDM signals typically cause large amounts spectral leakage, whereas CRs demand a shaped spectrum confined within the allocated channel in order to reuse free spectral bands without causing ICI to other users occupying adjacent bands. Some recent OFDM-based standards are defined with the requirements on spectral leakage that are extremely stringent in an effort to avoid ICI. Spectral leakage filtering may cause some effects on transmitted signals that lead to a reduction in the effective timing guard. Therefore, the implementations of spectral leakage filtering need to be able to take into account the parameters of the underlying OFDM signal and its channel characteristics to avoid causing the negative effects on the transmitted signal such as distorsion and ISI.

In this chapter, a novel method that embeds baseband filtering within a cognitive radio (CR) architecture, is proposed. The method is able to meet the specification for the most stringent 802.11p SEM, meet the specification of the 802.11af strict

SEM requirement, and furthermore is able to allow ten additional 802.11af subcarriers to occupy a single basic channel without violating SEM specifications. In addition, the method can adaptively change filter performance according to the transmission power to reduce the computation cost while guaranteeing that the emission spectrum remains smaller than the allowed spectral leakage. The method, performed at baseband, relaxes the otherwise strict RF front-end requirements. This allows the RF subsystem to be implemented based upon much less stringent 802.11a designs, which can significantly reduce total cost.

The work presented in this chapter has also been discussed in:

- T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Shaping Spectral Leakage for IEEE 802.11p Vehicular Communications," to appear in *Proceedings of IEEE Vehicular Technology Conference (VTC Spring), Seoul, Korea, May 2014* [101].
- T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Spectrally Efficient Emission Mask Shaping for OFDM Cognitive Radios," to be submitted to *IEEE Transactions on Communications (TCOM)*.

## 6.2 Signal Model for Spectral Leakage Filtering

Conventionally, there are two methods that can be employed to compress the spectral leakage for OFDM-based system, namely pulse shaping and image spectrum compression. Pulse shaping, recommended in 802.11a, is effective at reducing side lobes. Spectral leakage filtering is designed with respect to the signal model and channel model to avoid the negative effect of filter. In this section, we present the OFDM signal model, 802.11p and 802.11af channel models for compressing the spectral leakage.

#### 6.2.1 Signal Model

We define an OFDM symbol to have inverse fast Fourier transform (IFFT) length and cyclic prefix (CP) length N and  $N_{CP}$ , respectively, so that the length of the symbol including its CP is  $N_T = N + N_{CP}$ . A sample x(m) of the OFDM symbol  $(0 \le m \le N_T - 1)$  can be expressed in the time domain as

$$x(m) = \frac{1}{N} \sum_{k=0}^{N-1} X(k) e^{i2\pi \frac{k}{N}(m-N_{CP})},$$
(6.1)

where X(k) denotes the frequency domain representation of the data sub-carriers. Since OFDM symbol samples are generally transmitted sequentially, this is equivalent to multiplying symbols with a rectangular window function, p. Then the transmitted OFDM samples can be expressed as

$$x(n) = \frac{1}{N} \sum_{l=-\infty}^{\infty} \sum_{k=0}^{N-1} X(k) p(n - lN_T) e^{i2\pi \frac{k}{N}(n - N_{CP} - lN_T)}.$$
 (6.2)

In a conventional OFDM system, the window function, p(m), is rectangular and simply described as

$$p(m) = \begin{cases} 1, & m = 0, 1, ..., N_T \\ 0, & otherwise \end{cases}$$

$$(6.3)$$

CIR, of length h, is derived from the delay spread of the channel. If the CP is shorter than the channel delay, ISI will be present in received symbols. Channels experienced by the two standards discussed in this chapter will obviously differ, but both tend to experience high levels of delay spread: 802.11p because it is primarily a vehicular communications standard, and 802.11af because it operates in lower attenuation UHF and VHF bands.

The PHYs specified in 802.11p and 802.11af are largely inherited from the well-established 802.11a and 802.11ac OFDM PHYs, respectively. The major parameters of both new PHYs are presented in Table 6.1. However, since the new

standards operate in different channel regions and environments, they are subject to different, and much more stringent SEM requirements than their parent standards.

TABLE 6.1: Major parameters of 802.11p and 802.11af OFDM PHYs Parameters 802.11p802.11af Bandwidth, MHz 10 6 7 8 114 Used subcarriers,  $N_C$ 52Total subcarriers,  $N_T$ 64 144 168 144 FFT points,  $N_{FFT}$ 128 64 Subcarrier spacing  $\Delta f$ , MHz 10/647/1688/144 6/144Sampling frequency, MHz 10 5.33 5.33 7.11 Fourier transform size 6.4 us 24 us 24 us 18 us CP length 1.6 us 4.5 us 6 us 6 us

#### 6.2.2 802.11p Signal and Channel Models

802.11p is defined for vehicular channels that tends to experience a larger delay spread than WLAN. While the 802.11p symbol has 16 samples for CP (i.e. the same as in 802.11a), the guard intervals are lengthened to avoid ISI by reducing the bandwidth from 20 MHz to 10 MHz (i.e., a 10 MHz sampling frequency). However this raises some challenges in the frequency domain. First, reducing bandwidth requires a higher quality factor front-end filter circuit for the higher frequency carrier compared to 802.11a. Second, 802.11p shares a 6 sub-carrier spacing frequency guard per side with 802.11a. Given the reduced sampling frequency, this leads to the absolute frequency guard being correspondingly narrower.

Generally, vehicular communication channels with large delay spread will require an increased timing guard, hence narrowing the frequency guard, which leads to more strict filtering constraints. Empirical channel models in [102, 103] reveal how maximum delay spread varies depending on different propagation models and traffic environments. For example, the RTV model for suburban street, urban canyon, and expressway have maximum excess delays of 700, 501, and 401 ns,

respectively [102]. For the V2V model, measurements in [103] reveal that the 90% largest delay spread (found in urban areas) is near 600 ns, which is equivalent to 6 samples. Given the fact that the CP is 16 samples, this leaves 10 samples (1 us) remaining. Any spectral leakage filtering necessary to meet the stringent SEM specification must not encroach further than this into the guard time.

#### 6.2.3 802.11af Signal and Channel Models

On the other hand, 802.11af is defined to reuse white space in the UHF band, with three basic channel units (BCUs) of 6 MHz, 7 MHz, and 8 MHz. Within this chapter, we will confine our consideration to the narrowest (and hence possibly most problematic) 6 MHz BCU for investigating the performance of the proposed filtering method for 802.11af.

In the 802.11af channel, the measured delay spread is less than 1 us [104], which is equivalent to the duration of 6 samples in the CP. Therefore, the 802.11af guard interval of 6 us is sufficient to avoid ISI, with the remaining 5 us (i.e., 26 samples) being available for filtering spectral leakage, if necessary. In the US, FCC rules mandate a very strict SEM to avoid ICI on the adjacent channels of primary users in the UHF band. For 6 MHz channels, the signal transmitted by TVBDs shall maintain at least 55 dB attenuation at the edge of the channel, which is significantly higher than the requirement of the parent 802.11ac standard. In the UK, the Ofcom requirement for 8 MHz channels is similarly strict.

### 6.3 Related Work

This section investigates state of the art methods from the 802.11a and 802.11ac domain, and considers their application for the newer standards. Specifically, each method is evaluated, and shown as unsuitable in meeting the strict SEM criteria for 802.11p (and hence very unlikely to satisfy the even more stringent 802.11af SEM).

#### 6.3.1 Pulse Shaping

Pulse shaping (using a smooth rather than rectangular pulse), recommended in 802.11a, is effective at reducing side lobes, although it induces distortion in the subcarriers. One way to avoid the distortion is to add extending parts, i.e. CP and cyclic suffix (CS), concatenated to the conventional OFDM symbol before the beginning and after the end respectively. The extended symbol is then multiplied with a smoothing function. While the CP in conventional OFDM is used as a guard interval, here it is also occupied, along with the CS, for pulse shaping.

Pulse shaping extends the  $N_T$  length of the OFDM signal by a roll-off factor,  $\beta$ . One effect of extending the symbol is to reduce spectral efficiency, and thus the CP and CS of consecutive symbols can be overlapped, as shown in Fig. 6.1. This, in turn, causes ISI in the overlapped region.

In practical terms, pulse shaping using the overlapping method is effectively shortening the OFDM guard interval. A larger  $\beta$  means reduced spectral leakage, at the cost of reducing the effective guard interval since a number of guard interval samples are taken for pulse shaping. If  $\beta N_T$  is increased to equal the CP length, the effective guard interval is reduced to zero (i.e. there is no guard interval to prevent channel-induced ISI). In this chapter, three state-of-the-art smoothing functions for pulse shaping are investigated. We present each in discrete form, before investigating their performance with different roll-off factors. The first smoothing function, denoted  $p_1$ , is present in the IEEE 802.11a standard:

$$p_{1} = \begin{cases} \sin^{2}(\frac{\pi}{2}(0.5 + \frac{m}{2\beta N_{T}})), & 0 \leq m < \beta N_{T} \\ 1, & \beta N_{T} \leq m < N_{T} \\ \sin^{2}(\frac{\pi}{2}(0.5 - \frac{m - N_{T}}{2\beta N_{T}})), & N_{T} \leq m < (1 + \beta)N_{T} \end{cases}$$

$$(6.4)$$

The second, proposed by Bala et al. [105], is based on a raised cosine function, denoted here as  $p_2$ :

$$p_{2} = \begin{cases} \frac{1}{2} + \frac{1}{2}cos(\pi(1 + \frac{m}{\beta N_{T}})), & 0 \leq m < \beta N_{T} \\ 1, & \beta N_{T} \leq m < N_{T} \\ \frac{1}{2} + \frac{1}{2}cos(\pi(1 + \frac{m - N_{T}}{\beta N_{T}})), & N_{T} \leq m < (1 + \beta)N_{T} \end{cases}$$

$$(6.5)$$

The third, denoted  $p_3$ , is based on the characteristics of functions with vestigial symmetry as derived by Castanheira and Gameiro [106]:

$$p_{3} = \begin{cases} \frac{1}{2} + \frac{9}{16}cos(\pi(1 - \frac{m}{\beta N_{T}})) \\ -\frac{1}{16}cos(3\pi(1 - \frac{m}{\beta N_{T}})), & 0 \le m < \beta N_{T} \\ 1, & \beta N_{T} \le m < N_{T} \end{cases}$$

$$\frac{1}{2} + \frac{9}{16}cos(\pi \frac{m - N_{T}}{\beta N_{T}})$$

$$-\frac{1}{16}cos(3\pi \frac{m - N_{T}}{\beta N_{T}}), & N_{T} \le m < (1 + \beta)N_{T} \end{cases}$$

$$(6.6)$$

The compression of OFDM spectral side lobes as a consequence of pulse shaping is investigated by first assuming that the effect of the image spectrum caused by interpolation or digital-to-analogue conversion (DAC) is negligible. This assumption is noted because the band gap between the wanted spectrum and its image is relatively narrow. Thus the overlapping image spectrum can influence the effectiveness of the shaped spectral leakage. The issue will be discussed later in the chapter, where image cancellation is presented separately for 802.11p and 802.11af.

The three smoothing functions are simulated for otherwise identical channels and signals, and compared in Fig. 6.2. The figure reveals the spectral envelope attenuation achieved using the three smoothing functions.

In Fig. 6.2, os shows the original OFDM spectrum without applying pulse shaping.  $p_1$ - $\beta 1$ ,  $p_2$ - $\beta 1$ ,  $p_3$ - $\beta 1$  show the spectra of the OFDM signal using smoothing



FIGURE 6.1: Pulse Shaping operation performed on OFDM symbols.



FIGURE 6.2: Spectral envelope due to pulse shaping OFDM symbols using three smoothing functions and different roll-off factors for 802.11p. Class C and D spectral emission mask limits are overlaid as dotted lines.

functions  $p_1(m)$  to  $p_3(m)$ , respectively, and roll-off factor  $\beta N_T = 1$ . Plots are also shown for the same functions with roll-of factor  $\beta N_T = 5$ . In the case of using one guard interval sample, the spectral leakage is reduced compared to the original OFDM signal, and  $p_1(m)$  obtains better results than the other two methods, however, the shaped spectra do not meet the emission requirement of class C. When 5 CP samples are used for pulse shaping,  $p_2(m)$  and  $p_3(m)$  achieve a significant



FIGURE 6.3: Spectrum of 802.11p OFDM symbols shaped with different window functions, with the image spectrum included.

improvement, and in fact,  $p_2$ - $\beta 5$  satisfies class C and almost meets the requirement of class D.

Thus, we can state that, ignoring the presence of an image spectrum as noted previously, the pulse shaping method can take part of the guard interval to apply a smoothing function in order to shape the spectral leakage and nearly meet stringent SEM requirements.

To investigate further, Fig. 6.3 plots simulation results for pulse shaped 802.11p OFDM symbols with the presence of the image spectrum included. The image is a consequence of interpolation or DAC response. The plot shows that pulse shaping yields a response that is similar to, but slightly better than, the original OFDM signal. However, when the image spectrum is considered, pulse shaping clearly can not achieve meaningful adjacent channel signal compression. In fact, the band gap between the main spectrum and the image spectrum is insufficient for pulse shaping to achieve any significant spectral leakage decay.

In a practical system, the consequence is that almost all of the side-lobe attenuation may need to be contributed by sharp and hence both high order and accurate analogue filters. Such filters contribute design complexity, increased component count, manufacturing difficulty, and additional cost to a product.

#### 6.3.2 Image Spectrum Cancellation By FIR Filter

The critical issue for 802.11p signals to meet the stringent mask requirements is that the frequency guards are narrow and the carrier frequency is relatively high (5.9 GHz) compared to 802.11a. Similarly, the 802.11af guard bands are even narrower, and the sub-channels are also much narrower. Interpolation can be used at baseband to increase sampling frequency, and thereby expand the baseband bandwidth. This then provides a wider frequency transition band, which is easier to fit a filter roll-off response into, however the resulting image spectra (repeats of the original baseband spectrum) must then be removed by filtering. Such filters are commonly implemented using finite impulse response (FIR) form. A cascaded integrator comb (CIC) implementation is sometimes chosen, since this can combine the interpolation and filtering steps, however while it is computationally efficient it lacks flexibility, Since this chapter is concerned with the tradeoff between the filter order, duration of impulse response, degree of oversampling, and filter transition band sharpness, flexibility is important and thus general FIR form filters are assumed.

The tradeoff mentioned above exists because the narrow band gap between main and adjacent image spectra mandates a high order filter to remove ICI, which generally implies a high order and thus long impulse response filter. Unfortunately the long impulse response of the filter has a similar effect to the impulse response of the overall channel in terms of inducing ISI. Thus the FIR filter also reduces the effective guard interval of OFDM symbols [1]. Consequently, its design must contribute to the wider tradeoff between ISI avoidance, spectral efficiency (the transition band width), and degree of filter attenuation needed to meet the SEM requirement.

Several widely used FIR implementation filters are listed in Table 6.2. These are all be investigated for image spectrum attenuation, as applied to 802.11p symbols initially, and then evaluated for 802.11af. An empirical formula [107] is used to estimate the length of each filter in terms of attenuation A and transition band  $\Delta\omega$ . The specifications of the stringent 802.11p class D SEM are used to calculate the required number of taps with L-fold interpolation, in terms of L.

Table 6.2: Popular window-based FIR filter lengths

|               |                              |                                                                               | 0                 |     |
|---------------|------------------------------|-------------------------------------------------------------------------------|-------------------|-----|
| Window        | Stopband<br>Attenua-<br>tion | Filter Length, N                                                              | Length<br>802.11p | for |
| Hamming, HM   | $-26.5\mathrm{dB}$           | $\frac{6.22\pi}{\Delta\omega}$                                                | $N \approx 31L$   |     |
| Hanning, $HN$ | -31.5 dB                     | $rac{6.65\pi}{\Delta\omega}$                                                 | $N \approx 33L$   |     |
| Backman, $BM$ | $-42.7\mathrm{dB}$           | $rac{11.1\pi}{\Delta\omega}$                                                 | $N \approx 55L$   |     |
| Kaiser, $KS$  | _                            | $\frac{A-7.95}{2.23\Delta\omega}, A > 21$ $\frac{5.79}{\Delta\omega}, A < 21$ | $N \approx 33L$   |     |
| Chebyshev, CW | —                            | $\frac{2.06A - 16.5}{2.29\Delta\omega}$                                       | $N \approx 67L$   |     |

It is noticeable that the required lengths of these FIR filters for 802.11p are all longer than the 1.6 us guard interval of the 802.11p symbol. For example, the Hanning window requires filter length  $N \approx 33 \times L$  samples (each sample duration is  $\frac{1}{L.10MHz}$ ), equivalent to a duration of 3.3 us. To avoid ISI, the maximum length of the FIR filter is derived by taking into account the guard interval and the CIR. By assuming that the delay spread of the vehicular channel is constrained to a maximum of 600 ns (as discussed in Section 6.2) based on the results stated in [102, 103], the CIR is equivalent to 6 samples of the 802.11p guard interval (the sampling frequency of 802.11p is 10 MHz). Therefore, a remaining effective guard interval of 10 samples is available for filtering. However, when the filter is used in a transmitter, a matched filter is required at the receiver [1] with equivalent length, meaning that the remaining guard interval is effectively halved: only 5 samples remain for transmitter filtering.



FIGURE 6.4: Spectra of OFDM symbols for 802.11p using different FIR interpolation filters, with L=8.

Given that L-fold interpolation is used at the transmitter, the permitted FIR filter length becomes  $5 \times L$ , constituting one of the rules for the FIR filter design process.

Theoretically, in the conventional approach, a value of L can be chosen based on the baseband sampling frequency ( $F_s = 10 \,\text{MHz}$  for 802.11p) and the DAC maximum frequency,  $F_{DAC}^{max}$ , such that  $F_{DAC} = L.F_s \leq F_{DAC}^{max}$ .

In the proposed method, the baseband frequency of the OFDM symbol is increased by a factor of M over the critical sampling frequency (where M is a power of two). Thus, to maintain the same DAC sampling frequency,  $F_{DAC} = L' \cdot M \cdot Fs$ , only L'-fold interpolation is required. It is clearly preferable that L'/M is an integer. For comparison between the proposed and conventional approaches, L should also be an integer, and in this chapter, assuming  $F_{DAC}^{max} = 80 \text{ MHz}$ , we choose L = 8 for the conventional method, and compare with two values of L' (2 and 4).

To visualise the spectral performance, a simulation is performed, with L=8, to evaluate filtered spectra using each window function, for 802.11p symbols. The spectral responses are plotted in Fig. 6.4, where the same OFDM signal as in Fig. 6.3 has been filtered by the FIR interpolation filters, and compared to the SEMs. In the figure, the filtered spectra obtained by using Kaiser, Hamming,

Hanning, Blackman, and Chebyshev windows are compared (denoted using the abbreviations in Table 6.2). In each case, two prominent auxiliary peaks, visible beside the main spectrum, are the biggest impediments to satisfying the SEM criteria. In detail, the Blackman filtered spectrum slightly exceeds the class A limits, whereas the remaining filters are able to meet the requirements of classes A and B but not of classes C and D. In fact, none of the filters are even close to class C and D compliance. Hence, given the effective guard interval of 802.11p, FIR filtering clearly does not provide a solution.

The simulation results show that the common filtering methods used at interpolated baseband are not even close to meeting the strict SEM requirements of 802.11p. Although not shown here, this is of course equally true of the more stringent 802.11af SEM. This result implies that 802.11p and 802.11af implementations must rely on sharp front end RF and analogue filtering, which typically results in an increased total system cost and reduced power efficiency.

## 6.4 A Spectrum Efficient Shaping Method

The discussions above have revealed that the main challenge to conforming with strict SEMs is the narrow frequency guard which must accommodate a very sharp filter transition between pass and stop bands, since the stop-band attenuation is high. In this section, we briefly explain the method before building upon its foundation to derive a CR architecture for OFDM spectral leakage mitigation which we will then evaluate for both 802.11p and 802.11af.

## 6.4.1 New Spectral Leakage Filtering Method

In a conventional approach, pulse shaping is only employed with small roll-off factors. This is because large roll-off factors involve longer filters, reducing the effective guard interval. Given the narrow frequency guard of the OFDM spectrum for the new standards, and the amount left after accounting for the effects of CIR

and matching filters, pulse shaping under such constraints is unable to cancel the image spectrum: in fact, even if the entire guard interval was to be used, this may be insufficient for the very stringent class D SEM in 802.11p, and the SEM of 802.11af. Thus our new method takes a different approach. Instead of using a large proportion of the guard interval for FIR filtering, we allow the pulse shaping to occupy a significant portion of the guard space, with large roll-off factors. To obtain the significant spectral leakage reduction necessary, the frequency guard needs to be increased. It thus involves introducing a frequency guard extension technique. Then, given a wider frequency guard, pulse shaping with large roll-off factors can achieve significant side lobe compression of the OFDM signal, and the required transition band for FIR filtering is extended, which means a shorter FIR filter is able to attenuate the image spectrum.

The method thus involves three steps:

- The IFFT length is multiplied by a factor M, and the sampling frequency similarly increased by a factor M, to maintain the same subcarrier spacing. Given this, the allocation vector is formed to add data symbols at lower subcarriers that are the same as those in the original IFFT, while the remaining sub-carriers are zero-padded.
- Next, pulse shaping is applied in the normal way, to meet the given SEM constraint.
- $\bullet$  Finally, L'-fold interpolation is used to allow simple filtering to remove the image spectrum.

Based on the proposed approach, we build a flexible CR architecture which is demonstrated to achieve the SEM requirements for both 802.11p (all classes A, B, C and D) as well as 802.11af in the UHF band.



FIGURE 6.5: The CR-Based architecture for adaptive OFDM spectral leakage shaping.

#### 6.4.2 Novel CR Filtering Architecture

A CR architecture for OFDM implies the agility to handle non-contiguous (NC) transmission of symbols (NC-OFDM) as well as the ability to adapt to different frequency bands, bandwidths and timing synchronisation regimes. For the purposes of this chapter, a CR architecture is developed which combines transmission of NC-OFDM symbols with switched sampling frequency, supporting both 802.11p and 802.11af. In particular, the architecture adaptively extends the frequency guard as required, and performs both pulse shaping and FIR interpolation filtering to meet the most strict SEM requirements of both standards. It should be understood that this CR architecture is designed to demonstrate compliance with the more difficult standards: it could trivially be de-rated to the much less stringent 802.11a and 802.11ac parent standards.

The proposed CR-based transmitter architecture structure is presented in Fig. 6.5, as implemented on a Xilinx FPGA for experimental purposes. As can be seen, the architecture consists of the baseband sub-modules, including  $Pil\_Insert$  which flexibly inserts data symbols and pilots from the data modulator,  $DAT\_Mod$ , into an OFDM symbol according to the current allocation vector ( $Alloc\_Vec$ ).  $Pre\_Insert$  inserts the preamble symbol while IFFT+CP Insert is an IP core that with flexibly

reconfigurable IFFT length and CP insertion. PulseShaping performs pulse shaping with a smoothing function for which the roll-off factor can be changed from small, for relaxed spectral shaping, to large, for more stringent spectral shaping. L-Fold Inter&FIR performs L'-fold interpolation, with L' being controllable on a symbol-by-symbol basis. After interpolation, the FIR block is used to filter out the image spectrum. In addition, for the CR architecture, the cognitive control sub-module  $(CR\_Ctrl)$  is used to modify sub-module parameters to match SEM requirements imposed from the higher layers (i.e. it adjusts timing, bandwidth, frequency band and SEM requirements). The mixed-mode clock manager (MMCM) is another integrated IP core used to manage the sampling clock  $(F_s)$ , which is set according to the filter performance requirements and operating frequency band.

In addition, the *MMCM* allows the transmitter to reduce the degree of filtering (i.e. degree of spectral leakage shaping) when transmission power is reduced: since transmit power reduction naturally reduces ICI. This is particularly important for lower power operating modes in which a lower sampling frequency, less filtering complexity, and reduced transmission amplitude all contribute to power savings.

Compare this with the 802.11p prototype presented in [108], which was adopted for direct device-to-device communication between smartphones. That innovative prototype was able to adaptively increase transmission power to extend communication range. However, the system was based on an 802.11a hardware solution and baseband, and did not investigate the increased spectral leakage when the transmitted signal was amplified to increase range (at which point it would not be likely to meet the 802.11p SEM requirement).

By contrast, the method proposed and implemented in this chapter, is able to apply a more stringent SEM filter when transmission power is increased such that ICI exceeds a given threshold. In particular,  $CR\_Ctrl$  is invoked to change the IFFT length to M times the original IFFT, while  $Alloc\_Vec$  extends the frequency guard and MMCM increases  $F_s$  according to the required IFFT length. Moreover,  $CR\_Ctrl$  changes PulseShaping to use a large roll-off factor, reduces the L-fold interpolation (since  $L' \times M$  is constant) and shortens the FIR length to meet the

more stringent SEMs. On the other hand, when a device and access point are in closer proximity, the transmission power can be reduced such that the spectral leakage is small, and thus filtering can be relaxed. In this case,  $CR\_Ctrl$  is invoked to change the IFFT length back to the original, and employ PulseShaping with a small roll-off factor. Moreover, L-Fold Inter&FIR switches back to a normal range in order to reduce the amount of computation. It should be noted that the additional computation needed for signal processing in the baseband (which uses low cost, low power components), can be more than compensated for by relaxing the specification of the RF front-end design since the analogue filtering requirements are so much less strict.

The following Section presents the application of the proposed CR architecture in performing stringent filtering to achieve the SEM specifications of both 802.11p and 802.11af respectively.

#### 6.5 Simulation Results and Discussion

## 6.5.1 Configuration and Performance Evaluation for 802.11p

Based on the signal model and environmental factors discussed in Section 6.2, the CIR length is assumed to not exceed 60 ns. Therefore, the effective guard interval, equivalent to the length of 10 samples of the original CP, i.e.,  $10 \times 100 = 1000$  ns, is used for the pulse shaping and FIR filter. By choosing a DAC sampling frequency of 80 MHz, the sampling frequency for 802.11p is increased by 8 times (L'.M = 8) over the original nominal rate of 10 MHz. Based on this configuration, two optional systems, denoted as Prop1 and Prop2, will now be explored for 802.11p:

*Prop1* doubles the size of the IFFT, i.e., M=2, which means doubling the sampling frequency to extend the frequency guard, after which 4-fold interpolation, i.e., L'=4, is required to obtain a sampling frequency of 80 MHz.



Figure 6.6: Spectrum of 802.11p signal of the proposed CR architecture after interpolation.

*Prop2* quadruples the size of the IFFT, i.e., M=4; Then applies 2-fold interpolation, i.e., L'=2 to achieve the 80 MHz sampling frequency.

Based on the results in subsection 6.3.1,  $p_2(m)$  is employed with  $\beta N_T = 5 \times M$ . i.e. equivalent to the length of 5 samples of the original CP, i.e., 500 ns. It should be noted that, after extending the frequency guard, the number of samples in the symbol, including CP, is increased M times. Fig. 6.6 plots the shaped spectrum of the proposed method after interpolation in the baseband, at a sample frequency of 80 MHz. The original spectrum denoted Conv, and the specifications for classes C and D are also shown. The main spectrum of Prop1 and Prop2 almost satisfy class D. The image spectrum of Prop1 is present at  $\pm 20 \,\mathrm{MHz}$  and  $\pm 40 \,\mathrm{MHz}$  whereas Prop2 has an image spectrum at  $\pm 40 \,\mathrm{MHz}$  only.

A simpler, shorter-length FIR filter is needed to cancel the image spectra, since they lie much further away in frequency than for the original approach, Conv. The remaining guard interval for the transmitter filter and matched filter is 500 ns. Therefore, the maximum impulse response available for the image rejection FIR filter is 250 ns, which is equivalent to  $2.5 \times M.L'$  samples at the 80 MHz sampling frequency.



FIGURE 6.7: Spectrum of 802.11p signal using option *Prop1* with 20th order FIR filtering.

FIR filters are designed for *Prop1* and *Prop2* respectively, using a Kaiser window. For *Prop1*, since the frequency guard is still relatively narrow, an FIR filter with an impulse length of 20 samples is required to cancel the image spectrum. Fig. 6.7 shows the result of spectrum filtering for *Prop1*, with the original OFDM spectrum (*Conv*) and SEMs for classes C and D overlaid. It can be seen that there are still two small peaks caused by the image spectrum, but these are compressed by the FIR filter to meet the class D requirement. Some distortion is noticeable in the main spectrum due to the effects of the FIR filter.

Prop2 has a wider frequency guard compared to Prop1 and thus its FIR filter only requires a length of 12 samples to cancel the image spectrum. A remaining effective guard interval of 200 ns is reserved. Fig. 6.8 plots the result of this degree of spectral filtering for Prop2 and with respect to the class C and D SEMs. Clearly the image spectrum of Prop2 can be cancelled by a short-length FIR filter, while suffering less distortion (rounding) to the main passband.

Overall, the simulation results demonstrate that the proposed CR architecture can meet the specification of class D, the most stringent of the four 802.11p SEMs. Prop2 obtains better performance in terms of distortion and effective guard interval compared to Prop1, but pays the cost of a higher computational requirement



FIGURE 6.8: Spectrum of 802.11p signal for *Prop2* with 12th order FIR filtering.

due to the increased IFFT size.

#### 6.5.2 Configuration and Performance Evaluation for 802.11af

Based on the discussion in Section 6.2, if the CIR length is assumed to not exceed 1 us, this is equivalent to 6 samples in the 802.11af CP. Therefore, the effective guard interval, equivalent to a length of 26 samples of the original CP, is available for use by the pulse shaping and FIR filters.

By choosing a DAC sampling frequency of 48 MHz, the output sampling frequency of 802.11af is increased by 8 times (L'.M=8) compared to the original nominal frequency (6 MHz). Simulation results for shaping the spectral leakage of 802.11af are presented here. The proposed method is compared to the conventional approach which make use of state-of-the-art pulse shaping and FIR filtering. In the conventional approach, pulse shaping uses 2 samples in the CP for a smoothing function, and the length of the FIR filter for cancelling the image spectrum is allowed to extend to 96 ( $\frac{26-2}{2} \times 8$ ) to avoid ISI. Again, the FIR filter is designed using a Kaiser window, with a cut-off frequency set to attempt to compress the signal spectrum to meet the FCC mandated SEM.



FIGURE 6.9: Spectrum of 802.11af signal using the proposed CR architecture.

Our proposed method quadruples the size of the IFFT (i.e. M=4), to extend the frequency guard. Pulse shaping is configured to employ  $p_2(m)$  from Subsection 6.3.1, with  $\beta N_T = 20 \times M$ . That is equivalent to the length of 20 samples of the original CP. Then 2-fold interpolation (i.e. L'=2), is required to obtain the sampling frequency of 48 MHz. The maximum allowed FIR filter length for cancelling the image spectrum without inducing ISI is equivalent to 3 samples of the original CP. Fig. 6.9 illustrates the results of shaping the spectral leakage for 802.11af using this proposed CR architecture.

Because of the limited length, the band transition of the FIR filter is not narrow enough. This results in the spectrum of the conventional method, *Conv*, exhibiting two side lobes, as well as introducing visible distortion in the main spectrum. Thus, the conventional method is far from able to meet the SEM requirement for 802.11af. One method that might be considered for achieving this is to deactivate the outer sub-carriers (instead, use null-subcarriers). This can extend the frequency guard, but clearly results in a loss of spectral efficiency. Reducing transmission power is also possible, but is similarly unattractive since it would adversely impact range – in fact the power reduction needed to bring the *Conv* transmission within the SEM envelope is not small. For example, the 802.11af prototype hardware in

[104] requires the transmitted power to be attenuated by 20dB to satisfy the SEM specifications.

On the other hand, the spectrum of the proposed CR architecture, denoted in Fig. 6.9 as *Prop*, comfortably meets the SEM specification without impacting either range or spectral efficiency.

#### 6.5.3 802.11af Spectral Efficiency

Another way of investigating performance is to compute spectral efficiency, given a flexible allocation of subcarriers. In other words to adjust the number of occupied subcarriers until the transmission profile fits within the 802.11af SEM, and using the unoccupied subcarrier space for filter roll off. In the conventional method (Fig. 6.9), about 35 dBc additional filtering would be needed to suppress the image spectrum at the edge of the channel bandwidth (3 MHz), from -20 dBc, to -55 dBc.

With a 96th order FIR filter as mentioned above for *Conv*, the transition band needed to achieve such suppression is estimated, based on the Kaiser window formula in Table 6.2, as being 0.83 MHz. This is equivalent to 21 subcarrier spacings which would need to be trimmed from each side of the 802.11af channel. Thus the number of occupied sub-carriers would need to be reduced from the standard 114 subcarriers to below 100 in order to give *Conv* a sufficient guard interval for FIR filtering.

However, the window formula is only an estimate, and hence the system is simulated here to explore further. In this case, the number of sub-carriers is reduced step-wise in pairs, from the edges working inwards, until the SEM is just satisfied. Fig. 6.10 plots results with 94 and 92 sub-carriers occupied ( $Conv_{2}2s$ ) and  $Conv_{2}2s$ ), showing that 92 sub-carriers meets the SEM requirement whereas 94 sub-carriers does not, by a small margin.

Referring back to Fig. 6.9, we also see a clear frequency gap between the spectrum of the proposed approach and the SEM. This gap could potentially be exploited to



FIGURE 6.10: Fitting Filtered Spectrum of 802.11af signal to SEMs.

pack in several more occupied sub-carriers. We therefore undertake simulations to explore this phenomenon, and find that the proposed method is able to pack in up to 124 employed sub-carriers while still satisfying the SEM. This is also illustrated in Fig. 6.10 as  $Prop_124s$ .

The results demonstrate that the proposed CR architecture can not only meet the stringent SEM requirement of 802.11af but could also be used to enhance spectral efficiency beyond that. Compared to the approach of trimming off edge carriers needed by an equivalent transmission power conventional system in order to satisfy the SEM, the proposed approach to CR spectrum shaping increases spectral efficiency by 32%.

## 6.6 Summary

In this chapter, shaping the OFDM leakage spectrum has been investigated at baseband within a CR architecture, in order to meet stringent spectral emission mask (SEM) requirements. In particular, this research considers two relatively new standards, 802.11p and 802.11af, which are defined for the physical layer and largely based upon existing standards. In both cases, the extended physical layers

are scaled to encourage reuse of existing hardware, devices and designs, but the resulting systems are then subject to much more stringent SEMs. The research relies upon a combination of interpolation, IFFT length adjustment, pulse shaping and FIR image suppression filtering, to mitigate against spectral leakage into adjacent channels. Simulations show that the proposed architecture can meet the specification of the four 802.11p classes A to D, as well as the stringent FCC-imposed SEM for 802.11af in the UHF band. The proposed method is also shown able to improve the achievable spectral efficiency for reuse of Television White Spaces in the 802.11af standard by 32%, given equivalent transmission power, compared to conventional approaches which need to drop the outermost subcarriers in order to meet SEM requirements. In addition, the architecture has the ability to adaptively change the degree of spectral leakage filtering in response to transmission power. The computation of filtering can be reduced when transmission power is low, but when transmission power is high, it is able to extend to meet the strict SEM specification of both 802.11p and 802.11af. Furthermore, the architecture is capable of adjusting clock rate, bandwidth, and frequency band on a symbol-by-symbol basis, in order to implement an agile CR solution.

# Chapter 7

# A Novel Architecture for Multiple Standard Cognitive Radios

## 7.1 Introduction

Cognitive radios that support multiple bands, multiple standards and adapt operation according to environmental conditions are becoming more attractive as the demand for higher bandwidth and more efficient spectrum use increases. Traditional implementations in custom ASICs cannot support such flexibility, with standards changing at a faster pace, while software implementations of baseband communications fail to achieve the performance required. Hence, FPGAs offer an ideal platform bringing together flexibility, performance, and efficiency. Our research focuses on designing the baseband processing of a MSCR system. This work explores the advantages of coupling PR and parameterised functional units in one architecture to offer flexibility for OFDM-based baseband processing while minimising reconfiguration time. To the best of our knowledge, there is no published research on dynamical reconfiguration for OFDM-based baseband processing of multiple standards cognitive radio on FPGA.

The work presented in this chapter has also been discussed in:

- T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient Multi-Standard Cognitive Radios on FPGAs," PhD Forum Poster in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), Munich, Germany, September 2014 [109].
- T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient OFDM-based baseband processing for Multi-Standard Cognitive Radios on FPGAs," to be prepared.

## 7.2 Related Work

Most practical CRs are built using powerful general purpose processors to achieve flexibility through software, but they can fail to offer the computational throughput required for advanced modulation and coding techniques and they often have high power consumption. GNU Radio [31] has been a widely used platform in academia. It is a software application that runs on a computer or an embedded ARM processor platform, e.g. on the Ettus USRP E100. Computational limitations mean that while it has been successful for investigating CR ideas, it is not feasible for implementing advanced embedded radios using complex algorithms. Other software based frameworks like Iris [23], have some limited support for FP-GAs but suffer from poor bandwidth between software and hardware. Moreover, the compilation time and reconfiguration time of software defined radio based system are quite long for CRs may not be suitable for adapting fast condition changing requirements.

In an application area with fast moving standards and requiring support for multiple standards, custom ASIC implementation is both unlikely to be agile or cost effective enough to cope with fast-changing standards and operating requirements. In order to address this, Delorme et al. [71] presented a heterogeneous reconfigurable hardware platform for Cognitive Radio. It can adapt its hardware structure to support standards like GSM, UMTS, and wireless LAN. Most processing components run as embedded software on the nodes in a network on chip processor,

while the channel coder and the mapping of the RX chain are implemented inside an FPGA. Partial reconfiguration (PR) is used to switch the channel coder from one context to an another depending on SNR. A processor manages data movement between the different processors, the ASIC, and the FPGA. The need for a large data buffer and inefficient data transfer mechanisms lead to increased power consumption and reduced throughput. There are few studies on FPGA based platforms for radio implementation. KUAR [21] is a mature radio platform built around a fully-featured Pentium PC with a Xilinx Virtex II FPGA. The baseband processing is accelerated on FPGA and only limited for NC-OFDM signal based on the IEEE 802.16 standard. Projects at Virginia Tech [110] have shown dynamically assembled radio structures on FPGAs, where the target radio system is defined at a high-level with datapaths connecting relatively large functional modules. The modules are wrapped, and each of them consists of a PR module with complied partial bit-streams stored in dynamic library. Using PR eliminates the need for run time compilation, thus affording flexibility. A flexible radio controller can insert and remove compiled modules to adapt to current conditions.

# 7.3 Proposed OFDM-based baseband modulation for MSCR

Fig. 7.1 illustrates the proposed structure of baseband modulation for our OFDM-based MSCR. A mix of partially reconfigurable and parameterised modules make up the baseband implementation. FIFOs are included to help overcome the reconfiguration latency when PR modules are reconfigured. Since these modules are now just a small part of the system, buffering is significantly reduced over a more general implementation.



FIGURE 7.1: The structure of a generic MSCR system

### 7.3.1 System Description

We developed a prototype MSCR baseband modulation that supports transceiving non-contiguous OFDM (NC-OFDM) signals. This system can perform with different OFDM symbol lengths and frame formats specified differently according to multiple standards such as IEEE 802.11 [18], IEEE 802.16 [19], and IEEE 802.22 [20]. The main specifications of these standards are summarised in Table 7.3.1.

Table 7.1: System specifications of three supported OFDM-based standards.

| Specifications          | IEEE 802.11         | IEEE 802.16          | IEEE 802.22          |
|-------------------------|---------------------|----------------------|----------------------|
| Frequency band          | 2.4–2.5 GHz         | 5–6 GHz              | 54–862 MHz           |
| Channel Width           | $10 \mathrm{\ MHz}$ | $10~\mathrm{MHz}$    | 8 MHz                |
| Sampling Frequency      | $10 \mathrm{\ MHz}$ | $11.52~\mathrm{MHz}$ | $9.136~\mathrm{Mhz}$ |
| FFT size $(N_{FFT})$    | 64                  | 256                  | 2048                 |
| CP Length               | 16                  | 32                   | 512                  |
| Number of data carriers | 48                  | 192                  | 1440                 |
| Number of pilots        | 4                   | 8                    | 240                  |

Basically, The CR implementation is divided into a control plane and data plane. The data plane performs data processing for the radio system in data stream. It tranceives data streams to/from the RF front end through two AXI (Advanced eXtensible Interface) stream interfaces. For transmission, sending data from higher

level layers is processed and modulated by the data plane. Then modulated sample streams are transferred to RF front end to convert to analogue signals and subsequently be up-converted to RF band before transmission on the RF channel. For the receiver side, the received signals are down-converted to the baseband followed by analogue to digital convertion form a sample stream. The sample stream is demodulated and processed at the data plane before transferring to a higher level layer. To support multiple standards, the data plane has to provide the flexibility of switching functions accordingly to the parameters specified in different standards. The AXI stream interfaces are used for the overall data plane system as well as each sub-module. This unified interface is required to make partial reconfiguration possible, since modules that are reconfigured should share the same connections. This interface also allows the data from higher layers to be processed in a streaming format. This reduces the requirement for buffering, hence optimising resource usage and total power consumption [83]. Each module has one slave interface to receive data from the previous module and one master interface to send processed data to the subsequent module.

The control plane is a cognitive radio (CR) engine that is required to perform the adaptive algorithm based on sensing data from the receiver to decide the allocation vector and the required standard. It monitors events in the data plane via an AXI bus interface connected set of registers. The control plane needs to issue commands to the data plane through the same bus register interface. In addition, the CR engine consists of the PR controller that is responsible for download the precomplied bitstreams stored in DRAM into the corresponding PR modules through the ICAP (Internal Configuration Access Port) interface [72] when the CR engine decides to change the standard. The control plane can be implemented in one of a number of ways. It can be standalone software running on a processor core. It can also be hardware in a separate part of the FPGA. Alternatively, for maximum flexibility and programming support, it can run on top of an operating system on the processor. By ensuring that symbol data is processed and moved through the data plane independently of the processor in the control plane, we are able to achieve a high throughput. Normally, the data plane processes data streams based

on a specified standard to transceive data symbols according to the allocation vectors. The CR engine must be able to performs sensing and adaptive tasks to update suitable allocation vectors based on current channel status. When the frequency band of the current operating standard is mostly occupied by PUs and IUs, the CR engine performs switching to an other standard or frequency band that is currently (or will soon be) free for transmission. The PR controller inside the CR engine is required to download the bitstreams of PR modules according to the new standard. In the meantime, the CR engine configures the system by writing relevant parameters to registiters in CR-Regs such as allocation vectors (Alloc-vec), symbol modulation type (MOD), and standard (STD).

The scope of this chapter focuses on the baseband modulation in order to deal with the challenges of long configuration. The system takes the advantages of coupling PR and parameterised functional units to offer flexibility while minimising reconfiguration time. The CR system is assumed capable of performing in burst mode in which the system only transmit the packages as soon as data is available [111]. The transmitter subsystem just performs the transmission initiated by the higher layer. If the system needs to transmit data during the periods of reconfiguration time, the data package will be stored in FIFO. The stored data packet will be flushed out of FIFO for transmission after the reconfiguration of the transmitter has been done. Moreover, the hardware usage for the transmitter side is so small compared to the receiver side that it has a much faster reconfiguration time. The hardware usage and reconfiguration time of the transmitter is presented latter in Section 7.4. The reconfiguration time does not cause a critical disruption of the processing chain and loss of transmitted packets. Therefore, the monolithic PR module for the transmitter side is implemented because of simplicity and flexibility. In contrast, the receiver subsystem is required to continuously the receive signal and detect the data frame. Furthermore, the hardware usage for the receiver subsystem is very large resulting in a long reconfiguration time. To avoid losing data frames, a huge FIFO is required to store received samples from the RF front end during reconfiguration and this is not practical. The critical issue is to minimise the reconfiguration time in the receiver subsystem in an attempt to achieve

these aims. The coupling of PR and parameterised functional units is applied to implement the receiver subsystem. The design of each module in the receiver subsystem to obtain the flexibility, while minimising reconfiguration time, will be investigated in the next subsection. In addition, after reconfiguring the receiver subsystem, the FIFO needs to be flushed for the next reconfiguration. A Mixed-Mode Clock Manager (MMCM) is employed to increase the received processing rate (rx\_clk). After stored samples for reconfiguration are read out, the received processing rate is reduced to the sampling rate (sys\_clk) to reduce power consumption. The MMCM module is also required to change the processing rate of data plane according to the different samppling rates specified in different standards, shown in Table 7.3.1.

## 7.3.2 Module Description

In this subsection, each functional module, shown in Fig. 7.1, in the receiver processing chain is investigated and commonalities across standards are analysed. In term of design options, there is a tradeoff between the simplicity, flexibility, and long reconfiguration latency of PR module and the increased hardware overhead, complexity, but faster configuration of parameterised modules. The comparison between the hardware usage of parameterised modules for multiple standards and that of individual specified standard modules is evaluated. If the hardware overhead required to parameterise the module is less than given percentage above the hardware size of the standard-alone module, then it is judged better to be parameterised. For those modules that require significant changes (i.e. the parameterised overhead is greater than this threshold) or are required to support unspecified parameter for future standard (such as preamble), PR can be used on a per-module level. The bistream for each standard is implemented and precompiled separately. As a result, when switching from one standard to another, only part of the FPGA needs to be reconfigured.

1. FIFO buffer (FIFO): There are two FIFOs at the transmitter side and the receiver side to buffer the data sent from the higher layer and the RF front



FIGURE 7.2: The received FIFO module

end, respectively. The FIFO buffers are implemented based on the FIFO IP cores with 2 port AXI stream interface configuration. In the normal operation of data stream, one data word is written to the buffer by the higher layer/the RF front end, meantime, another one is read out from the buffer by the transmitter side the receiver side at every single system clock. Therefore, the FIFO buffers normally operate in almost empty state. When the reconfiguration is required for switching underlying standard, the transmitted processing and received processing are suspended processing data streams. The FIFO buffers are stored coming data streams to avoid losing frames. Because the system performs in burst mode, the trasmitted streams are not continuously, therefore, the transmitted FIFO can be flushed the stored data during the gap of the burst streams. The data buffer is not a critical issue for the transmitter side. However, the receiver side needs to continuous processing data streams from the RF front end to detect incoming frames. The receiver side does not have spare time to flush the stored data in FIFO bufferred during reconfiguration time. Therefore, the received FIFO is configured with independent clocks as shown in the Fig. 7.2. In order to flushing stored data in the received FIFO after switching standard reconfiguration, the MMCM increases the received processing rate (rx\_clk) be multiple time greater than the sampling rate of the RF front end to make the FIFO be almost empty for the next reconfiguration. When the received FIFO is almost empty the received processing rate is reduced back to equals the sampling rate to minimise power consumption.

2. Synchronisation (Synch): The CR system performs in burst mode. Synch is to detect the presence of a frame and estimate the frequency offset required, based upon the preamble of the received frame. This module performs estimation based on the timing parameters, defined by Schmidl and Cox [3], expressed as below;

$$P[d] = \sum_{m=0}^{L-1} (r^*[d+m]r[d+m+L])$$

$$R[d] = \sum_{m=0}^{L-1} |r[d+m+L]|^2$$

$$M[d] = \frac{|P[d]|^2}{(R[d])^2},$$
(7.1)

where d denotes a time sequential index of received samples r, L is the periodic length of the short preamble, \* denotes complex conjugation.

As discussed in Chapter 4, the frame detection is performed by finding the plateau of M when a frame presents at receiver. P is also used to estimate fractional CFO in the subsequent module. The Fig. 7.3 shows block diagram of synchronisation implementation. The timing metrics are calculated by using auto-correlation on received samples. Coarse Time detects the new frame and estimate roughly the start of frame based on the comparison the metric P and the value  $\frac{R}{2}$  that is equivalents to detect M greater than a threshold of 0.25. The method is blind estimation that provides a generality to be applied for multiple standards. By parameterising the length of L, the synchronisation module can effectively perform for three current supported standards as well as be extensible for future standards. Therefore, this module is implemented as a parameterised version with parameter L whose values are defined together with the length of FFT (NFFT), shown in Table 7.2. The parameterised value combinations of L, and NFFT allow for the support of multiple standards. The parameterised values consist not



Figure 7.3: The block diagram of Synchronisation module

only of the required combinations for 802.11, 802.16, and 802.22, but also support other combinations for future standards.

Table 7.2: Parameterised values according to supported standards

| NFFT | 16     | 32 | 64     | 128 | 256 | 512    |
|------|--------|----|--------|-----|-----|--------|
| 64   | 802.11 |    |        |     |     |        |
| 128  |        |    |        |     |     |        |
| 256  |        |    | 802.16 |     |     |        |
| 512  |        |    |        |     |     |        |
| 1024 |        |    |        |     |     |        |
| 2048 |        |    |        |     |     | 802.22 |

3. Frequency Compensation: FreComp module performs fractional CFO estimation based on the value of P passed from the previous module Synch. Fractional CFO estimation and compensation are mathematically expressed as below;

$$\widehat{\Delta f} = \frac{\angle P}{2\pi L}$$

$$\widehat{r[d]} = r[d]e^{-j2\pi\widehat{\Delta f}d}$$
(7.2)



Figure 7.4: The block diagram of frequency compensation module

where  $\widehat{\Delta f}$  is estimated fractional CFO,  $\angle P$  denotes the angle of P and N is number of FFT points.

The Fig. 7.4 illustrates the block diagram of frequency compensation. A phase rotation sub-module is used to compensate fractional CFO by rotating the received sample phase with a proper angle. The proper angles are calculated and accumulated based on estimated fractional CFO.

$$\phi[d] = \phi[d-1] + \frac{\angle P}{L}$$

$$\widehat{r[d]} = r[d]e^{-j\phi[d]}$$
(7.3)

According to (7.3), the computation of FreComp depends on the periodic length of short preamble that is used to calculate P. Assuming that L is normally defined with a power of two value, the division by L can be effectively computed by a right shift. Therefore, this module can be effectively implemented to support multiple standards by parameterising a right shifter according to the value of L.

4. Fine STO Estimation: FineSTO\_Est is to estimate the starting sample of each OFDM symbol. The RF front-end for MSCR need to access a wide range of frequencies, shown in Table 7.3.1. Depending on the standard in operation, the CFO may be large, resulting in the present of IFO. Therefore, the implementation of FineSTO\_Est is based on the algorithm presented in [92] that is also robust to IFO. The metric of fine STO estimation is expressed



FIGURE 7.5: The block diagram of fine STO estimation module

as below;

$$S[d] = \sum_{m=0}^{L-1} |r[d+m+L]|^2 |a[m]|^2, \tag{7.4}$$

when |a[m]| denotes the normalised amplitude of the preamble at the transmitter. The Fig. 7.5 shows the block diagram of fine STO estimation. The metric is calculated based on the multiplierless correlation between received samples and transmitted preamble. The cross-correlation provides high precision for STO estimation. However, this operation requires a large number of DSP blocks, resulting in increased hardware resource and power consumption of the system. Reducing the hardware resource and power consumption requires using multiplierless correlation [97]. Peak Detect finds the maximum value of correlation that is employed to accurately estimate the STO and Fine Time determines the exact first sample of the next OFDM symbol (long preamble symbol). However, the multiplierless correlation is not flexible and depends on preambles that are different for each standard. Therefore, we employ the PR module for the FineSTO\_Est module to obtain flexibility. Each supported standard is implemented separately and precompiled to a stored bitstream. The suitable precompiled bitstream will then be downloaded to the PR module by the CR engine when the underlying standard is switched. The PR module must evidently to be large enough to contain the largest bitstream among the supported standards.

5. Remove Cyclic Prefix: RemoveCP removes the cyclic prefix attached to each OFDM symbol. The performance of this module depends on the length of CP  $L_{CP}$  specified differently in each standards. RemoveCP consists of a counter to count from the beginning of each symbol and remove the CP samples if

the counted value is smaller than  $L_{CP}$  This module can be parameterised by adjusting  $L_{CP}$  to support multiple standards.

- 6. FFT: FFT is based on the Xilinx FFT/IFFT IP core, but now supports run time reconfiguration of the FFT length according to the underlying selected standard. When the standard is changed, the length of the FFT of this core is reconfigured using the configuration input. The FFT/IFFT IP core reconfiguration is very fast, being able to accomplish a switch within a few clock cycles.
- 7. IFO Estimation and Channel Equalisation: IFO\_Est&Ch\_EstEqu corrects the IFO and performs channel equalisation. IFO results in a cyclic shift in the frequency domain. Using differential demodulation of the FFT output, the IFO can be conventionally determined with robustness to frequency selective channels using the correlation function [99] on the second preamble (long preamble) symbol. The function is expressed as below;

$$\hat{\epsilon} = \underset{\tilde{\epsilon}}{\operatorname{argmax}} \left| \sum_{k=0}^{N_{FFT}-1} Y^*[k-1]Y[k]X^*[k-\tilde{\epsilon}]X[k-1-\tilde{\epsilon}] \right|$$
 (7.5)

where  $\epsilon$  denotes the value of IFO,  $\hat{\epsilon}$ ,  $\tilde{\epsilon}$  are estimated and trial values of  $\epsilon$ , respectively, Y(k) and X(k) denote the  $k^{th}$  frequency symbol index of the received subcarriers and the known transmitted preamble, respectively, and the OFDM symbol size  $N_{FFT}$  is equal to the FFT size. The Fig. 7.6 illustrates the block diagram of IFO estimation and channel equalisation. The estimated IFO can achieve high precision using cross-correlation in the frequency domain. A low power and low cost IFO estimation architecture presented in Chapter 5 is applied for this module. IFO Correction is performed effectively by cyclic shifting OFDM symbol corresponding to the estimated IFO estimation.

After compensating for IFO, the effects of the channel and residual STO must be taken into account to compensate the received subcarriers. By employ the information of the second preamble symbol, the effect of the channel can be estimated. The estimation and compensation of channel and residual



FIGURE 7.6: The block diagram of IFO estimation and channel equalisation

effect can be expressed as below;

$$Y[k] = X[k] * H[k] + N[k]$$

$$H[k] = \frac{Y[k] - N[k]}{X[k]}$$

$$\hat{H}[k] = \frac{Y[k]}{X[k]}$$

$$\hat{R}[k] = \frac{R[k]}{H[k]}$$

$$(7.6)$$

where X[k], Y[k] are the transmitted and received carriers in the preamble, respectively. H[n] represents for the channel and residual STO effect and N[k] is the AWGN. The equalization taps are estimated in (7.6), and the compensation for received data carriers is given in (7.7) in which R[k],  $\hat{R}[k]$  denote received and compensated data carriers, respectively. Due to symbols are modulated by QPSK, the symbol amplitude is not concerned. So, the complex division of channel estimation and compensation can be equivalently performed by multipling to the conjugation of X[k] and H[k], respectively. The operation of this module depends on the second preamble that is specified differently in each standard. Therefore, PR is used for the  $IFO\_Est\&Ch\_EstEqu$  module to obtain the effective standard-specific implementation. Suitable precompiled bitstreams are downloaded to the PR module by CR engine when the underlying standard is changed. The PR module must evidently be large enough to contain the largest bitstream among the supported standards.

8. Phase Tracking: Phase Track estimates the residual common phase error

in each OFDM symbol after channel equalisation. The *PhaseTrack* implementation is based on the algorithm presented by Troya et al. [112]. The estimation is computed on the pilot symbols inserted in the OFDM symbol. The transmitted pilots are typically assigned the values {1}. The residual phase error causes a phase rotation on received pilots.

$$P_{k,l} = \cos\theta_l - \alpha.k.\sin\theta_l + j(\sin\theta_l - \alpha.k.\cos\theta_l), \tag{7.8}$$

where  $P_{k,l}$  denotes the phase of received pilot which has frequency index k in the  $l^{th}$  OFDM symbol.  $cos\theta_l + jsin\theta_l$  is the residual common phase error of the  $l^{th}$  OFDM symbol, and  $\alpha$  is the slope of the phase distortion. The residual common phase error is generally estimated for the supported standards as below;

$$cos\theta_{l} = \frac{1}{N_{P}} \sum_{k \in S_{P}} \Re\{P_{k,l}\},$$

$$sin\theta_{l} = \frac{1}{N_{P}} \sum_{k \in S_{P}} \Im\{P_{k,l}\},$$

$$(7.9)$$

where  $N_P$  denotes the number of received pilots employed for estimation,  $S_P$  is a set of used pilot frequency indices. The Fig. 7.7 shows the block diagram of phase tracking. Pilot Extract finds the employed pilots for phase tracking in the OFDM symbol based on allocation vector (Alloc. Bits). Phase Accumulator and Phase Accumulator compute the residual common phase error according to (7.9). The phase error is simply compensated for by multiplying the data carriers by the complex conjugate of the estimated common phase error. To support multiple standards,  $N_P$  is parameterised and  $S_P$  can be determined through the allocation vector with the value shown in Table 7.3

9. Data symbol demodulation (*DatSymDem*): At the final step, the received bits are extracted from the data symbol by a data symbol demodulation block named *DatSymDem*. In the present implementation, this only supports with QPSK modulation, but can be extended to support different data symbol

| Table 7.3: Allocation vector coding. |                 |  |  |  |  |
|--------------------------------------|-----------------|--|--|--|--|
| Subcarrier Type                      | Allocation Bits |  |  |  |  |
| Null                                 | 00              |  |  |  |  |
| Data                                 | 10              |  |  |  |  |
| Positive pilot                       | 01              |  |  |  |  |
| Negative pilot                       | 11              |  |  |  |  |



FIGURE 7.7: The block diagram of phase tracking module

modulations such as 16-QAM or 64-QAM in future, using the same basic interface. All data symbols go through this module, and the 2 bits are assigned to the output according to the signed bits of the real and imaginary parts of data symbol.

## 7.4 Performance Analysis and Discussion

# 7.4.1 Analysing the latency and halting time of PR modulebased systems

One crucial challenge when implementing CR systems on reconfigurable hardware is the long adaptation time, including compiling time and reconfiguring time, when changing standards. Despite the PR technique can eliminate the compiling time (since PR uses a database of precompiled modules), the partial reconfiguring time is still relatively long, particularly for a large monolithic module. Especially in the case of the CR receiver, the system is halted during reconfiguration, potentially causing the loss of data packet and possibly even the loss of synchronisation. A huge FIFO may be required to store a stream of received samples to overcome the



FIGURE 7.8: Comparison of system latency

reconfiguration latency when a PR module is reconfigured. A longer reconfiguration time demands a larger FIFO, which results in significantly increased hardware resource and power consumption. Mathematical analysis is used to evaluate the system latency for cases of monolithic PR modules, as well as for a system employing a finer granularity with multiple PR modules, and a mix of PR and parameterized modules for the case when the system switches to a new channel condition. Fig. 7.8 illustrates the system latency for monolithic PR modules and for a system with finer granularity with multiple PR modules. We consider a system consisting of N modules.  $T_c$  refers to the reconfiguration time. The assumption is that the system or module can not process data during its reconfiguration time.  $L_i$  is the computation latency of the  $n^{th}$  module. Received data can, of course, be processed during the computation latency. In the case of a large monolithic PR module employed for the system, the system latency,  $L_{sys}$ , and halting time,  $T_{hlt}$  that require a FIFO to buffer the received data which would otherwise be lost, is

calculated as follow;

$$L_{sys} = T_c + \sum_{i=1}^{N} (L_i)$$

$$T_{hlt} = T_c, \qquad (7.10)$$

Finer granularity approaches divide the system into multiple sub-modules, each of which employes a PR module. When a module is completely configured, it can process the received data while the following module begins to be configured. Therefore, the system latency and halting time for the case of multiple PR modules can be calculated as follow;

$$L_{sys} = \sum_{i=1}^{N} (T_{ci}) + T_{dN} + L_{N}$$

$$T_{hlt} = \sum_{i=1}^{N} (T_{si}), \qquad (7.11)$$

where  $T_{di}$  refers the processing delay of the following module and  $T_{si}$  is the stalling time to wait for configuration of the following module. If the computation latency of a module,  $L_i$ , is greater than the reconfiguration time of the following module,  $T_{ci+1}$ , the following module has to delay operation by a duration  $T_{di}$  before it receives input data for processing. Otherwise, the previous modules is halted a duration  $T_{si}$  until the following module is completely configured. The following module begins processing data just after its configuration is done  $(T_{di} = 0)$ .

$$T_{di} = \begin{cases} Max(T_{ci}, T_{di-1} + L_{i-1}) - T_{ci} & i = 2..N, \\ 0 & i = 1 \end{cases}$$
 (7.12)

$$T_{si} = \begin{cases} T_{ci} - min(T_{ci}, T_{di-1} + L_{i-1}), & i = 2..N, \\ T_{c1} & i = 1 \end{cases}$$
 (7.13)

Substituting the above equations into (7.11),

$$L_{sys} = \sum_{i=1}^{N} (T_{ci}) + L_{N} + (Max(T_{cN}, (Max(...) - T_{cN-1}) + L_{N-1}) - T_{cN})$$

$$T_{hlt} = \sum_{i=1}^{N} (T_{ci}) - \sum_{i=2}^{N} (min(T_{ci}, T_{di-1} + L_{i-1})), \qquad (7.14)$$

As can be seen in the equations, the system latency and halting time in the case of multiple PR modules are theoretically reduced thanks to being able to overlap the reconfiguration and data processing periods. Practically, the reconfiguration times usually are much greater than the processing latencies. This leads to  $T_{di} = 0$  and  $min(T_{ci}, T_{di-1} + L_{i-1}) = L_{i-1}$  resulting in the approximated equations for (7.14) as below;

$$L_{sys} = \sum_{i=1}^{N} (T_{ci}) + L_{N}$$

$$T_{hlt} = \sum_{i=1}^{N} (T_{ci}) - \sum_{i=1}^{N-1} (L_{i}), \qquad (7.15)$$

In addition, because of the optimisation in hardware compilation, the overhead of partitioning into multiple PR modules leads to the fact that  $\sum_{i=1}^{N} (T_{ci})$  is clearly greater than  $T_c$ , in (7.10). Therefore, the system latency,  $L_{sys}$ , and halting time,  $T_{hlt}$  in (7.14) may be greater than that in (7.10). Generally, the finer granularity approach can only achieve the efficiency in terms of the system latency and halting time, if the gain of overlapping the reconfiguration and data processing is greater than the overhead of partitioning into multiple PR modules.

Through the above analyses and comparison of large monolithic PR modules with a finer granularity approach, we proposed a new method employing a mix of PR modules and parameterised modules to obtain a significant reduction in system latency and halting time. Each module in the processing chain of the granularity approach is investigated, and commonalities across different operation modes are analysed. For modules requiring only minor modifications, parameterized versions

are created. For the  $i^{th}$  module to be parameterised, the configuration time of this module can be eliminated because the parameterised modules can switch operation mode within one clock period. This approximately results in the following simplified equations;

$$T_{ci} \approx 0$$

$$T_{si} \approx 0$$

$$T_{di} \approx T_{di-1} + L_{i-1}, \qquad (7.16)$$

The above equations show the increasing efficiency of overlapping the performing reconfiguration and data processing leading to significant reduction in the system latency and halting time.

# 7.4.2 Analysing results of proposed OFDM-based MSCR architecture

The system latency and halting time of OFDM-based MSCR systems are investigated based on the results of hardware consumption and system performance when implemented on a Vertex6 FPGA (XC6VLX240t). A comparison of large monolithic PR module, finer granularity and proposed approaches is given to show the advantages of the proposed architecture. To compute a configuration time of a PR module, the bitstreams of all implementations have to be computed. The area of the PR module, i.e., the occupied and reserved hardware resources, must satisfy the needs of the largest implementation. For the monolithic PR module approach, it is required that the PR module must be able to contain the 802.22 OFDM-based implementations, which is the largest receiver implementation among the three target implementations. Similarly to granularity approach, the configurations of the PR modules are computed based on the sub-modules of the 802.22 OFDM-based implementation. Table 7.4 reports the hardware resources usage of each sub-module and transmitter, receiver system for 802.22 on the Vertex6 FPGA

(XC6VLX240t).  $M_1$ ,  $M_2$ ,  $M_3$ ,  $M_4$ ,  $M_5$ ,  $M_6$ ,  $M_7$ ,  $M_8$  denote the functional modules of the OFDM based system: synchronisation, frequency compensation, fine STO estimation, remove CP, FFT, IFO estimation and channel equalisation, phase tracking, data symbol demodulation, respectively.  $M_R$ ,  $M_T$  are the monolithic receiver and transmitter sub-systems, respectively. These results are employed to calculate the bitstream size of PR modules for each functional block. Fig. 7.9 illustrates the bitstream sizes of each PR module, that are employed to calculate the configuration time later. The bitstream sizes of PR module for the functional modules are relatively small compared to the monolithic PR module for the receiver sub-system. The  $M_3$  bitstream is the largest bitstream among the functional granular modules. Interestingly, the bitstream of the receiver sub-system is nearly triple that of the transmitter sub-system.

Table 7.4: Resources for 802.22 OFDM-based implementation

| $\overline{modules}$           | Slices | DSP | BRAM |
|--------------------------------|--------|-----|------|
| $M_1(Synch)$                   | 498    | 5   | 0    |
| $M_2(FreComp)$                 | 474    | 4   | 0    |
| $M_3(FineSTO\_Est)$            | 2414   | 0   | 0    |
| $M_4(RemoveCP)$                | 23     | 0   | 0    |
| $M_5(FFT)$                     | 1179   | 15  | 11   |
| $M_6(IFO\_Est \& Ch\_Est Equ)$ | 1249   | 6   | 0    |
| $M_7(Phasetrack)$              | 523    | 3   | 0    |
| $M_8(DatSymDem)$               | 4      | 0   | 0    |
| $M_R(\text{Receiver})$         | 6363   | 33  | 11   |
| $M_T(\text{Transmitter})$      | 1668   | 15  | 11   |

The latencies of functional modules are considered for three standards. Fig. 7.10 shows the latencies in number of clock cycles. As can be seen, the latencies for the 802.11 standard are smallest, because this standard uses the shortest FFT length, i.e., shortest symbol length for OFDM modulation. It should be noted that during the latency time the module still receives input data for processing. The processing chain is halted when the latency time has ended but the reconfiguration of the



FIGURE 7.9: The bitstream size of PR modules



FIGURE 7.10: The latency of sub-modules for three standards

following module has not been completed yet. Therefore, the worst case halting time is a case of the shortest latency and the longest reconfiguration time. So, the latencies for 802.11 are taken to calculate the system halting time. The latency of the synchronisation module depends on the timing offset that is the duration from the time of receiving input samples to the time when the first sample of the coming frame is received. The synchronisation module does not output data if no frame is detected. Generally, the synchronisation latency is calculated as  $L_1 = Offset + processingtime$ . To solely evaluate the processing latency of the synchronisation module, the timing offset is considered to equal 0. We will consider the presence of timing offset later for calculating the system halting time and FIFO requirement.



Figure 7.11: The configuration time and latency of sub-modules for OFDM-based MSCR system

The partial reconfiguration is performed with the ICAP interface that supports downloading speeds of up to 400 MBps. Practically, because of the overhead of PR controller, a speed of 380 MBps is used for downloading the PR bitstream to FPGA. Therefore, the accuracy of computing the reconfiguration time for PR modules by dividing the bitstream size by the downloading speed should be acceptable. In addition, the sampling frequency for the radio signal is varied according to the currently chosen standard. We use a sampling frequency of 10 MHz (i.e., the clock period = 0.1 us) that is typically defined for the 802.11p standard. The latency is calculated from the results of 802.11 standard in Fig. 7.10 multiplied by the clock period of 0.1 us. Fig. 7.11 illustrates the time of configuration and latency for each module. The values of latency are scaled by a factor of 10 for better observation. As can be seen, the latency is very small in comparison to the configuration time. It is clearly not efficient to perform the overlapping reconfiguration and data processing. Therefore, the finer granularity approach may not obtain a good improvement.

The system halting time is an accumulated value of the halting time in each module described in (7.13). During the halting time, the processing chain is halted and a FIFO is required to buffer input samples. Because the halting time of the synchronisation module depends on the time when a new frame is detected. The timing offset must be taken into account. Given a scenario of a transmission,



FIGURE 7.12: A scenario of a transmission



FIGURE 7.13: The halting time comparison of the system for three different approaches

shown in Fig. 7.12 when a standard switch is required, Both the transmitter and receiver spend time to reconfigure the system for the new operating standard. In the proposed receiver, the synchronisation module is a parameterised module, so this module can change its function very fast - within one clock period and hence quickly process input samples. However, a new frame is not able to be sent so quickly because the transmitter is still being reconfigured, resulting in a timing offset in the receiver. It is thus reasonable that the minimum timing offset can be chosen as the configuration time of the transmitter whose hardware characteristics were reported in Table 7.4.

Fig. 7.13 shows the system halting time of the three approaches. Mon, Mul, Pro denote the halting time of the monolithic PR module approach, the multiple PR module approach, and the proposed approach, respectively.  $Ts_n$  is the halting



FIGURE 7.14: A comparison of the three approaches in terms of system latency and FIFO requirement

time of the corresponding  $M_n$  functional module.  $Ts_R$  is the halting time of the monolithic receiver sub-system. As can be seen, the halting time of the multiple PR module approach is greater compared to the monolithic PR module approach, because the gain achieved by overlapping the reconfiguration and data processing is greater compared to the overhead of partitioning into multiple PR modules. The proposed approach can significantly reduce the halting time that only less than one-third of that of the monolithic PR module approach. This results in a reduction in the FIFO requirement.

Fig. 7.13 compares the three approaches in terms of system latency and the FIFO requirement to avoid losing data frame. The system latencies are computed in the worst case which is the latencies for the 802.22 standard (the longest latencies). The FIFO requirement is calculated based on multiplying the sampling frequency by the halting time, followed by rounding up to the next power of two value. The system latency of the proposed method is significantly reduced to 18 % and 27 % of the system latency of the monolithic PR module approach and the multiple PR module approach, respectively. The FIFO requirement for the proposed approach is only 16 kilo-samples (KSs) while the two other approaches require a FIFO which must store up to 64 KSs.

## 7.5 Summary

Our research explores the feasibility of designing efficient multi-standard radios to enhance bandwidth efficiency and avoid spectrum congestion. Our proposed architecture assumes the use of OFDM radio, and implemented on an FPGA, coupling parameterized modules and PR modules to achieve flexibility while minimizing reconfiguration time. A mathematical analyses of the FPGA synthesis results shows that the proposed architecture achieves a significant reduction of 82 % compared to conventional approaches in terms of system latency. The FIFO requirement of the proposed method is also decreased to 25 % of that required for the conventional method.

# Chapter 8

# Future Work and Conclusion

This chapter gives a brief summary of the research contributions as well as providing an overall conclusion to the research presented in this thesis. Some directions for future work related to these contributions are also identified.

## 8.1 Conclusion

This thesis began with a literature survey which investigates the power consumption of FPGA devices, along with some low-power design strategies suitable for OFDM-based radio systems. The background discussion studied OFDM in terms of mathematical representation and functionality, while the performance of OFDM was analysed through its system model. Within this framework, the advantages and limitations of OFDM were discussed, and several challenges identified. The synchronisation issues related to OFDM systems were considered in depth, and the related work focusing on achieving good performance in terms of synchronisation was discussed on its merits as well as limitations. Meanwhile, the issue of spectral leakage of the OFDM signal was also investigated along with its trade-off effects in terms of ICI, especially related to pulse shaping and filtering. The challenges of implementing a MSCR in terms of improving bandwidth efficiency and avoiding spectrum congestion were identified and considered.

In order to reduce computational complexity and power dissipation, a multiplierless design was adopted for the timing synchronisation correlator and compared to conventional approaches for performing correlation-based frame synchronisers that use DSP48E1 slices. We discovered, in the context of synchronisation for IEEE802.16 OFDM systems, that simplified multiplierless designs offer comparable synchronisation performance, even for realistic models of channel conditions. While the DSP48E1-based correlators can support higher clock speeds, this is only possible through a detailed pipelined design. Furthermore, the power consumption and resource usage of multiplierless designs are considerably reduced. Since low-power, low-cost devices such as the Xilinx Spartan-6 do not include sufficient DSP Slices, this suggests adopting multiplierless designs for low-power implementations using such devices. We have shown that while very low quantisation resolution does impact synchronisation performance, with a step size of just 0.5, synchronisation accuracy is on a par with multiplier-based correlation. Multiplierless correlation on a Spartan-6 can save over 85% power compared to a DSP slice design on a Virtex-6 FPGA.

For OFDM's synchronisation performance, conventional schemes achieve good performance when the CFO is in the range of fractional CFO estimation. Some methods employ cross-correlation for the time synchronisation; and these method are robust to large CFO and can obtain acceptable performance at low SNR. However, the much higher computational resources needed for cross-correlation tend to make such methods unsuitable for hardware implementation, despite their good performance. Therefore, a new method was proposed to improve upon these drawbacks of previously reported works. The proposed method takes advantage of period and energy distribution characteristics of the preamble. The synchronisation performance results, obtained through simulation, demonstrate good performance and robustness to large CFO. Moreover, the complexity of the timing metric is much less than that of cross-correlation-based methods, making the proposed method more suitable for hardware implementation. A novel IFO estimation was adopted using efficient and low cost circuitry based on a four-fold resource sharing architecture. The novel IFO estimation method yields significant power and resource

reduction that makes for an IFO estimation implementation on FPGA while also achieving excellent performance, similar to the theoretically achievable bound. Coupling the robust OFDM synchronisation and performing IFO estimation at baseband is important to allow the RF front-end specification to be relaxed, thus reducing system cost. In fact, for some multi-standard radios, and applications suffering significant Doppler shift, RF constraints may be infeasible without techniques such as this IFO estimation being applied.

The novel method for shaping the OFDM leakage spectrum at baseband was also studied within a CR architecture, in order to meet stringent operational SEM requirements. In particular, the research considered two relatively new standards, 802.11p and 802.11af, which specify physical layers which are largely based upon existing standards. In both cases, the extended physical layers are scaled to encourage reuse of existing hardware, devices and designs, but the resulting systems are then subject to much more stringent SEMs. To date, there have been no published implementation solutions for either system. An architecture was proposed which is able to dynamically change the degree of spectral leakage filtering according to transmission power, in which the computation of filtering can be reduced when transmission power is low, but when transmission power is high, it is able to meet the strict SEM specification of both 802.11p and 802.11af. Furthermore, the architecture has the ability to adjust clock rate, bandwidth, and frequency band on a symbol-by-symbol basis, in order to implement an agile CR solution.

The research investigated the feasibility of MSRC as well as the possible techniques for designing multi-standard radios on FPGAs. Traditional implementations in custom ASICs cannot support such flexibility for MSCR systems with standards changing at a faster pace, while software implementations of baseband communication fail to achieve the performance required. Hence, FPGAs offer an ideal platform bringing together flexibility, performance, and efficiency. The mathematical analysis explores the performance of the proposed architecture for MSCR based on mixing the PR modules and parameterised modules. The PR modules provide the flexibility, easy implementation and low resource usage but require long configuration times. The parameterised modules can be employed to

reduce configuration time of system but leads to the implementation complexity and increasing resource usage. The coupling of PR modules and parameterised modules of the proposed architecture for MSCR can achieve both the flexibility and significant decrease configuration time. The calculated results based on the FPGA synthesis show that the proposed architecture achieves a significant reduction in terms of system latency compared to conventional structures. The proposed method requires a very small FIFO in comparison to conventional structure. This allows a MSCR to be implemented on an FPGA platform for a low cost, low power systems.

This thesis has worked towards the implementation of effective MSCR on an FPGA platform, utilising OFDM-based wireless standards and a careful balancing between partial reconfiguration and parameterisation. In order to implement the MSCR demonstrator, which marks the endpoint of this research, several major challenges were identified in this research field, particularly in terms of timing synchronisation, frequency offset compensation and spectral emissions masks. Each of these have been solved using novel contributions to the field. The use of an FPGA platform has enabled this research to take different approaches, which have improved on the state-of-the-art solutions from other researchers, and have been recognised by being published in good journals and conferences, and have yielded solutions to these challenges which have enabled the eventual implementation of the final MSCR demonstrator.

### 8.2 Future Work

In this thesis, the feasible implementation of an OFDM-based MSCR was presented with proposed solutions to overcome challenges in terms of configuration time, synchronisation, and shaping spectrum leakage. A robust and efficient synchronisation method is proposed and evaluated. The novel filtering scheme is presented to meet the strict specifications of recent standards for CRs. The new architecture, based on coupling PR and parameterised modules, is investigated

to reduce the reconfiguration time of MSCR on FPGA. The results presented in the research also introduce some interested research questions and potential future research directions that should be discussed as follows;

#### 8.2.1 Efficiently adaptive shaping spectral leakage

The static strict SEMs in radio systems can guarantee that the systems mitigate the effect of ICI. However, the implementation and power cost may be significant. To meet a strict SEM, the number of subcarriers may need to be reduced, resulting in decreased the spectrum efficiency and throughput, otherwise the transmission power may be tuned down, leading to a reduction of communication range. Moreover, the extending frequency guard can maintain the throughput and transmission power but involves increased computational cost. However, in practice, the adjacent channels are not always occupied and the communication range can change from time to time. Therefore, statically maintaining a very strict SEMs may be redundant. An interesting research question is whether and how the CR can reduce this redundancy. The spectral sensing ability of the CR allows the system to recognise the performance of adjacent systems as well as currently allowed communication range. A novel method is demanded to calculate the dynamic SEM relied upon the performance of adjacent system as well as current allowed communication range. The dynamic SEM could then be temporarily more relaxed than the static SEM while still guaranteeing it does not causing ICI to adjacent systems. In addition, the demand of an optimisation approach that takes into account the cost and advantage of the above reduced effort leakage methods can satisfy the dynamic SEM with the smallest cost in terms of throughput, computation and power consumption.

## 8.2.2 Flexible and efficient MSCR platform

The interface to the higher layer processing is another important factor in building a radio platform for MSCR. While optimization of low level blocks is equally

important, providing a general interface for implementing higher layer processing is important. This allows radio experts to use the system to investigate cognitive radio techniques without the need for substantial low-level FPGA expertise. In future work, we aim to build a standardised software interface for this purpose that simplifies the process of retrieving systems status and initiating reconfiguration. The standardised software interface also allows the MSCR platform to be flexibly and rapidly extended to support new standards. In addition, in case of increasing the number of standards, pre-determining the next standard in a switching pattern, relies upon the global (or co-operative distributed) spectrum sensing. Dual PR regions for a PR module could be a good solution to reduce reconfiguration time. One PR region contains the PR module for the current operating standard while the other region can be reconfigured ready for the standard in the switching pattern. When frequency bands change, the processing band can therefore instantaneously be switched to the second standard's PR module. To do so, a new method needs to be studied to in pre-determining the next standard, based on spectrum sensing and this possibly makes use of a predictive system. The interface between the processing chain and dual PR regions and the switching mechanism also need to be investigated.

# **Bibliography**

- [1] B. Farhang-Boroujeny, Signal Processing Techniques for Software Radios. Lulu Publishing House, 2008.
- [2] "Virtex-6 FPGA DSP48E1 Slice: User Guide." Xilinx, San Jose, CA. www.xilinx.com/support/documentation/user\_guides/ug369.pdf, 2011.
- [3] T. Schmidl and D. Cox, "Robust frequency and timing synchronization for OFDM," *IEEE Transactions on Communications*, vol. 45, pp. 1613–1621, Dec. 1997.
- [4] C. N. Kishore and V. U. Reddy, "A frame synchronization and frequency offset estimation algorithm for OFDM system and its analysis," *EURASIP* Journal on Wireless Communications and Networking, vol. 2006, pp. 1–16, 2006.
- [5] Federal Communications Commission, "Spectrum policy task force report," Tech. Rep. No. 02135, 2002.
- [6] R. Rajbanshi, A. M. Wyglinski, and G. J. Minden, "OFDM-based cognitive radios for dynamic spectrum access networks," in *Cognitive Wireless Communication Networks*, pp. 165–188, Springer US, 2007.
- [7] M. Cummings and S. Haruyama, "FPGA in the software radio," *IEEE Communications Magazine*, vol. 37, pp. 108–112, Feb. 1999.
- [8] T. A. Weiss and F. K. Jondral, "Spectrum pooling: An innovative strategy for the enhancement of spectrum efficiency," *IEEE Radio Communications*, March 2004.

[9] R. Rajbanshi, A. M. Wyglinski, and G. J. Minden, "An efficient implementation of NC-OFDM transceivers for cognitive radios," in *Proceedings of the 1st International Conference on Cognitive Radio Oriented Wireless Networks and Communications*, June 2006.

- [10] J. Mitola and G. Q. M. Jr., "Cognitive radio: making software radios more personal," *IEEE Personal Communications*, vol. 6, pp. 13–18, Aug 1999.
- [11] J. Mitola, Cognitive Radio Architecture: The Engineering Foundations of Radio XML. Wiley, 2006.
- [12] A. L. Recio, Spectrum-Aware Orthogonal Frequency Division Multiplexing. PhD thesis, Virginia Polytechnic Institute and State University, 2010.
- [13] A. MacKenzie, J. Reed, P. Athanas, C. Bostian, R. M. Buehrer, L. DaSilva, S. Ellingson, Y. Hou, M. Hsiao, J.-M. Park, C. Patterson, S. Raman, and C. da Silva, "Cognitive Radio and Networking Research at Virginia Tech," *Proceedings of the IEEE*, vol. 97, pp. 660–688, April 2009.
- [14] C. J. Rieser, T. W. Rondeau, C. Bostian, W. R. Cyre, and T. M. Gallagher, "Cognitive radio engine based on genetic algorithms in a network." US Patent 7,289,972, Oct. 2007.
- [15] T. W. Rondeau, B. Le, C. J. Rieser, and C. W. Bostian, "Cognitive radios with genetic algorithms: Intelligent control of software defined radios," in Software Defined Radio Forum Technical Conference, pp. C3–C8, 2004.
- [16] A. Ghasemi and E. S. Sousa, "Spectrum sensing in cognitive radio networks: requirements, challenges and design trade-offs," *IEEE Communica*tions Magazine, vol. 46, no. 4, pp. 32–39, 2008.
- [17] Federal Communications Commission, "Notice of proposed rule making and order: facilitating oppurtunities for flexible, efficient, and reliable spectrum use employing cognitive radio technologies," Tech. Rep. 03-108, 2005.
- [18] IEEE std.802.11-2005, IEEE Standard 802 Part11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications.

[19] IEEE std.802.16-2009, IEEE Standard for Local and Metropolitan Area Networks Part16: Air Interface for Fixed Broadband Wireless Access Systems.

- [20] IEEE std.802.22-2011, IEEE Standard for Wireless Regional Area Networks Part22:Cognitive Wireless RAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Policies and Procedures for Operation in the TV Bands.
- [21] G. J. Minden, J. B. Evans, L. Searl, D. DePardo, V. R. Petty, R. Rajbanshi, T. Newman, Q. Chen, F. Weidling, J. Guffey, D. Datla, B. Barker, M. Peck, B. Cordill, A. M. Wyglinsky, and A. Agah, "KUAR: A flexibile software-defined radio development platform," in *Proceedings of IEEE International Symposium on New Frontiers in Dynamic Spectrum Access Networks (DyS-PAN)*, pp. 17–22, Apr. 2007.
- [22] P. Sutton, L. Doyle, and K. Nolan, "A reconfigurable platform for cognitive networks," in Proceeding of the International Conference on Cognitive Radio Oriented Wireless Network Communications (CROWNCOM), 2006.
- [23] P. Sutton, J. Lotze, H. Lahlou, S. Fahmy, K. Nolan, B. Özgül, T. Rondeau, J. Noguera, and L. Doyle, "Iris an architecture for cognitive radio networking testbeds," *IEEE Communications Magazine*, vol. 48, pp. 114–122, Sep 2010.
- [24] S. Fahmy, J. Lotze, J. Noguera, L. Doyle, and R. Esser, "Generic software framework for adaptive applications on FPGAs," in *Proceedings of IEEE* Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 55–62, 2009.
- [25] J. Lotze, S. A. Fahmy, J. Noguera, B. Ozgül, L. Doyle, and R. Esser, "Development framework for implementing FPGA-based cognitive network nodes," in *IEEE Global Communications Conference (GLOBECOM)*, 2009.
- [26] J. van de Belt, P. D. Sutton, and L. E. Doyle, "Accelerating software radio: Iris on the Zynq SoC," in *Proceedings of IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC)*, pp. 294–295, 2013.

[27] Z. Miljanic, I. Seskar, K. Le, and D. Raychaudhuri, "The WINLAB network centric cognitive radio hardware platform: WiNC2R," Mobile Networks and Applications, vol. 13, no. 5, pp. 533–541, 2008, Springer.

- [28] H. So, A. Tkachenko, and R. W. Brodersen, "A unified hardware/soft-ware runtime environment for FPGA based reconfigurable computers using BORPH," ACM Transactions on Embedded Computing Systems, vol. 7, no. 2, 2008.
- [29] K. Amiri, Y. Sun, P. Murphy, C. Hunter, J. R. Cavallaro, and A. Sabharwal, "WARP, a unified wireless network testbed for education and research," in Proceedings of IEEE International Conference on Microelectronic Systems Education (MSE), pp. 53–54, June 2007.
- [30] K. Tan, H. Liu, J. Zhang, Y. Zhang, J. Fang, and G. M. Voelker, "Sora: high-performance software radio using general-purpose multi-core processors," Communications of the ACM, vol. 54, no. 1, pp. 99–107, 2011.
- [31] Free Software Foundation, Inc., "GNU Radio The GNU Software Radio," 2009.
- [32] Free Software Foundation, Inc., "GNU Radio Companion," 2009.
- [33] R. Marlow, C. Dobson, and P. Athanas, "An Enhanced and Embedded GNU Radio Flow," in *Proceedings of the International Conference on Field Programmable Logic and Applications (FPL)*, pp. 1–4, 2014.
- [34] A. Love, W. Zha, and P. Athanas, "In pursuit of instant gratification for FPGA design," in *Proceedings of the International Conference on Field Programmable Logic and Applications (FPL)*, pp. 1–8, 2013.
- [35] C. Dick and F. Harris, "FPGA implementation of an OFDM PHY," in Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, vol. 1, pp. 905 – 909, Nov. 2003.

[36] A. Fort, J.-W. Weijers, V. Derudder, W. Eberle, and A. Bourdoux, "A performance and complexity comparison of auto-correlation and cross-correlation for OFDM burst synchronization," in *Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)*, vol. 2, pp. II–341–4, april 2003.

- [37] K. Wang, J. Singh, and M. Faulkner, "FPGA implementation of an OFDM-WLAN synchronizer," in *IEEE International Workshop on Electronic Design*, Test and Applications (DELTA), pp. 89 94, Jan. 2004.
- [38] J. Guffey, A. Wyglinski, and G. Minden, "Agile radio implementation of OFDM physical layer for dynamic spectrum access research," in *Proceedings* of IEEE Global Telecommunications Conference (GLOBECOM), pp. 4051– 4055, Nov. 2007.
- [39] Z. Huang, B. Li, and M. Liu, "A proposed timing synchronization method for 802.16e downlink," in *International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)*, pp. 1–4, Dec. 2010.
- [40] A. Recio and P. Athanas, "Physical layer for spectrum-aware reconfigurable OFDM on an FPGA," in Proceedings of the Euromicro Conference on Digital System Design: Architectures, Methods and Tools (DSD), pp. 321–327, Sep. 2010.
- [41] E.-S. Shim and Y.-H. You, "OFDM integer frequency offset estimator in rapidly time-varying channels," in *Proceedings of the Asia-Pacific Conference on Communications*, pp. 1–4, 2006.
- [42] M. Morelli and M. Moretti, "Integer frequency offset recovery in OFDM transmissions over selective channels," *IEEE Transactions on Wireless Communications*, vol. 7, pp. 5220–5226, Dec. 2008.
- [43] Y.-H. You and K.-W. Kwon, "Multiplication-Free Estimation of Integer Frequency Offset for OFDM-Based DRM Systems," *IEEE Signal Processing Letters*, vol. 17, pp. 851–854, Oct. 2010.

[44] K. Lee, S.-H. Moon, S. Kim, and I. Lee, "Sequence Designs for Robust Consistent Frequency-Offset Estimation in OFDM Systems," *IEEE Transactions on Vehicular Technology*, vol. 62, pp. 1389–1394, March 2013.

- [45] M. Morelli, L. Marchetti, and M. Moretti, "Integer frequency offset estimation and preamble identification in WiMAX systems," in *Proceedings of IEEE International Conference on Communications (ICC)*, pp. 5610–5615, June 2014.
- [46] A. G. Armada and M. Calvo, "Phase noise and sub-carrier spacing effects on the performance of an OFDM communication system," *IEEE Communi*cations Letters, vol. 2, 1998.
- [47] "IEEE 802 Standard; Part 11; Amendment 5: Television White Spaces (TVWS) Operation," Dec. 2013.
- [48] S. Shellhammer, A. Sadek, and W. Zhang, "Technical challenges for cognitive radio in the TV white space spectrum," in *Information Theory and Applications Workshop*, pp. 323–333, Feb 2009.
- [49] "IEEE 802 Standard; Part 11; Amendment 6: Wireless Access in Vehicular Environments," Jul. 2010.
- [50] W. Vandenberghe, I. Moerman, and P. Demeester, "Approximation of the IEEE 802.11p standard using commercial off-the-shelf IEEE 802.11a hardware," in *Proceedings of the International Conference on ITS Telecommuni*cations (ITST), pp. 21–26, 2011.
- [51] J. Fernandez, K. Borries, L. Cheng, B. Kumar, D. Stancil, and F. Bai, "Performance of the 802.11p Physical Layer in Vehicle-to-Vehicle Environments," IEEE Transactions on Vehicular Technology, vol. 61, no. 1, pp. 3–14, 2012.
- [52] D. Jiang and L. Delgrossi, "IEEE 802.11p: Towards an international standard for wireless access in vehicular environments," in *Proceedings of IEEE Vehicular Technology Conference (VTC)*, pp. 2036–2040, 2008.

[53] "IEEE Standard for Wireless Access in Vehicular Environments-Multi-Channel Operations," Dec. 2010.

- [54] X. Wu, S. Subramanian, R. Guha, R. White, J. Li, K. Lu, A. Bucceri, and T. Zhang, "Vehicular Communications Using DSRC: Challenges, Enhancements, and Evolution," *IEEE Journal on Selected Areas in Communications*, vol. 31, no. 9, pp. 399–408, 2013.
- [55] I. Macaluso, B. Ozgul, T. Forde, P. Sutton, and L. Doyle, "Spectrum and Energy Efficient Block Edge Mask-Compliant Waveforms for Dynamic Environments," *IEEE Journal on Selected Areas in Communications*, vol. 32, pp. 307–321, February 2014.
- [56] T. Forde, L. Doyle, and B. Ozgul, "Dynamic Block-Edge Masks (BEMs) for Dynamic Spectrum Emission Masks (SEMs)," in *IEEE Symposium on New Frontiers in Dynamic Spectrum*, pp. 1–10, April 2010.
- [57] P. Kryszkiewicz and H. Bogucka, "Dynamic determination of spectrum emission masks in the varying cognitive radio environment," in *Proceedings of IEEE International Conference on Communications (ICC)*, pp. 2733–2737, June 2013.
- [58] C. Masse, "A direct-conversion transmitter for WiMAX and WiBro applications," *RF Design*, vol. 29, no. 1, pp. 42–46, 2006.
- [59] Y.-M. Chen and I.-Y. Kuo, "Design of lowpass filter for digital down converter in OFDM receivers," in *Proceedings of the International Conference on Wireless Networks, Communications and Mobile Computing*, vol. 2, pp. 1094–1099, June 2005.
- [60] J. Dowle, S. H. Kuo, K. Mehrotra, and I. V. McLoughlin, "FPGA-based MIMO and space-time processing platform," in *EURASIP Journal of Ap*plied Signal Processing, pp. 137–137, Jan. 2006.

[61] M. Faulkner, "The effect of filtering on the performance of OFDM systems," IEEE Transactions on Vehicular Technology, vol. 49, pp. 1877–1884, Sep. 2000.

- [62] I. Kuon, R. Tessier, and J. Rose, "FPGA architecture: Survey and challenges," Foundations and Trends in Electronic Design Automation, vol. 2, pp. 135–253, Feb. 2008.
- [63] P. Coussy and A. Morawiec, eds., High-Level Synthesis: From Algorithm to Digital Circuit. Springer, 2008 ed., Oct. 2008.
- [64] Xilinx Inc., UG470: 7 Series FPGAs Configuration User Guide, Jan. 2013.
- [65] Xilinx Inc., DS190: Zynq-7000 All Programmable SoC Overview, Mar. 2013.
- [66] Xilinx Inc., UG585: Zynq-7000 All Programmable SoC Technical Reference Manual, Mar. 2013.
- [67] C. Dobson, K. Rooks, and P. Athanas, "A Power-Efficient FPGA-Based Self-Adaptive Software Defined Radio," in *International Workshop on Power and Timing Modeling*, Optimization and Simulation (PATMOS), 2014.
- [68] P. Sedcole, B. Blodget, T. Becker, J. Anderson, and P. Lysaght, "Modular dynamic reconfiguration in Virtex FPGAs," *IEE Proceedings - Computers* and Digital Techniques, vol. 153, pp. 157–164, May 2006.
- [69] E. McDonald, "Runtime FPGA Partial Reconfiguration," in *Proceedings of IEEE Aerospace Conference*, pp. 1–7, Mar. 2008.
- [70] J.-P. Delahaye, J. Palicot, C. Moy, and P. Leray, "Partial reconfiguration of FPGAs for dynamical reconfiguration of a software radio platform," in Proceedings of IST Mobile and Wireless Communications Summit, July 2007.
- [71] J. Delorme, J. Martin, A. Nafkha, C. Moy, F. Clermidy, P. Leray, and J. Palicot, "A FPGA partial reconfiguration design approach for cognitive radio based on NoC architecture," in *Joint International IEEE Northeast Workshop on Circuits and Systems and TAISA Conference (NEWCAS-TAISA)*, pp. 355–358, June 2008.

[72] K. Vipin and S. Fahmy, "A high speed open source controller for FPGA Partial Reconfiguration," in *Proceedings of the International Conference on Field-Programmable Technology (FPT)*, pp. 61–66, Dec 2012.

- [73] K. Vipin and S. A. Fahmy, "Automated partitioning for partial reconfiguration design of adaptive systems," in *Proceedings of the Reconfigurable Architectures Workshop (RAW)*, May 2013.
- [74] A. Raghunathan, N. K. Jha, and S. Dey, *High-level power analysis and optimization*. Boston: Kluwer Academic, 1998.
- [75] L. Shang, A. S. Kaviani, and B. Kusuma, "Dynamic power consumption in Virtex-II FPGA family," in ACM International Symposium on FPGAs, pp. 157–164, ACM Press, 2002.
- [76] J. Anderson and F. Najm, "Interconnect capacitance estimation for FP-GAs," in Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 713–718, 2004.
- [77] J. Anderson and F. Najm, "Power estimation techniques for FPGAs," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, no. 10, pp. 1015–1027, 2004.
- [78] E. Todorovich, E. Boemo, A. F., and V. J., "Statistical power estimation for FPGAs," in *Proceedings of the International Conference on Field Programmable Logic and Applications (FPL)*, pp. 515–518, 2005.
- [79] A. Reimer, A. Schulz, and N. W., "Modelling macromodules for high-level dynamic power estimation of FPGA-based digital designs," in *Proceed*ings of the International Symposium on Low Power Electronics and Design (ISLPED), pp. 151–154, 2006.
- [80] K. Danckaert, K. Masselos, F. Cathoor, H. De Man, and C. Goutis, "Strategy for power-efficient design of parallel systems," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 7, Issue: 2, pp. 258 – 265, Jun 1999.

[81] L. Varga, G. Hosszu, and F. Kovacs, "A low-power design technique for digital signal processing applications," in *Proceedings of the Mediterranean Electrotechnical Conference (MEleCon)*, vol. II, pp. 827–830, 2000.

- [82] P. P. Czapski and A. Sluzek, "Power optimization techniques in FPGA devices: A combination of system and low levels," *International Journal of Electrical, Computer and Systems Engineering*, vol. 1, no. 3, pp. 148–154, 2007.
- [83] Q. Liu, G. Constantinides, K. Masselos, and P. Cheung, "Data-reuse exploration under an on-chip memory constraint for low-power FPGA-based systems," *IET Computers & Digital Techniques*, vol. 3, no. 3, pp. 235–246, 2009.
- [84] S. Ahuja, High Level Power Estimation and Reduction Techniques for Power Aware Hardware Design. PhD thesis, Virginia Polytechnic Institute and State University, 2010.
- [85] C. Deng, "Power analysis for CMOS/BiCMOS circuits," in *International Symposium on Low Power Electronics and Design*, 1994.
- [86] P. Landman and J. Rabaey, "Activity-sensitive architectural power analysis," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 571–587, 1996.
- [87] S. Gupta and F. Najm, "Analytical models for RTL power estimation of combinational and sequential circuits," *IEEE Transactions on Computer-*Aided Design of Integrated Circuits and Systems, vol. 19, pp. 808–814, 2000.
- [88] A. Raghunathan, S. Dey, and N. K. Jha, "High-level macro-modeling and estimation techniques for switching activity and power consumption," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 11, no. 4, pp. 538–557, 2003.

[89] "Xpower estimator userguide." Xilinx, San Jose, CA. http://www.xilinx.com/support/documentation/sw\_manuals/xilinx13\_2/ug440.pdf, 2012.

- [90] K.-W. Yip, Y.-C. Wu, and T.-S. Ng, "Design of multiplierless correlators for timing synchronization in IEEE 802.11a wireless LANs," *IEEE Transactions on Consumer Electronics*, vol. 49, pp. 107 114, Feb. 2003.
- [91] L. Hanzo and T. Keller, *OFDM and MC-CDMA : A Primer*. Wiley-IEEE Press, 2006.
- [92] T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Robust and Efficient OFDM Synchronisation for FPGA-Based Radios," Circuits, Systems, and Signal Processing, vol. 33, pp. 2475–2493, Aug. 2014, Springer.
- [93] L. Schwoerer, "VLSI suitable synchronization algorithms and architecture for IEEE 802.11a physical layer," in *Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS)*, vol. 5, pp. V–721, 2002.
- [94] F. Manavi and Y. Shayan, "Implementation of OFDM modem for the physical layer of IEEE 802.11a standard based on Xilinx Virtex-II fpga," in Proceedings of IEEE Vehicular Technology Conference (VTC), vol. 3, pp. 1768 1772, May 2004.
- [95] T.-H. Kim and I.-C. Park, "Low-power and high-accurate synchronization for IEEE 802.16d systems," *IEEE Transactions on Very Large Scale Integration* (VLSI) Systems, vol. 16, pp. 1620 –1630, Dec. 2008.
- [96] V. Erceg, K. V. S. Hari, and M. S. Smith, "Channel models for fixed wireless applications," *Tech. Rep. IEEE 802.16a-03/01*, Jul. 2003.
- [97] T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Low-power correlation for IEEE 802.16 OFDM synchronisation on FPGA," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 21, pp. 1549–1553, August 2013.

[98] K. Bang, N. Cho, J. Cho, H. Jun, K. Kim, H. Park, and D. Hong, "A coarse frequency offset estimation in an OFDM system using the concept of the coherence phase bandwidth," *IEEE Transactions on Communications*, vol. 49, pp. 1320–1324, Aug. 2001.

- [99] M. Park, N. Cho, J. Cho, and D. Hong, "Robust integer frequency offset estimator with ambiguity of symbol timing offset for OFDM systems," in Proceedings of IEEE Vehicular Technology Conference (VTC), pp. 2116– 2120, 2002.
- [100] Xilinx Inc., UG389: Spartan-6 FPGA DSP48A1 Slice, August 2009.
- [101] T. H. Pham, I. V. McLoughlin, and S. A. Fahmy, "Shaping Spectral Leakage for IEEE 802.11p Vehicular Communications," in *IEEE Vehicular Technol*ogy Conference (VTC), May 2014.
- [102] G. Acosta-Marum and M.-A. Ingram, "Six time and frequency selective empirical channel models for vehicular wireless LANs," *IEEE Vehicular Technology Magazine*, vol. 2, no. 4, pp. 4–11, 2007.
- [103] I. Sen and D. Matolak, "Vehicle Vehicle Channel Models for the 5-GHz Band," *IEEE Transactions on Intelligent Transportation Systems*, vol. 9, no. 2, pp. 235–245, 2008.
- [104] Z. Lan, K. Mizutani, G. Villardi, and H. Harada, "Design and implementation of a Wi-Fi prototype system in TVWS based on IEEE 802.11af," in Proceedings of IEEE Wireless Communications and Networking Conference (WCNC), pp. 750–755, April 2013.
- [105] E. Bala, J. Li, and R. Yang, "Shaping Spectral Leakage: A Novel Low-Complexity Transceiver Architecture for Cognitive Radio," *IEEE Vehicular Technology Magazine*, vol. 8, no. 3, pp. 38–46, 2013.
- [106] D. Castanheira and A. Gameiro, "Novel Windowing Scheme for Cognitive OFDM Systems," *IEEE Wireless Communications Letters*, vol. 2, no. 3, pp. 251–254, 2013.

[107] R. J. Kapadia, Digital Filters Theory, Application and Design of Modern Filters. Wiley VCH, 2012.

- [108] P. Choi, J. Gao, N. Ramanathan, M. Mao, S. Xu, C.-C. Boon, S. A. Fahmy, and L.-S. Peh, "A case for leveraging 802.11p for direct phone-to-phone communications," in *Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED)*, pp. 207–212, ACM, 2014.
- [109] T. H. Pham, S. A. Fahmy, and I. V. McLoughlin, "Efficient Multi-Standard Cognitive Radios on FPGAs," in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), 2014.
- [110] P. Athanas, J. Bowen, T. Dunham, C. Patterson, J. Rice, M. Shelburne, J. Suris, M. Bucciero, and J. Graf, "Wires on Demand: run-time communication synthesis for reconfigurable computing," in *Proceedings of the Inter*national Conference on Field Programmable Logic and Applications (FPL), 2007.
- [111] B. Ai, Z.-X. Yang, C.-Y. Pan, J.-H. Ge, Y. Wang, and Z. Lu, "On the synchronization techniques for wireless OFDM systems," *IEEE Transactions* on *Broadcasting*, vol. 52, pp. 236–244, June 2006.
- [112] A. Troya, K. Maharatna, M. Krstic, E. Grass, U. Jagdhold, and R. Kraemer, "Efficient Inner Receiver Design for OFDM-Based WLAN Systems: Algorithm and Architecture," *IEEE Transactions on Wireless Communications*, vol. 6, pp. 1374–1385, April 2007.