# Optimization of Multi-Channel BCH

Error Decoding for Common Cases

by

Russell Dill

A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science

Approved April 2015 by the Graduate Supervisory Committee:

Aviral Shrivastava, Chair Hyunok Oh Arunabha Sen

ARIZONA STATE UNIVERSITY

May 2015

©2015 Russell Dill

All Rights Reserved

#### **ABSTRACT**

Error correcting systems have put increasing demands on system designers, both due to increasing error correcting requirements and higher throughput targets. These requirements have led to greater silicon area, power consumption and have forced system designers to make trade-offs in Error Correcting Code (ECC) functionality. Solutions to increase the efficiency of ECC systems are very important to system designers and have become a heavily researched area.

Many such systems incorporate the Bose-Chaudhuri-Hocquenghem (BCH) method of error correcting in a multi-channel configuration. BCH is a commonly used code because of its configurability, low storage overhead, and low decoding requirements when compared to other codes. Multi-channel configurations are popular with system designers because they offer a straightforward way to increase bandwidth. The ECC hardware is duplicated for each channel and the throughput increases linearly with the number of channels. The combination of these two technologies provides a configurable and high throughput ECC architecture.

This research proposes a new method to optimize a BCH error correction decoder in multi-channel configurations. In this thesis, I examine how error frequency effects the utilization of BCH hardware. Rather than implement each decoder as a single pipeline of independent decoding stages, the channels are considered together and served by a pool of decoding stages. Modified hardware blocks for handling common cases are included and the pool is sized based on an acceptable, but negligible decrease in performance.

This thesis's experimental approach examines multi-channel configurations found in typical NAND flash systems. My experimental data shows that the proposed pooled group approach requires significantly fewer hardware blocks than a traditional multi-channel configuration. By allowing a 2% performance degradation and sizing the decoding pool appro-

priately, the scheme reduces hardware area by 47%-71% and dynamic power by 44%-59%.

Additionally, I examined what improvements were possible with the improved design using the same hardware area as the traditional implementation. My experiments show that an improved throughput of 3x-5x can be achieved or NAND flash lifetime can be extended by 1.4x-4.5x.

# **DEDICATION**

This paper is dedicated to my loving wife who has had both eternal patience with my own commitments as well as the energy to deal with her own struggles.

#### **ACKNOWLEDGEMENTS**

The road to completing a thesis is long, bumpy, often confusing, and yet also exciting, enriching, and rewarding I could not have travelled this road alone and owe much of my success to those who have helped me along the way. I'd like to thank those who have helped me get to where I am today.

On such a journey, it is invaluable to have an excellent guide. Without such a guide, I would have meandered more than I did and I certainly would have never completed my research. I have great gratitude for my academic advisor and committee chair, Dr. Shrivastava. Dr. Shrivastava has provided invaluable input into both my research and my writing.

I'd also like to thank Dr. Oh who has been able to provide valuable insight and advice. He has proven an invaluable resource in my area of research and my thesis would be much poorer without his help.

Finally I'd like to thank the entire advising graduate advising department, who have had to endure my countless questions, forms, requests, and overrides. Christina Sebring, Cynthia Donahue, and Martha Vander Berg have not only ensured that I met the necessary requirements but pushed when necessary.

# CONTENTS

| Pa                                                   | ıge  |
|------------------------------------------------------|------|
| LIST OF TABLES                                       | vii  |
| LIST OF FIGURES                                      | ⁄iii |
| CHAPTER                                              |      |
| I INTRODUCTION                                       | I    |
| 2 BACKGROUND                                         | 5    |
| 2.1 Error Rates                                      | 5    |
| 2.2 Flash Memory Lifetime                            | 7    |
| 2.3 BCH Codes                                        | 8    |
| 2.3.1 Finite Field Overview                          | Ю    |
| 2.3.2 Finite Field Operations Utilizing LFSR         | 12   |
| 2.3.3 Encoding                                       | 13   |
| 2.3.4 Decoding                                       | 13   |
| 2.3.4.1 Syndrome Computation                         | 13   |
| 2.3.4.2 Error Locator Polynomial Generation          | 14   |
| 2.3.4.3 Root Finding                                 | 15   |
| 3 RELATED WORKS                                      | 17   |
| 3.1 Improving Throughput                             | 17   |
| 3.2 Improving Efficiency                             | 18   |
| 4 MAIN OBSERVATIONS                                  | 21   |
| 5 MY APPROACH                                        | 23   |
| 5.1 Architecture                                     | 23   |
| 5.1.1 Syndromes                                      | 24   |
| 5.1.2 Syndrome/Error Locator Polynomial Interconnect | 25   |

| CHAPTER         |        |                                                   | Page |
|-----------------|--------|---------------------------------------------------|------|
|                 | 5.1.3  | Error Locator Polynomial Generator                | . 26 |
|                 | 5.1.4  | Error Locator Polynomial/Root Solver Interconnect | . 26 |
|                 | 5.1.5  | Traditional Chien Root Solver                     | . 27 |
|                 | 5.1.6  | Reduced Root Solver                               | . 28 |
|                 | 5.1.7  | Output Units                                      | . 29 |
| 5.2             | Deter  | mining the Number of Units                        | . 30 |
| 6 EXPI          | ERIME  | NTS                                               | . 34 |
| 6. <sub>1</sub> | Setup  |                                                   | . 34 |
| 6.2             | Baseli | ne Configuration                                  | . 35 |
| 6.3             | Area ( | Optimized BCH Decoder                             | . 36 |
| 6.4             | Throu  | aghput Optimized BCH Decoder                      | . 37 |
| 6.5             | Flash  | Lifetime Optimized Design                         | . 40 |
| 7 CON           | CLUSI  | ON AND FUTURE WORK                                | . 42 |
| REFERI          | ENCES  |                                                   | . 44 |

# LIST OF TABLES

| Table |                                                       |    |
|-------|-------------------------------------------------------|----|
| Ι     | $x^3 + X + 1$ over $GF(2^3)$                          | II |
| 2     | Targeted ECC Range                                    | 36 |
| 3     | Hardware Units Required for Area Optimized Decoder    | 36 |
| 4     | Hardware Units Required for Lifetime Optimized Design | 40 |
| 5     | BER Achievable with Lifetime Optimized Design         | 4I |

# LIST OF FIGURES

| Fig | gure                                                                              | Page |
|-----|-----------------------------------------------------------------------------------|------|
| I   | Basic BCH Decoder Structure                                                       | . 2  |
| 2   | P/E Cycles, BER, and ECC Strength Relation                                        | . 9  |
| 3   | Example LFSR                                                                      | . 12 |
| 4   | LFSR with Input                                                                   | . 13 |
| 5   | Probabilities of Errors at BER of 1×10 <sup>-4</sup>                              | . 21 |
| 6   | An Example of the Proposed BCH Decoder                                            | . 24 |
| 7   | Probability that More than $m$ Blocks Contain at Least One Error Where $n=8\dots$ | . 31 |
| 8   | Probability that More than $m$ Blocks Contain More than One Error Where $n=8$ .   | . 32 |
| 9   | Units Required for BER of 2×10 <sup>-4</sup>                                      | . 33 |
| Ю   | Area Saving Results                                                               | . 37 |
| II  | Power Saving Results                                                              | . 38 |
| 12  | Requirements of $2 \times 10^{-5} Design$                                         | . 38 |
| 13  | Throughput Optimization Results                                                   | . 39 |
| 14  | Improved Lifetime                                                                 | . 4I |

## Chapter 1

### INTRODUCTION

Error rates in storage and communication channels are increasing (Luyi, Jinyi, and Xiaohua 2012). Forward Error Correction (FEC) is a commonly used method to decrease the error rates of those channels (Rate 1983). FEC adds redundant information to the message to allow the receiver to correct errors. BCH codes are very commonly used across a wide range of systems (Sun, Rose, and Zhang 2006). Some of the systems that utilize BCH error correction are; wireless communication links, NAND flash storage, magnetic storage, on-chip cache memories, DRAM memory arrays, and data buses.

Although encoding BCH is fairly straightforward, performing the decoding steps is much more complex (Zambelli et al. 2012). System designers must balance the high complexity of BCH decoders with their overall system requirements (Strukov 2006). The decoders must provide high throughput, either by running at high clock speeds or by implementing bit parallel operation. The maximum clock speed of the decoder is limited by the process technology and the complexity of the decoder. Additionally, adding bit-parallel operation increases the area of the decoder and makes it more difficult to achieve high clock speeds. Limited available area for the decoder can also limit the number of errors that can be corrected.

By developing a more area efficient BCH decoder, several possibilities open up besides simply reducing area. The area savings can be used to add bit-parallel operation to improve throughput. Alternatively the decoder could be designed to correct more errors extending the useful life of flash memory or increasing the bit-rate of a communication channel.

A typical BCH decoder implementation is essentially a 3-stage pipeline as shown in fig-



Figure 1. Basic BCH decoder structure

ure 1. The three stages of the pipeline are syndrome calculation, generating the error locator polynomial, and finding the roots of the error locator polynomial (Hong and Vetterli 1995). Each pipeline stage operates simultaneously and independently. Data is passed between the stages when the current stage is complete and the next stage is ready to receive the data. This pipelined configuration allows the decoder to operate on 3 codes simultaneously.

The first stage, syndrome calculation is similar in fashion to encoding and at similar cost. A simple logic circuit known as a Linear Feedback Shift Register (LFSR) is typically used for syndrome calculation. As LFSRs are used in encoding and syndrome calculation, work has gone to optimize high speed bit-parallel LFSR operation for BCH.

Calculating the error locator polynomial is performed by successive approximation using the Berlekamp-Massey algorithm. The implementation of the algorithm requires many multipliers and dividers, and consumes a large portion of the decoder. General work into optimized Berlekamp-Massey implementations has been done as well as the sharing of Berlekamp-Massey units between BCH channels.

Solving for the roots of the error locator polynomial is typical performed by brute force using an algorithm known as a Chien search (Litwin 2001). This algorithm checks for a root at each possible value of x. The Chien search can be expanded to a bit-parallel architecture. Optimization of this algorithm has been researched heavily, especially in the bit parallel case due to the large area requirements.

Previous works have concentrated on optimizing the stages of single-channel decoders. Much progress has been made on improving the performance and efficiency of individual stages of the BCH decoding process. Although syndrome calculation is the simplest step, it has still received much attention as similar hardware is also used for BCH encoding. As performing operations in a bit-parallel manner can be used to improve performance, Jun et al. Jun et al. 2005 have presented work in improving LFSR performance. Additionally, Lee, Yoo, and Park (2012) have presented work on improving the syndrome calculation techniques. Generating the error locator polynomial is the most algorithmically complex step of BCH decoding. Compounding the issue, it cannot be modified for bit-parallel operation to improve throughput. Jamro (1997) has demonstrated a method of preloading the initial two steps of the algorithm as well as utilizing basis rearrangement to combine two serial steps into one. The final stage of the algorithm is root finding, typically implemented by the Chien search. Kristian et al. (2010) has demonstrated the straightforward step to convert the Chien search from a purely serial operation to a bit-parallel operation. As moving to bit-parallel operation quickly increases hardware area, Chen and Parhi (2004) have developed a group matching scheme to reduce the hardware complexity in the bit-parallel case.

In order to achieve further advances in BCH decoding, I examine the decoding process as a whole and specifically as implemented in multi-channel architectures. A multi-channel BCH decoder is typically designed by putting several single-channel BCH decoders together in parallel. For each set of decoded blocks, only a small fraction of the full error correcting capability is used. For instance, if no error is present in a block, which can be detected during the syndrome calculation, no additional stages are required. If one error is present in a block, the error locator polynomial can be solved directly rather than through a brute force search. For a wide range of error rates, these two cases are very common. My idea then is to optimize a multi-channel architecture for the common case, rather than the worst case. I use these observations along with the reduced root solver to optimize the stages of the BCH decoder pipeline so that the area requirements are greatly reduced while the opti-

mization incurs a negligible performance degradation. The proposed optimizations reduce power consumption and area requirements greatly. Additionally, by trading saved area for greater complexity, we can improve throughput and error correcting capability as well.

In this thesis, I examine a fixed architecture decoder configured for a representative range of error correction capability. The base configuration chosen for the decoder is 8 channels, each 4 bits wide running at 200 MHz. This provides a total throughput of 6.4 Gbit/s. Experiments cover decoding strengths of 5 bits, 7 bits, 8 bits, and 10 bits. This covers a typical range of error rates. For the design parameters examined in this research, I achieve an area savings of 47%-71% if I allow a 2% performance degradation. For my test platform, this translates to a dynamic power savings between 44% and 62%.

Rather than reducing the area of the optimized design, I can keep the area the same and instead improve performance. My technique increases throughput by 3x-5x with the same area. Also, the improvements can increase the error correcting capability of the decoder with the same area, which increases the usable life of flash memory. The ageing of flash memory is determined by the number of Program/Erase (P/E) cycles each block has undergone. As the number of P/E cycles increases, the error rate also increases. There is a threshold then where the number of P/E cycles and associated error rate exceeds the error correction capability of the BCH decoder. Although the raw error rate increases rapidly as flash memory ages, the optimized decoder can improve flash lifetime by 1.4x-4.5x.

# Chapter 2

### **BACKGROUND**

### 2.1 Error Rates

The key component to understanding FEC and the improvements in this research is understanding error rates. Information theory tells us that coding systems exist that allow us to use noisy communication channels reliably. From the central result of Claude Shannon's information theory (Shannon 1948):

Let a discrete channel have the capacity C and a discrete source the entropy per second H. If  $H \leq C$  there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors.

Typical FECs transform the input data by adding specially calculated redundant check bits to form a codeword. The appropriate code must be selected for a number of bits to be corrected and a chosen block size. Larger block sizes have lower storage overhead, but higher algorithmic complexity.

If the number of errors that occur within the codeword exceeds the capability of the chosen code, an uncorrectable error occurs. The probability that an uncorrectable error occurring within a codeword determines the new channel error rate. This rate is calculated by determining the probability that t or fewer errors will occur in a block (where t is the number of errors that can be corrected by the code) and then working backwards to obtain the new bit error rate of the channel. This calculation also accounts for the coding loss, the additional probability that an error will occur in the redundant bits of the codeword.

In order to perform these calculations, the necessary values are the Bit Error Rate (BER), p, the number of bits in the codeword, n, the error correcting capability of the code, t, and

the desired uncorrectable BER. The most basic calculation is determining that an error free message is received. This is true if every bit in the message is correct (Houghton 2001, p. 168). I will represent this probability with  $P_0(n)$ .

$$P_0(n) = (1-p)^n (2.1)$$

It is straightforward to calculate from eq. 2.1 the probability that at least one error has occurred,  $\neg P_0(n)$ .

$$\neg P_0(n) = 1 - P_0(n) \tag{2.2}$$

$$\neg P_0(n) = 1 - (1 - p)^n \tag{2.3}$$

Moving on from this, one can calculate the probability that exactly m errors occur in a message,  $P_{eq}(m,n)$ .

$$P_{eq}(m,n) = p^m (1-p)^{n-m} \binom{n}{m}$$
 (2.4)

By summing eq. 2.4 for various values of m, one can calculate the probability that m or fewer errors occur,  $P_{le}(m,n)$ :

$$P_{le}(m,n) = \sum_{k=0}^{m} P_{eq}(k,n)$$
 (2.5)

$$P_{le}(m,n) = \sum_{k=0}^{m} \left[ p^k (1-p)^{n-k} \binom{n}{k} \right]$$
 (2.6)

One can then use eq. 2.6 to find the probability that more than m errors occur,  $P_{gt}(m,n)$ .

$$P_{gt}(m,n) = 1 - P_{le}(m,n)$$
(2.7)

$$P_{gt}(m,n) = 1 - \sum_{k=0}^{m} \left[ p^k (1-p)^{n-k} \binom{n}{k} \right]$$
 (2.8)

Eq. 2.8 is important in selecting a BCH code as it shows the probability that a block contains an uncorrectable error. One can then work backwards to find the uncorrectable error rate by plugging the result of eq. 2.8 into eq. 2.1 and reversing it.

$$p(t,n)_{uncorr} = 1 - P_{at}(t,n)^{1/n}$$
(2.9)

Thus given a BER, p, a block size n, and a designed uncorrectable error rate, a sufficient t can be found.

## 2.2 Flash Memory Lifetime

The push to maximize the storage capacity of NAND flash memory has led to a storage medium that requires extensive error correction in order to be reliable. The primary causes of increasing error rates in flash memory are due to a decreasing process size and an increase in the number of bits stored per cell. Both of these techniques are able to increase storage space well beyond the additional overhead required by ECC.

The properties that lead to high storage densities within flash memory also lead to a lower lifetime. The wearing out of flash memory cells is caused by the high voltages incurred during P/E cycles. These high voltages lead to a deterioration of the tunnel oxide within the cell which then allows leakage. Smaller process geometries have a smaller tunnel oxide layer which wears faster. The smaller process geometries leave less margin for damage that occurs to the cell.

The lifetime of flash memory is rated by the number of P/E cycles it is intended to endure before being retired. Typical P/E lifetimes are rated in thousands of cycles. The targeted

lifetime in P/E cycles is chosen as a compromise between durability and ECC requirements. However, by reducing the area and power required by BCH decoding substantially, that compromise can be shifted and the lifetime of the flash memory extended.

The data collected by Cai et al. (2012) shows that the relation between P/E cycles and error rates generally follows a polynomial growth. The BER for 3x-nm technology Multilevel Cell (MLC) NAND flash examined in their research closely follows the relation:

$$BER = A * age^2 \tag{2.10}$$

Where A is a constant specific to a given flash memory. In rearranging the equation to show the relation between age and BER, the constant is eliminated and the following relation is shown:

$$\frac{BER_2}{BER_1} = \left[\frac{age_2}{age_1}\right]^2 \tag{2.11}$$

So that a doubling of the P/E cycles leads to a quadrupling of the BER. Figure 2 shows the relation between P/E cycles, the BER, and the strength of the BCH code required (Cai et al. 2012).

The amount of ECC strength required is calculated by using a block size of 4096 bits and targeted uncorrectable bit error rate of  $1\times10^{-15}$ . However, the number of bits of ECC overhead scales at a much faster rate.

### 2.3 BCH Codes

BCH is a block based error correction code meaning that it operates on a block of bits at time (Bose and Ray-Chaudhuri 1960). It transforms the input data by adding specially cal-



Figure 2. P/E cycles, BER, and ECC strength relation (Cai et al. 2012)

culated redundant check bits to form a codeword. The appropriate code can be selected for a number of bits to be corrected and a chosen block size. Larger block sizes have lower storage overhead, but higher algorithmic complexity. This gives BCH a number of advantages, including:

- Configurability for number of bits to be corrected.
- · Scales to different word sizes.
- Algebraic method for decoding.
- Original data embedded in codeword.

Each codeword within the code is constructed such that it is a minimum Hamming distance away from any other codeword. The Hamming distance,  $d_{min}$  is determined by the number of bits that must be changed within a valid codeword to transform it into another valid codeword. The number of bit errors that can be detected is thus one less than the Hamming distance.

The function of the decoder is to determine which valid codeword received codeword most closely represents. If a codeword receives enough bit errors to cross half or more of the Hamming distance between two codewords, it will be incorrectly detected. Thus the number that can can be corrected, t, is related to the minimum Hamming distance by the following relation:

$$d_{min} \ge 2t + 1 \tag{2.12}$$

The encoding and decoding BCH codes is performed by using finite fields. A short overview of finites fields is necessary in understanding both the mechanism of BCH codes and the proposed improvements.

### 2.3.1 Finite Field Overview

As the name implies, a finite field contains a finite number of elements. Within the set of elements, operations are defined such as addition, subtraction, multiplication, and division. All such operations on field elements result in another field element. Although a wide variety of finite fields can be defined, the use of a binary finite fields makes for a straightforward implementation using digital systems.

A binary finite field is defined by its degree, n, denoted as  $GF(2^n)$ . The elements of a finite field are created by a generator polynomial. Each element in the field is a successive power of the generator polynomial. Thus the index of the element within the field is known as the power form. For example, for  $GF(2^3)$ , with a generator polynomial of  $x^3 + x + 1$ , the field is produced shown in table 1:

Table 1.  $x^3 + x + 1$  over  $GF(2^3)$ 

| Power form | Polynomial form | Binary representation |
|------------|-----------------|-----------------------|
| 0          | 0               | b000                  |
| $x^0$      | 1               | b001                  |
| $x^1$      | x               | b010                  |
| $x^2$      | $x^2$           | b100                  |
| $x^3$      | x+1             | b011                  |
| $x^4$      | $x^2 + x$       | b110                  |
| $x^5$      | $x^2 + x + 1$   | b111                  |
| $x^6$      | $x^2 + 1$       | <i>b</i> 101          |

Finite field addition and subtraction is performed by adding or subtracting the polynomial form. Because the order of the field is two (binary field), addition and subtraction are equivalent. In either case, any two equal powers of x cancel out. For example, adding  $x^2$  and  $x^2 + x + 1$  produces x + 1. This is the equivalent of the logical Exclusive or (XOR) operation.

Finite field multiplication is performed by multiplying the two polynomials together, performing elimination of terms as described above, and then taking the result modulo the generator polynomial. Finite field division is the inverse of finite field multiplication.

When utilizing finite fields for BCH codes, the number of elements in the field is equal to the number of bits within a codeword. For instance,  $GF(2^8)$  contains 255 elements (excluding 0). The associated BCH block size would be 255 bits.

In order to make BCH codes easier to work with, only a portion of the codeword is used and the rest of the bits are set to zero. For instance, when using a block size of 16 bytes (128 bits), a BCH code with a block size of 255 bits would be selected. Throughout this thesis, codewords are assumed to be constructed in this way.

## 2.3.2 Finite Field Operations Utilizing LFSR

LFSRs are commonly used for finite field operations. The basic operation of a LFSR allows one to transform a finite field element to the next or previous element within the field. This is equivalent to multiplying or dividing by  $x^1$ . Thus repeated operation can multiply or divide by any power of x.

A LFSR consists of a set of registers interconnected in a ring configuration. Between each register there can be an XOR gate. The XOR gate combines the value of the previous register with feedback from the highest register. An example LFSR is shown in figure 3. The configuration shown can be used to produce the finite field shown in table 1. This is because the connections match the binary representation of the generator polynomial. In this configuration, the LFSR will cycle through each element of the field in order.



Figure 3. Example LFSR

LFSRs are commonly used for BCH operations, either in their default form, or in a slightly modified form that allows other operations, such as determining the quotient and remainder of a division(Saluja 1987). Such an LFSR is shown in figure 4. The numerator is fed into the input serially, and the XOR gates are chosen to represent the divisor.



Figure 4. LFSR with input

## 2.3.3 Encoding

BCH encoding is performed by dividing the input data by a specially formed polynomial. This is performed utilizing a modified LFSR that accepts a bit of input data per clock cycle. At the end of the operation, the LFSR contains the remainder of the operation which is the redundant code bits (J.-H. Lee et al. 2013).

# 2.3.4 Decoding

The decoding process is broken into three stages. The input codeword is passed into the first stage and error locations are generated by the final stage. The stages operate independently and thus the process can be pipelined with three codewords being decoded simultaneously.

### 2.3.4.1 Syndrome Computation

The first stage, syndrome computation, accepts the input data. The syndromes are a set of values that once computed, depend only on the error locations within the message, and not on the message itself. The number of syndromes is twice the number of errors that the BCH code can correct, t. This produces an underdetermined system, giving many possible

solutions for error locations. It is up to the next stage to solve for the most likely situation.

The syndromes are generated by dividing the codeword by a set of minimal polynomials producing a set of remainders. Because of relations between the minimal polynomials, many syndrome elements can be easily derived from the other elements, reducing the amount of computation required. A useful property of the syndromes is that if all calculated syndromes are zero, then no errors exist in the received message.

Syndrome elements can be calculated by a modified LFSR or by repeated multiplication. The most efficient method for the given syndrome should be chosen. Both methods operate on one input bit at a time. This limits the overall bandwidth of the decoder to the clock rate of the syndrome units. However, syndrome calculation can be modified to perform bit-parallel operations, greatly increasing the throughput of the syndrome calculation stage at the cost of increased area and power.

## 2.3.4.2 Error Locator Polynomial Generation

The error locator polynomial is defined such that its roots give the locations of the errors within the message. The number of roots, or degree, of the error locator polynomial indicates the number of errors within the message. The second stage of the BCH decoding process is to generate the error locator polynomial from the set of syndromes.

The Berlekamp-Massey algorithm was developed to generate the error locator polynomial from a set of syndromes. It is an iterative algorithm which calculates a discrepancy at each stage, refining the approximation. This process requires several finite field multiplications, divisions, and additions per cycle of the algorithm which contributes to the overall complexity of the decoder.

One set of syndromes can produce multiple possible error locator polynomials, each with

a different degree. It is assumed by the algorithm that the most likely occurrence, the fewest number of errors, indicates the most likely error locator polynomial. This highlights the fact that if more errors occur than the code is configured to handle, the decoder may decode the input data incorrectly.

## 2.3.4.3 Root Finding

To find error locations, roots of the error locator polynomial must be found. Since the degree of polynomial can be as large as t, a brute force algorithm is used for hardware BCH implementations. An optimized algorithm used for this brute force search has been developed and is known as a Chien search. To implement the Chien search, a set of registers is loaded with the coefficients of the error locator polynomial. During each cycle of the Chien search, each register is multiplied by  $x^n$ , where n is the degree of x associated with the given coefficient. At the end of each cycle, all registers are summed. If the sum of all the registers is zero, then a root has been located. The cycle number indicates the index within the block of the error location.

The order of the Chien output can be made to match the order of the input message. Thus the output of the BCH decoder is a set of locations within the message that must be toggled to correct received errors.

As in syndrome computation, the Chien search operates one bit per cycle and the bandwidth is thus limited to the clock speed of the Chien unit. To improve bandwidth, multiple Chien search steps must be performed each cycle. The most straightforward way of performing this parallel operation is to duplicate the Chien search block for each bit of parallel output. Each stage must skip ahead by k cycles where k is the number of parallel outputs. While some logic can be shared between the parallel units, the cost in area and power of parallelizing the

Chien search operation is high.

# Chapter 3

### **RELATED WORKS**

Optimizing BCH decoders has generally followed two sometimes complementary and sometimes conflicting paths. These paths are to increase the throughput of the decoder and to increase the efficiency of the decoder. Here we examine the current state of the art and related research in those two areas.

## 3.1 Improving Throughput

Although increasing clock rate leads directly to an increase in throughput, there is a limit due to the complexity involved in the decoder. There are two other methods of increasing the throughput, implementing bit parallel operation in the syndrome calculation and root finding, and implementing multiple BCH decoders in a system operating in parallel.

Bit parallel operation is a straightforward implementation and typically requires few modifications to an overall system to implement. However, as bit parallel operation increases the complexity of the decoder, it decreases the achievable clock rate and thus has limits. Additionally, bit parallel operation cannot be applied to generating the error locator polynomial, and thus the overall throughput of the system will come to be limited by this step.

Implementing multiple BCH channels bypasses these problems as it is simply a duplication of the BCH engine. Multiple channels require modification of the overall system to implement and can be made in two primary situations.

The first is the case of a multi-channel architecture. For example, a system that has multiple data channels connected to flash memory (Abraham et al. 2010).

The second is to interleave the BCH code. Interleaving not only leads to increased throughput, but also offers error correction advantages in certain types of channels (K. Lee et al. 2010). This is because in many types of channels, errors tend to occur in bursts. With interleaved operation the burst is broken up across many codewords, decreasing the probability that a single burst will overwhelm the error capability of the chosen BCH code (Shi et al. 2004).

Both methods of multi-channel operation scale each property of the system (throughput, area, power) in a purely linear fashion.

# 3.2 Improving Efficiency

Improving the efficiency of each stage of decoding can lead to lower area requirements, lower power consumption, and increased clock speeds leading to higher throughput. As such, many ideas have been put forth to improve the efficiency of each BCH decoding stage.

For instance, it has been shown that a relation exists between many of the syndromes (Lin and Costello 1983, p152). This makes it possible to only calculate a limited set of syndromes, and then apply the relations to expand them into the full set of syndromes. This decreases the overall area and power requirements of the decoder.

Additionally, it has been shown that there are multiple methods of finding each syndrome element (p165). For a given element, it can be shown which method is the most efficient. This information can then be used to calculate each syndrome in the most efficient way possible. This not only decreases the overall area and power requirements of the decoder, but because it decreases complexity, can also increase clock speeds and throughput.

Work has also gone into decreasing the complexity of bit-parallel LFSRs. This work can be and has been applied to bit-parallel syndrome calculation.

As the step of generating the error locator polynomial can limit the overall throughput of the decoder, improving its efficiency, increasing the achievable clock rate, and decreasing the overall number of clock cycles required is important. General optimizations to finite field operations, such as more efficient multipliers and dividers, can be applied to generating the error locator polynomial.

Jamro (1997) has shown how linking multipliers which operate on different bases can lead to a reducing in the number of clock cycles required. This is done by linking a serial multiplier that takes parallel input and produces serial output with a multiplier that takes serial input and produces parallel output. However, as these two multipliers operate on a different bases, an efficient basis conversion circuit linking the two multipliers is shown. Additionally, Jamro shows how the first two rounds of the algorithm can be skipped by precalculating the necessary state of the registers. Both of these optimizations reduce the latency of generating the error locator polynomial. By reducing the latency, this allows the decoder to run at a higher overall throughput.

The Chien search requires a number of multipliers equal to the number of coefficients in the error locator polynomial (Chen and Parhi 2004). Additionally, bit parallel operation requires a duplication of this set of multipliers for each output bit as well as a multiplier to load each coefficient with the appropriate value.

Because of this high cost in complexity and area, two complementary methods have been put forth for improvement. The first is to combine the multiple parallel Chien operations together rather than considering them separately. Several multipliers are linked together serially, and the intermediate stages are summed for each output bit. While this decreases complexity, it greatly increases the critical path of the unit, decreasing possible clock rates. The second, a complementary group matching scheme has been applied to this structure to reduce complexity and the critical path. The scheme exploits the substructure sharing within

a multiplier and among groups of multipliers (Chen and Parhi 2004).

# Chapter 4

### MAIN OBSERVATIONS

In order to push uncorrectable error rate very low, BCH decoders are very oversized compared to the number of errors they typically correct. The common case is for only a fraction of the decoder to be used. This is shown clearly in figure 5.



Figure 5. Probabilities of errors at BER of  $1 \times 10^{-4}$ 

At the error rate of  $1 \times 10^{-4}$ , the decoder is required to correct up to 10 bit errors in order to push the uncorrectable bit error rate below  $1 \times 10^{-15}$ . However, the probability that any errors occur in a block of 4096 bits is less than one in three. This means that in a multichannel decoder, on average only a third of the decoding hardware is required. Moving beyond that, the probability that the entire error correcting capability of a single decoder will be required is exceedingly small, around 1 in 30 billion.

This observation alone does not allow us any improvement because at any time the full decoder may be required. I instead observe that on average only a small percentage of the

decoder is required and then apply that observation to a multi-channel decoder. By applying this observation to a multi-channel decoder, at least one full BCH decoder must always be included. The remainder of the decoding hardware can be reduced decoders of some kind. These reduced decoders can reduce overall hardware requirements greatly.

To route data to the correct decoding block, the number of errors contained within a block must be considered. The result of the syndrome calculation can be used to determine if a block has any errors. All blocks must then at a minimum be passed through the syndrome calculation block. If the syndromes do evaluate to zero, then no further processing is necessary for that block.

To calculate the number of errors beyond zero, the error locator polynomial must be solved. Any reduction in the complexity of the decoder beyond zero errors must then be in the root search. The case of only one error is a very common case and a good target to optimize for. The optimization here is fairly straightforward as the error locator polynomial will only be one degree in this case. Rather than a brute force search, the root can be found algebraically.

The trade-off with such a system is that there is a possibility that insufficient resources will be available to decode a certain set of blocks. If this occurs, decoding will be delayed until resources are available and performance will be degraded. Fortunately, it is fairly straightforward to calculate this performance drop and thus intelligently trade-off a small drop in performance for a large reduction in hardware requirements.

# Chapter 5

### MY APPROACH

This section reviews my methods of acting on my observations. I first lay out the design of the decoder architecture. The decoder architecture is designed as pools of hardware blocks. This allows the pools to be sized appropriately and data to be assigned to units in each pool as they become ready. The design of a reduced root solver for blocks with only one error is also shown. Second, I show how the correct number of units can be chosen in order to meet a target miss rate.

## 5.1 Architecture

The basic design of a BCH decoder is broken down into three pipeline stages. For my multi-channel architecture, I implement those stages as stations fed by round robin arbitrators. The arbitrator collects data from each stage and then passes it to the next. The general layout of the decoder is shown in figure 6. In the example configuration, there are 3 error polynomial generator units  $(\Sigma)$ , one traditional Chien solver (C) and two reduced root solvers (c).

The overall architecture can be configured with the following compile time parameters:

- · Number of channels.
- Number of error locator polynomial generators.
- · Number of traditional Chien search units.
- Number of reduced root solver units.



Figure 6. An example of the proposed BCH decoder

The parameters must be chosen based on the allowed miss rate.

# 5.1.1 Syndromes

For every channel, the syndromes must be computed. This means that the number of syndrome units will be equal to the number of channels. I fix each syndrome unit to a channel and each unit contains a bit counter. The counter will be used to track how many bits the unit has received and if the syndrome is ready.

On the input side, the syndrome unit contains two control signals. An input to indicate that it should start accepting syndrome data, and an output that acknowledges that signal. If the unit is busy or contains processed syndrome data, it will not acknowledge the start signal.

On the output side, the syndrome unit contains an additional two control signals. One

signal indicates that the syndrome unit contains processed syndrome data. The other control signal is an input that clears this state and allows the unit to accept new data.

Each unit can be configured with the following compile time parameters:

- · Bit width.
- Code block size and number of correctable errors.
- · Additional pipeline stages to meet timing.
- Additional register duplication to meet timing.

## 5.1.2 Syndrome/Error Locator Polynomial Interconnect

This interconnect passes data from the channel syndrome units to the pool of error locator polynomial generators. The unit primarily consists of a register to hold the syndromes, an index to the current syndrome input unit, and an index to the current error locator polynomial unit. Both indexes operate in a purely round robin fashion. The unit also contains a circuitry to check its currently stored syndrome against zero. It determines if it is necessary to pass the syndrome data to the error locator polynomial unit or if it can be skipped.

The general operation is to wait on the currently indexed syndrome unit. When a syndrome is ready, it accepts the syndrome and stores it in its syndrome register. It also stores the index to associate the data with a channel. It then waits for the syndrome to be compared against zero. If the check indicates no errors are present, it sets a flag indicating that the current channel output should skip root finding for the next data set.

If the check indicates errors are present, it waits for the next error locator polynomial generator unit to become ready. When ready, it passes its syndromes to that unit and sets the start bit for that unit. It also passes the currently stored channel number so that the error locator polynomial will be associated with the correct channel.

## 5.1.3 Error Locator Polynomial Generator

If any error exists within the codeword, the error locator polynomial must be found The control signals on this unit are similar to the control signals on the syndrome unit. A start and start acknowledge signal on the input, and a signal to indicate done state and a signal to clear the done state on the output.

The output of the error locator polynomial generator unit includes the error locator polynomial and also the number of errors detected within the codeword. The only configuration available to the error locator polynomial are the BCH code parameters.

The general architecture of the unit follows that presented by Jamro (1997) but overcomes two shortcomings. First it expands the basis conversion to support all pentanomials, not just the single case supported by Jamro. This allows a wider range of BCH block sizes to be tested. Secondly, the Jamro decoder requires a code generation step, and then the compilation of that code. My solution is compile time configurable requiring no code generation step. This allowed me to more easily debug timing issues and also shorten the overall development and experimental gathering cycles.

The only compile time parameters for this unit are the BCH code parameters.

### 5.1.4 Error Locator Polynomial/Root Solver Interconnect

This interconnect is similar to the syndrome interconnect except that it must serve two possible pools. The first destination pool consists of traditional Chien root solvers and the second destination pool consists of reduced root solvers. When the currently selected error locator polynomial is ready, the interconnect stores the error locator polynomial, the error count, and the associated channel number.

The interconnect must then determine based on the error count which pool to serve. It keeps two separate indexing counters, one for each pool. If the error count is 1 then the reduced root solver pool is used, otherwise the traditional Chien pool is used.

When the appropriate root solver is ready, the interconnect signals it to start and passes the error locator polynomial along with the associated channel number.

#### 5.1.5 Traditional Chien Root Solver

The traditional Chien root solver units consist of a set of coefficient registers. Each register is wide enough to contain a finite field element from the given BCH configuration. The number of registers required is equal to the maximum number of errors that the code can correct. The registers are each multiplied by the appropriate degree of x each cycle and each cycle all registers are summed together. If the sum is zero, then an error has been located. This operation is duplicated for bit parallel operation, with the number of bits shared per register being configurable in order to meet timing. Additionally, the summing operation provides an opportunity for a configurable amount of pipelining.

The unit contains a start signal that is used to load new values in the coefficient registers, starting the algorithm. Due to the pipelined nature of the summation operation, an output signal is provided that indicates that the first bit (or set of bits) of errors is being output on the current cycle.

The glue logic surrounding the root solver contains a multiplexor that connects to the busy signal of the output stages. The output stage then counts the number of cycles necessary for the algorithm to complete.

Each unit can be configured with the following compile time parameters:

· Bit width.

- · Code block size and number of correctable errors.
- Additional pipeline stages to meet timing.
- · Additional register duplication to meet timing.

### 5.1.6 Reduced Root Solver

The reduced root solver can be used to find the error location for codewords with a single bit error. It offers large advantages over the traditional Chien search since it only requires a single register. It also is more efficient in the multi-bit case as for each bit, since the register is compared against a constant.

If only one error exists in a codeword, the error locator polynomial is of degree 1 and of the form:

$$Ax + B = 0 (5.1)$$

Which can be solved in a single step as:

$$x = -B/A \tag{5.2}$$

Because of the algorithm used to find the error locator polynomial, B is always 1. Additionally, negation is a null operation within finite fields. This reduces the equation further to the form:

$$x = 1/A \tag{5.3}$$

Although implementing an inverter would produce the value of x in a single cycle, the value would be of little use on its own. This is because the value is in the standard basis for the finite field and not the power form. The power form would give us a direct integer index

to the location of the error. The binary representation of the sequencing of the standard basis (polynomial form) can be seen in table 1.

Converting from the power form to the standard basis is an algorithmically complex operation. It is generally on the order of O(N) where N in the number of elements in the field. Rather than attempt to convert from the power form to the standard basis, I make two observations.

My first observation is that we need to cycle through each bit in the codeword in order to output error locations regardless of how the solver functions. My second observation is summed up by the following re-arrangement:

$$Ax = 1 \tag{5.4}$$

If we load a register with A and multiply it repeatedly by  $x^1$ , it will eventually reach the value of 1. Once it has we have multiplied A by the correct power of x and found the root. Because we are only multiplying by  $x^1$  per cycle we can use a LFSR instead of a multiplier.

To start, we load the LFSR with the value of *A*. Then during each cycle, we advance the LFSR and compare the value with 1. If they match we have found the location of the root.

Expanding this to support multiple bits scales very well. We advance the LFSR a number of cycles equal to the number of bits instead of just once. For each output bit, we compare the value in the LFSR with the next value in the finite field starting with 1 for the first bit.

### 5.1.7 Output Units

The output units multiplex the data from the root solvers and output it from the decoder. Each channel has an associated output unit. The output units provide the data indicating which bits are in error as well as a signal to indicate the start of a new block. Within each

output unit is a counter to keep track of when the output for the given block is complete and the next block can be processed.

The output units are driven by two flags. One flag indicates that the output unit should expect data from a root solver, the other flag indicates that the output unit should output one block's worth of error free data. Whenever the output unit completes its current block, it examines these flags to determine what it should output next.

Whenever the flag indicating that data from a root solver should be processed, the associated index of that solver is stored as well. This allows the output unit to assign its multiplexor to accept data from the appropriate solver.

# 5.2 Determining the Number of Units

Part of the design is to select the appropriate number of each unit type. The number of units included in a given design is determined by the expected error rate and the acceptable miss rate. The miss rate indicates the likelyhood that within any given set of blocks, there would be insufficient hardware to process the data. In this case the effected input channel is stalled and the decoding of that block is deferred until hardware is available. This causes a performance drop in decoding that is equal to the chosen miss rate.

The number of units required is decided in two stages. The first stage is the error locator polynomial generator units. Units are only required for blocks with one or more errors. Therefore the number of units is chosen based on the probability that more than m blocks contain one or more errors. We start by using eq. 2.2 to determine the probability that a single block contains one or more errors. Then we plug this probability into eq. 2.8 and choose the message size n to be equal to the number of channels. By evaluating this equation for different values of m, we can find the number of blocks required to be below the miss

rate probability.

The result of evaluating this equation for the chosen set of BCH parameters and an acceptable miss rate of 2% is shown in figure 7.



Figure 7. Probability that more than m blocks contain at least one error where n=8

Figure 7 shows that for a BER of  $5 \times 10^{-6}$ , only 1 unit is required with a 2% miss rate. For a BER of  $1 \times 10^{-4}$ , 5 units are required.

The next step determines the number of traditional Chien search units required. This is calculated similarly to the above, but we examine the probability that more than one error occurs since the reduced root solver can only handle one error. We first use eq. 2.8 to find the probability that more than one error occurs within a single block. And then we use eq. 2.8 again, but this time with the message size set to the number of channels and p set to the value found above.

The result of evaluating this equation across multiple values of m is shown in figure 8.



Figure 8. Probability that more than m blocks contain more than one error where n=8

Figure 8 shows that for a BER of  $1\times10^{-4}$ , 2 units are required. For all other examined error rates only 1 unit is required. The remaining units are filled in with reduced root solvers. Note that for any decoder at least one traditional Chien solver is required.

A miss rate of 2% is chosen for my experiments as it is a very small performance penalty, but still large enough for smaller unit counts to be used. In order to demonstrate the variability of units required for a given miss rate, the BER of  $2\times10^{-4}$  is examined. This BER provides a wide range of required units across a set of given miss rates.



Figure 9. Units required for BER of  $2 \times 10^{-4}$ 

As shown in Figure 9, the gain seen for a given miss rate falls off quickly beyond 2%. Although 2% was chosen for the experiments in this thesis, the additional gains achieved through much higher miss rates may still be desirable on extremely constrained designs.

# Chapter 6

#### **EXPERIMENTS**

#### 6.1 Setup

In order to test my ideas and approach, I have implemented them on a Field Programmable Gate Array (FPGA) in Verilog. A Xilinx Virtex-6 FPGA has been chosen as a target as it has sufficient logic resources and Input/Outputs (IOs) for implementing the necessary experiments.

The Verilog code is written to be configured through Verilog parameters. This allows the properties of the decoder to be configured at compile time. The build tools can then compile and verify a variety of configurations in a batch form without modification to the codebase.

Validation of the design is performed with a series of testbenches. This verifies the correctness of the compiled code. The testbench operates by generating a stream of random input data as well as random bit errors. In order to find problems with the code sooner, the number of bit errors at or below the capability of the decoder are selected equally. The input data is fed to a BCH encoder and then bits are flipped in accordance with the generated error locations. The modified data is then passed to the BCH decoder and the error locations output are compared with the true error locations.

The area of a given design is calculated by implementing the design fully. All inputs and outputs of the design are assigned registers as would be done in a system design to meet IO timing. As the configurability of the design leads to a wide range of IO configurations, the tool is permitted to automatically assign IO locations. The comparative area of design is

then measured by FPGA slice usage.

Power estimation is performed using the Xilinx XPower Analyzer. Since the static power consumption of an FPGA does not vary significantly based on logic usage, dynamic power consumption is compared.

In order to ensure a fair comparison, all designs are constrained to run at at least  $200 \, \mathrm{MHz}$ . This ensures that complex designs will pay an area penalty as the tool will duplicate registers to meet timing.

# 6.2 Baseline Configuration

The baseline configuration is an 8 channel decoder. Not only do many systems contain a similar number of channels, but it also allows me to fully demonstrate the advantages of my approach.

Each channel is 4 bits wide. Most flash memory systems operate in an 8 bit wide configuration, but a 4 bit wide configuration was chosen for two reasons. First to allow the design to have headroom for demonstrating the increase in throughput possible in the optimized design. Second, many decoders operate at a higher clock rate than the data bus. For a decoder operating at double the clock rate of an 8 bit data bus, 4 bit wide operation would be required.

The baseline decoder operates on 4096 bit, or 512 B blocks. This is a typical block size for the error rates examined in this research (Cooke, Berrett, and Schulthies 2006; Cooke 2011). Similar results should be obtainable across a wide range of possible block sizes.

Flash manufacturers typically do not reduce BER values for released flash memory. They instead release the error correction strength required to reduce the error rate below an acceptable threshold, typically  $1 \times 10^{-15}$  (Cai et al. 2012). Knowing the error correction strength

required, the block size, and the targeting uncorrectable error rate, we can work backwards to estimate the associated BER. The values chosen are shown in the table 2.

Table 2. Targeted ECC range

| Strength (errors) | Estimated BER      | Bits of ECC required |
|-------------------|--------------------|----------------------|
| 5                 | $5 \times 10^{-6}$ | 65                   |
| 7                 | $2 \times 10^{-5}$ | 91                   |
| 8                 | $5 \times 10^{-5}$ | 104                  |
| IO                | $1 \times 10^{-4}$ | 130                  |

# 6.3 Area Optimized BCH Decoder

The area optimized BCH decoder reduces the hardware area while it impacts performance only 2%. The reduced number of units required is shown in the table 3. Although 8 syndrome calculation blocks are always required, the number of error locators and traditional Chien search blocks decreases as BER decreases to meet the given error rate with 2% miss rate.

Table 3. Hardware units required for area optimized decoder

| BER                                      | Syndrome | Error Locator | Traditional Chien | Reduced Root |
|------------------------------------------|----------|---------------|-------------------|--------------|
| $5 \times 10^{-6}$<br>$2 \times 10^{-5}$ | 8        | I             | I                 | 0            |
| $2 \times 10^{-5}$                       | 8        | 3             | I                 | 2            |
| $5 \times 10^{-5}$ $1 \times 10^{-4}$    | 8        | 4             | I                 | 3            |
| $1 \times 10^{-4}$                       | 8        | 5             | 2                 | 3            |

The area is then compared with the baseline decoder and the results are shown in Figure 10. Note that the area includes all hardware components such as arbitrators to build the BCH decoders which are not required in the baseline implementation.

By optimizing the number of units and utilizing the reduced root solver my design can reduce required area by 47%–71% compared to the baseline implementation.



Figure 10. Area saving results

The smaller area of the decoder also translates to dynamic power savings. By profiling the designs I can estimate the power consumed by each design. The results are shown in figure 11. This equates to a 44%–59% reduction in dynamic power requirements.

# 6.4 Throughput Optimized BCH Decoder

While the proposed area optimized BCH decoder sacrifices a small amount of performance to reduce the required hardware area, it is possible to devise a throughput optimized BCH decoder while holding area constant to improve the performance. The optimization is achieved by increasing the bit parallel configuration parameter until a maximum throughput is found at the same area cost as the baseline configuration.



Figure 11. Power saving results



Figure 12. Requirements of 2 $\times 10^{-5}$  design

Figure 12 shows the process as applied to the  $2\times10^{-5}$  BER configuration. The area consumed by the baseline unoptimized design is shown by the red line. The discontinuity in results is due to an additional level of hardware duplication in the Chien search when moving to a 20 bit wide unit to meet timing.

While the unoptimized design consumes an area of 3168 slices, the optimized design con-

sumes an area of only 1272 slices. Both designs accept 4 bits per cycle in the syndrome calculation stage, and output 4 bits per cycle in their output stage. I then implement the optimized design at 8 bits, 16 bits, 18 bits, and 20 bits per cycle. These designs increase throughput by operating on more input and output bits per clock cycle. We can see that the optimized design operating at 18 bits per cycle only consumes 2874 slices, which is less than the unoptimized design operating at only 4 bits per cycle.

Thus it is possible to implement an 18-bit design within the same area, leading to a 4.5x improvement in performance. Note that there is 2% performance degradation due to the miss rate, which is negligible compared with the performance gain of 450%. Similar improvements in performance are possible with the other configurations and are shown in figure 13. The amount of performance improvement is related to the area savings provided by the optimized decoder.



Figure 13. Throughput optimization results

# 6.5 Flash Lifetime Optimized Design

Similarly, the area reduction can be utilized to increase the lifetime by providing higher error correction strength. To provide stronger error correction, a larger hardware area is required. The area reduction in my approach is utilized to provide greater error correction strength in a smaller area. For an 8 channel unit and a 2% targeted miss rate, the hardware area requirement in my approach for a BER of  $1\times10^{-4}$  becomes similar to the area in the baseline approach for a BER of  $5\times10^{-6}$ . Table 4 shows the units required in my approach for different given BERs.

Table 4. Hardware units required for lifetime optimized design

| BER                  | Syndrome | Error Locator | Traditional Chien | Reduced Root |
|----------------------|----------|---------------|-------------------|--------------|
| $1.2 \times 10^{-4}$ | 8        | 6             | 3                 | 3            |
| $1.5 \times 10^{-4}$ | 8        | 7             | 3                 | 4            |
| $2.0 \times 10^{-4}$ | 8        | 7             | 4                 | 3            |

Table 5 compares the error correction capability between the baseline approach and the proposed optimization with a given hardware area constraint. For instance, for the hardware area with which the baseline approach can handle a BER of  $5\times10^{-6}$ , the proposed approach can handle a BER of  $1\times10^{-4}$ . Note that in the table, my approach requires no larger hardware area than the baseline approach. In addition to increased error correction capability, the implementation includes additional hardware units to meet the 2% miss rate. Therefore, my approach can correct more errors than the baseline approach without sacrificing performance, hardware area, and power consumption.

Equation 2.11 shows the relation between BER and ageing. Since the proposed scheme can correct more errors, allowing a decoder targeted for a higher BER, the lifetime of the same

Table 5. BER achievable with lifetime optimized design

| Original BER       | Original $t$ | Optimized BER        | Optimized $t$ |
|--------------------|--------------|----------------------|---------------|
| $5 \times 10^{-6}$ | 5            | $1.0 \times 10^{-4}$ | IO            |
| $2 \times 10^{-5}$ | 7            | $1.2 \times 10^{-4}$ | II            |
| $5 \times 10^{-5}$ | 8            | $1.5 \times 10^{-4}$ | 12            |
| $1 \times 10^{-4}$ | IO           | $2.0 \times 10^{-4}$ | 13            |

NAND flash memory is prolonged compared with the baseline implementation. Figure 14 shows the lifetime improvement over the baseline BCH decoder. As a BER decreases, more hardware reduction is achievable and more errors can be corrected by utilizing the reduced area. The flash lifetime is extended by 1.4x-4.5x.



Figure 14. Improved lifetime

### Chapter 7

#### CONCLUSION AND FUTURE WORK

My research goal was to improve the efficiency of ECC systems by concentrating on multi-channel BCH architecturs. In this thesis I have presented a novel multi-channel BCH decoder optimization to reduce the hardware area requirement by considering a common error case. The proposed scheme utilizes a pooled group of shared decoding blocks. Compared with a traditional multi-channel implementation, it reduces the hardware area by 47%-71%. The area reduction also saves the dynamic power consumption by 44%-59%. In my approach, if the reduced hardware area is utilized to increase the performance, the throughput is improved by 3x-5x and the lifetime of NAND flash increases by 1.4x-4.5x if it is utilized to correct more errors.

The approach does increase the complexity of the decoder by adding arbitrators between pipeline stages, and incurs a small performance penalty if a miss occurs. However, the massive area savings provided are an excellent trade-off. Additionally, I've shown how the area savings instead can be used to increase overall performance.

Although I have already achieved significant gains, additional work could lead to further improvements across a wider range of bit error rates. The most straightforward extension of my work is to create reduced Chien solver units. These reduced units would function identically to traditional Chien search units but would support fewer coefficients. This would not offer the large savings seen by the reduced root solver, but would be fully configurable leading to applicability across a wider BER range.

A second improvement may be a speculative error polynomial generator. The generator could be sized such that it only supports up to a certain degree of polynomial. If during

calculation the number of errors exceeded the capacity of the current generator, calculation would need to be restarted by a full unit. The viability of both improvements is untested, but they warrant further study.

#### **REFERENCES**

- Abraham, Michael, et al. 2010. "NAND flash trends for SSD/Enterprise." Flash Memory Summit.
- Bose, Raj Chandra, and Dwijendra K. Ray-Chaudhuri. 1960. "On a class of error correcting binary group codes." *Information and control* 3 (1): 68–79.
- Cai, Yu, Erich F. Haratsch, Onur Mutlu, and Ken Mai. 2012. "Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis." In *Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012*, 521–526.
- Chen, Yanni, and Keshab K. Parhi. 2004. "Small area parallel Chien search architectures for long BCH codes." *Ieee Transactions on Very Large Scale Integration (VLSI) Systems* 12 (5): 545–549.
- Cooke, Jim. 2011. "NAND 201: An Update on the Continued Evolution of NAND Flash."
- Cooke, Jim, B. Berrett, and V. Schulthies. 2006. "NAND 101: An Introduction to NAND Flash and How to Design It in to Your Next Product." *Micron:* 1–28.
- Hong, Jonathan, and Martin Vetterli. 1995. "Simple algorithms for BCH decoding." *Communications, IEEE Transactions on* 43 (8): 2324–2333.
- Houghton, A. 2001. Error coding for engineers. Springer Science & Business Media.
- Jamro, Ernest. 1997. "The design of a vhdl based synthesis tool for bch codecs." *The university of Huddersfiel.*
- Jun, Zhang, Wang Zhi-Gong, Hu Qing-Sheng, and Xiao Jie. 2005. "Optimized design for high-speed parallel BCH encoder." In *VLSI Design and Video Technology*, 2005. Proceedings of 2005 IEEE International Workshop on, 97–100.
- Kristian, Hans, Hernando Wahyono, Kiki Rizki, and Trio Adiono. 2010. "Ultra-fast-scalable BCH decoder with efficient-Extended Fast Chien Search." In *Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on*, 4:338–343.
- Lee, Je-Hoon, Sharad Shakya, Deepti Gupta, Ajay K. Sharma, Qin-li An, Jian-feng Chen, and Zhong-hai Yin. 2013. "Implementation of Parallel BCH Encoder Employing Tree-Type Systolic Array Architecture."

- Lee, Kihoon, Han-Gil Kang, Jeong-In Park, and Hanho Lee. 2010. "100GB/S two-iteration concatenated BCH decoder architecture for optical communications." In *Signal Processing Systems (SIPS), 2010 IEEE Workshop on,* 404–409.
- Lee, Youngjoo, Hoyoung Yoo, and In-Cheol Park. 2012. "Small-area parallel syndrome calculation for strong BCH decoding." In *Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on,* 1609–1612.
- Lin, Shu, and Daniel J. Costello. 1983. *Error control coding: fundamentals and applications*. Pearson-Prentice Hall Upper Saddle River.
- Litwin, Louis. 2001. "Error control coding in digital communications systems." RF Design, July.
- Luyi, Sui, Fu Jinyi, and Yang Xiaohua. 2012. "Forward error correction." In *Computational and Information Sciences (ICCIS)*, 2012 Fourth International Conference on, 37–40.
- Rate, Switch. 1983. "Forward error correction schemes for digital communications."
- Saluja, Kevval K. 1987. "Linear Feedback Shift Registers Theory and Applications." Department of Electrical and Computer Engineering, University of Wisconsin-Madison: 4–14.
- Shannon, C. E. 1948. "A Mathematical Theory of Communication." *Bell System Technical Journal*, number July: 623.
- Shi, Yun Q., Xi Min Zhang, Zhi-Cheng Ni, and Nirwan Ansari. 2004. "Interleaving for combating bursts of errors." *Circuits and Systems Magazine*, *IEEE* 4 (1): 29–42.
- Strukov, Dmitri. 2006. "The area and latency tradeoffs of binary bit-parallel BCH decoders for prospective nanoelectronic memories." In *Signals, Systems and Computers*, 2006. ACSSC'06. Fortieth Asilomar Conference on, 1183–1187.
- Sun, Fei, Ken Rose, and Tong Zhang. 2006. "On the use of strong BCH codes for improving multilevel NAND flash memory storage capacity." In *IEEE Workshop on Signal Processing Systems (SiPS): Design and Implementation.*
- Zambelli, Cristian, Marco Indaco, Michele Fabiano, Stefano Di Carlo, Paolo Prinetto, Piero Olivo, and Davide Bertozzi. 2012. "A cross-layer approach for new reliability-performance trade-offs in MLC NAND flash memories." In *Proceedings of the Conference on Design, Automation and Test in Europe*, 881–886.