Permalink
Browse files

Added optimized CRC32 algorithm.

  • Loading branch information...
1 parent e536962 commit 48c90703d7c71af9de4398bdbc6d2bc7ef4911fa @Ezekeel committed Jan 5, 2012
Showing with 1,303 additions and 449 deletions.
  1. +2 −0 Documentation/00-INDEX
  2. +183 −0 Documentation/crc32.txt
  3. +1 −0 crypto/Kconfig
  4. +3 −91 crypto/crc32c.c
  5. +2 −0 include/linux/crc32.h
  6. +50 −4 lib/Kconfig
  7. +953 −326 lib/crc32.c
  8. +48 −8 lib/crc32defs.h
  9. +61 −20 lib/gen_crc32table.c
View
@@ -104,6 +104,8 @@ cpuidle/
- info on CPU_IDLE, CPU idle state management subsystem.
cputopology.txt
- documentation on how CPU topology info is exported via sysfs.
+crc32.txt
+ - brief tutorial on CRC computation
cris/
- directory with info about Linux on CRIS architecture.
crypto/
View
@@ -0,0 +1,183 @@
+A brief CRC tutorial.
+
+A CRC is a long-division remainder. You add the CRC to the message,
+and the whole thing (message+CRC) is a multiple of the given
+CRC polynomial. To check the CRC, you can either check that the
+CRC matches the recomputed value, *or* you can check that the
+remainder computed on the message+CRC is 0. This latter approach
+is used by a lot of hardware implementations, and is why so many
+protocols put the end-of-frame flag after the CRC.
+
+It's actually the same long division you learned in school, except that
+- We're working in binary, so the digits are only 0 and 1, and
+- When dividing polynomials, there are no carries. Rather than add and
+ subtract, we just xor. Thus, we tend to get a bit sloppy about
+ the difference between adding and subtracting.
+
+Like all division, the remainder is always smaller than the divisor.
+To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial.
+Since it's 33 bits long, bit 32 is always going to be set, so usually the
+CRC is written in hex with the most significant bit omitted. (If you're
+familiar with the IEEE 754 floating-point format, it's the same idea.)
+
+Note that a CRC is computed over a string of *bits*, so you have
+to decide on the endianness of the bits within each byte. To get
+the best error-detecting properties, this should correspond to the
+order they're actually sent. For example, standard RS-232 serial is
+little-endian; the most significant bit (sometimes used for parity)
+is sent last. And when appending a CRC word to a message, you should
+do it in the right order, matching the endianness.
+
+Just like with ordinary division, you proceed one digit (bit) at a time.
+Each step of the division, division, you take one more digit (bit) of the
+dividend and append it to the current remainder. Then you figure out the
+appropriate multiple of the divisor to subtract to being the remainder
+back into range. In binary, this is easy - it has to be either 0 or 1,
+and to make the XOR cancel, it's just a copy of bit 32 of the remainder.
+
+When computing a CRC, we don't care about the quotient, so we can
+throw the quotient bit away, but subtract the appropriate multiple of
+the polynomial from the remainder and we're back to where we started,
+ready to process the next bit.
+
+A big-endian CRC written this way would be coded like:
+for (i = 0; i < input_bits; i++) {
+ multiple = remainder & 0x80000000 ? CRCPOLY : 0;
+ remainder = (remainder << 1 | next_input_bit()) ^ multiple;
+}
+
+Notice how, to get at bit 32 of the shifted remainder, we look
+at bit 31 of the remainder *before* shifting it.
+
+But also notice how the next_input_bit() bits we're shifting into
+the remainder don't actually affect any decision-making until
+32 bits later. Thus, the first 32 cycles of this are pretty boring.
+Also, to add the CRC to a message, we need a 32-bit-long hole for it at
+the end, so we have to add 32 extra cycles shifting in zeros at the
+end of every message,
+
+These details lead to a standard trick: rearrange merging in the
+next_input_bit() until the moment it's needed. Then the first 32 cycles
+can be precomputed, and merging in the final 32 zero bits to make room
+for the CRC can be skipped entirely. This changes the code to:
+
+for (i = 0; i < input_bits; i++) {
+ remainder ^= next_input_bit() << 31;
+ multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
+ remainder = (remainder << 1) ^ multiple;
+}
+
+With this optimization, the little-endian code is particularly simple:
+for (i = 0; i < input_bits; i++) {
+ remainder ^= next_input_bit();
+ multiple = (remainder & 1) ? CRCPOLY : 0;
+ remainder = (remainder >> 1) ^ multiple;
+}
+
+The most significant coefficient of the remainder polynomial is stored
+in the least significant bit of the binary "remainder" variable.
+The other details of endianness have been hidden in CRCPOLY (which must
+be bit-reversed) and next_input_bit().
+
+As long as next_input_bit is returning the bits in a sensible order, we don't
+*have* to wait until the last possible moment to merge in additional bits.
+We can do it 8 bits at a time rather than 1 bit at a time:
+for (i = 0; i < input_bytes; i++) {
+ remainder ^= next_input_byte() << 24;
+ for (j = 0; j < 8; j++) {
+ multiple = (remainder & 0x80000000) ? CRCPOLY : 0;
+ remainder = (remainder << 1) ^ multiple;
+ }
+}
+
+Or in little-endian:
+for (i = 0; i < input_bytes; i++) {
+ remainder ^= next_input_byte();
+ for (j = 0; j < 8; j++) {
+ multiple = (remainder & 1) ? CRCPOLY : 0;
+ remainder = (remainder >> 1) ^ multiple;
+ }
+}
+
+If the input is a multiple of 32 bits, you can even XOR in a 32-bit
+word at a time and increase the inner loop count to 32.
+
+You can also mix and match the two loop styles, for example doing the
+bulk of a message byte-at-a-time and adding bit-at-a-time processing
+for any fractional bytes at the end.
+
+To reduce the number of conditional branches, software commonly uses
+the byte-at-a-time table method, popularized by Dilip V. Sarwate,
+"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM
+v.31 no.8 (August 1998) p. 1008-1013.
+
+Here, rather than just shifting one bit of the remainder to decide
+in the correct multiple to subtract, we can shift a byte at a time.
+This produces a 40-bit (rather than a 33-bit) intermediate remainder,
+and the correct multiple of the polynomial to subtract is found using
+a 256-entry lookup table indexed by the high 8 bits.
+
+(The table entries are simply the CRC-32 of the given one-byte messages.)
+
+When space is more constrained, smaller tables can be used, e.g. two
+4-bit shifts followed by a lookup in a 16-entry table.
+
+It is not practical to process much more than 8 bits at a time using this
+technique, because tables larger than 256 entries use too much memory and,
+more importantly, too much of the L1 cache.
+
+To get higher software performance, a "slicing" technique can be used.
+See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm",
+ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf
+
+This does not change the number of table lookups, but does increase
+the parallelism. With the classic Sarwate algorithm, each table lookup
+must be completed before the index of the next can be computed.
+
+A "slicing by 2" technique would shift the remainder 16 bits at a time,
+producing a 48-bit intermediate remainder. Rather than doing a single
+lookup in a 65536-entry table, the two high bytes are looked up in
+two different 256-entry tables. Each contains the remainder required
+to cancel out the corresponding byte. The tables are different because the
+polynomials to cancel are different. One has non-zero coefficients from
+x^32 to x^39, while the other goes from x^40 to x^47.
+
+Since modern processors can handle many parallel memory operations, this
+takes barely longer than a single table look-up and thus performs almost
+twice as fast as the basic Sarwate algorithm.
+
+This can be extended to "slicing by 4" using 4 256-entry tables.
+Each step, 32 bits of data is fetched, XORed with the CRC, and the result
+broken into bytes and looked up in the tables. Because the 32-bit shift
+leaves the low-order bits of the intermediate remainder zero, the
+final CRC is simply the XOR of the 4 table look-ups.
+
+But this still enforces sequential execution: a second group of table
+look-ups cannot begin until the previous groups 4 table look-ups have all
+been completed. Thus, the processor's load/store unit is sometimes idle.
+
+To make maximum use of the processor, "slicing by 8" performs 8 look-ups
+in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed
+with 64 bits of input data. What is important to note is that 4 of
+those 8 bytes are simply copies of the input data; they do not depend
+on the previous CRC at all. Thus, those 4 table look-ups may commence
+immediately, without waiting for the previous loop iteration.
+
+By always having 4 loads in flight, a modern superscalar processor can
+be kept busy and make full use of its L1 cache.
+
+Two more details about CRC implementation in the real world:
+
+Normally, appending zero bits to a message which is already a multiple
+of a polynomial produces a larger multiple of that polynomial. Thus,
+a basic CRC will not detect appended zero bits (or bytes). To enable
+a CRC to detect this condition, it's common to invert the CRC before
+appending it. This makes the remainder of the message+crc come out not
+as zero, but some fixed non-zero value. (The CRC of the inversion
+pattern, 0xffffffff.)
+
+The same problem applies to zero bits prepended to the message, and a
+similar solution is used. Instead of starting the CRC computation with
+a remainder of 0, an initial remainder of all ones is used. As long as
+you start the same way on decoding, it doesn't make a difference.
+
View
@@ -302,6 +302,7 @@ comment "Digest"
config CRYPTO_CRC32C
tristate "CRC32c CRC algorithm"
select CRYPTO_HASH
+ select CRC32
help
Castagnoli, et al Cyclic Redundancy-Check Algorithm. Used
by iSCSI for header and data digests and by others.
View
@@ -40,6 +40,7 @@
#include <linux/module.h>
#include <linux/string.h>
#include <linux/kernel.h>
+#include <linux/crc32.h>
#define CHKSUM_BLOCK_SIZE 1
#define CHKSUM_DIGEST_SIZE 4
@@ -53,95 +54,6 @@ struct chksum_desc_ctx {
};
/*
- * This is the CRC-32C table
- * Generated with:
- * width = 32 bits
- * poly = 0x1EDC6F41
- * reflect input bytes = true
- * reflect output bytes = true
- */
-
-static const u32 crc32c_table[256] = {
- 0x00000000L, 0xF26B8303L, 0xE13B70F7L, 0x1350F3F4L,
- 0xC79A971FL, 0x35F1141CL, 0x26A1E7E8L, 0xD4CA64EBL,
- 0x8AD958CFL, 0x78B2DBCCL, 0x6BE22838L, 0x9989AB3BL,
- 0x4D43CFD0L, 0xBF284CD3L, 0xAC78BF27L, 0x5E133C24L,
- 0x105EC76FL, 0xE235446CL, 0xF165B798L, 0x030E349BL,
- 0xD7C45070L, 0x25AFD373L, 0x36FF2087L, 0xC494A384L,
- 0x9A879FA0L, 0x68EC1CA3L, 0x7BBCEF57L, 0x89D76C54L,
- 0x5D1D08BFL, 0xAF768BBCL, 0xBC267848L, 0x4E4DFB4BL,
- 0x20BD8EDEL, 0xD2D60DDDL, 0xC186FE29L, 0x33ED7D2AL,
- 0xE72719C1L, 0x154C9AC2L, 0x061C6936L, 0xF477EA35L,
- 0xAA64D611L, 0x580F5512L, 0x4B5FA6E6L, 0xB93425E5L,
- 0x6DFE410EL, 0x9F95C20DL, 0x8CC531F9L, 0x7EAEB2FAL,
- 0x30E349B1L, 0xC288CAB2L, 0xD1D83946L, 0x23B3BA45L,
- 0xF779DEAEL, 0x05125DADL, 0x1642AE59L, 0xE4292D5AL,
- 0xBA3A117EL, 0x4851927DL, 0x5B016189L, 0xA96AE28AL,
- 0x7DA08661L, 0x8FCB0562L, 0x9C9BF696L, 0x6EF07595L,
- 0x417B1DBCL, 0xB3109EBFL, 0xA0406D4BL, 0x522BEE48L,
- 0x86E18AA3L, 0x748A09A0L, 0x67DAFA54L, 0x95B17957L,
- 0xCBA24573L, 0x39C9C670L, 0x2A993584L, 0xD8F2B687L,
- 0x0C38D26CL, 0xFE53516FL, 0xED03A29BL, 0x1F682198L,
- 0x5125DAD3L, 0xA34E59D0L, 0xB01EAA24L, 0x42752927L,
- 0x96BF4DCCL, 0x64D4CECFL, 0x77843D3BL, 0x85EFBE38L,
- 0xDBFC821CL, 0x2997011FL, 0x3AC7F2EBL, 0xC8AC71E8L,
- 0x1C661503L, 0xEE0D9600L, 0xFD5D65F4L, 0x0F36E6F7L,
- 0x61C69362L, 0x93AD1061L, 0x80FDE395L, 0x72966096L,
- 0xA65C047DL, 0x5437877EL, 0x4767748AL, 0xB50CF789L,
- 0xEB1FCBADL, 0x197448AEL, 0x0A24BB5AL, 0xF84F3859L,
- 0x2C855CB2L, 0xDEEEDFB1L, 0xCDBE2C45L, 0x3FD5AF46L,
- 0x7198540DL, 0x83F3D70EL, 0x90A324FAL, 0x62C8A7F9L,
- 0xB602C312L, 0x44694011L, 0x5739B3E5L, 0xA55230E6L,
- 0xFB410CC2L, 0x092A8FC1L, 0x1A7A7C35L, 0xE811FF36L,
- 0x3CDB9BDDL, 0xCEB018DEL, 0xDDE0EB2AL, 0x2F8B6829L,
- 0x82F63B78L, 0x709DB87BL, 0x63CD4B8FL, 0x91A6C88CL,
- 0x456CAC67L, 0xB7072F64L, 0xA457DC90L, 0x563C5F93L,
- 0x082F63B7L, 0xFA44E0B4L, 0xE9141340L, 0x1B7F9043L,
- 0xCFB5F4A8L, 0x3DDE77ABL, 0x2E8E845FL, 0xDCE5075CL,
- 0x92A8FC17L, 0x60C37F14L, 0x73938CE0L, 0x81F80FE3L,
- 0x55326B08L, 0xA759E80BL, 0xB4091BFFL, 0x466298FCL,
- 0x1871A4D8L, 0xEA1A27DBL, 0xF94AD42FL, 0x0B21572CL,
- 0xDFEB33C7L, 0x2D80B0C4L, 0x3ED04330L, 0xCCBBC033L,
- 0xA24BB5A6L, 0x502036A5L, 0x4370C551L, 0xB11B4652L,
- 0x65D122B9L, 0x97BAA1BAL, 0x84EA524EL, 0x7681D14DL,
- 0x2892ED69L, 0xDAF96E6AL, 0xC9A99D9EL, 0x3BC21E9DL,
- 0xEF087A76L, 0x1D63F975L, 0x0E330A81L, 0xFC588982L,
- 0xB21572C9L, 0x407EF1CAL, 0x532E023EL, 0xA145813DL,
- 0x758FE5D6L, 0x87E466D5L, 0x94B49521L, 0x66DF1622L,
- 0x38CC2A06L, 0xCAA7A905L, 0xD9F75AF1L, 0x2B9CD9F2L,
- 0xFF56BD19L, 0x0D3D3E1AL, 0x1E6DCDEEL, 0xEC064EEDL,
- 0xC38D26C4L, 0x31E6A5C7L, 0x22B65633L, 0xD0DDD530L,
- 0x0417B1DBL, 0xF67C32D8L, 0xE52CC12CL, 0x1747422FL,
- 0x49547E0BL, 0xBB3FFD08L, 0xA86F0EFCL, 0x5A048DFFL,
- 0x8ECEE914L, 0x7CA56A17L, 0x6FF599E3L, 0x9D9E1AE0L,
- 0xD3D3E1ABL, 0x21B862A8L, 0x32E8915CL, 0xC083125FL,
- 0x144976B4L, 0xE622F5B7L, 0xF5720643L, 0x07198540L,
- 0x590AB964L, 0xAB613A67L, 0xB831C993L, 0x4A5A4A90L,
- 0x9E902E7BL, 0x6CFBAD78L, 0x7FAB5E8CL, 0x8DC0DD8FL,
- 0xE330A81AL, 0x115B2B19L, 0x020BD8EDL, 0xF0605BEEL,
- 0x24AA3F05L, 0xD6C1BC06L, 0xC5914FF2L, 0x37FACCF1L,
- 0x69E9F0D5L, 0x9B8273D6L, 0x88D28022L, 0x7AB90321L,
- 0xAE7367CAL, 0x5C18E4C9L, 0x4F48173DL, 0xBD23943EL,
- 0xF36E6F75L, 0x0105EC76L, 0x12551F82L, 0xE03E9C81L,
- 0x34F4F86AL, 0xC69F7B69L, 0xD5CF889DL, 0x27A40B9EL,
- 0x79B737BAL, 0x8BDCB4B9L, 0x988C474DL, 0x6AE7C44EL,
- 0xBE2DA0A5L, 0x4C4623A6L, 0x5F16D052L, 0xAD7D5351L
-};
-
-/*
- * Steps through buffer one byte at at time, calculates reflected
- * crc using table.
- */
-
-static u32 crc32c(u32 crc, const u8 *data, unsigned int length)
-{
- while (length--)
- crc = crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);
-
- return crc;
-}
-
-/*
* Steps through buffer one byte at at time, calculates reflected
* crc using table.
*/
@@ -179,7 +91,7 @@ static int chksum_update(struct shash_desc *desc, const u8 *data,
{
struct chksum_desc_ctx *ctx = shash_desc_ctx(desc);
- ctx->crc = crc32c(ctx->crc, data, length);
+ ctx->crc = __crc32c_le(ctx->crc, data, length);
return 0;
}
@@ -193,7 +105,7 @@ static int chksum_final(struct shash_desc *desc, u8 *out)
static int __chksum_finup(u32 *crcp, const u8 *data, unsigned int len, u8 *out)
{
- *(__le32 *)out = ~cpu_to_le32(crc32c(*crcp, data, len));
+ *(__le32 *)out = ~cpu_to_le32(__crc32c_le(*crcp, data, len));
return 0;
}
View
@@ -11,6 +11,8 @@
extern u32 crc32_le(u32 crc, unsigned char const *p, size_t len);
extern u32 crc32_be(u32 crc, unsigned char const *p, size_t len);
+extern u32 __crc32c_le(u32 crc, unsigned char const *p, size_t len);
+
#define crc32(seed, data, length) crc32_le(seed, (unsigned char const *)(data), length)
/*
View
@@ -51,14 +51,60 @@ config CRC_ITU_T
functions require M here.
config CRC32
- tristate "CRC32 functions"
+ tristate "CRC32/CRC32c functions"
default y
select BITREVERSE
help
This option is provided for the case where no in-kernel-tree
- modules require CRC32 functions, but a module built outside the
- kernel tree does. Such modules that use library CRC32 functions
- require M here.
+ modules require CRC32/CRC32c functions, but a module built outside
+ the kernel tree does. Such modules that use library CRC32/CRC32c
+ functions require M here.
+
+config CRC32_SELFTEST
+ bool "CRC32 perform self test on init"
+ default n
+ depends on CRC32
+ help
+ This option enables the CRC32 library functions to perform a
+ self test on initialization. The self test computes crc32_le
+ and crc32_be over byte strings with random alignment and length
+ and computes the total elapsed time and number of bytes processed.
+
+choice
+ prompt "CRC32 implementation"
+ depends on CRC32
+ default CRC32_SLICEBY8
+
+config CRC32_SLICEBY8
+ bool "Slice by 8 bytes"
+ help
+ Calculate checksum 8 bytes at a time with a clever slicing algorithm.
+ This is the fastest algorithm, but comes with a 8KiB lookup table.
+ Most modern processors have enough cache that this shouldn't be
+ a problem.
+
+ If you don't know which to choose, choose this one.
+
+config CRC32_SLICEBY4
+ bool "Slice by 4 bytes"
+ help
+ Calculate checksum 4 bytes at a time with a clever slicing algorithm.
+ This is a bit slower than slice by 8, but has a smaller 4KiB lookup
+ table.
+
+config CRC32_SARWATE
+ bool "Sarwate's Algorithm (one byte at a time)"
+ help
+ Calculate checksum a byte at a time using Sarwate's algorithm. This
+ is not particularly fast, but has a small 256 byte lookup table.
+
+config CRC32_BIT
+ bool "Classic Algorithm (one bit at a time)"
+ help
+ Calculate checksum one bit at a time. This is VERY slow, but has
+ no lookup table. This is provided as a debugging option.
+
+endchoice
config CRC7
tristate "CRC7 functions"
Oops, something went wrong.

0 comments on commit 48c9070

Please sign in to comment.