A high-performance, low-level cryptographic library built from scratch.
⚠️ SECURITY WARNING (The Golden Rule)⚠️ This project is strictly EDUCATIONAL AND EXPERIMENTAL. It MUST NOT be used in production environments, real-world systems, or to protect sensitive data. "Don't roll your own crypto" absolutely applies here. This library was created to study hardware-level optimization, side-channel mitigation, and CPU architecture.
- Features
- Supported Algorithms
- System Requirements
- Build & Installation
- Benchmarking Suite
- FAQ
- License
- Hardware Acceleration & SIMD: Automatic hardware detection and optimized implementations using C intrinsics and pure assembly directives (AES-NI, AVX2, VAES).
- Custom Assembly Kernels: Loop unrolling and parallel multi-block execution to maximize CPU throughput.
- Side-Channel Protection: SM4 implementation utilizing masked Substitution Boxes (S-Boxes) with runtime-generated entropy.
- Rigorous Benchmarking Suite: Nanosecond-precision performance testing using CPU barriers (
cpuid), memory barriers (mfence), and direct clock cycle reads (__rdtsc()). - Linux
perfIntegration: The build system (Makefile) is wired for automatic cache and instruction auditing viaperf stat.
- Symmetric / Block Ciphers:
- AES (Native AES-NI support for ECB, CBC, and CTR modes).
- SM4 (Chinese standard, featuring masked S-Boxes).
- Stream Ciphers:
- ChaCha20 (Fully functional and vectorizable).
- Hash Functions:
- SHA-256.
- Random Number Generation (RNG):
- Cross-platform interface (reads from
BCryptGenRandomon Windows,getrandomor/dev/urandomon Linux).
- Cross-platform interface (reads from
To compile and run the high-performance benchmarks, the following is recommended:
- OS: Linux (High-precision metrics and
mlockrequire POSIX/Linux syscalls). - Compiler:
gcc(support for GNU extensions and-march=native). - Tools:
make,perf(linux-tools). - Hardware: x86_64 processor with AES-NI and AVX2 support.
The project includes a smart Makefile that reads /proc/cpuinfo to dynamically inject the appropriate compilation flags (-maes, -mavx2, etc.).
# 1. Clone the repository
git clone [AxiomSSL](https://github.com/DarThunder/AxiomSSL)
cd libcrypto-exp
# 2. Check if your CPU supports the required extensions
make check_aes_support
# 3. Build the static (.a) and shared (.so) libraries
make all
# 4. Run basic integrity tests
make test
This project takes pride in its ability to measure itself. The test suite locks memory in RAM (mlock), sets scheduler priorities (SCHED_FIFO), and filters outliers to give you raw, pure data.
To run the benchmarks:
# Run the performance benchmark using the Linux `perf` tool
make performance
# If you want to audit cache-misses by forcing RAM flushes:
make audit
The benchmarking suite is designed for extreme precision, utilizing __rdtsc with CPU/memory barriers, mlock to prevent page faults, and SCHED_FIFO for low latency.
All benchmarks were executed on a local machine with the following specifications:
- CPU: AMD Ryzen 5 5500U (Zen 2 Architecture)
- OS: NixOS (Linux-zen kernel)
- RAM: 16GB @ 2667 MT/s
- Storage: 512GB NVMe SSD
- WM: Hyprland 0.54.1
Tests executed on a 2047 MB buffer (33.5 million blocks) over 20 iterations (outliers filtered).
| Metric | Result |
|---|---|
| Throughput | 544.32 MB/s |
| Cycles per byte | 3.67 cpb |
| Cycles per block | 235.03 cycles |
| Time per block | 112.13 ns |
| Measurement Variance | 0.04% (Extremely stable) |
Developer Note: The current ChaCha20 implementation relies on 128-bit
xmmregisters to process one 64-byte block at a time, achieving a very respectable ~3.6 cpb. Future work includes implementing a 4-block or 8-block parallel AVX2 kernel (ymmregisters) to push the throughput beyond the 2 GB/s threshold.
| Metric | Result |
|---|---|
| Throughput | 9752.39 MB/s (9.52 GB/s) |
| Cycles per byte | 0.205 cpb |
| Cycles per block | 3.28 cycles |
| Time per block | 1.56 ns |
| Measurement Variance | 0.03% (Extremely stable) |
Developer Note: This extreme throughput (~9.5 GB/s) is achieved by leveraging dedicated hardware instructions (AES-NI) alongside parallel block processing (
12xblock unrolling in custom assembly kernels) using 128-bit or 256-bit wide vector registers. Reaching 0.205 cycles per byte effectively saturates the CPU's cryptographic pipeline on the Zen 2 architecture. Note on Cycle Counting: Thecycles per bytemetric is derived from the__rdtscinstruction. On modern microarchitectures (like Zen 2), this measures invariant Time-Stamp Counter (TSC) reference cycles, not actual core clock cycles, which may vary due to dynamic frequency scaling (Turbo Boost).
| Metric | Result |
|---|---|
| Throughput | 1445.19 MB/s (1.41 GB/s) |
| Cycles per byte | 1.38 cpb |
| Cycles per block | 22.13 cycles |
| Time per block | 10.55 ns |
| Measurement Variance | 0.08% (Extremely stable) |
Developer Note: The stark contrast between AES-ECB (~9.5 GB/s) and AES-CBC (~1.4 GB/s) perfectly illustrates the bottleneck of instruction-level parallelism (ILP). CBC encryption is inherently serial ($C_i = E_K(P_i \oplus C_{i-1})$), meaning the CPU cannot process block
$N$ until block$N-1$ is completely encrypted. While ECB measures the maximum throughput of the Zen 2 AES pipeline, this CBC benchmark measures its raw hardware latency (taking ~22 cycles to push a single block through 10 AES rounds).
| Metric | Result |
|---|---|
| Throughput | 7904.97 MB/s (7.72 GB/s) |
| Cycles per byte | 0.253 cpb |
| Cycles per block | 4.05 cycles |
| Time per block | 1.93 ns |
| Measurement Variance | 0.21% (Extremely stable) |
Developer Note: CTR mode represents the sweet spot between raw hardware throughput and cryptographic security. By parallelizing the encryption of the counter blocks via the underlying ECB engine and subsequently XORing the resulting keystream with the plaintext using 256-bit AVX2 vectors (
_mm256_xor_si256), this implementation achieves massive speeds (~7.7 GB/s). The slight overhead compared to pure ECB (~9.5 GB/s) strictly comes from the memory loads/stores required for the final XOR phase.
| Metric | Result |
|---|---|
| Throughput | 98.30 MB/s |
| Cycles per byte | 20.33 cpb |
| Cycles per block | 325.36 cycles |
| Time per block | 155.23 ns |
| Measurement Variance | 0.03% (Extremely stable) |
Developer Note: Unlike AES, the Zen 2 architecture lacks native hardware instructions for the SM4 Chinese cipher. Furthermore, this specific implementation incorporates Masked S-Boxes to mitigate side-channel attacks (timing and power analysis). The overhead of runtime entropy generation and scalar execution explains the ~20 cpb cost, perfectly illustrating the classic cryptographic trade-off between extreme performance and side-channel resilience.
| Metric | Result |
|---|---|
| Throughput | 266.46 MB/s |
| Cycles per byte | 7.50 cpb |
| Cycles per block | 480.11 cycles |
| Time per block | 229.06 ns |
| Measurement Variance | 0.26% (Extremely stable) |
Developer Note: Unlike AES-ECB or ChaCha20, SHA-256 is an inherently serial algorithm (the state of block
$N$ strictly depends on the output of block$N-1$ ). This prevents aggressive parallelization on a single data stream. Achieving ~266 MB/s (7.5 cpb) via standard bitwise operations in C is a highly optimized scalar baseline. Future work entails replacing the scalarsha256_compressfunction with hardware-accelerated SHA extensions (using-mshaintrinsics) to further reduce the cycles per byte.
Q: Why did you write your own crypto if everyone says not to? A: Because it's the only way to truly understand how it works under the hood. Managing SIMD registers, dealing with CPU out-of-order execution, and seeing how cache-misses destroy performance is an invaluable engineering experience.
Q: Is the RNG code secure? A: It delegates entropy to the OS via /dev/urandom or BCryptGenRandom on Windows. Statistically sound for testing purposes, but as always — do not use in production.
Q: Can I use this code for my university project? A: Absolutely! It's GPLv2 licensed. Use it, break it, and if you find something wrong — or something that could be done better — open an issue or PR. Feedback is genuinely welcome.
This project is licensed under the GPLv2 License. See the LICENSE file for details.