Skip to content

DarThunder/AxiomSSL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AxiomSSL

State: Experimental Language: C / Assembly Platform: Linux License: GPLv2

A high-performance, low-level cryptographic library built from scratch.

⚠️ SECURITY WARNING (The Golden Rule) ⚠️ This project is strictly EDUCATIONAL AND EXPERIMENTAL. It MUST NOT be used in production environments, real-world systems, or to protect sensitive data. "Don't roll your own crypto" absolutely applies here. This library was created to study hardware-level optimization, side-channel mitigation, and CPU architecture.

Table of Contents

Features

  • Hardware Acceleration & SIMD: Automatic hardware detection and optimized implementations using C intrinsics and pure assembly directives (AES-NI, AVX2, VAES).
  • Custom Assembly Kernels: Loop unrolling and parallel multi-block execution to maximize CPU throughput.
  • Side-Channel Protection: SM4 implementation utilizing masked Substitution Boxes (S-Boxes) with runtime-generated entropy.
  • Rigorous Benchmarking Suite: Nanosecond-precision performance testing using CPU barriers (cpuid), memory barriers (mfence), and direct clock cycle reads (__rdtsc()).
  • Linux perf Integration: The build system (Makefile) is wired for automatic cache and instruction auditing via perf stat.

Supported Algorithms

  • Symmetric / Block Ciphers:
    • AES (Native AES-NI support for ECB, CBC, and CTR modes).
    • SM4 (Chinese standard, featuring masked S-Boxes).
  • Stream Ciphers:
    • ChaCha20 (Fully functional and vectorizable).
  • Hash Functions:
    • SHA-256.
  • Random Number Generation (RNG):
    • Cross-platform interface (reads from BCryptGenRandom on Windows, getrandom or /dev/urandom on Linux).

System Requirements

To compile and run the high-performance benchmarks, the following is recommended:

  • OS: Linux (High-precision metrics and mlock require POSIX/Linux syscalls).
  • Compiler: gcc (support for GNU extensions and -march=native).
  • Tools: make, perf (linux-tools).
  • Hardware: x86_64 processor with AES-NI and AVX2 support.

Build & Installation

The project includes a smart Makefile that reads /proc/cpuinfo to dynamically inject the appropriate compilation flags (-maes, -mavx2, etc.).

# 1. Clone the repository
git clone [AxiomSSL](https://github.com/DarThunder/AxiomSSL)
cd libcrypto-exp

# 2. Check if your CPU supports the required extensions
make check_aes_support

# 3. Build the static (.a) and shared (.so) libraries
make all

# 4. Run basic integrity tests
make test

Benchmarking Suite

This project takes pride in its ability to measure itself. The test suite locks memory in RAM (mlock), sets scheduler priorities (SCHED_FIFO), and filters outliers to give you raw, pure data.

To run the benchmarks:

# Run the performance benchmark using the Linux `perf` tool
make performance

# If you want to audit cache-misses by forcing RAM flushes:
make audit

Performance Benchmarks

The benchmarking suite is designed for extreme precision, utilizing __rdtsc with CPU/memory barriers, mlock to prevent page faults, and SCHED_FIFO for low latency.

Test Environment

All benchmarks were executed on a local machine with the following specifications:

  • CPU: AMD Ryzen 5 5500U (Zen 2 Architecture)
  • OS: NixOS (Linux-zen kernel)
  • RAM: 16GB @ 2667 MT/s
  • Storage: 512GB NVMe SSD
  • WM: Hyprland 0.54.1

Tests executed on a 2047 MB buffer (33.5 million blocks) over 20 iterations (outliers filtered).

ChaCha20 (SSE 1-Block Implementation)

Metric Result
Throughput 544.32 MB/s
Cycles per byte 3.67 cpb
Cycles per block 235.03 cycles
Time per block 112.13 ns
Measurement Variance 0.04% (Extremely stable)

Developer Note: The current ChaCha20 implementation relies on 128-bit xmm registers to process one 64-byte block at a time, achieving a very respectable ~3.6 cpb. Future work includes implementing a 4-block or 8-block parallel AVX2 kernel (ymm registers) to push the throughput beyond the 2 GB/s threshold.

AES-ECB (Hardware Accelerated via AES-NI / AVX2)

Metric Result
Throughput 9752.39 MB/s (9.52 GB/s)
Cycles per byte 0.205 cpb
Cycles per block 3.28 cycles
Time per block 1.56 ns
Measurement Variance 0.03% (Extremely stable)

Developer Note: This extreme throughput (~9.5 GB/s) is achieved by leveraging dedicated hardware instructions (AES-NI) alongside parallel block processing (12x block unrolling in custom assembly kernels) using 128-bit or 256-bit wide vector registers. Reaching 0.205 cycles per byte effectively saturates the CPU's cryptographic pipeline on the Zen 2 architecture. Note on Cycle Counting: The cycles per byte metric is derived from the __rdtsc instruction. On modern microarchitectures (like Zen 2), this measures invariant Time-Stamp Counter (TSC) reference cycles, not actual core clock cycles, which may vary due to dynamic frequency scaling (Turbo Boost).

AES-CBC (Hardware Accelerated via AES-NI)

Metric Result
Throughput 1445.19 MB/s (1.41 GB/s)
Cycles per byte 1.38 cpb
Cycles per block 22.13 cycles
Time per block 10.55 ns
Measurement Variance 0.08% (Extremely stable)

Developer Note: The stark contrast between AES-ECB (~9.5 GB/s) and AES-CBC (~1.4 GB/s) perfectly illustrates the bottleneck of instruction-level parallelism (ILP). CBC encryption is inherently serial ($C_i = E_K(P_i \oplus C_{i-1})$), meaning the CPU cannot process block $N$ until block $N-1$ is completely encrypted. While ECB measures the maximum throughput of the Zen 2 AES pipeline, this CBC benchmark measures its raw hardware latency (taking ~22 cycles to push a single block through 10 AES rounds).

AES-CTR (Hardware Accelerated via AES-NI / AVX2)

Metric Result
Throughput 7904.97 MB/s (7.72 GB/s)
Cycles per byte 0.253 cpb
Cycles per block 4.05 cycles
Time per block 1.93 ns
Measurement Variance 0.21% (Extremely stable)

Developer Note: CTR mode represents the sweet spot between raw hardware throughput and cryptographic security. By parallelizing the encryption of the counter blocks via the underlying ECB engine and subsequently XORing the resulting keystream with the plaintext using 256-bit AVX2 vectors (_mm256_xor_si256), this implementation achieves massive speeds (~7.7 GB/s). The slight overhead compared to pure ECB (~9.5 GB/s) strictly comes from the memory loads/stores required for the final XOR phase.

SM4 (Software Implementation with Side-Channel Mitigation)

Metric Result
Throughput 98.30 MB/s
Cycles per byte 20.33 cpb
Cycles per block 325.36 cycles
Time per block 155.23 ns
Measurement Variance 0.03% (Extremely stable)

Developer Note: Unlike AES, the Zen 2 architecture lacks native hardware instructions for the SM4 Chinese cipher. Furthermore, this specific implementation incorporates Masked S-Boxes to mitigate side-channel attacks (timing and power analysis). The overhead of runtime entropy generation and scalar execution explains the ~20 cpb cost, perfectly illustrating the classic cryptographic trade-off between extreme performance and side-channel resilience.

SHA-256 (Scalar C Implementation)

Metric Result
Throughput 266.46 MB/s
Cycles per byte 7.50 cpb
Cycles per block 480.11 cycles
Time per block 229.06 ns
Measurement Variance 0.26% (Extremely stable)

Developer Note: Unlike AES-ECB or ChaCha20, SHA-256 is an inherently serial algorithm (the state of block $N$ strictly depends on the output of block $N-1$). This prevents aggressive parallelization on a single data stream. Achieving ~266 MB/s (7.5 cpb) via standard bitwise operations in C is a highly optimized scalar baseline. Future work entails replacing the scalar sha256_compress function with hardware-accelerated SHA extensions (using -msha intrinsics) to further reduce the cycles per byte.

FAQ

Q: Why did you write your own crypto if everyone says not to? A: Because it's the only way to truly understand how it works under the hood. Managing SIMD registers, dealing with CPU out-of-order execution, and seeing how cache-misses destroy performance is an invaluable engineering experience.

Q: Is the RNG code secure? A: It delegates entropy to the OS via /dev/urandom or BCryptGenRandom on Windows. Statistically sound for testing purposes, but as always — do not use in production.

Q: Can I use this code for my university project? A: Absolutely! It's GPLv2 licensed. Use it, break it, and if you find something wrong — or something that could be done better — open an issue or PR. Feedback is genuinely welcome.

License

This project is licensed under the GPLv2 License. See the LICENSE file for details.

About

High-performance cryptographic library in C/x86-64 ASM — AES-NI, AVX2, ChaCha20, SM4 with masked S-Boxes. Educational only.

Topics

Resources

License

Stars

Watchers

Forks

Packages