Apple Color Emoji





Canine-Table

April 23, 2025

# Contents

| Definitions                |
|----------------------------|
| Constants                  |
| Mathematical Constants     |
| Physical Constants         |
| Astronomical Constants     |
| Engineering Constants      |
| Basic Operations           |
| Endian Macros XI           |
| Endian Swap Functions      |
| Architecture and Alignment |
| CPU Features               |
| CPU Core Detection         |



### **Definitions**

### **I Definitions**

- Constants: Includes mathematical, physical, astronomical, and engineering constants.
- Sasic Operations: Contains macros for arithmetic, bitwise operations, and geometry.
- ♥ Variable Types: Definitions of pointers, floating-point types, and integers.
- Endian Macros: Definitions for byte order detection and endian swapping.
- Architecture and Alignment: Macros and types for architecture detection and memory alignment.
- CPU Features: Detection of instruction sets, branch prediction, and prefetching capabilities.



### **I** Constants

### **I Constants**

This section defines constants for various domains, including mathematical, physical, astronomical, and engineering applications.

- Physical Constants: Includes speed of light, Planck's constant, and more.
- Astronomical Constants: Includes values like AU, parsec, and solar mass.
- Engineering Constants: Includes data size units and floating-point precision limits.



#### I Mathematical Constants

#### I Mathematical Constants

Mathematical constants include commonly used values for calculations, such as the golden ratio, Pi, and Euler's number.

### Mathematical Constants

```
#define NX GOLDEN RATIO 1.618033988749895
#define NX_PI 3.141592653589793
#define NX_TAU 6.283185307179586
#define NX E 2.718281828459045
#define NX_SQRT2 1.414213562373095
#define NX_LN2 0.6931471805599453
#define NX LN10 2.302585092994046
```

/\* Mathematical Constants \*/ define NX\_GOLDEN\_RATIO 1.618033988749895 /\* Golden Ratio (phi) \*/ define NX\_PI 3.141592653589793 /\* Value of Pi \*/ define NX\_TAU 6.283185307179586 /\* Tau (2 \* Pi) \*/ define NX\_E 2.718281828459045 /\* Euler's number \*/ define NX\_SQRT2 1.414213562373095 /\* Square root of 2 \*/ define NX\_LN2 0.6931471805599453 /\* Natural logarithm of 2 \*/ define NX\_LN10 2.302585092994046 /\* Natural logarithm of 10 \*/



### I Physical Constants

### I Physical Constants

Physical constants include fundamental values such as the speed of light, Planck's constant, and Boltzmann constant.

### Physical Constants

```
/* Physical Constants */
#define NX LIGHT SPEED 299792458
#define NX_GRAVITY 9.80665
#define NX_PLANCK 6.62607015e-34
#define NX BOLTZMANN 1.380649e-23
#define NX_AVOGADRO 6.02214076e23
#define NX_GAS_CONSTANT 8.314462618
#define NX_ELECTRON_MASS 9.10938356e-31
#define NX_PROTON_MASS 1.67262192369e-27
#define NX_ELEM_CHARGE 1.602176634e-19
#define NX_PERMITTIVITY 8.854187817e-12
#define NX PERMEABILITY 1.2566370614e-6
/* Physical Constants */ define NX_LIGHT_SPEED 299792458 /* Speed of light in vacuum (m/s) */
define NX_GRAVITY 9.80665 /* Standard gravity (m/s<sup>2</sup>) * /defineNX_PLANCK6.62607015e -
          Planck's constant(Js) * /define NX\_BOLTZMANN 1.380649e

Boltzmann constant(J/K) * /define NX\_AVOGADRO 6.02214076e 23
34/
                                       /defineNX\_AVOGADRO6.02214076e23/
23/
31/*Electronmass(kg)*/defineNX_PROTON_MASS1.67262192369e -
Protonmass(kg) * /defineNX\_ELEM\_CHARGE1.602176634e - 19 / * Elementary charge(C) * 
/defineNX\ PERMITTIVITY8.854187817e\ -\ 12/\ *\ Vacuum permittivity (F/m)
/defineNX\ PERMEABILITY 1.2566370614e - 6/* Vacuum permeability (H/m)*/
```

୶ୢୖୄ୶ଋ



#### I Astronomical Constants

#### I Astronomical Constants

Astronomical constants include values used for celestial calculations, such as the astronomical unit (AU), parsec, and solar mass.

#### **Astronomical Constants**

```
#define NX_AU 149597870700
#define NX_PARSEC 3.085677581e16
#define NX_SOLAR_MASS 1.989e30
#define NX_EARTH_MASS 5.972e24
#define NX LUNAR MASS 7.342e22
                                      /* Earth's mean radius (meters) */
#define NX EARTH RADIUS 6371000
#define NX_EARTH_ORBITAL_PERIOD 365.25
#define NX MOON DISTANCE 384400000
```

/\* Astronomy Constants \*/ define NX\_AU 149597870700 /\* Astronomical Unit (meters) \*/ define NX\_PARSEC 3.085677581e16 /\* Parsec (meters) \*/ define NX\_SOLAR\_MASS 1.989e30 /\* Mass of the Sun (kg) \*/ define NX\_EARTH\_MASS 5.972e24 /\* Mass of the Earth (kg) \*/ define NX\_LUNAR\_MASS 7.342e22 /\* Mass of the Moon (kg) \*/ define NX EARTH RADIUS 6371000 /\* Earth's mean radius (meters) \*/ define NX\_EARTH\_ORBITAL\_PERIOD 365.25 /\* Earth's orbital period (days) \*/ define NX\_MOON\_DISTANCE 384400000 /\* Average Earth-Moon distance (meters) \*/



### **I** Engineering Constants

### I Engineering Constants

Engineering constants include values such as data size units and floating-point precision limits.

## **Engineering Constants**

```
/* Engineering and Computer Science Constants */
#define NX_KILOBYTE 1024 /* Bytes in a kilobyte */
#define NX_MEGABYTE (1024 * NX_KILOBYTE) /* Bytes in a megabyte */
#define NX_GIGABYTE (1024 * NX_MEGABYTE) /* Bytes in a gigabyte */
#define NX_FLOAT_EPSILON 1.19209290e-7 /* Smallest float difference

$\times(32-\text{bit})\ */
#define NX_DOUBLE_EPSILON 2.22044605e-16 /* Smallest double difference

$\times(64-\text{bit})\ */
#define NX_MAX_INT 2147483647 /* Maximum value of a 32-bit int */
#define NX_MIN_INT -2147483648 /* Minimum value of a 32-bit int */
```

/\* Engineering and Computer Science Constants \*/ define NX\_KILOBYTE 1024 /\* Bytes in a kilobyte \*/ define NX\_MEGABYTE (1024 \* NX\_KILOBYTE) /\* Bytes in a megabyte \*/ define NX\_GIGABYTE (1024 \* NX\_MEGABYTE) /\* Bytes in a gigabyte \*/ define NX\_FLOAT\_EPSILON 1.19209290e-7 /\* Smallest float difference (32-bit) \*/ define NX\_DOUBLE\_EPSILON 2.22044605e-16 /\* Smallest double difference (64-bit) \*/ define NX\_MAX\_INT 2147483647 /\* Maximum value of a 32-bit int \*/ define NX\_MIN\_INT -2147483648 /\* Minimum value of a 32-bit int \*/



### I Basic Operations

#### I Basic Operations

This section includes macros for basic arithmetic operations, bitwise operations, range checks, and geometry-related formulas.

```
Basic Operations
 #define NX_SQUARE_N(N) ((N) * (N))
 #define NX_CUBE_N(N) (NX_SQUARE_N(N) * (N))
#define NX_MIN(A, B) ((A) < (B) ? (A) : (B))
#define NX_MAX(A, B) ((A) > (B) ? (A) : (B))
#define NX_AVG(A, B) (((A) + (B)) / 2)
#define NX_ODD_N(N) ((N) % 2 == 1)
#define NX_{EVEN_N(N)} ((N) % 2 == 0)
#define NX_{ABS}(N) ((N) < 0 ? -(N) : (N))
#define NX_IS_POWER_OF_TWO(N) ((N) && !((N) & ((N) - 1))) /* Checks if N is
 /* Basic Operations */ define NX_SQUARE_N(N) ((N) * (N)) /* Calculates the square of a number */ define
 NX_CUBE_N(N) (NX_SQUARE_N(N) * (N)) /* Calculates the cube of a number */ define NX_MIN(A, B)
 ((A) < (B) ? (A) : (B)) / *Returns the minimum of two numbers */ define NX_MAX(A, B) ((A) > (B) ? (A)
 : (B)) /* Returns the maximum of two numbers */ define NX_AVG(A, B) (((A) + (B)) / 2) /* Calculates the
 average of two numbers */ define NX_ODD_N(N) ((N) define NX_EVEN_N(N) ((N) define NX_ABS(N)
 ((N) < 0 ? -(N) : (N)) /* Returns the absolute value of a number */ define NX_IS_POWER_OF_TWO(N)
 ((N) !((N) -1))) /* Checks if N is a power of 2 */
```



#### I Endian Macros

Endian macros are used to detect the byte order (endianness) of the platform and provide utilities for converting between big-endian and little-endian formats.

### Why does Intel use little-endian while older processors use big-endian?

Intel's x86 architecture, which became dominant in personal computers, adopted little-endian because it made certain operations easier. Many older systems and RISC-based processors (like PowerPC, SPARC, and older ARM designs) started with big-endian for compatibility with older networking and data transmission standards. Big-endian was originally favored because it aligns with how humans typically write numbers—most significant digit first (like 1234, not 4321). Little-endian, on the other hand, makes certain low-level memory operations more efficient, like reading variable-sized numbers without needing adjustments.

### **Pros and Cons of Each Endianness**

- Big-Endian (Most Significant Byte First):
  - ② Easier to read for humans (matches how numbers are written).
  - ⊘ Often used in networking protocols (big-endian is standard in IP/TCP).
  - Can require extra steps when handling certain memory operations.
- Little-Endian (Least Significant Byte First, Used by Intel):

  - ✓ Easier to handle variable-length data structures.
  - Solution Feels counterintuitive when reading raw memory data (since the bytes are in "reverse" order).



**Why is it called "Little-Endian"?** The term "Little-Endian" comes from Jonathan Swift's *Gulliver's Travels*, where people argued over which end of an egg to break first: the Big End or the Little End. Computer scientists borrowed the term to describe data storage formats.

**How Does This Relate to Two's Complement?** Two's complement is a way computers store negative numbers, and while endianness doesn't affect the math of two's complement itself, it does influence how the bytes are arranged in memory.

- **Big-Endian:** The sign bit is stored at the beginning.
- Little-Endian: The sign bit is stored at the end.

#### **Endian Macro Definitions**

```
#if defined(__BYTE_ORDER__) && (__BYTE_ORDER__ == __ORDER_BIG_ENDIAN__)
        #define NX BIG ENDIAN 1
        #define NX_LITTLE_ENDIAN 0
    #elif defined(_BYTE_ORDER__) && (_BYTE_ORDER__ == _ORDER_LITTLE_ENDIAN__)
        #define NX BIG ENDIAN 0
        #define NX_LITTLE_ENDIAN 1
    #elif defined(__BIG_ENDIAN__) || defined(_BIG_ENDIAN) || defined(__ARMEB__) ||

→defined(__MIPSEB__) || defined(__POWERPC__) || defined(__sparc__)

        #define NX_BIG_ENDIAN 1
        #define NX_LITTLE_ENDIAN 0
    #else
        #define NX_BIG_ENDIAN 0
        #define NX_LITTLE_ENDIAN 1
    #endif
    #if NX BIG ENDIAN
        #define NX_TO_LITTLE16(X) nx_swap16(X)
        #define NX_TO_LITTLE32(X) nx_swap32(X)
        #define NX_TO_LITTLE64(X) nx_swap64(X)
        #define NX_TO_BIG16(X) (X)
#define NX_TO_BIG32(X) (X)
        #define NX_TO_BIG64(X) (X)
23
        #define NX_TO_LITTLE16(X) (X)
        #define NX_TO_LITTLE32(X) (X)
        #define NX_TO_LITTLE64(X) (X)
        \#define NX_{TO}BIG16(X) nx_{swap}16(X)
28
        #define NX_TO_BIG32(X) nx_swap32(X)
29
        #define NX_TO_BIG64(X) nx_swap64(X)
30
    #endif
```



/\* Endian Macros \*/ if defined(\_BYTE\_ORDER\_\_) (\_BYTE\_ORDER\_\_ == \_ORDER\_BIG\_ENDIAN\_\_) define NX\_BIG\_ENDIAN 1 define NX\_LITTLE\_ENDIAN 0 elif defined(\_BYTE\_ORDER\_\_) (\_BYTE\_ORDER\_\_ == \_ORDER\_LITTLE\_ENDIAN\_\_) define NX\_BIG\_ENDIAN 0 define NX\_LITTLE\_ENDIAN 1 elif defined(\_BIG\_ENDIAN\_\_) || defined(\_BIG\_ENDIAN) || defined(\_ARMEB\_\_) || defined(\_MIPSEB\_\_) || defined(\_POWERPC\_\_) || defined(\_sparc\_\_) define NX\_BIG\_ENDIAN 1 define NX\_LITTLE\_ENDIAN 0 else define NX\_BIG\_ENDIAN 0 define NX\_LITTLE\_ENDIAN 1 endif

/\* Conditional Endian Swap Based on Architecture \*/ if NX\_BIG\_ENDIAN define NX\_TO\_LITTLE16(X) nx\_swap16(X) define NX\_TO\_LITTLE32(X) nx\_swap32(X) define NX\_TO\_LITTLE64(X) nx\_swap64(X) define NX\_TO\_BIG16(X) (X) /\* Already big-endian \*/ define NX\_TO\_BIG32(X) (X) define NX\_TO\_BIG64(X) (X) else define NX\_TO\_LITTLE16(X) (X) /\* Already little-endian \*/ define NX\_TO\_LITTLE32(X) (X) define NX\_TO\_BIG16(X) nx\_swap16(X) define NX\_TO\_BIG32(X) nx\_swap32(X) define NX\_TO\_BIG64(X) nx\_swap64(X) endif

### I Endian Swap Functions

### I Endian Swap Functions

These inline functions provide utilities for swapping the byte order (endianness) of 16-bit, 32-bit, and 64-bit values. They are useful for ensuring compatibility across architectures with different endianness.

#### **Endian Swap Functions**

```
static inline nx_u16_t nx_swap16(nx_u16_t v)
        return (v >> 8) | (v << 8);
    static inline nx_u32_t nx_swap32(nx_u32_t v)
        return ((v >> 24) & 0x000000FF)
            ((v >> 8) & 0x0000FF00)
            ((v << 8) & 0x00FF0000)
            ((v \ll 24) \& 0xFF000000);
    }
    static inline nx_u64_t nx_swap64(nx_u64_t v)
        return ((v >> 56) & 0x0000000000000FF) |
19
            ((v >> 40) & 0x000000000000FF00)
20
            ((v \gg 24) \& 0x0000000000FF0000)
21
                       & 0x0000000FF000000)
22
```





### What Are These Functions Doing?

These functions take a number in memory and flip the order of its bytes. This is necessary when converting data between big-endian and little-endian systems.

### How does nx32 work? (32-bit swap)

A 32-bit integer is made up of 4 bytes. Imagine it like this:

Byte0 | Byte1 | Byte2 | Byte3

In big-endian, it would be stored like this:

[AB] [CD] [EF] [GH] (Most significant byte first)

In little-endian, it would be:

[GH] [EF] [CD] [AB] (Least significant byte first)

The nx32 function rearranges the bytes by shifting them left or right, then using bitwise operations (&) to isolate each chunk before placing it into its new position.

#### How does nx64 work? (64-bit swap)

This does the same thing, but for a 64-bit number (which has 8 bytes). It follows the same pattern:

Byte0 | Byte1 | Byte2 | Byte3 | Byte4 | Byte5 | Byte6 | Byte7

Just like before, it shifts and isolates bytes to reverse their order.

### Why Is This Important?

If you're working on cross-platform applications or network protocols, you might need to convert data between formats to ensure all systems interpret it correctly. That's why these functions exist—to swap the byte order and avoid misinterpretation.



### What happens when you use » (bit shifting)?

When you use », you're shifting bits to the right. Each shift moves everything one position to the right, and depending on the system, new bits are filled with either zeros or sign bits. **Example with** [AB] [CD] [EF] [GH] (assuming a 32-bit number): v » 8 means everything moves 8 bits to the right:

> Before: [AB] [CD] [EF] [GH] After »8: [00] [AB] [CD] [EF] (GH gets pushed out)

If you shift by 24:

Before: [AB] [CD] [EF] [GH] After »24: [00] [00] [00] [AB] (Only the most significant byte remains)

### What happens when you use & (bit masking)?

The & operator compares two numbers bit by bit. If both bits are 1, the result is 1; otherwise, it's 0. **Example:** v & 0x000000FF keeps only the last byte, because 0x000000FF in binary is: 0000000 0000000 0000000 11111111

> [AB] [CD] [EF] [GH] & 00000000 00000000 00000000 111111111 = [00] [00] [00] [GH]

Essentially, & helps extract specific parts of the number, while » moves things around.

### How this applies to swapping endianness

The swap functions use:

- > to move bytes into the correct position
- & to extract only the parts we want before combining them into a new order

#### Bitmasking in CIDR notation

CIDR (Classless Inter-Domain Routing) uses bitmasking to define IP address ranges and subnet sizes.



### **How Bitmasking Works in CIDR**

CIDR notation looks like this: 192.168.1.0/24 The "/24" part is the subnet mask, which means:

- The first 24 bits of the IP address are fixed.
- The remaining 8 bits can vary (allowing for 256 possible addresses).

The subnet mask for /24 in binary:

Using a bitwise AND (&) operation, you can filter out the network portion of an IP address. **Example:** 

```
IP: 192.168.1.73 -> 11000000 10101000 00000001 01001001

MASK: 255.255.255.0 -> 11111111 1111111 1111111 00000000
```

Result: 192.168.1.0 -> 11000000 10101000 00000001 00000000

This operation ensures that any device within the subnet keeps the same "network" part of the IP, while only the last portion changes.

### Why Bitmasking is Important in Networking

- Tt helps routers quickly determine which subnet an IP belongs to.
- Makes it easy to allocate specific IP ranges to different parts of a network.
- Prevents overlapping IP addresses in large-scale networking.

So, in essence, CIDR uses bitmasks to group IP addresses together efficiently. That means networking and endianness both rely on shifting and masking bits, but for different reasons!



### **Architecture and Alignment**

### I Architecture and Alignment

These macros define memory alignment attributes for variables and structures in C. The \_\_attribute ((aligned(N))) directive tells the compiler to ensure that the specified variable or data structure is aligned to an N-byte boundary in memory.

- NX\_ALIGN\_4, NX\_ALIGN\_8, NX\_ALIGN\_16, NX\_ALIGN\_32: Align data to 4, 8, 16, or 32 bytes, respectively.
- NX\_CACHE\_ALIGN\_64, NX\_CACHE\_ALIGN\_128, NX\_CACHE\_ALIGN\_256: Align data to cache line sizes of 64, 128, or 256 bytes, respectively.
- NX IS 64BIT: Set to 1 if the system is 64-bit, otherwise set to 0.
- nx\_size\_t: Defines an unsigned integer type for sizes, appropriate for the architecture.
- nx\_ptrdiff\_t: Defines a signed integer type for pointer differences, appropriate for the architecture.

### Types of Alignments

- NX\_ALIGN\_4, NX\_ALIGN\_8, NX\_ALIGN\_16, NX\_ALIGN\_32:
  - Align data to 4, 8, 16, or 32 bytes, respectively.
  - ② Ensures variables are stored efficiently in memory.
  - ☑ Improves CPU performance by avoiding penalties for unaligned memory access.
- NX\_CACHE\_ALIGN\_64, NX\_CACHE\_ALIGN\_128, NX\_CACHE\_ALIGN\_256:
  - Align data to 64-, 128-, or 256-byte boundaries.
  - Optimizes cache performance by aligning data to cache line sizes.
  - Reduces cache thrashing and improves CPU efficiency.



### **^ I Architecture and Alignment**

### Why is Alignment Important?

Alignment is crucial for performance in systems where memory access is optimized for specific boundaries, particularly in:

- Embedded systems, where memory and processing resources are constrained.
- High-performance computing, where unaligned memory access can significantly impact throughput.
- SIMD (Single Instruction Multiple Data) operations, which require tightly aligned memory for vectorized processing.

### **Analogies for Clarity**

- NX\_ALIGN\_4: Think of this as saying, "Put this data in a spot that's a multiple of 4 bytes." It's akin to arranging chocolates in rows of 4 for uniformity and easy access.
- NX\_CACHE\_ALIGN\_64: This is like saying, "Make sure this data fits into a special box that's 64 bytes big." This helps the computer grab the data faster since it matches the size of its "grabber" (the cache line).

### Cache Thrashig

Cache thrashing happens when a computer keeps swapping data in and out of its cache too often, instead of using it efficiently. This slows things down because the CPU spends more time moving data around than doing actual work.

#### Think of it like a cluttered desk

Imagine you have a small desk and need a few papers to work on. If you keep shuffling papers on and off the desk because there's not enough space, you spend more time rearranging than actually working. That's cache thrashing—the CPU keeps replacing data in the cache because it doesn't fit, making everything slower.

### Why does this happen?

- Not enough cache space: If a program frequently accesses large amounts of data that don't fit in the cache, it keeps replacing items.
- Poor memory access patterns: Some algorithms keep switching between memory locations instead of accessing them efficiently.
- O Conflicting cache lines: If multiple pieces of data keep landing in the same cache slot, they overwrite each other too quickly.



### **^** I Architecture and Alignment

#### How to Fix It?

- Optimize data structures so that frequently accessed data stays close together.
- Use cache-friendly algorithms that avoid unnecessary swaps.
- Increase cache size (if possible) to fit more data at once.

Basically, cache thrashing is the CPU equivalent of reorganizing your room so much that you never actually get to relax in it!

### Architecture and Alignment Macros

```
defined(__PPC64__) || defined(__mips64)

    #define NX_IS_64BIT 1
    typedef unsigned long long nx size t;
    typedef long long nx_ptrdiff_t;
 #else
    #define NX_IS_64BIT 0
    typedef unsigned long nx_size_t;
    typedef long nx_ptrdiff_t;
 #endif
 #define NX_ALIGN_16 __attribute__((aligned(16))) /* Align to 16 bytes */
 #define NX_ALIGN_32 __attribute__((aligned(32))) /* Align to 32 bytes */
 #define NX_CACHE_ALIGN_64 __attribute__((aligned(64))) /* Align to 64-byte
 #define NX_CACHE_ALIGN_128 __attribute__((aligned(128))) /* Align to 128-byte
 #define NX_CACHE_ALIGN_256 __attribute__((aligned(256))) /* Align to 256-byte
```

/\* Architecture Detection \*/ if defined(\_\_x86\_64\_\_) || defined(\_M\_X64) || defined(\_\_aarch64\_\_) || defined( PPC64 ) || defined( mips64) define NX IS 64BIT 1 typedef unsigned long long nx size t; typedef long long nx\_ptrdiff\_t; else define NX\_IS\_64BIT 0 typedef unsigned long nx\_size\_t; typedef long nx ptrdiff t; endif

/\* Memory Alignment Macros \*/ define NX\_ALIGN\_4 \_\_attribute\_\_((aligned(4))) /\* Align to 4 bytes \*/ define NX\_ALIGN\_8 \_\_attribute\_\_((aligned(8))) /\* Align to 8 bytes \*/ define NX\_ALIGN\_16 \_\_attribute\_\_((aligned(16))) /\* Align to 16 bytes \*/ define NX\_ALIGN\_32 \_\_attribute\_\_((aligned(32))) /\* Align to 32 bytes \*/ define NX CACHE ALIGN 64 attribute ((aligned(64))) /\* Align to 64-byte cache line  $^*/$  define NX\_CACHE\_ALIGN\_128 \_\_attribute\_\_((aligned(128)))  $/^*$  Align to 128-byte cache line  $^*/$  define NX\_CACHE\_ALIGN\_256 \_\_attribute\_\_((aligned(256)))  $/^*$  Align to 256-byte cache line  $^*/$ 



#### CPU Features

#### I CPU Features

This section includes macros for detecting CPU instruction sets, enabling branch prediction, and optimizing memory access through prefetching.

- NX HAS SSE: Set to 1 if SSE instructions are available, otherwise set to 0.
- NX\_HAS\_AVX: Set to 1 if AVX instructions are available, otherwise set to 0.
- NX\_HAS\_NEON: Set to 1 if NEON instructions are available, otherwise set to 0.

#### **CPU Features Macros**

```
/* CPU Instruction Set Detection */
 #if defined(__SSE__) || defined(__x86_64__) || defined(_M_X64)
     #define NX HAS SSE 1 /* SSE Instructions Available
 #else
     #define NX HAS SSE 0
 #endif
 #if defined(__AVX__) || defined(__AVX2__) || defined(__x86_64__) ||

→defined( M X64)
     #define NX_HAS_AVX 1 /* AVX Instructions Available */
 #else
     #define NX_HAS_AVX 0
 #endif
 #if defined(__ARM_NEON) || defined(__ARM_FEATURE_NEON)
     #define NX_HAS_NEON 1 /* NEON Instructions Available (ARM) */
     #define NX HAS NEON 0
 #endif
 #if defined( CACHE LINE SIZE)
     #define NX_CACHE_LINE_SIZE __CACHE_LINE_SIZE /* Cache line size */
 #elif defined(__GNUC__) && defined(__x86_64__)
     #define NX_CACHE_LINE_SIZE sysconf(_SC_LEVEL1_DCACHE_LINESIZE) /* Retrieve
 #else
     #define NX_CACHE_LINE_SIZE 64 /* Default to 64 bytes (common architecture)
 #endif
 #if defined(__GNUC__) || defined(__clang__)
     #define NX_LIKELY(x) __builtin_expect(!!(x), 1) /* Likely branch */
     #define NX_UNLIKELY(x) __builtin_expect(!!(x), 0) /* Unlikely branch */
```



```
#define NX_LIKELY(x) (x) /* No prediction available */
    \#define NX_UNLIKELY(x) (x)
#endif
#if defined(__GNUC__) || defined(__clang__)
    #define NX_PREFETCH(addr) __builtin_prefetch(addr)
#elif defined(_MSC_VER)
    #include <mmintrin.h}</pre>
    #define NX_PREFETCH(addr) _mm_prefetch((const char *)(addr), _MM_HINT_T0)
    #define NX_PREFETCH(addr) /* No prefetch available */
#endif
/* CPU Instruction Set Detection */ if defined(__SSE__) || defined(__x86_64__) || defined(_M_X64) define
NX_HAS_SSE 1 /* SSE Instructions Available */ else define NX_HAS_SSE 0 endif
if defined(_AVX__) || defined(_AVX2__) || defined(_x86_64__) || defined(_M_X64) define
NX_HAS_AVX 1 /* AVX Instructions Available */ else define NX_HAS_AVX 0 endif
if defined(__ARM_NEON) || defined(__ARM_FEATURE_NEON) define NX_HAS_NEON_1 /* NEON_In-
structions Available (ARM) */ else define NX_HAS_NEON 0 endif
    defined(__CACHE_LINE_SIZE)
                                  define
                                          NX_CACHE_LINE_SIZE
                                                                   CACHE LINE SIZE
Cache line size */ elif defined( GNUC )
                                              defined(__x86_64__) include <unistd.h> define
NX_CACHE_LINE_SIZE sysconf(_SC_LEVEL1_DCACHE_LINESIZE) /* Retrieve cache line size */
else define NX CACHE LINE SIZE 64 /* Default to 64 bytes (common architecture) */ endif
/* Branch Prediction Macros */ if defined(__GNUC__) || defined(__clang__) define NX_LIKELY(x)
 builtin expect(!!(x), 1) /* Likely branch */ define NX UNLIKELY(x) builtin expect(!!(x), 0) /* Unlikely
branch */ else define NX_LIKELY(x) (x) /* No prediction available */ define NX_UNLIKELY(x) (x) endif
/* Instruction Prefetching */ if defined(__GNUC__) || defined(__clang__) define NX_PREFETCH(addr)
 _builtin_prefetch(addr) elif defined(_MSC_VER) include <mmintrin.h define NX_PREFETCH(addr)
mm prefetch((const char *)(addr), MM HINT T0) else define NX PREFETCH(addr) /* No prefetch
available */ endif
```



### What is SSE (Streaming SIMD Extensions)?

SSE (Streaming SIMD Extensions) is a set of CPU instructions designed to improve performance when handling vectorized operations, especially for multimedia, gaming, and scientific computing. SIMD (Single Instruction, Multiple Data) means the CPU can apply the same operation to multiple data elements simultaneously—great for speed!

### **Key Features of SSE Instructions**

- Parallel Processing: SSE allows the CPU to process multiple numbers at once, instead of one at a time.
- Floating-Point Optimization: Speeds up tasks involving floating-point calculations, such as graphics rendering, physics simulations, and audio processing.
- **Enhanced Vector Math:** Ideal for operations on arrays of numbers—common in physics engines, machine learning, image processing, and more.

#### Evolution of SSE

Intel introduced SSE with the Pentium III (1999), and expanded it over time:

- SSE (1999): Introduced 128-bit registers for floating-point calculations.
- SSE2 (2001): Added support for integer operations alongside floating-point.
- SSE3 (2004): Improved multimedia performance and added horizontal operations.
- SSSE3 (2006): Advanced shuffle and blend operations for greater efficiency.
- SSE4 (2007-2008): Added text processing optimizations and more arithmetic functions.
- AVX (2011+): The successor to SSE, introducing 256-bit vector registers for even faster computation.

#### Real-World Uses of SSE

- Video encoding/decoding (e.g., accelerating codecs like H.264)
- Game physics calculations (e.g., handling objects in motion)
- AI and deep learning (matrix multiplications)
- Image processing and graphics rendering



In short, SSE makes complex mathematical operations much faster by allowing multiple calculations to happen at once instead of one at a time!

### What is AVX (Advanced Vector Extensions)?

AVX (Advanced Vector Extensions) is a set of CPU instructions that extends the capabilities of SSE (Streaming SIMD Extensions) to allow even more powerful parallel processing and high-performance computing. AVX is especially useful for workloads involving heavy mathematical calculations, like scientific simulations, deep learning, and multimedia processing.

### **How AVX Improves Performance**

Compared to SSE, AVX:

- Uses 256-bit registers (instead of 128-bit in SSE), allowing double the amount of data to be processed at once.
- Introduces new floating-point operations, boosting performance for graphics rendering, physics simulations, and machine learning.
- Supports fused multiply-add (FMA) operations, letting the CPU perform multiplication and addition in a single step—saving both time and energy.

#### **Evolution of AVX**

AVX has evolved through multiple generations:

- AVX (2011): Introduced 256-bit vector processing, ideal for floating-point calculations.
- **AVX2 (2013):** Added integer processing and better memory handling, improving workloads like video encoding and matrix operations.
- **AVX-512 (2017):** Uses 512-bit registers, supporting even larger parallel computations (common in high-performance computing and AI).

### **Real-World Uses of AVX**

- Cryptography Encrypting and decrypting data efficiently.
- Machine learning Accelerating deep learning operations.
- Video processing Faster encoding and decoding.
- Scientific simulations Modeling physics, fluid dynamics, and more.

Posix-Nexus C

**^ I CPU Features** 

Basically, AVX is SSE on steroids, making CPU calculations way faster and more efficient.





### What is FMA (Fused Multiply-Add)?

FMA stands for **Fused Multiply-Add**, a specialized CPU instruction that performs multiplication and addition in a single step. This improves both efficiency and precision in mathematical computations, especially in areas like graphics, physics simulations, machine learning, and scientific computing.

#### How Does FMA Work?

Instead of computing:

```
result = (a * b) + c;
result = (a * b) + c;
```

as two separate operations (multiplication followed by addition), FMA combines both into a single instruction:

```
result = FMA(a, b, c);
\overline{\text{result}} = \overline{\text{FMA}}(a, b, \underline{c});
```

### Why is this better?

- Normally, CPUs round after each arithmetic operation, which can reduce precision.
- With FMA, the multiplication and addition happen together, with only one rounding at the end.
- This reduces precision loss and improves accuracy.

### Why is FMA Important?

- Improves Performance FMA can process calculations twice as fast compared to separate multiplication and addition.
- Reduces Rounding Errors Floating-point math suffers from precision loss due to repeated rounding; FMA minimizes this.
- Optimized for Vectorized Math Works great in SIMD instruction sets like AVX, AVX2, and AVX-512, making complex computations much faster.

#### Where is FMA Used?

- Computer Graphics & Game Engines Faster lighting and physics calculations.
- Machine Learning & AI Accelerating matrix multiplications.
- DE Cryptography Efficient mathematical computations in encryption algorithms CPU Features
- Scientific Simulations Fluid dynamics, particle physics, financial modeling.



#### What is NEON?

NEON is ARM's SIMD (Single Instruction, Multiple Data) extension, designed to accelerate tasks like multimedia processing, cryptography, AI computations, and gaming on ARM-based processors (which are found in most smartphones and embedded systems).

### What Makes NEON Special?

- Vector Processing: NEON can process multiple data elements in parallel, making it much faster than handling data one-by-one.
- Optimized for Mobile: Unlike Intel's SSE and AVX, NEON is tailored for ARM-based CPUs, commonly used in mobile devices, tablets, and embedded systems.
- Efficient Floating-Point and Integer Computation: Speeds up operations like image filters, audio processing, physics calculations, and machine learning inference.

### **How NEON Improves Performance**

Imagine you need to add up multiple pairs of numbers:

- A regular CPU would do one addition at a time.
- NEON can process multiple additions at once using vectorized instructions.

#### NEON vs. SSE/AVX

- SSE/AVX (on x86 processors) also do SIMD, but AVX supports 256-bit and 512-bit registers for even higher parallelism.
- NEON (on ARM processors) uses 128-bit registers, optimized for lower power consumption (great for mobile devices).
- NEON is more power-efficient, making it the preferred SIMD option for smartphones and embedded systems.

#### Where NEON is Used

- Graphics & Image Processing Filters, scaling, transformations
- Audio Processing Sound effects, speech recognition
- Cryptography Secure hashing and encryption operations
- Machine Learning Accelerating deep learning inference
- Physics Engines Collision detection in mobile games



### **How SIMD Instructions Use Large Registers**

When we talk about SSE being 128-bit or AVX being 256-bit, we're referring to special registers inside the CPU designed for vectorized operations (**SIMD**: Single Instruction, Multiple Data). Even if the CPU architecture is 64-bit, it still has dedicated SIMD registers (like **XMM** for SSE and **YMM** for AVX) that hold larger chunks of data at once.

### **Example: SIMD Register Capacity**

Let's say we're doing math on four 32-bit numbers at once:

- A normal CPU register (64-bit) would only hold two 32-bit numbers.
- An SSE register (128-bit) can hold four 32-bit numbers.
- An AVX register (256-bit) can hold eight 32-bit numbers!

This lets the CPU process multiple data points in a single instruction, improving speed for tasks like image processing, AI computations, and physics simulations.

#### **Architectural Bits vs. SIMD Bits**

- Architecture (16-bit, 32-bit, 64-bit) refers to general CPU capabilities like memory addressing and instruction execution.
- SSE and AVX registers are special-purpose registers inside the CPU designed to handle larger data chunks, independently of the standard architecture.

That's how a 64-bit CPU can still have 128-bit SSE and 256-bit AVX instructions—they're simply using separate vector registers for fast parallel computation!

### Cache Line Size (NX\_CACHE\_LINE\_SIZE)

This macro defines the size of a cache line—the smallest chunk of memory a CPU loads into the cache when accessing data.

#### **How the Macro Works**

- If the system provides \_\_CACHE\_LINE\_SIZE, we use it.
- If using GNU C and x86\_64, it retrieves the L1 data cache line size via sysconf(\_SC\_LEVEL1\_DCACHE\_LINESIZE).
- Otherwise, we default to 64 bytes, a common cache line size in modern processors.



### Why Does Cache Line Size Matter?

Efficient cache use minimizes cache thrashing and improves memory access speed. If data structures are aligned properly to match the cache line size, we reduce unnecessary cache misses and improve performance.

### Branch Prediction Macros (NX LIKELY and NX UNLIKELY)

Branch prediction is a CPU technique that tries to guess which way an if statement or loop condition will go.

```
__builtin_expect(x, 1) tells the compiler that x is very likely to be true.
```

```
builtin_expect(x, 0) tells the compiler that x is unlikely to be true.
```

### Why is this important?

CPUs use pipelines to execute instructions efficiently. If the CPU incorrectly predicts a branch, it must flush its pipeline and restart—hurting performance.

Using branch prediction hints helps the compiler generate optimized machine code, reducing the number of branch mispredictions.

### **Branch Prediction Example**

```
if (NX_LIKELY(value > 0)) {
```

if (NX\_LIKELY(value > 0)) // Fast path: Most of the time, value is > 0 else // Slow path: Rare case

#### **^ I CPU Features**

This allows the CPU to optimize for the most common case, improving execution speed.

#### Final Thoughts

Both cache line optimizations and branch prediction hints help squeeze extra performance out of the CPU by minimizing wasted cycles. The goal is to make memory access efficient and ensure the processor doesn't waste time on incorrect predictions.

Posix-Nexus C



#### I CPU Core Detection

### I CPU Core Detection

This inline function retrieves the number of logical CPU cores on the current system. On macOS, it uses the sysctlbyname function to query the hw.logicalcpu property.

```
CPU Core Detection Function

#ifndef NX_CPU_CORES

static inline nx_u32_t nx_cpu_cores()

{
    nx_u32_t c;
    nx_size_t s = sizeof(c);
    sysctlbyname("hw.logicalcpu", &c, &s, NX_NULL, 0);
    return c;

}

#define NX_CPU_CORES nx_cpu_cores()

#endif

ifndef NX_CPU_CORES static inline nx_u32_t nx_cpu_cores() nx_u32_t c; nx_size_t s = sizeof(c); sysctl-
```

byname("hw.logicalcpu", c, s, NX\_NULL, 0); return c; define NX\_CPU\_CORES nx\_cpu\_cores() endif